Information Technology | Testing and Quality assurance » Diagnosing Software Faults Using Multiverse Analysis

Datasheet

Year, pagecount:2020, 7 page(s)

Language:English

Downloads:2

Uploaded:March 09, 2024

Size:216 KB

Institution:
-

Comments:
University of Lisbon

Attachment:-

Download in PDF:Please log in!



Comments

No comments yet. You can be the first!

Content extract

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Diagnosing Software Faults Using Multiverse Analysis Prantik Chatterjee1∗ , Abhijit Chatterjee1 , José Campos2 , Rui Abreu3 and Subhajit Roy1 1 Indian Institute of Technology, Kanpur, India 2 LASIGE, Faculdade de Ciências, University of Lisbon, Portugal 3 INESC-ID and IST, University of Lisbon, Portugal {prantik, abhijitc, subhajit}@cse.iitkacin, jcmcampos@fculpt, rui@computerorg Abstract Spectrum-based Fault Localization (SFL) approaches aim to efficiently localize faulty components from examining program behavior. This is done by collecting the execution patterns of various combinations of components and the corresponding outcomes into a spectrum. Efficient fault localization depends heavily on the quality of the spectra. Previous approaches, including the current state-of-the-art Density- Diversity-Uniqueness (DDU) approach, attempt to generate “good” testsuites by

improving certain structural properties of the spectra. In this work, we propose a different approach, Multiverse Analysis, that considers multiple hypothetical universes, each corresponding to a scenario where one of the components is assumed to be faulty, to generate a spectrum that attempts to reduce the expected worst-case wasted effort over all the universes. Our experiments show that the Multiverse Analysis not just improves the efficiency of fault localization but also achieves better coverage and generates smaller test-suites over DDU, the current state-of-the-art technique. On average, our approach reduces the developer effort over DDU by over 16% for more than 92% of the instances. Further, the improvements over DDU are indeed statistically significant on the paired Wilcoxon Signed-rank test. 1 Introduction Spectrum-based Fault Localization (SFL) techniques [Abreu et al., 2009b] have proved to be immensely helpful at localizing faults in large code-bases [Pearson et al,

2017] These techniques aim to identify the faulty components, e.g, faulty lines of code, based on a statistical analysis of the test spectra. A spectra captures information on the activity pattern of each test case (which components are executed) and the resulting outcomes (which tests fail). Given a faulty program P with m components {c1 , c2 , . , cm }, the quality of the test suite is the key ∗ Contact Author 1629 factor for SFL to produce accurate diagnostic reports [Campos et al., 2013] Test suites can be created either manually or using automatic test generation techniques [Campos et al., 2014]. A test-suite generator can produce a test-suite T on P by optimizing a fitness function f . The function f is taken as a measure of the quality of a test-suite. The activity pattern of a test-case t ∈ T can be represented as a m-dimensional binary vector where the i-th element is 1 if the corresponding component ci was executed (activated) in test t. The behavior of a test-suite

consisting of n such test-cases can, therefore, be captured by a n × m-dimensional binary matrix A, where the element aij is set to 1 if the i-th test executed the j-th component of P . This matrix is referred to as an activity matrix; the rows of A correspond to the activity pattern of test-cases while the columns correspond to the involvement pattern of the corresponding components. The matrix A is complemented with a vector E, error vector, that captures whether a ti is found to be a passing test, then the corresponding entity Ei = 0, and for failing tests, Ei = 1. The tuple (A, E) is referred to as the spectrum. SFL techniques use this spectrum to rank the components of P by their suspiciousness of being faulty (several formulae to quantify suspiciousness exist [Pearson et al., 2017; Lucia et al., 2014; Wong et al, 2016]) Developers are expected to examine the components in the (decreasing) order of their suspiciousness scores till the faulty component is identified; thus, one

generally constructs an ordered list, L, by ranking the components in descending order of their suspiciousness. As mentioned before, the effectiveness of the SFL technique heavily depends on the quality of the test-suite for diagnosabilty, and can be measured by the rank of the (groundtruth) faulty component in L: if the rank of the faulty component is lower, then a developer would need to examine a fewer number of components before she identifies the faulty one. This effort, termed as the cost of diagnosis (D), is measured r as D = m , where r is the rank of the faulty component in L and m is the total number of components. We can also define the cost of diagnosis by the wasted effort (W), that captures the effort that is wasted in examining non-faulty components r−1 before hitting the faulty one: W = m−1 . A recent work [Perez et al., 2017b] claims that test-suites with good scores on adequacy metrics (like coverage) do not necessarily imply that these test-suites offer good

diagnosability. In fact, the authors performed rigorous experiments Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) to prove the contrary and proposed a new metric, DDU, that attempts to capture the quality of test-suites for SFL. Interestingly, such metrics can be plugged as fitness functions within Search-Based Software Testing (SBST) [McMinn, 2011] to automatically generate test-suites with the desired properties. One such popular search-based unit test generator, E VO S UITE [Fraser and Arcuri, 2011], accepts a program P , a fitness function f and employs a genetic algorithm to generate good test-suites by optimizing f . E VO S UITE also provides fault oracles in the form of a set of assertions which model the behavior of P . By examining deviations from these assertions, E VO S UITE can produce test outcomes and generate the error vector for a test-suite. In this work, we propose a new metric, Ulysis, to capture the quality

of test-suites for diagnosability. Instead of utilizing the structural properties of the activity matrix that are likely to be good proxies of test-case diagnosability (the road taken by prior efforts, like DDU [Perez et al., 2017b]), our metric directly considers multiple hypothetical universe (collectively defined as a multiverse), each universe assuming a component to be faulty, and computes the expected worstcase wasted effort for each of these hypothetical universe. We implement our metric in E VO S UITE and perform experiments on real-life software faults from the D EFECTS 4J benchmark (version 1.40) Our experiments demonstrate that Ulysis outperforms the current state-of-the-art metric, DDU, on all the relevant metrics: diagnosability, coverage and size of testsuites. The Wilcoxon signed-rank test showed that our fault localization improvements over DDU are indeed statistically significant. The following are the contributions of this work: • We propose a new metric, Ulysis, to

measure the diagnosability of test-suites, essentially computing the expected worst-case wasted effort instead of using proxies for good diagnosability as used in previous works; • We implement our metric as a fitness function in E VO S UITE and evaluate the test-suites generated by our metric versus those by DDU and coverage. 2 Our Approach 2.1 Ulysis: Multiverse Analysis The test-generation metrics accept an activity matrix A as an input and provide a score that quantifies the goodness of the test-suite1 . Given an activity matrix A, as the faulty component is not known, we design a metric that attempts to reduce the worst-case wasted effort for all components. Given a program P with a set of m components C = {c1 , c2 , , cm }, we consider m hypothetical universe (multiverse): the component ci is assumed to be faulty in the i-th hypothetical universe. Hence, the hypothetical universe Zi operates on a spectrum consisting of the activity matrix A, with a hypothetical error

vector Ei according to what the error vector would have been if ci was (persistently) faulty. This synthesized error vector for Zi is, thus, nothing but the involvement pattern 1 Note that the test-generation metrics do not have access to the error vector as the test-generation phase does not have access to the fault oracles. Fault localization is performed post test-generation 1630 of ci a test passing whenever ci is not activated and failing whenever it is. For each such hypothetical universe Zi , we compute the worst case wasted effort. Worst-case wasted effort is nothing but the effort we waste to localize ci as the faulty component in the hypothetical universe Zi in the worst case. Clearly, ci will be the most likely faulty candidate in Zi as ci = Ei , i.e, the involvement pattern matches perfectly with the error vector. However, all components from {c1 , c2 , , cm } that have the same involvement pattern as ci will also have the same likelihood of being faulty in Zi .

Assuming the number of such components is r, we will end up examining these r components before identifying ci as the actual faulty component in the worst-case scenario. From this observation, we define a highest ambiguity set Li as:  c | c ∈ C, j 6= i, if ci = ~0 Li = j j (1) cj | cj ∈ C, cj = ci , j 6= i, otherwise. Hence, the highest ambiguity set Li contains all these components that, due to having the same involvement pattern as ci , cannot be distinguished from ci by any similarity-based fault localization algorithm. As these elements in Li are components which, in the worst case, will be examined before ci , the cardinality of Li can be taken as measure of the worst-case wasted effort for localizing ci given Zi . We, thus, compute |Li | the worst case wasted effort in Zi as: Wi = m−1 . In an ideal scenario, where we should be able to perfectly localize ci as the faulty component with zero wasted effort, Wi = 0 (the highest ambiguity group contains only ci ) and in the

worst case scenario, where we will end up examining all other candidates before finally identifying ci as the faulty component, Wi = 1 (the highest ambiguity group contains all components other than ci ). Therefore, our objective is to minimize Wi for all components while generating test-suites. Hence, we can define the overall quality of the test-suite represented by A as the expectation over all Wi : Pm WU lysis = i=1 p(ci ).Wi where, p(ci ) is our prior belief about ci being the actual faulty component. A previous work [Paterson et al., 2019] has attempted to extract such possible distributions by past history of failures, number of repository commits etc. Without prior knowledge (as done in this work), we assume uninformed prior knowledge where all components are assumed equally likely to be faulty, i.e, p(c1 ) = p(c2 ) = · · ·P= p(cm ) = 1/m Thus m 1 WU lysis becomes: WU lysis = m i=1 Wi . We refer to WU lysis as the Ulysis score. Since, enhancing the quality of a test-suite

can be expressed in terms of minimizing the Ulysis score, we can plug in Ulysis score as a fitness function in any SBST tool which aim to generate test-suites by optimizing the given fitness function. There is one major advantage of using WU lysis to measure the quality of A instead of DDU. Consider a situation where a particular component ck was never executed in any test case. In such cases, the k-th column of A will contain all 0 values. If ck is a 0 vector, then the corresponding imaginary error vector Ek in the hypothetical universe Zk will also be a 0 vector. In that case, following Eqn 1, Lk will contain all the (m − 1) components from C other than ck . Consequently, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Algorithm 1: Ulysis 1 population ← Initialize(); 2 while not converged and not timedout do 3 T ← ∅; 46 for each test-suite t in population do 5 for each row  i in the activation matrix A of t do ~ = ~0

1, if A[i] P 1 Wi = ~ = A[i]), ~ (A[j] otherwise  m−1 j=1:m,j6=i Wt = 7 8 9 Figure 1: Multiverse analysis. 10 the value of Wk , i.e, the worst-case wasted effort to identify ck as faulty, will be 1. Therefore, to minimize Wk , the k-th component ck must be executed at least once in any test case. This is important because, if ck is the faulty component and it is never executed, then there is no way for us to identify ck as the faulty component. Therefore, optimizing our proposed metric will result in higher coverage of a program as well. We give a brief demonstration of how to compute WU lysis using an example shown in Figure 1. We start by assuming a hypothetical universe Z1 where c1 is the faulty component Then, the corresponding imaginary error vector will have the same pattern as c1 . This imaginary error vector in Z1 is shown by E1 . Since, components {c2 , c3 , c4 } share the same involvement patterns as c1 , L1 = {c2 , c3 , c4 }. There|L1 | fore, W1 = m−1 = 53 .

Similarly, W2 = W3 = W4 = 35 Now, when we assume c5 to be the faulty component in a hypothetical universe Z5 , E5 becomes the corresponding imaginary error vector and L5 = φ as no other component shares the same involvement pattern as c5 . Therefore, W5 = 0. When we assume c6 to be faulty, the corresponding imaginary error vector E6 in Z6 is a 0 vector as c6 is never executed in any test-case. Therefore, W6 = 1 Hence, WU lysis = 61 ( 35 + 53 + 35 + 35 + 0 + 1) = 17 30 . Our formulation makes two simplifying assumptions: • Single fault: We assume that the program has a fault in a single component. • Perfect detection: We assume that the tests do not exhibit flakiness [Bell et al., 2018], ie, the outcome of a test case is failure if and only if the faulty component is triggered in that particular test case. Note that, single fault assumption is a common assumption in most SFL-based works; further, Perez et al. [Perez et al., 2017a] show that it is a fair assumption The single fault

assumption makes our model simple and efficient. Also, we assumed perfect detection instead of inserting random noise as: (1) we lose predictable behavior of a deterministic fitness function, (2) with no knowledge of the magnitude and direction of the noise in the output, the noise introduced in the input will often turn additive, deteriorating the performance further. Our evaluations on the D EFECTS 4J benchmark (which has real programs with multiple faults) show that Ulysis out- 1631 1 m P Wi ; i=1:m T ← T ∪ ht, Wt i; population ← SelectTopTest-suites(T ); return population performs the state-of-the-art, even when the assumptions (including perfect detection) are violated. 2.2 Algorithm Algorithm 1 describes our test-suite generation in E VO S UITE using Ulysis. The underlying evolutionary algorithm in E VO S UITE starts by initializing a population of test-suites (Line 1). Then, the algorithm is guided through the search-space by our fitness function (Lines 2-9),

i.e, the Ulysis approach In detail, for each test suite t in the population (Line 4), the algorithms uses the t0 s activation matrix A (Line 5) to compute the worst-case wasted effort Wi by assuming each component j is faulty one at a time. Then, it computes the Ulysis score Wt of t by averaging all worst-case wasted efforts Wi (Line 7). Finally, it collects all Ulysis scores of all test suites (Line 8). On each iteration of the evolutionary algorithm, the population is refined by selecting the top test-suites based on wt from T (Line 9) until the fitness function converges or the time budget is exceeded. 3 Ulysis versus DDU Density- Diversity-Uniqueness (DDU) [Perez et al., 2017b], the state-of-the-art metric for diagnosability, uses three structural properties of an activity matrix A to generate good testsuites in terms of their fault localization capability. Density. Given a program P with m components and a testsuite with n tests, Pn the Pmdensity of an activity matrix A is

dei=1 j=1 Aij fined as: ρ = . This metric essentially attempts n×m to improve the entropy of the activity matrix, and hence, the ideal value of density is 0.5 Diversity. Test-cases having the same activity pattern are redundant, only increasing the size of the test-suite. Testcases should be diverse, ie, execute different combinations of components. The diversity measure [Perez et al, 2017b] tries to ensure that each test pattern in A (rows of A) is unique. Mathematically this isPexpressed as the Gini-Simpson k ni ×(ni −1) index [Jost, 2006]: G = 1 − i=1 , where k repren×(n−1) sents the number of groups of test cases having unique activity patterns, ni is the number of test cases having the same Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Figure 2: This example demonstrate a scenario where Ulysis is able to judge (b) as a better test-suite than (a) whereas DDU considers them both to be of the same quality. activity

pattern belonging to group i and n is the total number of test cases. Essentially, G measures how likely is it for two test cases, chosen at random from A, to have the same activity pattern. If all test cases are unique, then the value of G is 1. Uniqueness. The uniqueness measure [Baudry et al, 2006] ensures that the number of components having the same involvement pattern (columns of A) is reduced. To formulate uniqueness, we first define ambiguity groups: if one or more components share the same involvement pattern then we say that these components form an ambiguity group. Uniqueness of a test-suite is measured as: U = ml , where l represents the number of ambiguity groups in A and m is the total number of components. For a good test-suite, the value of U should be 1, i.e, all component patterns should be unique Finally, DDU is defined as: (1 − |1 − 2ρ|) × G × U . Let us analyze DDU for diagnosability: Figures 2(a) and (b) demonstrate two test-suites from a program having six

components where the value of U is less than 1 (generally U = 1 is infeasible due to branch correlations). For both of the test-suites, density (ρ) is 0.5 and diversity (G) is 1 as all the test cases have unique activation patterns. For the test-suite in Figure 2(a), there are three ambiguity groups: {c1 , c2 , c3 , c4 }, {c5 } and {c6 }. Therefore, the uniqueness score U is 36 = 0.5 Similarly, the test-suite in Figure 2(b) contains three ambiguity groups as well: {c1 , c2 }, {c3 , c4 } and {c5 , c6 }. Hence, the value of U is again 05 in this case Therefore, according to the DDU metric, both of these testsuites are equally good in terms of effort spent to localize the faulty component in the corresponding program, and hence, any of these test-suites are equally likely to be selected. Now, let us analyze two possible scenarios: The test-suite in Figure 2(a) is chosen. If any of the components ({c1 , c2 , c3 , c4 }) is faulty, we would end up examining 60% of the program components

before we are able to identify the actual fault. For real programs, which may contain thousands of components, this would be disastrous. However, if the component c5 is the faulty component, c5 will be identified with zero wasted efforta lucky situation! The test-suite in Figure 2(b) is chosen. In this case, regardless of whichever component is faulty, we will never have to examine more than 20% of the program components before discovering the actual fault even in the worst case scenario. 1632 Hence, DDU is incapable of discriminating test-suites based on their worst-case behavior. Given that we have no prior knowledge about which component is faulty, it is therefore far more reasonable to select the second test-suite for efficient fault localization. On the other hand, the Ulysis scores of the test-suites in Figures 2(a) and (b), are 52 and 15 respectively. This clearly demonstrates that, unlike DDU, Ulysis is capable of discriminating test-suites based on worst-case scenarios. Note

that, optimizing to improve the worst case, reduces the chances of riding on lucky situations (as the case with DDU picking the first test-case and the component c5 being faulty). This is seen in our experiments (Figure 3) where DDU performs exceedingly better in a few (7% instances) but the average decrease in effort while using Ulysis rather than Coverage is over 13% for more than 95% of all faults and the same over DDU is over 16% for more than 92% instances. 4 Experiments Test-suites can be evaluated on three criteria: • Coverage: Though not a good diagnosability metric [Staats et al., 2012], coverage is still an important metric that allows faults to be triggered. Note that, diagnosability metrics are helpless unless failing tests are found. • Diagnosability: This metric captures low wasted effort (or high suspiciousness scores) for ground-truth faults, given fault triggering tests are available. • Cost: This captures the runtime cost of testing. Smaller test-suites are

preferred over bigger test-suites. We pose three research questions to evaluate the performance of our proposal. RQ1 What is the saving of developer effort by our proposal over prior techniques? RQ2 Is the improvement in the ranking of the faulty component by our proposal indeed statistically significant over prior techniques like DDU and coverage? RQ3 Is the quality of test-suites (size and coverage) produced by our technique better than existing techniques? We have performed our experiments on D EFECTS 4J version 1.40 [Just et al, 2014] which is a benchmark suite consisting of six diverse Java project repositories D EFECTS 4J contains 395 real-life software faults. Given that our experimental evaluations show an improvement in fault localization over the D EFECTS 4J benchmark suite, the results should generalize. We have implemented Ulysis2 as a fitness function within E VO S UITE [Fraser and Arcuri, 2011] and compared it with other state-of-the-art fitness functions such as DDU and

coverage [Fraser and Arcuri, 2015] (available within E VO S UITE). DDU [Perez et al, 2017b] was shown to be better than current SFL techniques So, we restricted our comparison to only DDU and Coverage To take into account the randomization within E VO S UITE, for each fault, we have generated 5 test-suites using a time limit of 600 seconds on each fitness function. Post test-suite generation, we perform 2 Ulysis is available in E VO S UITE as part of pull request #293, https://github.com/EvoSuite/evosuite/pull/293 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) cov 59.05% 12.38% 28.57% DDU 54.96% 13.51% 31.53% top-k 5 10 50 U lysis 34.11% 44.19% 70.54% DDU 23.88% 35.07% 66.42% Percentage Decrease in Effort ∆W >0 =0 <0 Table 1: Comparisons w.rt wasted effort and top-k fault localization with the Ochiai metric using the GZ OLTAR tool [Campos et al., 2012] The reason we use Ochiai is because it usually outperforms

other similarity based metrics [Abreu et al., 2006], [Pearson et al, 2017] and it is as good as Bayesian Reasoning techniques, if we assume single faults [Abreu et al., 2009c] Both Ochiai and Spectrum-based Reasoning approaches like Barinel [Abreu et al, 2009a] produce the theoretically optimal ranking under perfect detection and single-fault assumptions (assumption in most SFL-work, including ours). Finally, it is known that other approaches like Model Based Reasoning are computationally prohibitive on large codebases [Wong et al., 2016] such the ones used in our experiments. Our experimental methodology of generating tests on the golden version is a standard setup for all SFL-related work. Of course, this methodology assumes a test-oracle to capture the program specification. Generation of the test-oracle is orthogonal, and a concern of the base framework (E VO S UITE, in our case). In real-world scenarios, for maintenance projects, the previous version is often taken as the golden

version for regression, and for others, oracles can be constructed from programmer annotations. We did not consider all test-suites for fault localization as in some cases either E VO S UITE generated an empty test-suite or no failing test cases were present in the spectrum when executed on the faulty version for either Ulysis or the state-ofthe-art approaches such as DDU and coverage. We found 111 such valid instances, on which we compare the median effort over 5 test-suite generation attempts with each of Ulysis, DDU and coverage. Note that, in our experiments, both Ulysis and DDU detect similar number of faults (in D EFECTS 4J), showing that Ulysis is comparable to the state-of-the-art in fault detection. This also demonstrates that generation of either empty test-suites or test-suites without fault-triggering test cases is a limitation of E VO S UITE and not of the fitness functions [Shamshiri et al., 2015] All experiments are performed at branch granularity, ie, the program

components are branches. We have done these experiments on a 16 core virtual machine with Intel Xeon processors having 2.1 GHz core frequency and 32 gigabytes of RAM. Threats to validity. In this study, we used faults taken from only six Java open source projects. Although our results may not generalize to other Java projects with different characteristics, we followed the setup proposed by others to ease the comparison with previous works. Also, our study only considered one test generation tool There are, however, other tools that may generate test cases that improve even further the performance of the underline fault localization technique. We leave such analysis for future work. 4.1 RQ1: Fault Localization Performance We quantify goodness of test-suites by their wasted effort W. Given a fault, two fitness functions, say A and B, 1633 100 50 0 -50 -100 0 20 40 60 80 Defects4J Faults 100 Figure 3: Percentage decrease of effort in fault localization while using Ulysis

instead of DDU. are compared by the sign of the difference in wasted effort, ∆W = WA − WB ; of course, as a lower value of wasted effort is better, B is a better performing metric if ∆W > 0. We denote ∆Wcov = WCoverage − WU lysis and ∆WDDU = WDDU − WU lysis , where WCoverage , WDDU and WU lysis represents the wasted effort needed to localize the fault on test-suites generated by Coverage, DDU and Ulysis (respectively). Rows 1, 2 and 3 of Table 1 show the number of instances where ∆W values are positive (Ulysis is better), zero (both metrics are equivalent ,i.e, difference within 1e − 15) and negative (Ulysis is worse). Ulysis is better than both the competing fitness functions in more than about 55% instances while being better or equivalent in about 68% instances. In Table 1, we compare Ulysis and DDU using the top − n metric which is a standard practice [Pearson et al., 2017] The top − n metric shows the percentage of cases for each metric where the faulty

component appears within the top-n positions of the ranked list after performing fault localization on the test-suites produced by the corresponding metric. As we can see, test-suites produced by Ulysis are usually superior with respect to the top − n metric. The comparison between Ulysis and Coverage w.rt top − n metric is also similar Figure 3 details the percentage decrease in the fault localization effort while using WU lysis rather than WDDU on all of our 111 (faulty) instances: Ulysis reduces effort in most instances. In some cases (around 7% instances), the competing metrics get “lucky” and are able to significantly decrease effort; few such cases are expected as per our discussion in Section 3 Since Ulysis optimizes fault localization efficiency over all components, we treat such instances as outliers. For fairness, we assume that any instances where one metric outperforms another by over 100%, are outliers and are not considered while summarizing the result (the graphs

are truncated at 100% and −100% on both ends of the y-axis). Our results against coverage is similar (omitted for brevity). In summary, the average decrease in effort while using Ulysis rather than Coverage is over 13% for more than 95% of all faults and the same over DDU is over 16% for more than 92% of all faults in D EFECTS 4J. 4.2 RQ2: Statistical Significance Having seen that Ulysis indeed seems to improve fault localization, we question if the improvement is indeed statistically significant? We take the effort needed for localiz- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Function DDU Coverage Ulysis Cov 0.88 0.93 0.92 Uniq 0.69 0.35 0.76 Size 25 10 18 DDU 0.65 0.14 0.33 Ulysis 0.13 0.16 0.08 Table 2: Comparison between the quality of test-suites generated by Coverage, DDU and Ulysis. Each value represents the median of each function. ing each fault by coverage, DDU and Ulysis as individual data columns

(WCoverage , WDDU , WU lysis ) and perform a Shapiro-Wilk test [Mohd Razali and Yap, 2011] for normality on each of these columns. The test refutes the null hypothesis that any of these data columns are from a normal distribution with 99% confidence. The corresponding p-values are 5.20e − 14, 702e − 14 and 198e − 14 respectively Having concluded that the effort data columns are not from a normal distribution, we perform a paired Wilcoxon Signed-rank test on ∆Wcov = (WCoverage − WU lysis ) and ∆WDDU = (WDDU − WU lysis ) respectively. Our null hypothesis is that the medians of both ∆Wcov and ∆WDDU are 0, while the alternate hypothesis is that the medians are greater than 0. In both cases, we are able refute the null hypothesis with 99% confidence, with corresponding p-values being 0.0015 for ∆Wcov and 00017 for ∆WDDU 4.3 RQ3: Quality of Test-suites (Size and Coverage) In Table 2, we show the median values of coverage (Cov), DDU, uniqueness (Uniq), test-suite

sizes in the number of test cases (Size) and the Ulysis score of the test-suites generated by DDU, Coverage and Ulysis respectively (with the best values set in bold). Not surprisingly, the test-suites generated by coverage attain the highest coverage with the least number of tests; however, the diagnosability for these test-suites is poor. Ulysis is comparable to the coverage metric in terms of percentage of component covered. Ulysis beats DDU, the current state-of-the-art fitness function for diagnosabilty on uniqueness, coverage and test-suite size. As discussed previously, uniqueness is a very important metric Diagnosability of a test-suite is directly correlated with the uniqueness score and Ulysis scores higher than all the other metrics in this regard, being even higher than DDU that includes it as part of its fitness function. This indicates that optimizing on the expected worst-case wasted effort automatically optimizes this very important metric. Note that, the numbers are

similar if we use mean instead of median. 5 Related Work Related studies on fault localization primarily focus on two key aspects, test-suite generation and fault localization. Testsuite generation approaches can be broadly categorized into approaches that improve test-suite adequacy (such as branch coverage [Fraser and Arcuri, 2011]) versus test-suite diagnosability (e.g, [Perez et al, 2017b]) It has been shown that test-case adequacy measures do not have a direct correlation with the effectiveness of fault localization [Staats et al., 2012] Other studies have demonstrated that coverage and size of test-suites together exhibit a stronger non- 1634 linear relation with the fault localization capabilities of a testsuites [Namin and Andrews, 2009]. Our approach, while focused on enhancing the diagnosability of test-suites, can also improve adequacy of test-suites. Approaches directed at improving diagnosability have attempted to maximize the entropy of the activity matrix [Abreu et

al., 2009a], which, however, can lead to prohibitively large test-suites Subsequent works [GonzalezSanchez et al, 2011] have attempted to contain this explosion in size by optimizing the density of the activity matrix (to an ideal 0.5) Other metrics, such as Uniqueness [Baudry et al, 2006] and DDU [Perez et al., 2017b] focus on enhancing certain structural properties of the activity matrix in order to improve fault localization performance of test-suites While the objective of our approach is also to improve the diagnosability of test-suites, we choose to directly attack the fault localization performance instead of optimizing proxies for it. The test-suites generated by our approach are smaller than those generated by diagnosability and entropy optimization metrics such as DDU while providing comparable adequacy to those generated by adequacy enhancement metrics such as coverage. Since the fault localization performance of our approach is also better, this ensures that our test-suites

are both more efficient and faster on virtue of being smaller and therefore, less computationally expensive and taking lesser time to execute. Another advantage is the interpretability of the Ulysis score of a test-suite. Since we assume that each component may be faulty while computing the score, it can also be interpreted as an estimate of the fault localization performance of any test-suite even prior to its execution. No other test-suite generation metric can provide such an estimate. Also, recent approaches which are orthogonal to the test-suite generation but use SFL [Liu et al., 2019] can benefit from our work The diagnostic accuracy of similarity-based fault localization is dictated not just by the quality of the test suite, but also at the efficiency of the metric to compare component’s involvement patterns with error vectors. There are many such metrics [Lucia et al., 2014]; a recent study [Pearson et al, 2017] has identified Ochiai to be amongst the best metrics. Note

that, although we have chosen to perform fault localization using the Ochiai score, our approach is not dependant on any particular fault localization method. 6 Conclusion We propose a test-suite diagnosability metric that, unlike previous state-of-the-art approaches, directly attacks the fault localization metric via a multiverse model, instead of using structural properties of the activity matrix as proxies (like density or uniqueness). The test-suites generated by our method are not only statistically better in terms of diagnostic accuracy, but they also provide comparable or better code coverage. Acknowledgements This material is based upon work supported by Fundação para a Ciência e Tecnologia (FCT), through project with ref. PTDC/CCI-COM/29300/2017, and by the research units UIDB/50021/2020, UIDB/00408/2020 and UIDP/00408/2020. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) References [Abreu et al., 2006] Rui

Abreu, Peter Zoeteweij, and Arjan J. C van Gemund An Evaluation of Similarity Coefficients for Software Fault Localization In Proc of the 12th PRDC, page 39–46. IEEE Computer Society, 2006 [Abreu et al., 2009a] Rui Abreu, Peter Zoeteweij, and Arjan J. C van Gemund Spectrum-Based Multiple Fault Localization In Proc of the 2009 IEEE/ACM ASE, page 88–99 IEEE Computer Society, 2009. [Abreu et al., 2009b] Rui Abreu, Peter Zoeteweij, Rob Golsteijn, and Arjan J C van Gemund A Practical Evaluation of Spectrum-Based Fault Localization J Syst Softw, 82(11):1780–1792, 2009. [Abreu et al., 2009c] Rui Abreu, Peter Zoeteweij, and Arjan J. C Van Gemund A New Bayesian Approach to Multiple Intermittent Fault Diagnosis. In Proc of the 21st IJCAI, page 653–658. Morgan Kaufmann Publishers Inc, 2009 [Baudry et al., 2006] Benoit Baudry, Franck Fleurey, and Yves Le Traon. Improving Test Suites for Efficient Fault Localization. In Proc of the 28th ICSE, page 82–91 ACM, 2006. [Bell et al., 2018]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. DeFlaker: Automatically Detecting Flaky Tests In Proc. of the 40th ICSE, page 433–444 ACM, 2018 [Campos et al., 2012] José Campos, André Riboira, Alexandre Perez, and Rui Abreu GZoltar: An Eclipse Plug-in for Testing and Debugging. In Proc of the 27th IEEE/ACM ASE, page 378–381. ACM, 2012 [Campos et al., 2013] José Campos, Rui Abreu, Gordon Fraser, and Marcelo d’Amorim. Entropy-based test generation for improved fault localization In Proc of the 28th IEEE/ACM ASE, page 257–267. IEEE Press, 2013 [Campos et al., 2014] José Campos, Andrea Arcuri, Gordon Fraser, and Rui Abreu. Continuous test generation: Enhancing continuous integration with automated test generation In Proc of the 29th ACM/IEEE ASE, page 55–66 ACM, 2014. [Fraser and Arcuri, 2011] Gordon Fraser and Andrea Arcuri. EvoSuite: Automatic Test Suite Generation for ObjectOriented Software. In Proc of the 19th ACM

ESEC/FSE, page 416–419. ACM, 2011 [Fraser and Arcuri, 2015] Gordon Fraser and Andrea Arcuri. 1600 Faults in 100 Projects: Automatically Finding Faults While Achieving High Coverage with EvoSuite. Empirical Softw Engg, 20(3):611–639, 2015 [Gonzalez-Sanchez et al., 2011] Alberto Gonzalez-Sanchez, Hans-Gerhard Gross, and Arjan JC van Gemund. Modeling the Diagnostic Efficiency of Regression Test Suites In Proc. of the 4th IEEE ICSTW, pages 634–643, 2011 [Jost, 2006] Lou Jost. Entropy and diversity. Oikos, 113(2):363–375, 2006. [Just et al., 2014] René Just, Darioush Jalali, and Michael D Ernst. Defects4J: A Database of Existing Faults to Enable 1635 Controlled Testing Studies for Java Programs. In Proc of the 23rd ISSTA, page 437–440. ACM, 2014 [Liu et al., 2019] Kui Liu, Anil Koyuncu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, and Yves Le Traon You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program

Repair Systems. In Proc of the 12th IEEE ICST, pages 102–113, 2019. [Lucia et al., 2014] Lucia Lucia, David Lo, Lingxiao Jiang, Ferdian Thung, and Aditya Budi. Extended Comprehensive Study of Association Measures for Fault Localization J. Softw Evol Process, 26(2):172–219, 2014 [McMinn, 2011] Phil McMinn. Search-Based Software Testing: Past, Present and Future. In Proc of the 4th IEEE ICSTW, pages 153–163, 2011. [Mohd Razali and Yap, 2011] Nornadiah Mohd Razali and Bee Yap. Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests. J Stat Model Analytics, 2:21–33, 2011 [Namin and Andrews, 2009] Akbar Siami Namin and James H. Andrews The Influence of Size and Coverage on Test Suite Effectiveness. In Proc of the 18th ISSTA, page 57–68. ACM, 2009 [Paterson et al., 2019] David Paterson, José Campos, Rui Abreu, Gregory M Kapfhammer, Gordon Fraser, and Phil McMinn. An Empirical Study on the Use of Defect Prediction for Test Case Prioritization

In Proc of the 12th IEEE ICST, pages 346–357, 2019. [Pearson et al., 2017] Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D. Ernst, Deric Pang, and Benjamin Keller Evaluating and Improving Fault Localization In Proc of the 39th ICSE, page 609–620. IEEE Press, 2017 [Perez et al., 2017a] Alexandre Perez, Rui Abreu, and Marcelo d’Amorim. Prevalence of Single-Fault Fixes and Its Impact on Fault Localization. In Proc of the 10th IEEE ICST, pages 12–22, 2017. [Perez et al., 2017b] Alexandre Perez, Rui Abreu, and Arie van Deursen. A Test-Suite Diagnosability Metric for Spectrum-Based Fault Localization Approaches. In Proc of the 39th ICSE, page 654–664. IEEE Press, 2017 [Shamshiri et al., 2015] Sina Shamshiri, René Just, José Miguel Rojas, Gordon Fraser, Phil McMinn, and Andrea Arcuri. Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges. In Proc of the 30th IEEE/ACM ASE, pages 201–211, 2015.

[Staats et al., 2012] Matt Staats, Gregory Gay, Michael Whalen, and Mats Heimdahl. On the danger of coverage directed test case generation. In FASE, pages 409–424 Springer Berlin Heidelberg, 2012. [Wong et al., 2016] W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. A Survey on Software Fault Localization. IEEE TSE, 42(8):707–740, 2016