Content extract
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Introducing automatic software fault localization in a continuous integration environment JOHANNES WIRKKALA WESTLUND KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE 0 Introducing automatic software fault localization in a continuous integration environment JOHANNES WIRKKALA WESTLUND MASTER’S IN COMPUTER SCIENCE DATE: JANUARY 28, 2020 SUPERVISOR: MATHIAS EKSTEDT EXAMINER: ROBERT LAGERSTRÖM SWEDISH TITLE: INTRODUKTION AV AUTOMATISK MJUKVARUFELSÖKNING I EN MILJÖ MED KONTINUERLIG INTEGRATION SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Abstract In this thesis we investigate the usefulness of neural networks to infer the relationship between co-changes of software modules and test verdicts with the goal of localizing software faults. Data for this purpose was collected from a continuous integration (CI) environment at the
telecommunications company Ericsson. The data consisted of test verdicts together with information about which software modules had been changed in a software product since last successful test execution. Using this data, different types of neural network models were trained. Following training, an evaluation of the neural networks’ fault localization and defect prediction capabilities was made. For comparison we also included two statistical approaches for fault localization known as Tarantula and Ochiai in the evaluation. There have been similar studies like this thesis in the domain of software fault localization. However, the previous studies all work on the code level. The contribution of this thesis is to examine if the models used in the previous studies perform well when given a different kind of input, namely co-change information of the software modules making up the software product. One major obstacle with the thesis was that we only had data for the problem of software
defect prediction. Because of this we had to evaluate the performance of our software fault localization models on the problem of predicting defects. The results were that all networks performed poorly when predicting defects. The models achieve an accuracy of around 50% and an AUC score of around 0.5 Interestingly, F-score values can reach as high as 0.632 for some models However, this is most likely a result of properties of the data set used rather than the models learning a relationship between input and output. Comparing the models to Tarantula and Ochiai suggests that neural networks might be too complicated for our situation. Overall, the results suggest that the neural network approaches we have for fault localization do not work in our setting. Sammanfattning I detta arbete undersöks om neurala nätverk kan lära sig relationen mellan kodmodulförändringar och testfallsresultat för att kunna underlätta lokalisering av fel. Data för studien samlades in från en
kontinuerlig integrationsmiljö (KI) vid telekommunikationsföretaget Ericsson. Data bestod av testkörningsresultat tillsammans med information om vilka kodmoduler som förändrats sedan senast lyckade testkörning för en mjukvaruprodukt. Data användes för att träna olika typer av neurala nätverk Efter träning utvärderades de neurala nätverkens förmåga att lokalisera fel och förutspå defekter. I jämförelsesyfte inkluderades två statistiska metoder för felsökning: Tarantula och Ochiai. Det finns liknande forskning inom området mjukvarufelsökning som i detta arbete. Skillnaden är att tidigare arbeten studerar detta problem på kodnivå. Arbetets bidrag är att undersöka om liknande resultat kan fås när man ger modellerna från tidigare studier annorlunda indata i form av information om kodmodulerna som mjukvaran består av. Ett hinder i arbetet var att vi enbart har data för forskningsproblemet att förutspå mjukvarudefekter. Vi fick därför utvärdera våra
felsökningsmodeller på problemet att förutspå mjukvarudefekter. Slutresultatet var att maskininlärningsmodellerna generellt presterade dåligt när de försökte förutspå defekter. Alla modeller uppnår 50% eller lägre noggrannhet och en AUK-poäng runt 0,5. Det intressanta är att vissa modeller kan uppnå så höga F-poäng som 0,632 men detta beror troligtivs på egenskaper i datamängden vi använder snarare än att modellerna lärt sig en relation mellan indata och utdata. När vi jämför modellerna med Tarantula och Ochiai noteras det att neurala nätverk kan anses vara för komplexa för vår situation. Övergripande tyder resultatet på att neurala nätverksmetoder för mjukvarufelsökning inte fungerar i vår situation. Acknowledgement I would like to thank my supervisor at KTH, The Royal Institute of Technology, Mathias Ekstedt for his feedback and insightful comments during the thesis development. I would also like to thank Ericsson AB for their generosity of
allowing me to conduct my thesis at their office in Kista, Stockholm, Sweden. Special thanks to Tomas Borg, Ziver Koc and Conny Wickström for their supervision of my thesis work at Ericsson. You provided me with helpful feedback and insight during the execution of the thesis work. I would also like to thank Robert Lagerström for his role as examiner of my thesis. Table of Contents 1 2 Introduction . 1 1.1 Thesis problem and objectives . 1 1.2 Data restrictions in the environment . 1 1.3 Contributions to the research field . 3 1.4 Choice of training data . 3 1.5 Thesis’s approach compared to previous studies . 4 1.6 Research questions . 4 1.7 Thesis’s restrictions . 5 1.8 Thesis structure . 5 Background. 7 2.1 Continuous integration . 7 2.2 Continuous integration at large companies . 7 2.3 Machine learning. 10 2.31 Deep neural networks . 10 2.32 Activation functions . 11 2.33 Theory on training and testing neural networks. 11 2.34 Convolutional
neural networks . 13 2.35 Recurrent neural networks . 14 2.4 2.41 Accuracy . 15 2.42 F-score . 16 2.43 Area under the receiver operator characteristic curve. 16 2.5 3 4 Software defect prediction . 15 Software fault localization . 16 2.51 Spectrum-based fault localization . 16 2.52 Fault localization based on neural networks . 17 Related work . 18 3.1 Software defect prediction . 18 3.2 Software fault localization . 18 3.3 Previous work on co-change history . 19 Method . 20 4.1 Data collection. 20 4.11 Issues with the data gathering . 21 4.2 Data preprocessing. 21 4.3 Training the machine learning models . 23 4.31 Splitting the data set . 24 5 4.32 The deep neural network . 24 4.33 The convolutional neural network . 25 4.34 The recurrent neural network . 26 4.35 Parameters used during training . 27 4.4 Evaluating the models on the software defect prediction problem . 27 4.5 Comparing neural network models against spectrum-based
approaches . 27 4.6 Hardware and software configurations . 28 Results . 29 5.1 Resulting data. 29 5.2 Resulting performance . 29 5.21 Deep neural network performance. 29 5.22 Convolutional network performance . 31 5.23 LSTM network performance. 32 5.3 6 Comparison with Spectrum-based approaches . 32 Discussion . 36 6.1 Discussion of software defect prediction results . 36 6.2 Co-change as an input metric. 37 6.3 Discussion of software fault localization results . 38 6.4 Fault localization of neural networks . 38 6.5 Social, ethical and economic impacts of the research . 38 6.6 Relevance of results . 39 6.7 Validity. 39 6.71 Internal validity . 39 6.72 External validity . 40 6.8 Conclusion . 40 6.9 Future work . 40 7 Bibliography . 41 8 Appendix . 44 A Deep neural network results . 44 B Convolutional network results . 45 C Long Short-Term Memory (LSTM) network results . 46 List of figures FIGURE 1 THE STRUCTURE OF A SOFTWARE PRODUCT. 2 FIGURE
2 FLOW INSIDE THE CI ENGINE AT ERICSSON FOR HOW NEW SOFTWARE MODULES ARE INTEGRATED AND TESTED IN THE SOFTWARE PRODUCT. 2 FIGURE 3 ILLUSTRATION OF HOW WE GO FROM VERSION HISTORY IN THE CI ENVIRONMENT TO THE DATA FED TO THE MACHINE LEARNING MODELS. 3 FIGURE 4 A GENERAL DEPICTION OF A CI PIPELINE CONSISTING OF ONE OR MORE DEVELOPERS, A CODE REPOSITORY, CI SERVER AND A PUBLIC REPOSITORY. 8 FIGURE 5 A NETWORK OF CI PIPELINES, ILLUSTRATING THE SITUATION AT ERICSSON AND HOW AN ERROR CAN PROPAGATE THROUGH THE SYSTEM. RED BOXES INDICATING WHAT IS DEFECTIVE IN THE NETWORK 9 FIGURE 6 A TRADITIONAL ARTIFICIAL/DEEP NEURAL NETWORK. 11 FIGURE 7 A 1-D CONVOLUTIONAL FILTER THAT AGGREGATE TWO INPUT FEATURES INTO A SINGLE FEATURE. 14 FIGURE 8 AN OVERVIEW OF A LSTM MEMORY CELL WITH A RECURRENT NEURON (BOX LABELED INTERNAL) AND ITS RESPECTIVE GATES. THE MEMORY CELL IS USED AS A LAYER IN A RECURRENT NEURAL NETWORK 15 FIGURE 9 AN OVERVIEW OF THE DATA GATHERING PROCESS. 20 FIGURE 10 AN EXAMPLE ON HOW
THE CHANGE MODULE LISTS WERE PREPROCESSED. NOTE THAT ALL SOFTWARE VERSIONS #1-3 ARE TOGETHER GIVEN AS INPUT TO THE PYTHON SCRIPT. 22 FIGURE 11 AN ILLUSTRATION OF THE FLOW BETWEEN THE TEST EXECUTION API AND THE PREPROCESSING SCRIPT. NOTE THAT FOR A SINGLE SOFTWARE PRODUCT VERSION THERE MIGHT BE MULTIPLE ENTRIES OF TEST VERDICTS IN THE API WHICH NEED TO BE MERGED. 23 FIGURE 12 AN ILLUSTRATION OF THE STRUCTURE OF OUR DEEP NEURAL NETWORKS. NOTE THAT THE NUMBER OF NEURONS IN THE HIDDEN LAYERS VARY AS WELL AS THE NUMBER OF HIDDEN LAYERS. 25 FIGURE 13 AN ILLUSTRATION OF THE CONVOLUTIONAL NEURAL NETWORK THAT WE USED. 26 FIGURE 14 A GENERAL ILLUSTRATION OF THE RECURRENT NEURAL NETWORK WE BUILT FOR THIS STUDY. 27 FIGURE 15 PERFORMANCE OF DEEP NEURAL NETWORKS WHERE EACH HIDDEN LAYER CONSISTS OF 4 NEURONS. 30 FIGURE 16 PERFORMANCE OF DEEP NEURAL NETWORKS WHERE EACH HIDDEN LAYER CONSISTS OF 20 NEURONS. 30 FIGURE 17 PERFORMANCE OF DEEP NEURAL NETWORKS WHERE EACH HIDDEN LAYER CONSISTS OF 1024
NEURONS. 31 FIGURE 18 PERFORMANCE OF CNN FOR 10 TRAINING SESSIONS. 31 FIGURE 19 PERFORMANCE OF LSTM NETWORK FOR 10 TRAINING SESSIONS. 32 FIGURE 20 SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY TARANTULA. 33 FIGURE 21 SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OCHIAI. 33 FIGURE 22 SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR CNN. 34 FIGURE 23 SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR LSTM MODEL. 34 FIGURE 24 SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR DEEP NEURAL NETWORK WITH 1 HIDDEN LAYER AND 4 NEURONS IN EACH HIDDEN LAYER. 35 FIGURE 25 SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR DEEP NEURAL NETWORK WITH 5 HIDDEN LAYERS AND 20 NEURONS IN EACH HIDDEN LAYER. 35 FIGURE 26 SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR DEEP NEURAL NETWORK WITH 5 HIDDEN LAYERS AND 1024
NEURONS IN EACH HIDDEN LAYER. 36 1 Introduction Continuous integration (CI) of software changes is a widely used development practice in today’s software industry. As the idea of agile software development has become the norm in industry, continuous integration has been adopted to meet the demand of faster turnaround of software releases [1] [2]. Generally, continuous integration is defined as the development practice of automatically building and verifying software when a code change (also known as a code commit) to the software is made. Although, there is a debate on continuous integration’s exact meaning [3] we have chosen this definition based on previous research [4] [5] [6] [7]. The goal of CI is to allow developers to integrate changes to the software often and thus improve the development pace of the software project. However, various issues can arise when adopting CI such as long response time for developers to get feedback on integrated code, keeping track of the
integrations and insufficient tools to cope with the complexity of CI [8]. Because of the severity of these issues a lot of research [4] [5] [6] [7] [9] [10] has gone into solving these problems. However, we would like to argue that there is one problem that have not been explored enough and is potentially a big bottleneck when developing software using CI. Namely, once the CI machinery has found a test that fails how do you determine what caused that failure? The aim of this thesis is to investigate how supervised machine learning can be applied to solve this traceability issue in a continuous integration environment. Stated differently, can machine learning be useful to automate the localization of faults based on information in the CI environment? 1.1 Thesis problem and objectives The problem we are trying to solve in this thesis is the following. Can we automate the localization of software faults in the telecommunications company Ericsson’s CI environment? The proposed approach
to solve this problem is to train various types of neural networks on data gathered from Ericsson’s continuous integration environment. We use various deep neural networks, a convolutional neural network and a Long Short-Term memory network to search for the best type of neural network for the situation. The end goal is to find a model that can output the location of faults in a software product that Ericsson is developing. 1.2 Data restrictions in the environment The type of data we can feed the neural networks is restricted by the design of the CI system at Ericsson. In our specific situation, we have a software product which is made up of software modules as depicted in Figure 1. The product is developed by making new versions of the software modules and then swapping them with the older versions of the same software modules inside the software product, thus creating a new version of the software product. After the swap has been made the software product needs to be tested in
order to determine that no functionality has been lost. If all tests pass, we keep the new version of the software product and further revisions can be made on 1 this version. If a test fails, we mark the new software modules that we swapped in as being defective and revert to using the older version of the software product for further development, see Figure 2. Because of this setup in the CI environment, the data that we can extract is a timeline of the software product’s development from older to newer versions and what software modules have been changed between these versions. We can also extract the test verdicts of the tests that have been executed on each version of the software product. However, we cannot extract information such as what code instructions have been inserted or deleted with each new version of the software modules. Figure 1 The structure of a software product. Figure 2 Flow inside the CI engine at Ericsson for how new software modules are integrated and
tested in the software product. 1.3 Contributions to the research field There have been previous studies in which one have applied neural networks to localize software faults [11] [12] [13]. This supports our idea that machine learning is applicable to our situation and specific problem. What differentiates this study from the previous ones is the type of data that is fed to the machine learning models. Thus, our main contribution to the research area of fault localization is to investigate the usefulness of previously proposed neural network models within the field on a new type of data. The goal in this study is the same as in the previous studies However, the intended data used to get there is vastly different. Because of this our neural networks will not be the same as in the previous studies. Nevertheless, we try to make our networks as similar to the previous studies as possible. The theories and ideas of this study will remain relatively the same as in the previous works and
we will be introducing these in more depth in chapter 2. 1.4 Choice of training data The type of data we decided to train our neural networks on stem from the field of software defect prediction and is referred to as the co-change history, change coupling history or evolutionary coupling in the literature [14] [15] [16]. Formally stated, co-change is the relationship between a software product’s components that are frequently changed at the same time during the development of the product [14]. In our scenario the co-change history should be defined as follows. For a version of the software product, we can obtain from the CI environment which software modules have been swapped since the previous version. This is the co-change for that version The co-change history is then a list of the co-changes for each software version in the development history of the software product. This together with the test verdicts for the different versions of the software product is used to train the
neural networks, see Figure 3. This was the most similar measurement to the test coverage information (explained in section 2.52) used in the previous studies on software fault localization that we could extract given the restrictions in the CI environment. Figure 3 Illustration of how we go from version history in the CI environment to the data fed to the machine learning models. 1.5 Thesis’s approach compared to previous studies A major obstacle for this thesis was how we would evaluate the neural network models after training. In previous studies in software fault localization it is standard to evaluate the performance of the model by giving it data where the exact locations of the faults are known. Thus, you can evaluate the performance of your fault localization by checking that the model outputs the correct locations for the faults. We do not have this information in the CI environment We can train the neural networks on the data described in the previous section. However,
once trained we cannot evaluate the models’ performance on localizing faults because we do not know where the actual faults are located. This forced us to consider another problem very similar to software fault localization known as software defect prediction1. These two problems are very similar to each other and intuitively, if you can solve one of them you can also solve the other. In other words, if you can predict whether software is defective with absolute certainty you can localize the faults responsible for the defect by simply testing each change introduced to the software since last time it was stable. Similarly, if you can localize all faults with absolute certainty then the software is considered non-defective if you cannot localize any faults. Otherwise, it is defective Based on this, we decided to evaluate our fault localization models on the software defect prediction problem when comparing the models against each other. For this we only need data on what software
product versions are faulty. Data which we already use during the training of the models Again intuitively, the model that is best at predicting defectiveness should be the best model at localizing faults in the software. This means that the thesis is also investigating the suitability of co-change as an input feature for software defect prediction, something that has been studied in [14] which reported a correlation between change coupling and software defects. Of importance is that previous studies in software fault localization (that utilize neural networks) make a point to compare their neural network approaches to other non-machine learning methods for fault localization. These are called spectrum-based methods and because of their popularity in the literature we decided to make a similar comparison between these methods and our neural networks as in the previous studies. 1.6 Research questions Based on the previous sections, we can summarize and itemize the questions we are
trying to answer into two research questions. RQ1: Is co-change history of software modules on its own useful as input for predicting defects? RQ2: Can neural networks utilizing co-change history be used for fault localization? 1 To the best of our knowledge, the terms “fault” and “defect” refer to the same thing. However, depending on which research discipline you are studying you either use only “fault” or “defect”. Throughout this thesis we have chosen to use both terms, reflecting which research domain we are currently talking about. Note that the first research question stem from the fact that we must evaluate our models on the software defect prediction problem as discussed in the previous section. Thus, it is not really related to our problem presented in section 1.1 but is a byproduct of our forced approach Instead, the second research question is the primary focus of this thesis. 1.7 Thesis’s restrictions Due to the restrictions of the CI environment
used at Ericsson, as well as other issues such as time, the thesis has some constraints. Some of them have already been discussed in the previous sections of this chapter. However, we summarize them all below, followed by a more thorough motivation of each one. • • • • Localization of defects is on a software module-level granularity. Only use data regarding the failure or passing of tests for different versions of the software product together with what software modules have been updated. Test three different types of neural network: a recurrent network, a convolutional network and deep networks. Test two different spectrum-based methods: Tarantula and Ochiai. Due to the nature of how software is built at Ericsson, and the constraints of their CI environment, there is no interest at Ericsson for finer granularity than localizing which software modules contain the faults in a software product. A software product can fail in the CI environment for many different reasons, for
example bad environment configuration. This does not always have anything to do with the software product in itself and makes predicting which software module that is causing the failure meaningless. This is the reason for constraining the data to whether tests fail or pass. Because if the verification of the software product fails for other reasons it has nothing to do with the product itself. The type of tests we are gathering are integration tests that check that the new software modules work together as intended in the software product. Due to time constraints, we could only test three different types of neural networks. It is not clear whether a different type of neural network would intuitively perform better than the ones we decided to try. Our motivation for picking networks was based on what type of networks that had been used in previous studies on software fault localization and software defect prediction. In order to make a qualitative analysis of our machine learning
models’ performance on localizing faults it seemed important to compare them to other non-machine learning approaches within the research field of software fault localization. This has been done in previous work where machine learning models have been applied, such as [12]. There are of course, many different non-machine learning approaches that one could try. However, due to time constraints we limit ourselves to only two spectrum-based approaches for comparison. 1.8 Thesis structure The rest of this thesis report is structured in the following way. Chapter 2 is the Background chapter where we present the theoretical knowledge needed to understand the problem of fault localization and defect prediction. Following that is chapter 3 Related work where we present the research already done in the domains of software defect prediction and software fault localization. Observe that it is from the previous research we have obtained much of our theoretical knowledge and thus there might
seem to be some overlap between chapter 2 and 3. A rule of thumb is that we in chapter 2 present the theoretical knowledge acquired and used by the previous studies whereas in chapter 3 we focus on the experiments conducted in the studies and their results. Following Related works we have chapter 4 Method where we present the machine learning models that we build and details about the experiments conducted on these models. We also include details on the data gathering and data preprocessing. Our intention was to be as detailed as possible to allow for reproducibility of our experiments for comparison as well as motivating our choice of approach. Chapter 5 Results will then present our findings from the experiments. Finally, we have chapter 6 Discussion where we discuss the results from chapter 5 in a more general context and compare them to the result of the previous studies presented in chapter 3. We also try to answer our two research questions presented in section 1.6 2
Background This chapter gives an overview of the theoretical knowledge within the field of continuous integration as well as machine learning within the domains of software defect prediction and software fault localization respectively. First, the general concept of continuous integration is presented. Second, theory on the three different types of neural networks we have decided to use is described. Following that we present the research field known as software defect prediction and more specifically describe the evaluation metrics used within the research field to evaluate neural networks. Last, we present the research field known as software fault localization 2.1 Continuous integration Continuous integration (CI) is the development practice of integrating software changes continuously, thus getting a rapidly changing software which can be released at any given moment [3] [4] [5] [6] [7]. CI is usually realized by having a dedicated server (known as a CI server) get the code when a
developer has submitted their changes to the code base. It then compiles the code into a software product and runs a set of tests on the product to ensure that it behaves as intended. This stepwise process (get code, compile, run test etc.) is usually referred to as a CI pipeline Should the building of the software product or one of the tests fails, the server can send that information back to the developer and revert the changes. In the case that all tests pass, the newly compiled software is considered stable and new code changes can be added, and the process is repeated. This allows the CI system to always have a version of the software product that works, as seen in Figure 4. The reason continuous integration is so popular in the software industry is because it allows for fast development of working software. This is in comparison with the traditional approach to software development, where each developer makes a lot of changes locally, spends a few months combining everyone’s
changes, ensuring that it all works together. With CI you have one version of the software product that everyone is working on, thus minimizing the overhead of merging people’s code together. 2.2 Continuous integration at large companies This setup depicted in Figure 4 works well in small projects. But if you scale up the project you begin to get problems. To illustrate one of these issues we present an imaginary scenario from Ståhl’s and Mårtensson’s book Continuous Practices [3] on page 130: “Assume a large software system which takes ten minutes to fully compile, package and test any source code change to the utmost confidence (needless to say ten minutes is rather an optimistic figure, but stay with us for the sake of argument). This means that during a typical working day of about ten hours, 60 new release candidates can be sequentially created and evaluated.” Now, if you have a company like Ericsson with thousands of developers this simply will not hold. New commits
will fill up the pipeline and those that are not yet evaluated will clog the CI machinery, resulting in developers having to wait days until they know whether their commit was successfully integrated or not [3]. One way to solve this issue is to introduce batching where you allow developers to commit their code without integrating it to the product for each commit. Then at specific times the CI machinery gets the latest commit and starts the integration, effectively testing many commits (i.e a batch) at the same time [3]. This partially solves the congestion problem of the CI machinery but introduces a traceability issue. When an integration now fails it is not clear which commit in the batch introduced the fault. This might still not seem like too big of an issue depending on the size of a batch. However, at a large company like Ericsson the pipeline of a software product will inherently have a large batch. Say for example, that the software product produced by a CI pipeline is
the input to another pipeline, merging software products together to build a bigger software product which again is fed into yet another pipeline etc. It becomes apparent that the batch size grows for each CI pipeline and that at an early stage we are looking at batches consisting of millions of lines of code changes. This illustrates the importance of being able to localize a fault and the situation can be seen in Figure 5. Figure 4 A general depiction of a CI pipeline consisting of one or more developers, a code repository, CI server and a public repository. Figure 5 A network of CI pipelines, illustrating the situation at Ericsson and how an error can propagate through the system. Red boxes indicating what is defective in the network. 2.3 Machine learning As the problem is to find the relationship between co-change history of modules and test verdicts, a natural conclusion was that machine learning could potentially solve the problem. Machine learning is a field within
computer science interested in the problem of making computers learn from experience [17]. Of interest to our study are the artificial neural network models known as deep neural networks, convolutional neural networks and recurrent neural networks. All of these models are widely popular machine learning models used for solving many different types of research problems [11] [12] [13] [18] [19] [20] . 2.31 Deep neural networks Artificial neural networks are simple imitations of the brain process in biological creatures [21]. The networks consist of a set of artificial neurons, used to simulate single biological neurons, connected in such a way as to form neuron layers as seen in Figure 6. Each neuron in the network works like a computational unit. First it calculates a weighted sum of the input values (eg numbers) from the previous layer’s neurons in the network. The neuron then adds a bias (fixed value) to the sum and optionally apply a non-linear function to the result to
determine what value it should output to the neurons in the next layer of the network [21]. There are various non-linear functions that you can use but for this thesis it is enough to know about the ReLU (rectified linear unit) activation function and the sigmoid activation function which are explained in section 2.32 Generally, you can divide an artificial neural network into three parts as seen in Figure 6. The first layer is the input layer where outside data (hence referred to as input) is passed to the network. Each neuron in the input layer is mapped to one feature value in the input data. After that follows zero or more hidden layers which are used to process the input data into the output of the system. The output layer (or the final layer) is where you receive the output of the neural network [21]. Depending on how many hidden layers you have in your neural network it is either called an artificial neural network (usually zero or one hidden layer) or deep neural network (more
than one hidden layer). However, for simplicity we will from here on call all the networks on the form depicted in Figure 6 deep neural networks. Figure 6 A traditional artificial/deep neural network. 2.32 Activation functions In this section we present the rectified linear unit (ReLU) activation function and the sigmoid activation function which are used after certain layers in our neural networks. The ReLU activation function is a non-linear function mapping all values bigger than or equal to 0 to themselves and all negative values to 0. This is to ensure that the output is a non-negative number and can be expressed as shown in Equation 1, which was obtained from [12]. ����(�) = max(0, �) (1) The sigmoid activation function transforms the input value into a value on the interval [0, 1]. The closer the input is to positive infinity the closer the output is to 1 and the closer it is to negative infinity the close the output is to 0. The formula used for the
transformation can be seen in Equation 2, which was obtained from [12]. �������(�) = 2.33 1 1+� −� (2) Theory on training and testing neural networks In theory, you can interpret a neural network as a non-linear function � that maps input data � into the desired output �. The way the network learns the correct behavior of � is through experience Using samples of (�, �) pairs, called the training set, the network uses a cost function to measure how far its output �(�) is from the desired output �. Then using gradient descent the parameters of the network (e.g the fixed values in the network used to transform input to output) are updated accordingly in order to get the desired output � given an input of � [21]. The process of determining the output �(�) is called a forward-pass through the network and the process of updating the weights and biases of the network given the cost function is done using the backpropagation algorithm
[6]. Overall, the process of learning the behavior of � using samples of (�, �) pairs is referred to as training the network. Understand that the fixed values in the network that is updated during training is randomly initialized at the beginning. This can potentially lead to variations in the final network’s performance between training sessions. After a network has been trained it is important to test whether the network behaves as expected and thus it is common to have a testing phase after training. During the testing phase you supply the network with a new set of pairs of (�, �) that has not been used during training (referred to as testing set) and compare the output �(�) of the network with the desired output � in order to determine whether the relationship was successfully learnt or not. More details on how to make this comparison can be found in section 2.4 2.331 Batching and epochs It is usually too computationally difficult to train the network on all (�,
�) pairs at the same time. Because of this you have to split the samples into smaller groups (called batches) and train your network on each group at a time. The size of a group is referred to as batch size Once you have gone through all the batches, in other words used all samples of (�, �), you have completed what is known as one epoch of training. 2.332 Optimization algorithm During the updating phase, where parameters of the network are changed using the cost function and backpropagation algorithm, it is important to determine how much to change the network. This is determined by using what is called an optimization algorithm. There are many kinds of optimization algorithms but for this thesis it is enough to know that the Adam optimizer [22] is frequently used within the field software defect prediction [6] [16] [20]. 2.333 The issue of overfitting a network One major issue that can happen during the training of a neural network is called overfitting. What this means is that
the network finds relations between input and output in your training data that do not actually exist in the real relationship you are trying to learn [17]. As an example, if you were to train a network to recognize cars in images and only used pictures of yellow cars in your training data the network could potentially learn that all cars must be yellow. Something that we know is not generally true for cars. They way to mitigate this issue is to use a method called early stopping, in which we take a small part of the training set and remove it from the training. This set is usually called the validation set The idea with early stopping is that during training of the network we check how it performs on the validation set at certain time steps and record this performance. If we see an increase in performance on the validation set between time steps, we can be somewhat certain that the network is successively becoming better at understanding the relationship between input and output.
However, if the performance on the validation set starts to decrease it suggests that the network is overfitting on the training data. Hence, the rule of early stopping is to stop the training early in the case that the network’s performance on the validation set starts to decrease [17]. 2.34 Convolutional neural networks The network depicted in Figure 6 is that of a traditional deep neural network. There are many different variants to this traditional model where one of the more popular versions is the convolutional neural network (CNN). The difference between convolutional networks and deep neural networks is that convolutional ones allow for two special types of neural layers to be used in the network together with the ordinary neural layers we find in a deep neural network. These special kinds of layers are called convolutional layers and pooling layers [17]. Convolutional layers are used to perform the mathematical convolution operations on the input. In a sense, it
aggregates the input by merging input features together by using what is called filters. A filter is simply a set of weights used in the weighted sum performed when aggregating the input and the number of weights in a filter is referred to as the filter’s size. You can imagine sliding over the input data, calculating a weighted sum over the data points as you are sliding over them [6], see Figure 7. Note that Figure 7 displays the use of one filter. Usually you apply many filters in a convolutional layer and the output of each applied filter is stacked together into the final output of the layer. It is also common to apply an activation function to the output of the convolutional layer. When applying a filter, you must also specify the stride, which is a value defining how the “window” should slide over the data as depicted in Figure 7. To illustrate, if we have a stride value of 1 in Figure 7 it means that we should apply the filter to input feature 1 and input feature 2,
followed by moving down one entry and apply the filter to input feature 2 and input feature 3 etc. However, if the stride was 2 what would happen in Figure 7 is that we would apply the filter to input feature 1 and input feature 2, followed by moving down two entries and apply the filter to input feature 3 and input feature 4. Thus, skipping certain combinations of input features that are included if you have a stride value of 1. In addition to using convolutional layers it is common to use pooling layers directly after a convolutional layer, which is another way of summarizing a certain region of the input data [6] [17]. Of importance to our study is the max pooling layer, which like the convolutional layer successively slides over a group of input values (using a stride value) and outputs the biggest value for each group [12]. The size of the group is set by the pooling layers “window” size Figure 7 A 1-D convolutional filter that aggregate two input features into a single
feature. 2.35 Recurrent neural networks Another popular neural network model is the recurrent neural network (RNN). The difference between this kind of network and the deep neural network is that recurrent networks allow the network to have an internal state. Essentially, a recurrent neural network has at least one layer that keeps an internal state, see Figure 8 for an example of such a layer. This specializes the network to solve sequence classifications problems where it can use its state, that is dependent on what has previously been fed to the network, for the current classification [17]. We are using a variant of recurrent neural networks called the Long Short-Term Memory (LSTM) structure [23] because of its success in previous work in software defect prediction [16]. A LSTM consists of a memory cell structure as can be seen in Figure 8. The memory cell consists of a selfrecurrent neuron (ie a neuron that feeds its internal state back into itself), a forget gate, an input gate
and an output gate. The forget gate, together with the self-recurrent neuron and input gate keeps track of the internal state of the memory cell. They regulate what is to be forgotten over time and what should be remembered from the input. The output gate in conjunction with the internal state is then used to produce an output. This allows the LSTM to detect patterns in arbitrarily long sequences of data [16]. Figure 8 An overview of a LSTM memory cell with a recurrent neuron (box labeled internal) and its respective gates. The memory cell is used as a layer in a recurrent neural network. 2.4 Software defect prediction Software defect prediction is a research discipline where machine learning or statistical approaches are used to predict whether a piece of software is defective or not. As input the predictor takes a set of software features, some examples being lines of actual code, which developers have changed the software etc. [6] [16] and based on the input predicts whether
the software is defective or not The research area has been quite active during recent years, indicating that the problem of defect prediction has not yet been generally solved in a satisfactory way. This idea is further supported by meta-analysis of the field [24], which has concluded that the reliability of defect prediction approaches is questionable. This is because few replications studies are performed in the field [24] Some approaches within the field that appear to yield interesting predictions is to use either a convolutional neural network [6] [20] or recurrent neural network [16] to predict whether software is defective using historical data about the software development and test verdict results as training data for the networks. The general trend within the research of these approaches seems to be focused on finding more input features (or information about the software) to feed into the neural network models as this tends to increase performance, given enough data. The
field has numerous measurements to evaluate neural network models, some of which we list in the following subsections as we will use them to evaluate our own neural networks. 2.41 Accuracy One of the more simplistic metrics used to evaluate a neural network is how many instances in the test data that the network classifies correctly. This is the accuracy of the network with respect to the test data and is defined mathematically as seen in Equation 3. �������� = ������ �� ��������� ���������� ��������� ����� ������ �� ��������� (3) The range of accuracy is [0, 1], where an accuracy closer to 1 indicates a better neural network model [6]. 2.42 F-score A popular metric to use when doing binary predictions such as in software defect prediction is the Fscore. It is the harmonic mean of precision and recall where precision in software defect
prediction is the number of instances correctly classified as defective out of all instances classified as defective and recall is the number of instances correctly classified as defective out of all instances that are truly defective. The range of F-score is [0, 1], where a value closer to 1 is more desirable as it indicates a better performing neural network [6] [25]. Let us define the notation ��� for the number of defective instances classified as defective, ��� for the number of non-defective instances classified as defective and ��� for the number of defective instances classified as non-defective. Then we can express precision, recall and F-score using Equation 4, Equation 5 and Equation 6. ��������� = ������ = ������ = 2 × 2.43 ��� ��� +��� (4) ��� ��� +��� (5) ��������� ×������ ��������+������ (6) Area under the
receiver operator characteristic curve The receiver operator curve has been used to investigate the trade-off between the hit rate and false alarm rate in the signal processing domain [26]. These days it is also used to evaluate machine learning algorithms [6] and can be interpreted in our setting as representing the probability that a randomly chosen defective instance of the software is more likely to be considered defective than a randomly chosen non-defective instance of the software [26]. The area under the receiver operator curve (AUC) is in the range [0, 1] where a value of 0.5 means that the neural network performs as if randomly guessing whether the software is defective or not. A value higher than 0.5 means the classifier performs better than random and a value lower than 05 means that the classifier performs worse than random [6]. 2.5 Software fault localization Software fault localization is the research area of determining which program elements need to be fixed for a
failing test case to stop failing [27]. The definition of what a program element is depends on the scope and can range from single instructions (fine granularity) to packages (rough granularity). It is widely recognized that determining the location of a fault in the code is one of the more time consuming and demanding tasks of software development [27]. 2.51 Spectrum-based fault localization The most popular approach in the field of fault localization (as of 2018) was spectrum-based fault localization, which is a dynamic ranking of which program elements are most likely to contain the defect based on test case coverage information [27]. To summarize the approach, each program element has a tuple of four values associated to it: (�� , �� , �� , �� ). �� is the number of times the program element is executed and a test case fails. �� is the number of times the program element is executed and a test case passes. Similarly, �� is the number of times the
program element is not executed and a test case fails. And �� is the number of times the element is not executed and a test case passes [28]. Based on these values for each program element you can calculate a suspicious value. In this thesis work we have decided to use two different suspiciousness calculations: Tarantula and Ochiai [29] [28]. The calculations for Tarantula and Ochiai can be seen in Equation 7 and Equation 8 and was obtained from [28]. ��������� = ��ℎ��� = 2.52 �� ⁄(�� +�� ) (�� ⁄(�� +�� ))+(�� ⁄(�� +�� )) �� √(�� +�� )(�� +�� ) (7) (8) Fault localization based on neural networks In software fault localization, the goal of the machine learning algorithms is learning to deduce the location of a fault based on input data about the software and test verdicts. To our understanding this approach is not as popular as the spectrum-based methods that exist [28]. Although new
results within the field suggest that machine learning approaches are generally better than the spectrumbased techniques at finding the faults [12]. In general, the structure of the machine learning approach is as follows. The input to the neural network is test coverage information of code instructions. The idea is that you map the code to an input vector containing 0s and 1s, where each element in the vector represents whether a code instruction was executed during a test or not. From this you get a code coverage vector that you feed into your neural networks that returns as output a decimal value in the range [0, 1] where 0 means that the test passes and 1 means that the test fails. Now given pairs of input vectors and output values representing executions of different tests on the program the neural network can be trained [11] [12] [13]. After training, artificial input vectors are fed to the network where only one element in the vector is set to 1 and the rest are 0s. This
represents that only one code instruction is executed Feeding these artificial input vectors to the network you get a suspiciousness score as output for each program element which you can use to determine where the fault most likely is located [11] [12] [13]. 3 Related work In this chapter we present previous research within the field of software defect prediction and software fault localization. It is from these studies that we have obtained most of our theoretical knowledge presented in chapter 2. In addition, we review the work done on co-change history as a feature to describe defects in software. 3.1 Software defect prediction One of the more recent studies in software defect prediction is the work of Wen et al. [16] They use the recurrent neural network model Long Short-Term Memory (LSTM) to feed in sequences of changes to a software product for the model to learn how to predict defects. Their result is that the approach achieves an average F-score of 0.657 and an average
area under receiver operator curve (AUC) score of 0.892 over 10 different software projects This is much better than the more traditional approaches to defect prediction. Furthermore, they conclude that the approach is better than the state-of-the-art technique of using deep learning to learn features from the static code [16]. The work is similar to ours because they view their input data as a time sequence, which we will also do in this thesis. This seems to be a rather new way to represent the data for defect prediction as the work of Wen et al. [16] is the only work that we have found that considers the data in this way Nevertheless, as the research area of software defect prediction is popular there are many studies of interest to our thesis that should also be mentioned. We have a previous Master’s Thesis work from 2018 by Sundström [6] who tried to apply the approach of extracting semantic code features via a convolutional neural network for defect prediction in an industry
setting. Most notably from the study was that a poor performance was observed when compared to the work of Jian et al. [20] from which the approach was based on. The best results observed were a maximum average F-score of 0.315 and an AUC score of 0650 [6] Furthermore, Sundström concluded that the trained model did not seem to generalize well and that this, together with the poor performance overall, might be because of insufficient data. In comparison, the original study by Jian et al. [20] that introduced the approach used by Sundström [6] achieved an average F-score of 0.608 over 6 different projects Most notable about their approach is that they use a convolutional neural network (CNN) to extract semantic features from static code. These features are then combined with more traditional code metrics and fed to a logistic regression classifier which does the defect prediction [20]. This is different from the work of Wen et al. [16] and our own work since we strive to use neural
network models throughout the entire process and not only for feature extraction. 3.2 Software fault localization As for relevant studies within the research field of fault localization we have the work of Wong and Qi [11], Zhang et al. [12] and Zheng et al [13] They all base their approach on utilizing different kinds of neural networks for learning which code statements are most likely to be responsible for test failures as described in section 2.52 The difference between the studies is that [11] and [13] used a deep neural network and [12] went even further by introducing a convolutional neural network model. These studies have the same goal as this thesis, with the major difference being that the coverage metric in this thesis is less fine grained as we consider software modules rather than single code instructions. Of importance in the research field of fault localization is that applying machine learning does not seem to be as popular as we initially thought it would be when
considering the review studies of Wong et al. [28] and Zakari et al [30] In essence the work of [11], [12] and [13] appears to be the major papers on the topic of utilizing machine learning for fault localization that have been published. Although there are other works as well such as Briand et al [31] who used the C45 decision trees algorithm for suspiciousness ranking. Instead, the most popular approach to fault localization is spectrum-based techniques, which also use coverage metrics to rank code instructions similarly to the machine learning approach. The key difference being that spectrum-based techniques use statistical and probabilistic mathematical models to describe the relationship between the coverage statistics and fault proneness [28]. Some of the more popular mathematical models are the Tarantula model [28] and Ochiai model [32]. 3.3 Previous work on co-change history Previous work by Kirbas et al. [15] studied how evolutionary coupling (similar to co-change) relates to
software defects. They mention that the literature is rather divided on whether evolutionary coupling is a useful metric for predicting defects. On one side we have the work of D’Ambros et al [14] which concluded that the correlation between change coupling and defects is stronger than more traditional code complexity metrics like the size of the code. The work of Kouroshfar [33] builds on this and concludes that change coupling taking into consideration changes between subsystems appears to have a stronger correlation to defects than change coupling within a subsystem. At the same time Kirbas et al [15] points out that studies, such as Graves et al. [34] and their own have found poor correlation between evolutionary coupling and software defects for some cross-subsystem couplings. 4 Method In this chapter we describe how the experimental setup is constructed. The first section presents how the data was gathered, preprocessed and structured. Following this is a description of the
machine learning models that were tested. Finally, an overview of the hardware and software used in this thesis is presented. 4.1 Data collection The first step to solving our research questions was to investigate what data that could be gathered at Ericsson to train our machine learning algorithms. But first, we decided on the software product to focus on for our data gathering. Ericsson has implemented a very large CI machinery for the development of its products. This means that there was a lot of meta information that could potentially be gathered. There was also various APIs developed to expose the information for querying. It was decided that we would use an inhouse developed rest API that exposes the test execution results of new software versions of the decided product. The API did however only contain information regarding the testing process of new product versions and nothing about which software modules had been changed between versions. Because of this we had to use
software versioning data gathered from the test execution API to query another API exposing the artifact repositories for the software product. From the artifact API we could then obtain what software modules (which is the building blocks of a software in the CI system) had been changed when compiling new products. The result of using these two APIs is that we can get a list of what software modules have been changed when a new software product version is created and what the test execution verdicts were when testing the new version, see Figure 9. Figure 9 An overview of the data gathering process. The collecting of data was automated by writing a short Python script that used an initial software product ID supplied to the script. Using the product ID as a starting point, the script queried the two in-house APIs for data regarding the test verdicts of different versions of the software product together with, for each version, a list of software modules that had been changed since
the previous version. To be more specific about the type of data that was saved, for each version of the software product that we could find in the testing API we saved the software product’s name, its version, its previous version, what test suites it passed, what test suites it failed, the timestamp for when the testing began, what testing environment was used and what software modules had been changed since the previous version. 4.11 Issues with the data gathering While doing the data gathering, we found that the CI system at Ericsson deletes data after some time. Because of this there was a server limit on the amount of data that could be collected by the script at any point in time. To mitigate this problem, we decided to run the Python script once a week, creating batches of gathered data files for each week. We then created a Python script to merge these data files together, also filtering the data so that no duplicate entries where present. By doing this we could amass
data over a longer time period than was possible by only relying on the APIs at Ericsson. In total we amassed 15 months’ worth of data 4.2 Data preprocessing Having obtained the raw data that we wanted, the first step of the preprocessing consisted of mapping the list of changed software modules for each data point into a fixed-sized integer vector consisting only of 0s and 1s. We wrote a Python script that takes the entire raw data set and preprocessed it. The script enumerated all software module lists to determine the number of unique software modules present in the data. At the same time, the script gave each distinct software module an index in an integer vector using the dictionary data structure. From there, the script enumerated the modules’ lists again and used the dictionary to transform each list into an integer vector of fixed size containing only 0s and 1s, see Figure 10. Thus, we end up with a vector space of dimension equal to the number of unique software modules
that exist in the data set and a set of points in this vector space where each point represents a version of the software product. Figure 10 An example on how the change module lists were preprocessed. Note that all software versions #1-3 are together given as input to the Python script. After the integer vectorization phase, we started aggregating data points together if they convey information about the same version of the software product. The testing of software at Ericsson is based on allocating test jobs for the software. When we looked through the data received from the testing API we noticed that it is not necessarily true that a single test job is responsible for testing an entire version of a software product, see Figure 11. As an example, assume we have a version of a software product. When testing this software product, we divide the test cases over 3 different test jobs in order to verify the product faster. However, due to concurrency issues one of the test jobs
fails. A maintainer might notice this and decide to rerun the failed tests, spawning a new test job which now succeeds. This means that we have 4 test jobs, which when combined informs us that the new version of the software product works. Because of this setup it was important to aggregate the data points (representing each testing job) together to get a single data point for each version of the software product. Otherwise, we would get conflicting data points that might hinder the learning of the machine learning models. This is because data points referring to the same version of the software product will have the same input (the swapped software modules for that version) but different test jobs might have either failed or succeeded. In our previous example, only looking at the test job that failed would give us the impression that the new software product is faulty. However, we have another test job that reran the same tests and succeeded, meaning that there was actually nothing
wrong with the software product. Thus, we need to aggregate the data points dealing with the same version of the software product to infer whether that version is faulty or not. While aggregating the data points, we extract what tests the product version has passed and what tests that have failed for the different test jobs. We then check if the set of failed tests is a subset of the passed tests. If it is a subset, we mark that software version as having passed the testing phase and thus being non-defective, otherwise we label it as defective. This subset test is necessary in order to check if tests that have been rerun in different test jobs have passed or not. Figure 11 An illustration of the flow between the test execution API and the preprocessing script. Note that for a single software product version there might be multiple entries of test verdicts in the API which need to be merged. 4.3 Training the machine learning models After we had gathered and preprocessed the data
from the CI machinery at Ericsson the next step was to train the machine learning models. When attempting to learn the relationship between swapped software modules and test verdicts we utilized three different types of neural networks to determine what type of machine learning model was most effective for our situation. For each neural network model built we ran the training and testing phase 10 times. After that we calculated the average performance over these 10 executions when performing our comparison between the models. This was to make the comparison a bit more robust as the random initialization of the networks will make them differ in performance between different training and testing sessions. To our knowledge, there is no standard for how to mitigate the problem of nondeterminism in performance arising from random initialization of the networks 4.31 Splitting the data set We started by dividing the data set we had gathered into a training set consisting of 70% of the
data, followed by splitting the remaining 30% of the data in half to create the validation and test sets. This split was not random but performed by first ordering the data points with respect to time. Then we picked, starting with the oldest data point and moving forward in time, the training set followed by the validation set and finally the test set. The reason for doing this is that our interest in this study is whether historic co-change data can tell us anything about future versions of the software product. Therefore, randomly dividing the data into the different sets does not make sense in our situation. 4.32 The deep neural network The first neural network that we tested is the standard densely connected deep neural network (Figure 6). The motivation for including the deep neural network is that it is easy to increase the complexity of the model and that the network can be trained rather quickly. This type of network has also been used in the previous work on fault
localization [11] [13], which further supported the choice of the deep neural network. Since it is easy to increase the complexity of the network, we assume that it should be easy to overfit the model to the training data. If this is not the case it gives an indication that the input data might not carry enough information about the relationship between input and output. The reason being either that an insufficient correlation exists or simply because of too much noise in the gathered data. In either way, we wanted to use the deep neural network as a basis as it does not contain more modern features found in convolutional and LSTM networks. The network parameters we had to decide when constructing the deep neural network is how many hidden layers the network should have and how many neurons should be in each hidden layer. The number of neurons in the input layer is dictated by the number of software modules of the software product and hence no decision needs to be made for the input
layer. The same goes for the output layer as we want the network to output a single decimal number representing the probability of the software product being defective. To find a suitable number of hidden layers and number of neurons for each hidden layer we decided to build and evaluate many different deep neural networks, enumerating suitable values for these two network parameters. Based on the work of [11], [12] and [13] we assume that a good number of hidden layers ought to be in the range of [0, 5] and so we tried networks having between 0 and 5 hidden layers. However, for the number of neurons in each hidden layer the previous studies differ from each other quite severely. One study [11] had 4 neurons for each hidden layer Another study [13] calculates the number of neurons using Equation 9. ������ �� ������� = ����� ( ��������� �� ����� ����� 30 ) × 10 (9) Meanwhile, the last study [12] used
1024 neurons for each of their fully connected layers. Given this, we tried all these configurations from the previous studies for the number of neurons in each hidden layer. Note that a common feature of all three previous studies is that all hidden fully connected layers have the same number of neurons in them and so we also conformed to this. We only have one activation function in all networks, which is the sigmoid function which is applied to the output layer to ensure that the output is in the interval [0, 1]. A depiction of the structure can be seen in Figure 12. Figure 12 An illustration of the structure of our deep neural networks. Note that the number of neurons in the hidden layers vary as well as the number of hidden layers. 4.33 The convolutional neural network Having found the best deep neural network configuration, we then wanted to examine if a convolutional neural network could perform better. Convolutional neural networks is perhaps the most popular alternative
when it comes to defect prediction [6] [20] and more recently have shown promising results within the field of fault localization [12]. Hence, it seemed like a good idea to also evaluate a convolutional network for our problem. For our convolutional neural network model, we used the model presented in [12] as reference. The architecture in [12] is a convolutional neural network having an input layer of dimensionality equal to the number of code statements in the program under consideration. Following that is one convolutional layer with 32 filters, each with a filter size of 10 and unknown stride configuration. Following the convolutional layer, ReLU activation function is applied to the output of the convolutional layer. After that is a max pooling layer whose exact configurations are not described and another convolutional layer with 64 filters, each with a size of 10 and unknown stride configuration. The ReLU activation function is again applied to the output from the convolutional
layer and fed to another max pooling layer with unknown configuration. Finally, variably many fully connected layers are used before the output layer (consisting of 1 neuron) apply the sigmoid function and outputs a value in the range [0, 1]. In our network, we start with an input layer of dimensionality equal to the number of distinct software modules in our data set. Following that is a convolutional layer with 32 filters, each with a size of 10 and a stride of 1. We then apply the ReLU activation function on the output of the convolutional layer. Following that is a max pooling layer with a window size of 2 After that follows another convolutional layer with 64 filters, each with a size of 10 and a stride of 1. Again, the ReLU activation function is applied to the output of this convolutional layer, followed by a global max pooling layer that takes the maximum value found among all input features. This layer had to be a global max pooling layer because we would otherwise not be able
to connect this layer with fully connected neural layers afterwards. This is because the dimensionality of the output from a max pooling layer would not conform to the allowed input dimension for fully connected layers. It is unclear how the previous study [12] solved this issue as their configurations for the max pooling layers are not described. Finally, 3 fully connected neural layers with 1024 neurons in each layer is used before the output layer consisting of a single neuron. The output layer uses the sigmoid activation function to ensure the output value is in the range [0, 1]. These final fully connected neural layers are like the ordinary layers we use for the deep neural networks. For an overview of the network see Figure 13 Figure 13 An illustration of the convolutional neural network that we used. 4.34 The recurrent neural network The third neural network we wanted to try is a simple recurrent neural network based on the Long Short-Term Memory (LSTM) model. The
motivation for trying this model stem from the work [16] that reported an increase in performance in software defect prediction when considering the data as a sequence labeling problem. It should be noted that [16] used many more software metrics in their change sequence for predicting software defects and therefore we do not expect to get as high performance as them for our simple LSTM model. Nevertheless, for our research question it is interesting to evaluate this model as well. The model was built as follows. First, we have an LSTM memory unit (see Figure 8) which outputs a vector with dimensionality equal to the number of distinct software modules in the data set. We connect the output from the LST to the output layer consisting of a single neuron that uses the sigmoid function to restrict its output to a value between [0, 1], see Figure 14. Figure 14 A general illustration of the recurrent neural network we built for this study. 4.35 Parameters used during training As
mentioned, all networks use the sigmoid function as the activation function for the output value. Besides this the ReLU activation function is used for all convolutional layers. The cost function used during training was the mean squared error loss. The batch size during training was 32 except for the LSTM model which considered the entire training data as a long sequence. Each model was trained for 15 epochs in accordance with [6] and [20]. As optimization we used the Adam optimization as it has been used in other studies [6]. Early stopping is also applied during training as it is a well-known technique to combat overfitting [17]. 4.4 Evaluating the models on the software defect prediction problem After training each model we then wanted to see how well they performed on the software defect prediction problem. In other words, we wanted to analyze how well the models could predict whether a version of the software product was defective or not based on what software modules had been
changed. We fed what software modules had been swapped for each data point in the testing set. Then we rounded the output (which we also call suspiciousness score) of each network in the case it was bigger than 0.5 to 1 This indicate that the model thinks the software version was defective Otherwise, we rounded the output to 0, which indicate that the software is believed to be nondefective. From this we could then calculate the accuracy, F-score and AUC score for each model on the test data. 4.5 Comparing neural network models against spectrum-based approaches After having evaluated the different types of neural networks on the defect prediction problem we wanted to compare their suspiciousness scoring of the different software modules, as described in section 2.52 The suspiciousness scoring is obtained by feeding in fake input to the models where only one software module has been swapped. The output produced by the networks is then considered the software module’s suspiciousness
score. Doing this for every software module allowed us to plot the suspiciousness score distribution of the different modules for each neural network. For reference we also plotted the suspiciousness scores for the different modules calculated by the spectrum-based approaches Tarantula and Ochiai as described in chapter 2. Understand that the comparison of suspiciousness scoring was done to investigate whether we could draw any conclusions regarding the models’ fault localization capabilities. This is different from the models’ defect prediction capabilities as the problem of software defect prediction is not the same as the problem of software fault localization. However, they are related problems dealing with the same underlying issue. Thus, it is interesting to see whether we can draw similar conclusions about the neural networks by viewing them through these two problems. The motivation for including the spectrum-based approaches Tarantula and Ochiai is that they have been
heavily studied in software fault localization. Furthermore, to be able to compare the neural networks against the spectrum-based approaches on both software fault localization and software defect prediction we created a simple method for classifying a version of the software product as defective or non-defective using Tarantula and Ochiai. The method is described as follows We used the data in the training set to calculate a suspiciousness for each software module using Tarantula and Ochiai respectively. Then, for each data point in the test data set, we evaluated whether a software module with suspiciousness higher than or equal to 0.5 had been changed for that data point. If it was true, we classified the software product version as being defective, otherwise it was considered non-defective. Using this simple method allowed us to get an indication of whether Tarantula or Ochiai could be used for defect prediction in our situation. 4.6 Hardware and software configurations All
experiments were executed on an HP EliteBook 840 G5 having an Intel Core i7-8650U CPU at 1.90 GHz and 32 GB RAM. All scripts for data gathering, processing, creating and training of the models etc. were written in Python 3672 and the popular Python packages requests3, pickle4, numpy5, sklearn6, keras7 and tensorflow8 were used. Requests was used to query the internal rest APIs and parse the JSON data obtained from them. Pickle was used to save the data to file so that the data was easily accessible after it had been gathered. Numpy, sklearn, keras and tensorflow were used to build the neural networks, train them and finally evaluate them. 2 See https://www.pythonorg/downloads/release/python-367/ See https://2.python-requestsorg/en/master/ 4 See https://docs.pythonorg/3/library/picklehtml 5 See https://numpy.org/ 6 See https://pypi.org/project/sklearn/ 7 See https://github.com/keras-team/keras 8 See https://www.tensorfloworg/ 3 5 Results This chapter presents the experimental
results. The following section contains results about the data collection process. Thereafter follows a presentation about the performance of the different neural network models on the data. 5.1 Resulting data In the end we extracted data about one software product from the CI system at Ericsson. As the CI system deletes data after roughly a month has passed, we gathered the data during multiple occasions and then merged it all into a single data set. Information about the final data set can be found in Table 1. In total, the data set represents the development process of the software product over roughly 1.5 months’ time It should be noted that the number of distinct software modules found in the data set was 69. Before or after preprocessing Before After Number of entries 5485 289 Percentage of entries defective 35.6% 52.2% Table 1 Meta information about the data gathered. For information about the preprocessing see section 4.2 5.2 Resulting performance In this section we
present how the deep neural networks, the convolutional neural network and LSTM network performed on the test set. 5.21 Deep neural network performance The average performance for the different configurations of the deep neural networks over 10 training and testing sessions can be seen in Figure 15, Figure 16 and Figure 17. Note that since the unique number of software modules in the data set was 69, we obtained from Equation 9 that 20 neurons in each hidden layer could be a good configuration. So the configurations for the number of neurons in each hidden layer that we tested was 4 neurons, 20 neurons and 1024 neurons respectively. Among the configurations there were two different networks that achieved the best performance with respect to the three measurements accuracy, F-score and AUC. The configuration with the highest accuracy and F-score was the 5 hidden layer network with 1024 neurons in each hidden layer seen in Figure 17. It achieved an accuracy of 555% on the test data
set, an F-score of 0632 and an AUC score of 0.543 The configuration with the highest AUC score was the 2 hidden layer network with 1024 neurons which you can also see in Figure 17. Achieving an accuracy of 529%, an F-score of 0.5 and an AUC score of 0605 The general trend based on a comparison of Figure 15, Figure 16 and Figure 17 is that no configuration outperforms another by that much. On average all configurations have around 50% accuracy, 0.5 F-score and 05 AUC score Deep networks having 4 neurons per layer 0,7 0,6 Values 0,5 0,4 0,3 0,2 0,1 0 0 1 2 3 4 5 Number of hidden layers Accuracy F-score AUC Figure 15 Performance of deep neural networks where each hidden layer consists of 4 neurons. Deep networks having 20 neurons per layer 0,6 0,5 Values 0,4 0,3 0,2 0,1 0 0 1 2 3 4 5 Number of hidden layers Accuracy F-score AUC Figure 16 Performance of deep neural networks where each hidden layer consists of 20 neurons. Deep networks having 1024 neurons per
layer 0,7 0,6 Values 0,5 0,4 0,3 0,2 0,1 0 0 1 2 3 4 5 Number of hidden layers Accuracy F-score AUC Figure 17 Performance of deep neural networks where each hidden layer consists of 1024 neurons. 5.22 Convolutional network performance The performance of the convolutional neural network (CNN) for 10 runs can be seen in Figure 18. On average the CNN network had an accuracy 49.0% on the test data set, an F-score of 0502 and an AUC score of 0.484 Performance of CNN for 10 training sessions 0,8 0,7 Values 0,6 0,5 0,4 0,3 0,2 0,1 0 1 2 3 4 5 6 7 Training session Accuracy F-Score AUC Figure 18 Performance of CNN for 10 training sessions. 8 9 10 5.23 LSTM network performance The performance of the recurrent neural network using the LSTM structure for 10 runs can be seen in Figure 19. On average the LSTM network had an accuracy of 464%, an F-score of 0223 and an AUC score of 0.539 Performance of LSTM network for 10 training sessions 0,7 0,6 Values 0,5 0,4
0,3 0,2 0,1 0 1 2 3 4 5 6 7 8 9 10 Training session Accuracy F-score AUC Figure 19 Performance of LSTM network for 10 training sessions. 5.3 Comparison with Spectrum-based approaches When performing our simple method described in section 4.5 to convert suspiciousness score of Tarantula and Ochiai into software defect prediction we got the result shown in Table 2. Approach Tarantula Ochiai Accuracy 56.8% 36.4% F-score 0.698 0.0 AUC 0.487 0.5 Table 2 Performance of Tarantula and Ochiai on the test data set, having used the training data set to calculate suspiciousness. The suspiciousness ranking of the 69 modules making up the software product under consideration is depicted in Figure 20 for Tarantula and Figure 21 for Ochiai. We also include the suspiciousness distribution of the convolutional network (Figure 22), LSTM network (Figure 23) and for the best configuration in each of Figure 15, Figure 16 and Figure 17 (Figure 24, Figure 25 and Figure 26). Tarantula
suspiciousness score distribution 1 0,9 Suspiciousness 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Software module Tarantula Figure 20 Suspiciousness distribution of the 69 software modules as calculated by Tarantula. Ochiai suspiciousness score distribution 1 0,9 Suspiciousness 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Software module Ochiai Figure 21 Suspiciousness distribution of the 69 software modules as calculated by Ochiai. Convolutional network suspiciousness score distribution 1 0,9 Suspciousness 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Software modules Convolutional network Figure 22 Suspiciousness distribution of the 69 software modules as calculated by our CNN. LSTM network suspiciousness
score distribution 1 0,9 Suspiciousness 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Software module LSTM network Figure 23 Suspiciousness distribution of the 69 software modules as calculated by our LSTM model. Suspiciousness Deep network with 1 hidden 4 neuron layer suspiciousness score distribution 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Software module Deep network Figure 24 Suspiciousness distribution of the 69 software modules as calculated by our deep neural network with 1 hidden layer and 4 neurons in each hidden layer. Suspiciousness Deep network with 5 hidden 20 neuron layers suspiciousness score distribution 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Software module Deep network
Figure 25 Suspiciousness distribution of the 69 software modules as calculated by our deep neural network with 5 hidden layers and 20 neurons in each hidden layer. Suspiciousness Deep network with 5 hidden 1024 neuron layers suspiciousness score distribution 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Software module Deep network Figure 26 Suspiciousness distribution of the 69 software modules as calculated by our deep neural network with 5 hidden layers and 1024 neurons in each hidden layer. 6 Discussion In this chapter the results and the validity of the thesis work are discussed. Suggestions for improving the study is also presented in the future work section. 6.1 Discussion of software defect prediction results Generally, when comparing the results of this work with previous studies within the field of software defect prediction [6] [16] [20] we can conclude that our machine
learning models perform poorly. Most noticeably is the low accuracy achieved by all models, the highest accuracy of 55.5% being achieved by the best deep neural network configuration. This is very close to random guessing, that would on average achieve a 50% accuracy and is very small in comparison to [6] whose neural network could achieve accuracies of between 85% to 90%. As for F-score, we have some networks performing almost equal to or better than some of the previous studies. The best average F-score of 0632 achieved by the best deep neural network is in the same range as [20] whose convolutional network’s best F-score was 0.778 and their best model had an average F-score of 0.608 over 7 different projects It is also very close to the average F-score of 0.657 reported by [16] As we can see from the results, the LSTM model appears to have the hardest time learning from the data, achieving an average F-score of 0.223 which is lower than any of the models tested in [20]. However,
our high F-scores are most likely a result of the fact that we have an almost even ratio of defective and non-defective instances in our data, as seen in Table 1. This means that with 50% accuracy we should get 50% of the defective instances correct, resulting in a high F-score. This is different from [6] where the defective ratio in the data set was 8.1% instead Thus, it is not likely that the high F-scores observed are connected to the findings in [33] that co-change between subsystems exhibit a correlation to defects, as presented in chapter 3. For the AUC measurements we get lower results than [6] which measured AUC scores between 0.45 and 0.75 Compared to [16] which had an average AUC score of 0892, our models do not perform well. Our best AUC score of 0605 from one of our deep neural network configurations is not that high and overall most of our models have an AUC score of around 0.50 To summarize, our neural network models do not seem to perform much better than random
guessing when it comes to whether a software version is defective or not. This is based on the low AUC scores and accuracies observed. 6.2 Co-change as an input metric Returning to our first research question, namely: Is co-change history of software modules on its own useful as input for predicting defects? The result indicates that, although some information about the software product’s stability might be carried in the co-change history, it is not enough on its own in achieving decent software defect prediction results. As we can see, the accuracies and AUC scores observed are low compared to other studies within the field. This indicates that the models generally perform similar to random guessing Our reasoning for why this is the case is as follows. The first, most likely major reason why we do not get good performance, is that we do not have sufficiently many data points to learn the relationship. After the preprocessing step we have 289 data points (see Table 1), which
according to the rule of thumb presented by [17] is far too little to achieve decent performance. The rule being that you need approximately 10,000 data points to get adequate performance. Secondly, even with many data points it would still be hard to learn the relationship because of the relationship’s non-static nature. To illustrate we will compare our classification problem with the problem of classifying images. When classifying images, a picture of a car should always be classified as a car. However, in our scenario a software version where software modules A, B and C have been changed should not always be classified as defective. It all depends on how A, B and C have been changed. Thus, by working on the abstraction level of co-changed software modules we have a nonstatic relationship we wish to learn which dramatically increases the difficulty of learning a relationship. Our suggestion to deal with this problem is to decrease the abstraction level and if possible provide the
entire code as input to the machine learning models as done in [6] and [20]. Another approach would be to have the actual code changes for each software module change. The key point is to be able to differentiate the data points from each other. Because with time you will probably end up with data points having the same input but different outputs which will hinder your learning. 6.3 Discussion of software fault localization results As seen in Table 2, Tarantula together with our method of transforming suspiciousness into defect prediction (as described in section 4.5) performs similar to our best neural network models Ochiai on the other hand does not perform well at all. This indicates that our neural network models do not learn a complex relationship between changed software modules and defects, as their performance can be imitated by simpler statistical approaches. The reason for the differences in performance between Tarantula and Ochiai can be understood when considering the
suspiciousness distribution of the respective methods. Tarantula tends to score many modules higher than 0.5 which is our threshold for classifying a module as defective Ochiai does not rate any module more suspicious than 0.45 which means that all instances are classified as non-defective in the test data set, resulting in poor performance. 6.4 Fault localization of neural networks Returning to our second research question: Can neural networks utilizing co-change history be used for fault localization? When comparing the different models’ suspiciousness distributions, we can see that the convolutional neural network is different from the other models, scoring all software modules close to 0.5 Two deep neural network (Figure 24, Figure 25) also tends to score close to 05 for all software modules. Thus, these models do not consider any software module extra prone to be faulty, instead all modules could potentially be faulty. Interestingly, the suspiciousness distributions of the deep
neural network in Figure 26 have greater variance. Why this is the case is not well understood. This result begs us to ask the question whether using the problem of defect prediction to evaluate fault localization performance was a good strategy. The different deep neural network configurations are very similar when it comes to defect prediction performance but internally ranks the software modules very different in order of suspiciousness. Hence, it does not seem like the strategy of evaluating fault localization using the defect prediction problem is a useful approach. To summarize, the results suggest that you cannot use neural networks together with co-change history in order to localize faults in new software versions. Instead, the best approach we can recommend in this scenario is to use Tarantula for your fault localization. 6.5 Social, ethical and economic impacts of the research There are some social, ethical and economic aspects to the thesis work. Generally, there is an
economic interest in solving both the problem of software defect prediction and software fault localization, as this would lead to faster development of stable software. To illustrate the importance of this, it has been estimated that the cost of software bugs in 2016 was around 1.1 trillion US dollars [35]. Of course, social and ethical concerns can arise in this research field by for example feeding developer information to the machine learning models. There is probably a correlation between who the developer is and what type of bugs that will occur, something that could be useful for the models. In our case we do not feed this information directly to the models However, it should not be impossible to use the co-change information together with other information in the CI environment to trace the changes back to a group of developers. This follows from the fact that if you can solve the software fault localization problem you can trace the problem back to the exact lines of code.
Then, by using the information in the versioning system, you can determine who introduced the problem. This type of information could potentially create issues in the work environment. 6.6 Relevance of results Considering the results from a company’s perspective, this study illustrates that it is very important to have the correct type of data if you wish to apply machine learning. Generally, this work is of importance to IT companies and the people working at these companies. The results and problems we try to solve in this study are not of general importance to society. They are only relevant in companies working with large code bases, having many developers committing code frequently and that have a complex CI pipeline. The problem we try to solve arise with the scale up of software development. Thus, from a single developer’s point of view, this study is not that relevant The most important result from the work is what type of data you should not use for fault localization. It
also highlights the most difficult problem facing companies that want to use machine learning, namely the problem of generating and gathering correct data. 6.7 Validity In this section we discuss issues with our approach that may compromise our findings. It is our goal to find generally applicable rules that can help others. However, as this is merely a case study, we can only guarantee that our results hold within the tight settings of our experiments. 6.71 Internal validity The most notable issues with the thesis work is the size of our data set, the simplicity of our input feature and the lack of correct answers for the fault localization problem. As been mentioned, using a limited representation of a software’s development has the risk of producing a lot of conflicting data points in your data set. If a richer set of input features could be used the number of conflicting data points would decrease, making the relationship between input and output more explicit to the machine
learning algorithms. This together with more data points is the preferred approach to increasing the performance of our system, which is a general trend you can observe when considering software defect prediction studies over time [16] [20] [36]. It would also be useful to create a scheme for determining which software modules are faulty at any given point in time. This would allow us to create a set of correct answers for the fault localization problem in order to properly evaluate the machine learning models. Another issue with this study is that the performed F-score calculations actually approximate the real F-score value. The Python library Keras, which we rely on heavily in this study, evaluates a network’s performance using batches of testing data similar to when a network is trained. Calculating F-score over these batches has been argued to be misleading, which is a flaw in our method that should be fixed if this study were to be replicated.9 9 For more information see this
thread: https://github.com/keras-team/keras/issues/5794 6.72 External validity As previously mentioned, due to the limitations in the environment there is no guarantee that similar experiments conducted on different software products will reach the same conclusion. However, we strongly believe that replicated studies will reach the same conclusion as ourselves. Namely, that co-changes do not help neural networks perform much better than random when it comes to software defect prediction and software fault localization. 6.8 Conclusion Using the metric of historical co-changes to predict test verdict failures produces poor results. The most obvious issue with the approach is the limited representation of a software product’s development that historical co-changes provide. The limitation results in a non-static relationship between input and output, something that supervised machine learning models have trouble learning. This study has shown that historical co-changes as a
feature does not help a neural network perform much better than random on the software defect prediction problem. Furthermore, as the relationship becomes non-static when you only use historical co-changes it appears networks such as LSTM suffers the worst from this non-static attribute, which is expected. As for software fault localization, we concluded that different types of neural networks seem to assign different suspiciousness to software modules when trained on the same data. However, there is no clear correlation between a network’s performance on software defect prediction and its suspiciousness scoring. This makes us question our assumption that there is a relationship between the problems of software fault localization and software defect prediction. The results generally suggest that a network’s performance on predicting software defects has no correlation with its performance of localizing software faults. 6.9 Future work Extending the data gathering, changing the
input feature to the models and obtaining correct answers to the fault localization problem (see section 6.71) are necessary improvements to the study. Given that these improvements increase the performance of the neural networks, it would be interesting to see how the relationship between the neural network models and the spectrum-based approaches (Tarantula and Ochiai) changes with respect to suspiciousness scoring. Of course, introducing richer input features makes the method of comparing the approaches blurry, as the neural network models will use more information about the system than the spectrum-based approaches. 7 Bibliography [1] P. Rodríguez, A Haghighatkhah, L E Lwakatare, S Teppola, T Suomalainen, J Eskeli, T Karvonen, P. Kuvaja, J M Verner and M Oivo, "Continuous deployment of software intensive products and services: A systematic mapping study," Journal of Systems and Software, vol. 123, p. 263–291, 2017 [2] M. V Mäntylä, B Adams, F Khomh, E Engström
and K Petersen, "On rapid releases and software testing: a case study and a semi-systematic literature review," Empirical Software Engineering , vol. 20, no 5, p 1384–1425, 2015 [3] D. Ståhl and T Mårtensson, Continuous Practices - A Strategic Approach to Accelerating the Software Production System, Self published, 2018. [4] A. Haghighatkhah, M Mäntylä, M Oivo and P Kuvaja, "Test prioritization in continuous integration environments," Journal of Systems and Software, vol. 146, 2018 [5] A. Tummala, "Self-learning algorithms applied in Continuous Integration system," 2018 [Online]. Masters Thesis Available: http://wwwdivaportalorg/smash/get/diva2:1229394/FULLTEXT02pdf [Accessed 09 12 2019] [6] A. Sundström, "Investigation into predicting unit test failure using syntactic source code features," 2018. [Online] Masters Thesis Available: http://kthdivaportalorg/smash/get/diva2:1239808/FULLTEXT01pdf [Accessed 09 12 2019] [7] S. Vangala,
"Pattern Recognition applied to Continuous integration system," 2018 [Online] Masters Thesis. Available: http://wwwdivaportalorg/smash/get/diva2:1185406/FULLTEXT01pdf [Accessed 09 12 2019] [8] A. Debbiche, M Dienér and R B Svensson, "Challenges When Adopting Continuous Integration: A Case Study," Springer International Publishing, 2014. [9] Y. Zhu, E Shihab and P C Rigby, "Test Re-Prioritization in Continuous Testing Environments," in IEEE International Conference on Software Maintenance and Evolution, 2018. [10] H. Spieker, A Gotlieb, D Marijan and M Mossige, "Reinforcement learning for automatic test case prioritization and selection in continuous integration," in Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2017. [11] W. E Wong and Y Qi, "BP neural network-based effective fault localization," International Journal of Software Engineering and Knowledge Engineering, vol. 19, no 4, 2009
[12] Z. Zhang, Y Lei, X Mao and P Li, "CNN-FL: An Effective Approach for Localizing Faults using Convolutional Neural Networks," in IEEE 26th International Conference on Software Analysis, Evolution and Reengineering, 2019. [13] W. Zheng, D Hu and J Wang, "Fault localization analysis based on deep neural network," Mathematical Problems in Engineering, 2016. [14] M. D’Ambros, M Lanza and R Robbes, "On the Relationship Between Change Coupling and Software Defects," in IEEE 16th Working Conference on Reverse Engineering, 2009. [15] S. Kirbas, B Caglayan, T Hall, S Counsell, D Bowes, A Sen and A Bener, "The relationship between evolutionary coupling and defects in large industrial software," Journal of Software: Evolution and Process, vol. 29, no 4, 2017 [16] M. Wen, R Wu and S Cheung, "How Well Do Change Sequences Predict Defects? Sequence Learning from Software Changes," IEEE Transactions on Software Engineering, 2018. [17] I.
Goodfellow, Y Bengio and A Courville, Deep Learning, MIT Press, 2016 [18] H. He, Z Zhu and E Mäkinen, "A neural network model to minimize the connected dominating set for self-configuration of wireless sensor networks," IEEE Transactions on Neural Networks, vol. 20, no 6, 2009 [19] T. Guo, J Dong, H Li and Y Gao, "Simple convoltional neural network on image classification," in IEEE 2nd International Conference on Big Data Analysis, 2017. [20] J. Li, P He, J Zhu and M R Lyu, "Software Defect Prediction via Convolutional Neural Network," in IEEE International Conference on Software Quality, Reliability and Security, 2017. [21] I. N da Silva, D H Spatti, R A Flauzino, L H B Liboni and S F dos Reis Alves, Artificial Neural Networks, Springer, Cham, 2017. [22] D. P Kingma and J L Ba, "Adam: A method for stochastic optimization," 2014 [Online] Available: https://arxiv.org/pdf/14126980pdf [Accessed 09 12 2019] [23] S. Hochreiter and J Schmidhuber,
"Long short-term memory," Neural computation, 1997 [24] Z. Mahmood, D Bowes, T Hall, P C R Lane and J Petrić, "Reproducibility and replicability of software defect prediction studies," Information and Software Technology, vol. 99, 2018 [25] J. Ren, K Qin, Y Ma and G Luo, "On Software Defect Prediction Using Machine Learning," Journal of Applied Mathematics, vol. 2014, 2014 [26] J. Huang and C X Ling, "Using AUC and accuracy in evaluating learning algorithms," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no 3, 2005 [27] G. Laghari, K Dahri and S Demeyer, "Comparing spectrum based fault localisation against testto-code traceability links," in International Conference on Frontiers of Information Technology, 2018. [28] W. E Wong, R Gao, Y Li, R Abreu and F Wotawa, "A Survey on Software Fault Localization," IEEE Transactions on Software Engineering, vol. 42, no 8, 2016 [29] R. Abreu, P Zoeteweij, R Golsteijn
and A J van Gemund, "A practical evaluation of spectrumbased fault localization," Journal of Systems and Software, vol 82, no 11, 2009 [30] A. Zakari, S P Lee, K A Alam and R Ahmad, "Software fault localisation: a systematic mapping study," IET Software, vol. 13, no 1, 2019 [31] L. C Briand, Y Labiche and X Liu, "Using Machine Learning to Support Debugging with Tarantula," in The 18th IEEE International Symposium on Software Reliability, 2007. [32] R. Abreu, P Zoeteweij and A J V Gemund, "An Evaluation of Similarity Coefficients for Software Fault Localization," in 12th Pacific Rim International Symposium on Dependable Computing, 2006. [33] E. Kouroshfar, "Studying the Effect of Co-change Dispersion on Software Quality," in 35th International Conference on Software Engineering, 2013. [34] T. L Graves, A F Karr, J Marron and H Siy, "Predicting fault incidence using software change history," IEEE Transactions on Software
Engineering, vol. 26, no 7, 2000 [35] L. Li, S Lessmann and B Baesens, "Evaluating software defect prediction performance: an updated benchmarking study," 2019. [Online] Available: https://arxiv.org/ftp/arxiv/papers/1901/190101726pdf [Accessed 22 11 2019] [36] Z. Li, X-Y Jing and X Zhu, "Progress on approaches to software defect prediction," IET Software, vol. 12, no 3, 2018 8 Appendix The appendix presents the raw data obtained from running the software defect prediction experiments described in chapter 4. This raw data was then used to create the figures presented in section 5.2 A Deep neural network results No hidden layer Accuracy (training) 50.8 % (528 %) F-score (training) 0.566 (0546) AUC (training) 0.514 (0532) Table 3 Average performance obtained when testing a deep network with no hidden layers over 10 training sessions. 1 hidden layer 4 neurons per layer 20 neurons per layer 1024 neurons per layer Accuracy (training) 55.5 % (532 %) 47.5 % (529
%) F-score (training) 0.581 (0529) 0.494 (0529) AUC (training) 0.540 (0530) 0.469 (0558) 54.1 % (613 %) 0.536 (0598) 0.591 (0654) Table 4 Average performance obtained when testing a deep network with one hidden layer over 10 training sessions. 2 hidden layers 4 neurons per layer 20 neurons per layer 1024 neurons per layer Accuracy (training) 53.1 % (549 %) 50.9 % (533 %) F-score (training) 0.553 (0527) 0.528 (0529) AUC (training) 0.520 (0551) 0.499 (0548) 52.9 % (643 %) 0.500 (0605) 0.605 (0697) Table 5 Average performance obtained when testing a deep network with two hidden layers over 10 training sessions. 3 hidden layers 4 neurons per layer 20 neurons per layer 1024 neurons per layer Accuracy (training) 53.0 % (527 %) 52.6 % (551 %) F-score (training) 0.547 (0526) 0.511 (0525) AUC (training) 0.564 (0555) 0.543 (0560) 51.2 % (659 %) 0.477 (0615) 0.577 (0708) Table 6 Average performance obtained when testing a deep network with three hidden layers over 10
training sessions. 4 hidden layers 4 neurons per layer 20 neurons per layer 1024 neurons per layer Accuracy (training) 49.8 % (534 %) 42.0 % (561 %) F-score (training) 0.510 (0508) 0.380 (0517) AUC (training) 0.465 (0530) 0.439 (0568) 52.1 % (586 %) 0.486 (0535) 0.558 (0606) Table 7 Average performance obtained when testing a deep network with four hidden layers over 10 training sessions. 5 hidden layers 4 neurons per layer 20 neurons per layer 1024 neurons per layer Accuracy (training) 46.9 % (549 %) 54.5 % (537 %) F-score (training) 0.455 (0522) 0.549 (0500) AUC (training) 0.489 (0567) 0.542 (0545) 57.0 % (555 %) 0.632 (0587) 0.543 (0562) Table 8 Average performance obtained when testing a deep network with five hidden layers over 10 training sessions. B Convolutional network results Convolutional network Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Run 8 Run 9 Run 10 Average Accuracy (train) F-score (train) AUC (train) 56.8 % (501 %) 56.8 % (501 %) 54.2 % (510
%) 49.5 % (594 %) 47.9 % (697 %) 37.5 % (530 %) 38.0 % (803 %) 59.9 % (704 %) 45.3 % (653 %) 43.8 % (535 %) 49.0 % (603 %) 0.712 (0659) 0.712 (0659) 0.676 (0660) 0.434 (0510) 0.444 (0653) 0.163 (0273) 0.417 (0795) 0.698 (0737) 0.493 (0642) 0.268 (0241) 0.502 (0583) 0.513 (0535) 0.335 (0658) 0.532 (0583) 0.547 (0684) 0.549 (0718) 0.459 (0619) 0.482 (0880) 0.525 (0809) 0.471 (0726) 0.430 (0616) 0.484 (0683) Table 9 Performance of the convolutional neural network model over 10 different training sessions. C Long Short-Term Memory (LSTM) network results LSTM network Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Run 8 Run 9 Run 10 Average Accuracy (train) 45.5 % (886 %) 38.6 % (926 %) 47.7 % (886 %) 50.0 % (916 %) 52.3 % (881 %) 52.3 % (896 %) 40.9 % (911 %) 43.2 % (861 %) 47.7 % (871 %) 45.5 % (916 %) 46.4 % (895 %) F-score (train) 0.205 (0431) 0.159 (0446) 0.250 (0416) 0.250 (0455) 0.318 (0406) 0.273 (0431) 0.159 (0431) 0.205 (0386) 0.227 (0411) 0.182 (0431) 0.223 (0424) AUC (train)
0.502 (0946) 0.538 (0981) 0.547 (0961) 0.540 (0973) 0.551 (0973) 0.596 (0961) 0.453 (0979) 0.493 (0963) 0.612 (0958) 0.554 (0972) 0.539 (0967) Table 10 Performance of the LSTM network over 10 different training sessions. TRITA -EECS-EX-2020:26 www.kthse