A Study on Java/JML Open Source Systems

We evaluate our approach in 12 Java/JML open source projects with respect to automatic detection and manual classification of nonconformances. Moreover, we investigate how practitioners write contracts in JML by analyzing all 12 systems evaluated with JmlOk2.

We make publicly available a replication package containing data for replication purposes and to support future studies. The package contains: (i) the source code of JmlOk2 tool with a runnable version of the tool; (ii) the source code of all experimental units used in this study; (iii) all test cases generated by JmlOk2 that enabled us to detect the 119 nonconformances reported in this page.

Definition

The goal of the first case study is to analyze one approach (implemented for JmlOk2) for the purpose of evaluation with respect to detection and manual classification of nonconformances from the point of view of researchers in the context of contract-based programming.

In particular, we address the following research questions:

Q1. How many nonconformances is the approach able to detect in real contract-based programs for a given time limit, and which are the most common types and likely causes of those nonconformances?

    We measure the number of detected nonconformances (#NCs) in each system within a given test generation time limit, and analyze the frequency of  nonconformances by types and likely causes.


Q2. What does it take tests to exercise the nonconformances detected?

    We measure and summarize metrics that define the complexity of the test case that first fails for a given nonconformance: B and D.

In order to discover how practitioners write contracts in JML, some additional questions are defined:

Q3. What are the most common contract types written by developers?

    We measure the frequency of contract clauses (#CC) for each type of contract in each experimental unit and discuss if there are significant differences between the use of each contract type.

Q4. Is there any relationship between contracts complexity and the number of detected
nonconformances?

    To answer this question we measure contract complexity (CCo) and relate this metric with the number of nonconformances detected (#NCs) for each experimental unit, by using a correlation test.

Q5. Which are the most common nonconformance types for the units with the most
complex contracts?

    We rank the top five systems for CCo and group the detected nonconformances in the respective unit according to the nonconformance type.

Q6. What is the relationship between contract type and number of detected nonconformances?

    We get the top five experimental units in terms of number of nonconformances detected and list the most common contract type of each unit and discuss any relationship between contract type and number of nonconformances.

Q7. What is the relationship between contract type and nonconformance type?

    For each experimental unit, we get the most common contract type and the most common nonconformance type and try to find a contract type more likely to reveal a specific type of nonconformance.


Planning
Experimental units
    We evaluate JMLOK2 in 12 JML projects, summing up to 26 KLOC and more than 4.2 K contract clauses (CC, henceforth). They include sample programs available in the JML web site, composed by 11 example programs for training purposes, written by specialists in the JML language. Also, the study includes programs collected from 11 open-source JML projects; While Bank, PayCard and TheSchorrWaiteAlgorithm are presented in the KeY approach book, Bomber is a mobile game, and Dnivra-Jive is the set of programs used by the Jive JML Dynamic Program Verifier; HealthCard is an application that manages medical appointments into smart cards. JAccounting is a case study from the ajml compiler project, implementing an accounting system. Likewise, Javacard is a technology used to program smart cards and it comprises a subset of the desktop Java; a subset of the programming language itself, and a cut down version of the API (The current API implementation has been developed for the KeY interactive verification system). Mondex is a system whose translation from original Z specification was developed in the Verified Software Repository context. In its turn, PokerTop is a poker system. Finally, TransactedMemory is a specific feature of the Javacard API. Table 1 summarizes all units in terms of lines of code and contract clauses. The contract clauses are grouped by contract type.

Table 1 shows the program versions we used in the evaluation of JmlOk2 and a summary of all those units in terms of lines of code and contract clauses.



Study procedure

The study follows these steps: (1) Run JmlOk2 over each contract-based program in a given time limit (the time limit is the time used for tests generation), and (2) manually classify each detected nonconformance in terms of its likely cause -- see Figure 1.

Figure 1. First, we use JmlOk2 for detecting nonconformances in each system. Then, we perform the manual classification of likely causes for each detected nonconformance.

    We performed the case study on a 3.6 GHz core i7 with 16 GB RAM running Windows 8.1 Enterprise, backed by Java 1.7 update 80, jml compiler 5.6_rc4. Since Randoop requires a time limit for test generation – the time after which the generation process stops – and we want to extend the evaluation of JMLOK2 performed in previous works, we varied time limits from 60s to 600s; aiming to stabilize the number of detected nonconformances through time, we ran JMLOK2 with time limits until 1,100s for the unit Samples. These limits were chosen based on the fact we test the systems as a whole, not only a class by time as similar works do. Furthermore, based on the results of each tool on each experimental unit, we can suggest the best cost benefit for choosing a time limit. As a basis, we suggest to start with a low time limit, such as 60s and increases this value based on the budget for the test generation step. All systems source code, generated test cases and JMLOK2 binaries are available as a replication package on the companion website.

    In addition, we collected two metrics for each nonconformance: breadth (B) and depth (D). The first is related to the number of calls into a test method realized until the nonconformance exposure. This metric is defined in Equation 1; the calls(tm) returns the sequence of method calls into the test method received as parameter. The second, is related to the depth required to find each nonconformance the internal calls performed until the contract have been violated. This metric is defined in Equation 2. Given the method that corresponds to the position in which the nonconformance was revealed, if the nonconformance is into the body of this method D receives 1, otherwise, its value is recursively increased until the method that reveals the nonconformance.


R Script
The R script and data of this study is available online: RScript, Contract types data, Correlation data.

Results
JMLOK2 detects 119 nonconformances in the projects evaluated (see Table 2), from those 24 new in experimental units that have been used in previous works with JMLOK2: Bank (3), Bomber (5), Dnivra-Jive (6), HealthCard (41), JAccounting (26), Javacard (7), Mondex (2), PokerTop (1), Samples (18), TheSchorrWaiteAlgorithm (2), and TransactedMemory (8). In PayCard, JMLOK2 was not able to detect nonconformances with the configuration used.
    Concerning the types, those nonconformances were distributed in the following manner: 51 invariant, 47 postcondition, 8 constraint, 7 evaluation, and 6 precondition. Most of the 119 detected nonconformances manually received Weak precondition (51) as likely cause, followed by Code error (38). Four units (HealthCard, JAccounting, Samples, and TransactedMemory) presented Weak precondition as the most frequent likely cause. Average breadth ranges from 1.00 (in PokerTop experimental unit) to 4.17 (in Dnivra-Jive); average depth varies from 1.00 (in TheSchorrWaiteAlgorithm) to 4.22 (in Samples) – see Table 2. Concerning how JML developers write contracts, we found a small difference between the use of preconditions (0.442) and postconditions (0.419) over the evaluated projects – see Table 2; in the other hand, the use of history constraint is limited to few projects (4 out of 12), correspondingly the number of contract clauses of this type is also small: just 40 from 4,230 contract clauses. In three of the top-five units with respect to CCo metric, Invariant is the most common type of nonconformance. The contract type most common for the projects with the highest number of nonconformances detected is Postcondition (four out of five) – see Table 3. Regarding the relationship between contract type and nonconformance type, contract clauses of precondition are more related to nonconformances of invariant type; whereas contract clauses of postcondition are equally related to postcondition and invariant types – see Table 3.

    Table 2 presents a summary of the case study on nonconformances detection and classification. The table presents, for each experimental unit, the following information: LOC - lines of code, CC - the number of contract clauses, the metrics breadth (B) and depth (D). Furthermore, the table also presents the CCo and NCsD metrics; the first is the ratio between contract clauses and lines of code; the last is the ratio between the number of nonconformances and the number of contract clauses. Concerning to the tests generated for each experimental unit, columns GT and TC shows the time given for tests generation and the average of the number of test cases generated in a given time limit, respectively. Column NCs gives the number of nonconformances detected in a given time limit. Regarding the nonconformances as a whole, columns PRE to EVAL presents the detected nonconformances in number and grouped by type -- PRE is an acronym for precondition, POST for postcondition, and INV for invariant. The group of columns Manual Classification, presents the manual likely causes assigned for the nonconformances detected grouped by likely cause -- following the classification model presented here. In this table we use the following acronyms: WPre for Weak Precondition, SPre for Strong Precondition, WPost for Weak Postcondition, SPost for Strong Postcondition, SInv for Strong Invariant, and CodeErr for code error.
Table 2. Results summary of running JmlOk2 over the experimental units.
    
    For each system, we present the results of the ratio between each contract type and contract clauses, the value of CCo, the number of nonconformances detected (#NCs), the most common contract and nonconformance type.
Table 3. Metrics over the JML systems.


Discussion
JMLOK2 detected a total of 119 nonconformances. The approach detected nonconformances to the set of example programs, written by JML specialists. Despite their best efforts, subtle nonconformances remained in the contract and/or programs; some of those were indeed hard to catch only with visual analysis or simple tests. For instance, we detected four nonconformances in methods that invoke, in their contract, JMLDouble.approximatelyEqualTo; JMLDouble is imported from the standard JML API, being only visible on the contract level; its method approximatelyEqualTo performs precise comparison between two values. The tolerance value (constant related to error rate) can be inappropriately small (only 0.005); or, this postcondition is too strong; or, the implementation of type JMLDouble is too restrictive. All these possible reasons show how hard it is to detect those kinds of nonconformances.

    From the 119 nonconformances detected, the most frequent type was the invariant – 51, followed by postcondition violations – 47. These numbers indicate that most violations occur at the exit of operations, even if the cause is not in the operation itself. There are several possible explanations: specifiers expect certain behavior to which the code fails to comply, or a chain of previous calls fails to avoid certain undesirable states. More severe contract errors were not significantly frequent (only seven were evaluation errors). A few nonconformances were related to history constraints (eight) in the only program that actually uses those constraints (HealthCard). This result could indicate that using those constraints is not trivial, with questionable usefulness – they often can be replaced by an invariant.

    Regarding likely causes, our manual classification established Weak precondition as the main cause for the experimental units (with 51 instances). In this case, a method allows values that should be denied for the correct execution of the method – as we could infer from the information available in the program. For instance, the default precondition (requires true) is often used in the experimental unit JAccounting – from all 98 precondition clauses, 44 are the default precondition. Maybe Weak precondition has been the most common case of likely cause because it is complex to specify preconditions; since the specifier does not previously know the clients of the system, an overly strong precondition deny access to several clients; on the other hand, if the precondition is too weak, many clients will not be able to get the expected result. Therefore, this represents an important trade-off for contract-based programs. However, Code error is also very recurrent, with 38 instances. This problem is probably related to different levels of abstraction between programming and contract languages, in which the coder may not understand a contract written by someone else, or even the contract may be too complex to be implemented; or even a synchronization issue between code and contract evolution. For instance, in HealthCard we observed that a change in the code concerning dates made a class invariant obsolete, which resulted in a nonconformance.

    The results suggest that breadth and depth are, in average, higher than 1, for all experimental units. Those numbers suggest that tests have a higher chance of success in detecting nonconformances when the test cases present higher horizontal (call sequences) and vertical (internal calls) complexity. The method of test generation used in our approach favors those properties. The highest values were obtained on Dnivra-Jive (4.17) and Samples (4.22) for breadth and depth, respectively. For instance, breadth = 17 for a method from CardUtil, a class from HealthCard unit, where several modifications into objects under test were performed until the nonconformance emerged; and depth = 12 into some classes from Sampes unit, where several calls are made until the postcondition problems arise.

    The smallest mean for breadth (1.00) was obtained on PokerTop and for depth (1.00) metric was obtained on TheSchorrWaiteAlgorithm; this result indicates that the nonconformances in these units were simpler to detect. It seems reasonable to conclude that there is a relationship between those metrics and the most frequent type of nonconformance present in those units – invariant, which specifies that all fields of the class must be non-null (JML default behavior). A single execution of a constructor that does not initialize all fields conflicts with the assertion based on this invariant. The complete manual categorization performed is available in our technical report.

    JML developers appear to prefer pre- and postcondition in comparison with invariant and history constraint. In order to check this impression, we performed a paired t test – both data from pre- and postcondition are Normal – getting a p-value of 0.7216, not being able to reject the null hypothesis that the samples have the same mean with a confidence level of 95%. So, we can claim that for the 12 experimental units we have used through this case study, the use of preconditions is statistically equal to the use of postconditions.
    
    We performed a correlation test with the data from CCo metric and #NCs. Evaluating the normality of CCo and #NCs data, we found by means of a Shapiro-Wilk test that the #NCs set of data is not Normal with a confidence level of 95%; so, we were not able to use Pearson’s correlation test. Thus, we used the non-parametric tests – Spearman and Kendall – to check the correlation between CCo and #NCs. The result for the Spearman test was: ρ = -0.34, and the result for the Kendall test was: τ = -0.29. Both tests indicate a weak negative correlation between contract complexity (measured by means of CCo metric) and the number of nonconformances detected. According to these results, when the CCo metric increases there is at least a weak decrement in the number of nonconformances detected.

    invariant and postcondition were the most common types of nonconformances in the projects with the highest values of CCo metric. This result is explained by the amount of weak preconditions in the evaluated projects, for example, in JAccounting unit, from all 98 precondition clauses, 44 are the default precondition, meaning that no obligation is imposed by the supplier, in consequence, it may not be able to deliver the expected result (postcondition or even invariants).

    With respect to the top-five systems in number of nonconformances detected, in three out of five systems with more precondition clauses were those with the highest number of nonconformances detected (HealthCard, JAccounting, and TransactedMemory). This result is against our expectancy because as the preconditions are like a filter for avoiding method executions with forbidden values, we would not expected that more preconditions could occur in projects with high values of number of nonconformances detected. To confirm this result, we performed Spearman and Kendall correlation tests (since the nonconformance ratio data is not Normal, according to the Shapiro-Wilk normality test), getting ρ = 0.37 and τ = 0.26, indicating in both tests a weak correlation between the number of precondition clauses and the number of nonconformances detected.

    Concerning the relationship between contract and nonconformance type, systems with more postcondition clauses usually have more postcondition and invariant nonconformances – indicating the postconditions may be strong to be satisfied, however they are
not enough to avoid invariant errors. Considering the systems with more precondition clauses, more precondition clauses imply in more nonconformances of invariant type – indicating that the preconditions are not enough to avoid invariant errors.

Threats to validity

Conclusion validity
    With respect to conclusion validity, we use Randoop as test generator for JMLOK2, Randoop may generate different test suites for a given time limit; to mitigate this threat, we ran the tool 10 times for each experimental unit and for each time limit. In addition, The CCo metric based on the ratio between the number of contract clauses and LOC may not be representative as a measure of contract complexity. However, the way of counting contract clauses here is similar to the way presented by Estler et al..

Internal validity
    As Randoop is random-based test generator, we ran each experimental unit 10 times for each time limit (varying from 60 to 600 seconds) to get more confidence on the results of this work. Moreover, as the unit Samplespresented an ascending curve in the number of detected nonconformances, we ran JMLOK2 with this unit for time limits varying until 1,100 seconds – we considered the detection was stable when the number of nonconformances detected did not change for at least three time limits.

Construct validity
    Concerning to the metrics for evaluating how hard is to detect a nonconformance, maybe B and D could not be representative of the real difficult of detecting a nonconformance. And the metric for measuring contracts complexity, although the way of measuring is quite simple, the approach used in the present work follow the same idea of
Estler et al.. We created the classification model for nonconformances based on our own experience on manually classifying nonconformances. To reduce this threat, all nonconformances were classified by three researchers separately and in a discussion session, the most suitable likely cause was established for each nonconformance.

    The number of nonconformances detected may not reflect the real number of errors in a given experimental unit. In addition, some nonconformances may be related to updates in the contract-based language that occurred after the implementation of a given experimental unit. Moreover, JMLOK2 is not able to detect all nonconformances of a system.

External validity
    Since we use Randoop as test generator and this tool is completely random and dependent on the machine setup, we cannot argue about the repeatability of the findings from this work in different settings of machine; however, we were able to reply 77 from the 84 nonconformances (almost 92%) previously detected in experiments using JMLOK2. Therefore, the randomness of Randoop may allow the detection of new nonconformances in the experimental units presented in this work.
    
    To mitigate other threats related to external validity, we evaluated different kinds of systems in JML, ranging from few lines of code to thousands of lines of code and from few contract clauses to thousands of contract clauses. Nevertheless, the experimental units presented in this work may not be representative of the actual use of JML.


Comparison between the manual classification and the heuristics-based classification from JmlOk2
Table 4 lists all 119 nonconformances presenting side by side the automatic and the manual classification (baseline).
In summary, the heuristics-based approach had an accuracy of 75%. The main differences occurred due to the simplicity of the heuristics used for classification. In contract-based methods with tricky postconditions; if the precondition is default, our classification suggests Weak precondition as the likely cause. This is one limitation of our approach, but we believe that if a method has a complex postcondition, this method needs some precondition to guarantee properties needed to satisfy the postcondition. In addition, since the context knowledge may be important to understand the real source of contract violations, a heuristics-based approach hardly is able to find the real source of the contract violations. Furthermore, our heuristics-based approach returns as likely cause the first match found; and we observed that a more detailed analysis of the contract-based program may be necessary. Nevertheless, JmlOk2 is a step towards the correction of contract violations aided by automation.



Comments