COMETS (Code Metrics Time Series) is a dataset of source code metrics collected from several systems to support empirical studies on source code evolution. The dataset includes information on the evolution of the following Java-based systems:

For each system, the dataset includes the source code of its versions, in intervals of bi-weeks, during the time frame indicated in the following table.

Table 1
Moreover, the dataset includes time series with the values of the following source code metrics measured at the level of classes:

  • Size Metrics:
    • Number of attributes (NOA),
    • Number of public attributes (NOPA)
    • Number of private attributes (NOPRA)
    • Number of attributes inherited (NOAI)
    • Number of lines of code (LOC)
    • Number of methods (NOM)
    • Number of public methods (NOPM)
    • Number of private methods (NOPRM)
    • Number of methods inherited (NOMI)
  • Coupling Metrics:
    • fan-in
    • fan-out
  • CK Metrics:
    • Weighted Methods per Class (WMC)
    • Depth of Inheritance Tree (DIT)
    • Number Of Children (NOC)
    • Coupling Between Objects (CBO)
    • Response For a Class (RFC)
    • Lack of Cohesion in Methods (LCOM)

Basically, for each system S and metric M, there is in the COMETS dataset a csv file whose lines represent the classes of S and whose columns represent the bi-weeks considered when extracting the versions of S. A cell (c,t) in this file contains the value of the metric M, measured for the class c, in the bi-week t.


    The following figures show examples of the time series provided in the dataset. The first figure shows the time series with the values of two metrics (NOA and NOM) collected for one of the classes in the Eclipse JDT system. The second figure shows the time series describing the evolution of this same class in terms of lines of code. Finally, the third shows the evolution of the number of attributes (NOA) and number of methods (NOM) considering all classes in the Eclipse JDT.

    Figure 1
    Figure 2
    Figure 3

    Extraction Process

    To create the dataset, we extracted in intervals of bi-weeks the source code of each system from its revision control platform. We used the Moose platform to extract the metrics values for each class of each considered version, excluding only test classes. Particularly, we relied on VerveineJ — a Moose application — to parse the source code of each version and to generate MSE files. MSE is the file format supported by Moose to persist source code models. Because Moose's current version does not calculate three CK metrics (CBO, LCOM, and RFC), we extended the platform with new routines for this purpose. In the dataset, we also included the MSE files we used to extract the metrics time series.