Data provided in this part of the study are firmly based on the hard data collected using an automated source code analysis. The source code is analyzed using both static and dynamic methods. It means that we are not looking just at the current state of the source code (static view) but we also analyze how the source code was developed (dynamic view). Especially the dynamic viewpoint is very important because it allows us to estimate project progress and therefore to predict future development of individual projects.
The source code analysis is automated. This page is automatically updated every day. Therefore this report has a life of its own and not even the team that created it can predict what data it will show tomorrow. The methodology and also the code that generates this pages is publicly available therefore anyone can independently verify that this report is correct.
Computer programs are very abstract things. Therefore it is a challenge to correctly visualize them. Especially if we also consider the evolution of the program in time. Therefore we considered many methods of data visualization. We have finally decided to use a popular "quadrant" form of data presentation mostly because it is intuitively understood by many readers. This is the result:
The "zero axes" that divide the chart into four quadrants represent average values summarized from all projects. We believe this is roughly equivalent to the similar concept used by industry analysts. However unlike the charts presented by the analysts all the metrics that we use are firmly based in the real-world data refined from the project source code. Detailed explanation of these metrics is provided below.
The detailed explanation of the concepts used in this study including specific formulas for maturity and progress and source code size analysis details are provided in the methodology section.
The size of the bubble in the quadrant chart directly show the size of the project source code. According to our methodology the relevant parts of source code of each individual project are selected to make projects comparable. Only non-blank lines of the source code are counted. All programming languages, comments and test code are included.
Results of a slightly more detailed analysis of the source code size is illustrated in the following chart.<
Following two charts illustrate project activity. This first chart shows the number of commits per month. The second chart shows the size of project team that was contributing code to the project each month. We assume that this is a good approximation of the technical core team - a team that actually develops the product as their day-to-day job.
Both the number of commits and the team size seems to be a very volatile values for most of the projects. Therefore the charts show a 6-month average of their respective values to improve the readability. The charts showing a raw data are available here and here.
Detailed numeric results of source code analysis are summarized in this table.
Based on the source code analysis we are able to make following predictions:
These prediction are based on the data gathered from the source code analysis. These are predictions in a scientific sense of the work. They may or may not become true. The only goal of these predictions is to validate the usability of the model. We are publicly publishing these prediction not with a goal to harm other projects. Our goal is transparency. We want general public to be able to evaluate whether these predictions were correct and therefore also evaluate our model. See also disclaimer.
This is the first version of our model. It makes a lot of assumptions. E.g. we assume that the projects are continually developed, that the development teams are more-or-less stable, that the fluctuation is reasonably low, etc. As far as we know all the evaluated projects currently satisfy these assumptions. Therefore we believe that our model is correct.
However this model has one major drawback: we do not know about any reliable method to actually validate this model. We strongly believe that it is correct but we do not know how to be entirely sure about that. Therefore we kindly leave the evaluation to reader. We also tried to make several predictions based on the model. These predictions may be used in the future to potentially falsify the applicability of this model. But if the predictions are proven to be true than it will be a sign that this model works.
However we reserve the right to modify this model in the future if we notice that the model needs adjustments. E.g. in case that assumptions are not met. This is a usual scientific practice: create a model, subject it to verification, fix any issues and repeat. And this is our plan.
We have been able to successfully differentiate core developers from external contributors for most projects. Except for OpenIDM. OpenIDM is the only project that is still stuck with a centralized source code version control system. We expect that OpenIDM actually has some external contributors. However we intentionally do not account for them when computing project score. We use this as a penalty for a contribution slowdown caused by a very high entry barrier of centralized source code control systems.
Small commits vs big commits: during the preparations of a predecessor of this study there were discussion about how much a size of any particular commits matters. A small commit is a commit that changes only a couple of source code lines in a couple of files. While a big commit changes a lot of lines in a lot of files. However we have found that the size is of a very little importance. Firstly, each and every commit has to bring in a complete unit of work. All the evaluated projects are using continuous integration systems (as far as we know). Therefore it is unlikely that any significant amount of commits deliver vastly unfinished or broken functionality. Secondly, the big commits are considered to be a very bad open source practice. Use of big commits is likely to rise suspicion about how open and transparent the development process really is. Big chunks of functionality that the vendor "throws over the fence" from time to time are a reliable strategy to break the code of all contributors, partners and customized deployments. Therefore even if big commits can theoretically deliver more functionality their use should be penalized by the model. In our view the fact that we consider all commits to be equal seems to be the most appropriate penalization.
Source code line as a metric: our use of the number of source code lines as a metric caused a lot of heated discussions in the past. The opponents are partially right: The KLOC (thousands of lines of code) metric is not the smartest source code metric. But we believe that the use of this metric is appropriate in this case because:
See also "Are you sure this study is correct?".