Ben Northrop

  Decisions and software development
Essays   |   Popular   |   Cruft4J   |   RSS    

Cruft4J

(2 comments)

  Overview
  How to Run
    Command Line
    Maven
  Benchmarks
  About
    Source Code
    Score
    License

Overview

Cruft4J is a static source code analysis tool which reports on the maintainability (or "cruftiness") of your Java code. Whereas other tools generate an unwieldy list of maintainability violations from which it's difficult to know what to do, Cruft4J generates one score which is based on just two things: cyclomatic complexity and copy-paste.

With this score, you can better understand how your code base measures up against other systems (see open source benchmarks) and also track over time whether things are getting better or worse.

Obviously this score isn't perfect, but that's not the point. The goal is to get the conversation started with managers and developers about code quality, and that's easier to do this with one number rather than with a list of obscure violations. Once the conversation is started, however, the team can discuss how to improve, using the excellent, more fine grain-tools like Sonar, PMD, and others, or implementing development best practices like code reviews, etc.

Running Cruft4J is easy: currently you can run it from the command line, or within Maven, and in the future I hope to support Ant, Gradle, Jenkins, and possibly Eclipse.

How to Run

Command Line

The easiest way to run Cruft4J is from the command line. First, make sure you have Java 1.5 or higher installed. Download Cruft4J from here, and unzip to some directory (e.g. C:\Cruft4J). This will be referred to here as your CRUFT4J_HOME.

Next, open a command prompt, and cd to your CRUFT4J_HOME directory. From here, type...

> cruft4j.bat -sourceDir C:\some_project\src\

That's it! This will analyze all Java code within the specified source directory, and then generate a set of HTML reports in the CRUFT4J_HOME/output/ directory. If you want to store your output in a different directory, try...

> cruft4j.bat -sourceDir C:\some_project\src\ -outputDir C:\some_directory
And finally if you'd like to analyze multiple projects, then you'll probably want to keep the output reports separate. To do this, pass in the project name...
> cruft4j.bat -sourceDir C:\some_project\src\ -outputDir C:\some_directory
-projectName myProject

Optionally, you may set your CRUFT4J_HOME as an environment variable, called (of course!) CRUFT4J_HOME. Also, put CRUFT4J_HOME on your path, and now you can run Cruft4J from any directory.

Maven

Through the magic of Maven, hooking Cruft4J into your build is quite easy. As of now, Cruft4J is not in the pulic repo, but I'm working on this. In the mean time, you can get the source from GitHub here, and just build both the cruft4j-calculator and cruft4j-maven projects to install to your local repo.

Once Cruft4J is in your repo, it's just a matter of configuring in your pom.xml. A full sample build file is here, but the important part is just configuring the Cruft4J plugin, like so:

<plugins>
  <plugin>
    <groupId>org.summalabs.cruft4j</groupId>
    <artifactId>cruft4j-maven-plugin</artifactId>
    <version>1.0</version>
    <executions>
      <execution>
        <phase>verify</phase>
        <goals>
          <goal>calculate-cruft</goal>
        </goals>
      </execution>
    </executions>
    <configuration>
      <scoreThreshold>10</scoreThreshold>
    </configuration>
  </plugin>
</plugins>

There are a couple important things to note. First, given the configuration above, Cruft4J will run during the verify phase (according to Maven is the time to "run any checks to verify the package is valid and meets quality criteria"), but you're obviously free to change this. To test this out, in the directory of your pom.xml, run:

> mvn verify

...and this will calculate a Cruft4J score, and generate a set of reports within the target/cruft4j directory.

Now, in terms of improving overall software quality, it's often helpful to draw the proverbial line in the sand, and say "we may not be ecstatic about the level of quality, but we commit to not letting it get any worse." To this end, a "cruft threshold" can be set, using the following configuration:

  <coreThreshold>40</scoreThreshold>

Given this configuration, if the code's cruft score ever goes above 40, then the build will fail. Pretty harsh, but perhaps necessary if code quality is important enough to you! To see what a reasonable Cruft Score is, check out the open source benchmarks, where you can see how popular open source projects score.

You can also set a threshold for the raw Cruft Score (i.e. before being scaled by lines of code) with this configuration:

  <rawScoreThreshold>40</rawScoreThreshold>

Finally, if you want to take advantage of Cruft4J's trend reports to see how code quality has been tracking over time, then you'll want to specify the output directory where the reports and, more importantly, the Cruft4J database will be stored:

  <outputDirectory>C:\Cruft4J\output</outputDirectory>

Benchmarks

As mentioned in the goals, after running most static code analysis tools, you're still left in a quandry: "ok, I have 3,000 violations and 70k lines of code...but is that good or bad?"

To help answer this question, I went on an archeology tour of sorts, and ran Cruft4J against 81 open source projects. With this data, it's possible to see how your code measures up! Just run Cruft4J against your own Java project, and see where you fall on the distribution below:


Across all projects, the average score was 51. Note that the Cruft4J calculation scales by lines of code, meaning that it takes into account that a larger code base would have more instances of "cruftiness" than a smaller code base. To see all projects that were measured, go to the projects page.

About

Source Code

The source code for Cruft4J is stored in GitHub here. Development is slow, but steady. Pull requests are welcome!

Cruft4J was written (ironically, I guess) in Groovy, not Java. Under the hood it uses the excellent tools of PMD CPD to find the copy-paste instances, and JavaNCSS to calculate complexity. All credit for this tool goes to these guys - Cruft4J is just a dumb aggregator.

Score

The Cruft4J score is a measure of how "crufty" or un-maintainable a given set of Java source code is. The higher the score, the more cruft was found, and so like golf, lower is better! The calculation is simple:

Cruft = (Cyclomatic Complexity + Copy-Paste) / Lines of Code

Why only use cyclomatic complexity and copy-paste, and not other good metrics like unit test coverage, or comments, or lines of code, etc.? Well, while there's disagreement about which code should be unit tested or whether code should have comments or be self-documenting, high complexity and copy-paste are incontrovertable evidence of poor code quality. Everyone agrees: they are always bad. No "if"s, "and"s or "but"s.

For example, there is never a good reason why a method has a complexity of 20; it could always be broken down into smaller, more manageable sub-methods. This is the principle of decomposition. Likewise, there is never a good reason why a block of 100 lines of code was copied from one class to another; the logic could always be re-used either through composition or delegation. This is the DRY principle.

Under the hood, Cruft4J uses two excellent tools to calculate complexity and copy-paste, respectively: JavaNCSS and PMD CPD. When given a directory of Java source code, Cruft4J will use these tools to find all instances of high complexity and copy-paste, and then for each instance increment the overall cruft score using the following buckets:

Bucket  Complexity  Copy-Paste  Points 
Green 0-5 0-50 0 points
Yellow 6-10 51-100 1 point
Orange 11-20 101-200 5 points
Red 21+ 201+ 10 points

For example, every method with a cyclomatic complexity between 11 and 20 adds 5 points to the raw cruft score. Or every block of code over 201 tokens that was copied and pasted somewhere else adds 10 points. In this way, more egregious violations are counted as more "crufty" than less egregious violations. Note that the bucket ranges for complexity were based on some investigation of industry standards (see here, here, and here), and for copy-paste were based on my own analysis of many code bases.

Now, you might ask, wouldn't this give an unfair advantage to smaller code bases, since, by virtue of them having less source code, they will have fewer instances of complexity and copy-paste violations? Yes! And to compensate for this, the raw cruft score is then scaled by dividing by lines of code (and then multiplying by 1,000 to get a nice round number). So a sample score would be calculated like so:

Complexity

Copy-Paste

Instances Points Total Instances Points Total
Yellow 16 x 1 = 16 Yellow 33 x 1 = 33
Orange 2 x 5 = 10 Orange 1 x 5 = 5
Red 4 x 10 = 40 Red 1 x 10 = 10
66 48
/ 10,000 LOC / 10,000 LOC
.007 .005
x 1,000 x 1,000
= 7 = 5

Adding the complexity and copy-paste scores together, it would give an overall Cruft score of 12 - which would be exceptionally good when you compare it to open source code (see benchmarks).

In the end, it's obvious that the score isn't air-tight or perfect, but that's not the point. The goal is just to generate a number which starts the conversation about code quality. "Our system scored a 115, but the industry average is 50...something's going on here." With this number, you can hopefully begin to make the case to management, architects, or other developers for best practices, like code reviews, continuous development, or even further investigation with more comprehensive tools like Sonar. That's the goal, at least, and I hope it makes some even small, positive difference - because as a developer in industry for 15 years, I know first-hand the pain of maintaining crufty code!

License

Cruft4J is licensed under a "BSD-style" license:

Copyright (c) 2013, Ben Northrop
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met: 

1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer. 
2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution. 

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those
of the authors and should not be interpreted as representing official policies, 
either expressed or implied, of the FreeBSD Project.




I believe that software development is fundamentally about making decisions, and so this is what I write about (mostly). I'm a Distinguished Technical Consultant for Summa and have two degrees from Carnegie Mellon University, most recently one in philosophy (thesis here). I live in Pittsburgh, PA with my wife, 3 energetic boys, and dog. Subscribe here or write me at ben at summa-tech dot com.



Got a Comment?

Name:
Website:
Are you human:
Comment:

Comments (2)



Carl Lajeunesse June 06, 2013
Nice tool, to gave us a quick / easy score to know.

But when I run it on my whole project I got a Out of memory exception when cruft try to generate the copy-paste table. :(

At least I can run it on each sub package.

Thanks

Ben June 07, 2013
@Carl - Thanks for the heads-up. I know the PMD CPD process is pretty intensive. Wondering if I can optimize a bit - will update the site if I can. The largest code base I've run against was about 400k lines of code. Anyway, more soon. Thanks again!