Refactoring Legacy Code - Part 10: Dissecting Long Methods with Extractions

In the sixth part of our series we talked about attacking long methods by leveraging on pair programming and viewing code from different levels. We continuously zoomed in and out, and observed both small things like naming as well as form and indentation.

Today, we will take another approach: We will assume we are alone, no colleague or pair to help us. We will use a technique called "Extract till you drop" that breaks code in very small pieces. We will make all the efforts we can to make these pieces as easy to understand as possible so the future us, or any other programmer will be able to easily understand them.

Extract Till You Drop

I first heard about this concept from Robert C. Martin. He presented the idea in one of his videos as a simple way to refactor code that is hard to understand.

The basic idea is to take small, understandable pieces of code and extract them. It doesn't matter if you identify four lines or four characters that can be extracted. When you identify something that can be encapsulated in a clearer concept, you extract. You continue this process both on the original method and on the newly extracted pieces until you can find no piece of code that can be encapsulated as a concept.

This technique is particularly useful when you are working alone. It forces you to think about both small and larger chunks of code. It has another nice effect: It makes you think about the code - a lot! Besides the extract method or variable refactoring mentioned above, you will find yourself renaming variables, functions, classes, and more.

Let's see an example on some random code from the Internet. Stackoverflow is a good place to find small pieces of code. Here is one that determines if a number is prime:

//Check if a number is prime
function isPrime($num, $pf = null)
{
    if(!is_array($pf))
    {
        for($i=2;$i<intval(sqrt($num));$i++) {
            if($num % $i==0) {
                return false;
            }
        }
        return true;
    } else {
        $pfCount = count($pf);
        for($i=0;$i<$pfCount;$i++) {
            if($num % $pf[$i] == 0) {
                return false;
            }
        }
        return true;
    }
}

At this point, I have no idea how this code works. I just found it on the Internet while writing this article, and I will discover it along with you. The process that follows may not be the cleanest. Instead, it will reflect my reasoning and refactoring as it happens, without upfront planning.

Refactoring the Prime Number Checker

According to Wikipedia:

A prime number (or a prime) is a natural number greater than 1 that has no positive divisors other than 1 and itself.

As you can see this is a simple method for a simple mathematical problem. It returns true or false, so it should also be easy to test.

class IsPrimeTest extends PHPUnit_Framework_TestCase {

    function testItCanRecognizePrimeNumbers() {
		$this->assertTrue(isPrime(1));
	}

}

// Check if a number is prime
function isPrime($num, $pf = null)
{
	// ... the content of the method as seen above
}

When we are just playing with example code, the easiest way to go is to put everything in a test file. This way we don't have to think about what files to create, in what directories they belong, or how to include them in the other. This is just a simple example to use in order to familiarize ourselves with the technique before we apply it on one of the trivia game methods. So, all goes in a test file, you can name as you wish. I've chosen IsPrimeTest.php.

This test passes. My next instinct is to add a few more prime numbers and than write another test with not prime numbers.

function testItCanRecognizePrimeNumbers() {
    $this->assertTrue(isPrime(1));
	$this->assertTrue(isPrime(2));
	$this->assertTrue(isPrime(3));
	$this->assertTrue(isPrime(5));
	$this->assertTrue(isPrime(7));
	$this->assertTrue(isPrime(11));
}

That passes. But what about this?

function testItCanRecognizeNonPrimes() {
    $this->assertFalse(isPrime(6));
}

This fails unexpectedly: 6 is not a prime number. I was expecting the method to return false. I do not know how the method works, or the purpose of the $pf parameter - I was simply expecting it to return false based on its name and description. I have no idea why it is not working nor how to fix it.

This is quite a confusing dilemma. What should we do? The best answer is to write tests that pass for a decent volume of numbers. We may need to try and guess, but at least we will have some idea about what the method does. Then we can start refactoring it.

function testFirst20NaturalNumbers() {
    for ($i=1;$i<20;$i++) {
		echo $i . ' - ' . (isPrime($i) ? 'true' : 'false') . "\n";
	}
}

That outputs something interesting:

1 - true
2 - true
3 - true
4 - true
5 - true
6 - true
7 - true
8 - true
9 - true
10 - false
11 - true
12 - false
13 - true
14 - false
15 - true
16 - false
17 - true
18 - false
19 - true

A pattern starts to emerge here. All true up to 9, then alternating up to 19. But is this pattern repeating? Try to run it for 100 numbers and you will immediately see that it is not. It is actually seems to be working for numbers between 40 and 99. It misfires once between 30-39 by nominating 35 as prime. Same is true in the 20-29 range. 25 is considered prime.

This exercise that started as a simple code to demonstrate a technique proves to be much more difficult than expected. I decided though to keep it because it reflects real life in a typical way.

How many times did you start working on a task that looked easy just to discover that it is extremely difficult?

We don't want to fix the code. Whatever the method does, it should continue doing it. We want to refactor it to make others understand it better.

As it does not tell prime numbers in a correct way, we will use the same Golden Master approach we learned in Lesson One.

function testGenerateGoldenMaster() {
    for ($i=1;$i<10000;$i++) {
		file_put_contents(__DIR__ . '/IsPrimeGoldenMaster.txt', $i . ' - ' . (isPrime($i) ? 'true' : 'false') . "\n", FILE_APPEND);
	}
}

Run this once to generate the Golden Master. It should run fast. If you need to re-run it, don't forget to delete the file before you execute the test. Otherwise the output will be attached to the previous content.

function testMatchesGoldenMaster() {
    $goldenMaster = file(__DIR__ . '/IsPrimeGoldenMaster.txt');
	for ($i=1;$i<10000;$i++) {
		$actualResult = $i . ' - ' . (isPrime($i) ? 'true' : 'false'). "\n";
		$this->assertTrue(in_array($actualResult, $goldenMaster), 'The value ' . $actualResult . ' is not in the golden master.');
	}
}

Now write the test for the golden master. This solution may not be the fastest, but it is easy to understand and it will tell us exactly what number does not match if break something. But there is a little duplication in the two test methods we could extract into a private method.

class IsPrimeTest extends PHPUnit_Framework_TestCase {

    function testGenerateGoldenMaster() {
		$this->markTestSkipped();
		for ($i=1;$i<10000;$i++) {
			file_put_contents(__DIR__ . '/IsPrimeGoldenMaster.txt', $this->getPrimeResultAsString($i), FILE_APPEND);
		}
	}

	function testMatchesGoldenMaster() {
		$goldenMaster = file(__DIR__ . '/IsPrimeGoldenMaster.txt');
		for ($i=1;$i<10000;$i++) {
			$actualResult = $this->getPrimeResultAsString($i);
			$this->assertTrue(in_array($actualResult, $goldenMaster), 'The value ' . $actualResult . ' is not in the golden master.');
		}
	}

	private function getPrimeResultAsString($i) {
		return $i . ' - ' . (isPrime($i) ? 'true' : 'false') . "\n";
	}
}

Now we can move un to our production code. The test runs in about two seconds on my computer, so it is manageable.

Extracting All We Can

First we can extract an isDivisible() method in the first part of the code.

if(!is_array($pf))
{
    for($i=2;$i<intval(sqrt($num));$i++) {
		if(isDivisible($num, $i)) {
			return false;
		}
	}
	return true;
}

That will enable us to reuse the code in the second part like this:

} else {
    $pfCount = count($pf);
	for($i=0;$i<$pfCount;$i++) {
		if(isDivisible($num, $pf[$i])) {
			return false;
		}
	}
	return true;
}

And as soon as we started to work with this code we observed that it is carelessly aligned. Braces are sometimes at the begining of the line, other times at the end.

Sometimes, tabs are used for indentation, sometimes spaces. Sometimes there are spaces between operand and operator, sometimes not. And no, this is not specially created code. This is real life. Real code, not some artificial exercise.

//Check if a number is prime
function isPrime($num, $pf = null) {
    if (!is_array($pf)) {
		for ($i = 2; $i < intval(sqrt($num)); $i++) {
			if (isDivisible($num, $i)) {
				return false;
			}
		}
		return true;
	} else {
		$pfCount = count($pf);
		for ($i = 0; $i < $pfCount; $i++) {
			if (isDivisible($num, $pf[$i])) {
				return false;
			}
		}
		return true;
	}
}

That looks better. Immediately the two if statements look very similar. But we can't extract them because of the return statements. If we don't return we will break the logic.

If the extracted method would return a boolean and we compare it to decide if we should or not return from isPrime(), that would not help at all. There may be a way to extract it by using some functional programming concepts in PHP, but maybe later. We can do something simpler first.

function isPrime($num, $pf = null) {
    if (!is_array($pf)) {
		return checkDivisorsBetween(2, intval(sqrt($num)), $num);
	} else {
		$pfCount = count($pf);
		for ($i = 0; $i < $pfCount; $i++) {
			if (isDivisible($num, $pf[$i])) {
				return false;
			}
		}
		return true;
	}
}

function checkDivisorsBetween($start, $end, $num) {
	for ($i = $start; $i < $end; $i++) {
		if (isDivisible($num, $i)) {
			return false;
		}
	}
	return true;
}

Extracting the for loop as a whole is a bit easier, but when we try to reuse our extracted method in the second part of the if we can see that it won't work. There is this mysterious $pf variable about which we know almost nothing.

It seems like it checks if the number is divisible by a set of specific divisors instead of taking all numbers up to the other magical value determined by intval(sqrt($num)). Maybe we could rename $pf into $divisors.

function isPrime($num, $divisors = null) {
    if (!is_array($divisors)) {
		return checkDivisorsBetween(2, intval(sqrt($num)), $num);
	} else {
		return checkDivisorsBetween(0, count($divisors), $num, $divisors);
	}
}

function checkDivisorsBetween($start, $end, $num, $divisors = null) {
	for ($i = $start; $i < $end; $i++) {
		if (isDivisible($num, $divisors ? $divisors[$i] : $i)) {
			return false;
		}
	}
	return true;
}

This is one way to do it. We added a forth, optional, parameter to our checking method. If it has a value, we use it, otherwise we use $i.

Can we extract anything else? What about this piece of code: intval(sqrt($num))?

function isPrime($num, $divisors = null) {
    if (!is_array($divisors)) {
		return checkDivisorsBetween(2, integerRootOf($num), $num);
	} else {
		return checkDivisorsBetween(0, count($divisors), $num, $divisors);
	}
}

function integerRootOf($num) {
	return intval(sqrt($num));
}

Isn't that better? Somewhat. It is better if the person coming after us doesn't know what intval() and sqrt() are doing, but it doesn't help make the logic easier to understand. Why do we end our for loop at that specific number? Maybe this is the question our function name should answer.

[PHP]//Check if a number is prime
function isPrime($num, $divisors = null) {
    if (!is_array($divisors)) {
		return checkDivisorsBetween(2, highestPossibleFactor($num), $num);
	} else {
		return checkDivisorsBetween(0, count($divisors), $num, $divisors);
	}
}

function highestPossibleFactor($num) {
	return intval(sqrt($num));
}[PHP]

That is better as it explains why we stop there. Maybe in the future we can invent a different formula to determine that number. The naming also introduced a little inconsistency. We called the numbers factors, which is a synonym of divisors. Maybe we should pick one and use that only. I will let you make the renaming refactoring as an exercise.

The question is, can we extract any further? Well, we must try till we drop. I mentioned the functional programming side of PHP a few paragraphs above. There are two main functional programming characteristics that we can easily apply in PHP: first class functions and recursion. Whenever I see an if statement with a return inside a for loop, like in our checkDivisorsBetween() method, I think about applying one or both techniques.

function checkDivisorsBetween($start, $end, $num, $divisors = null) {
    for ($i = $start; $i < $end; $i++) {
		if (isDivisible($num, $divisors ? $divisors[$i] : $i)) {
			return false;
		}
	}
	return true;
}

But why should we go through such a complex thought process? The most annoying reason is that this method does two distinct things: It cycles and it decides. I want it only to cycle and leave the decision to another method. A method should always do a single thing and do it well.

function checkDivisorsBetween($start, $end, $num, $divisors = null) {
    $numberIsNotPrime = function ($num, $divisor) {
		if (isDivisible($num, $divisor)) {
			return false;
		}
	};
	for ($i = $start; $i < $end; $i++) {
		$numberIsNotPrime($num, $divisors ? $divisors[$i] : $i);
	}
	return true;
}

Our first attempt was to extract the condition and the return statement into a variable. This is local, for the moment. But the code doesn't work. Actually the for loop complicates things quite a bit. I have a feeling that a little recursion will help.

function checkRecursiveDivisibility($current, $end, $num, $divisor) {
    if($current == $end) {
		return true;
	}
}

When we think about recursivity we must always start with the exceptional cases. Our first exception is when we reach the end of our recursion.

function checkRecursiveDivisibility($current, $end, $num, $divisor) {
    if($current == $end) {
		return true;
	}

	if (isDivisible($num, $divisor)) {
		return false;
	}
}

Our second exceptional case that will break the recursion is when the number is divisible. We don't want to continue. And that is about all the exceptional cases.

ini_set('xdebug.max_nesting_level', 10000);
function checkDivisorsBetween($start, $end, $num, $divisors = null) {
    return checkRecursiveDivisibility($start, $end, $num, $divisors);
}

function checkRecursiveDivisibility($current, $end, $num, $divisors) {
	if($current == $end) {
		return true;
	}

	if (isDivisible($num, $divisors ? $divisors[$current] : $current)) {
		return false;
	}

	checkRecursiveDivisibility($current++, $end, $num, $divisors);
}

This is another attempt to use recursion for our problem, but unfortunately, recurring 10.000 times in PHP leads to a crash of PHP or PHPUnit on my system. So this seems to be another dead end. But if it would have been working, it would have been a nice replacement of the original logic.

Challenge

When I wrote the Golden Master, I intentionally overlooked something. Let's just say the tests are not covering as much code as they should. Can you spot the problem? If yes, how would you approach it?

Final Thoughts

"Extract till you drop" is a good way to dissect long methods. It forces you to think about small pieces of code and to give the pieces a purpose by extracting them into methods. I find it amazing how this simple procedure, together with frequent renaming, can help me discover that some code does things I never thought possible.

In our next and final tutorial about refactoring, we will apply this technique to the trivia game. I hope you liked this tutorial that turned out to be a little bit different. Instead of talking on textbook examples, we took some real code and we had to fight with the real problems we face every day.

HIGHLIGHTS OF THE DAY