1

Recently I am using a software package which seems pretty useful, but which I don't understand how it works internally.

This package contains software generated by several PhD students and lecturers of an university among several years, which reflects in code style. There are several papers related to pieces of this software, which I have read them but still, I am not sure I understood them correctly.

I decided to look inside the source code to understand what and how does it work:
The package lacks of test, sometimes there are defined functions that the programming language has already implemented (it is R if that matters), or two functions with overlapping functionality, different ways of notifying the user (it uses message, cat or write), and functions with 30 parameters. I found also some errors in functions (using a variable name which is the same as a base function name instead of a variable name defined inside the function). Or it defines a new class but doesn't define methods for such class, thus making many lines of code duplicated.

I went on to try to understand the package and improve its coherence, I started by unifying the style and adding test (first time using tests), but since I am not sure of the expected output I am just testing against the results it gives for each option, or for the structure. After some time I would like to start modifying some functions, and check with the already written test that they don't break any.

What else should I do to improve the package and my understanding of it? How can I efficiently improve my understanding of the package?

llrs
  • 171
  • 11
  • 2
    Possible duplicate of [I've inherited 200K lines of spaghetti code -- what now?](http://softwareengineering.stackexchange.com/questions/155488/ive-inherited-200k-lines-of-spaghetti-code-what-now) – gnat Dec 11 '16 at 12:03
  • I have seen this post, (but I didn't thought this package was spaghetti code), but there are two differences 1) I am not responsible of this code 2) I would like to understand the code, not just put it in better format. I want to improve the code in order to understand the package but if there is any other way I would rather prefer it :D – llrs Dec 11 '16 at 12:21
  • 1
    There are literally whole books on this topic (Michael Feathers' *"Working Effectively with Legacy Code"* is the canonical resource), making this much too broad to cover here. But if you don't know what the code is *supposed* to do you can only test that it keeps doing what it *currently* does when refactoring. – jonrsharpe Dec 11 '16 at 13:02

1 Answers1

2

The mission

You have a nice software package that was made by different people, over a longer period (in which the R language has evolved, bringing in functions that had to be initially programmed from scratch) and which obviously doesn't follow the principle of clean code.

So it's tempting to refactor it and harmonize the code style.

The challenge

You don't fully understand how it works.

You can't can't predict expected results, as you don't fully master the subject.

The advise

It seems very ambitious to start immediately with a major refactoring under these conditions.

You should first start more modestly, by increasing your understanding. So start to build a reliable test suite for the software and its major components, without touching the internals.

Run the code with your suite, and check for any error in the test suite to understand if it's a problem in your suite, in the test data or in the current code.

Once you are there, you can consider starting the refactoring, always testing the changes with your extensive test suite. So at least you can check that you're not breaking (i.e. producing different results) the working code.

Take the opportunity to add more unit tests during the refactoring, to avoid that your successor will be in the same case as you.

Christophe
  • 74,672
  • 10
  • 115
  • 187
  • The most abundant refactoring I am doing is using `<-` instead of `=` whenever is possible or some other "minor" changes, I wouldn't say it is a major refactoring, but I could be wrong. As you mention maybe is due to changes in R and I couldn't found when a certain function was introduced (and I don't know so well the history of R to know in which version) but I also change `dim(x)[1]` for `nrow` which may break compatibility with older versions of R. I have been building the test suite and raised the coverage of 0% to 12%, but this hasn't increase my levels of understanding. – llrs Dec 11 '16 at 21:28
  • @Llopis I'm affraid there's no way round if it: if 88% of the code is not covered (so almost not undersood), your brainpower will encounter too much unknowns to build a solid understanding of these 12%. There is a critical mass that must be reached. I was rather thinking about 60% (i.e. only a minority of things to guess). Ok refactoring `=` into `<-` will do. But is it worth it ? Wouldn't it be more beneficial to get back to standard where it exist ? To deduplicate code, factorizing class methods where possible ? To rewrite functions to use less than 30 parameters ? And so on ? – Christophe Dec 11 '16 at 22:29