How to analyze and understand the use/application of a "class" in a colossal million-line legacy code base?

Question

I am working on a huge code base (more than a million lines of code with a sophisticated architecture) written in C++ over the span of a couple of decades. The task on which I'm working at this point is understanding the use of a specific class whose functionality is unknown to almost every developer of the team. The reason? Because as I mentioned the code has been in development for decades and it's been through major changes, upgrades etc. etc. so you can imagine it may get messy when you have a million lines of code being developed by hundreds of developers.

I need to analyse and understand the structure and utility of a file called CLASS_inc.hxx.

Here are the details of my challenge:

A class called A_CLASS is declared in the header file CLASS_inc.hxx with all it's member functions. The members of this class are called in a couple of different parts of the code using scope resolution CLASS::member_function (well it's more complicated than that but I'm simplifying, you can also see a simplified snippet of the code down below). I could understand that some of the member functions are completely useless, I simply ran the command grep -rwin member_function in src of the code which returned no trace of the memeber_function anywhere in the code, because it is simply declared but never called in any corner of the code. So I deleted these useless member_functions compiled my code and ran the Test_Cases (there is a huge test base in the code) and all tests passed without problem. Now here comes the challenging part, the remaining member_functions constituting around 70% of the original member_functions are called in other functions in the code and I have no idea how to understand what they do!!!

So is there any methodology or tool or strategy in such cases to attack such problems?

For the information of those of you who might suggest "read the document", "read the comments in the code", "try to understand from the name of the member functions or class" I should say that there is absolutely no comment in the code, the name of the variables, classes and functions don't suggest anything (for instance one member function is called LADP).

Here is a simplified snippet of the code just to give you an idea, this is our CLASS_inc.hxx

#ifndef CLASS_inc_included
#define CLASS_inc_included
#include "blabla1.hxx"
#include "blabla2.hxx"

namespace CLASS_inc

//    COMMON CLASS : VARIOUS WORK VARIABLES

class A_CLASS : public A_Base
{
 public:
  A_CLASS();
  void constructor();
  ~A_CLASS();
  void destructor()
  {
    delete[] _container_of_double;
    _container_of_double = NULL;
  }
  const double& LADP_get() const
  {
    return *_LADP;
  }
  const double& LADC_get() const
  {
    return *_LADC;
  }
.
.
.
.
.
 private:
  double* _container_of_double;
  double* _LADP;
  double* _LADC;
.
.
.
};
extern A_CLASS* _CLASS;
.
.
.

And then the members of the above class somewhere in the code in other functions are called as the following example:

&CLASS_inc::CLASS().LADP_set()

The above snippets are very simplified depiction of the code but the pattern is the same.

I'm working on unbuntu 20.04 and the code is in C++.

Does this answer your question? [I've inherited 200K lines of spaghetti code -- what now?](https://softwareengineering.stackexchange.com/questions/155488/ive-inherited-200k-lines-of-spaghetti-code-what-now) — Doc Brown, May 04 '23 at 20:36
Worth reading: https://stackoverflow.blog/2022/08/15/how-to-interrogate-unfamiliar-code/ — Ben Cottrell, May 04 '23 at 22:07
I'm not convinced the proposed duplicate is, in fact, a duplicate. It might be 200k lines of code (or a million), but limiting this to analyzing a single class is an interesting twist. — Greg Burghardt, May 05 '23 at 02:21
Do I correctly understand that they are static member functions (based on how they are called) and there are no non-static data members in the class? In that case the class may be a "Utility" class for people who don't like "free functions", and thus some of the functions may be totally unrelated to the others. — Hans Olsson, May 05 '23 at 06:34
A good IDE, such as Jetbrain's CLion might be able to help. I'm not sure how good CLion is at analysing and indexing code (as that's harder to do in C++ than Java, for example), especially for a project of this magnitude. But I'd hope that once it's completed indexing, you can simply do 'Find usages' on each of the member functions in the class and see where they are called. — tjalling, May 05 '23 at 12:23
Adding to what @tjalling said, your IDE might also give you some insights on this specific file's VCS history. Maybe you can extract some information from commit messages or the history itself. For example what steps were taken to write the class over time, or when certain parts of the code were added, or what certain parts of the code were changed etc. — QBrute, May 06 '23 at 17:01
Never underestimate the power of adding some temporary `printf()` calls into the methods you are interested in, and then running the program to see when they get called and with what arguments. (you can even go a bit further and use `backtrace_symbols()` to generate and print a stack trace when they are called, so you can see the entire call-stack that let to the execution) — Jeremy Friesner, May 07 '23 at 05:46
I'm surprised that I didn't see a reference to [Working Effectively with Legacy Code](http://www.amazon.co.uk/Working-Effectively-Legacy-Robert-Martin/dp/0131177052) - Lots of great patterns and references that can be used to do various refactors that don't change the behavior of the code, but allow the introduction of 'seams' for adding tests. — Cinderhaze, May 08 '23 at 14:44
This isn't specific to classes but applies to *any* part of a large code base. — user253751, May 10 '23 at 11:22
Try Doxygen (www.doxygen.nl). It is most often used to generate documentation from markup in the source code, but it does a great job of creating very useful information about the source even when there is no markup. — Roger House, May 12 '23 at 01:54

candied_orange · Answer 1 · 2023-05-05T15:47:17.250

36

One over riding pattern has emerged in every interaction I've ever had with technology: I learn more when it breaks.

So break it.

I mean, this is software. You can't hurt it. Sneak a copy off some where that no one will care about and abuse the heck out of it. Make this class produce little tattle tail messages that make it easy to track what's going on. Dump debugging output when it gets called. Take a peek at the stack and see what called you. Send back nonsense and watch where the nonsense goes.

Give things better names as you think of them. Break things into smaller things. Write tests that show how much you broke it. If there are no comments to show the authors thinking then add comments that show what you’re thinking. “I have no idea what this function does but without it the GUI won’t load”

Just be aware that what you get out of this mostly happens in your head. Sandboxes are fun to play in but usually don't produce useful artifacts. But it may inform your more typical work.

edited May 05 '23 at 15:47

answered May 04 '23 at 21:22

candied_orange

102,279
24
197
315

9

“Give things better names”… “Break things into smaller things. Write tests”… “add comments”… “usually don't produce useful artifacts” — Don't good names, refactored code, tests, and comments count as useful artifacts? They sound like good results to me! – gidds May 05 '23 at 12:46
4

@gidds in a sandbox you enjoy a lot a freedom that you don’t when doing the more typical work. If your new names and comments can survive that transition great. But spend some time doing this just for yourself until this object fits in your head before worrying about explaining it to others. – candied_orange May 05 '23 at 13:32
if the OP knows that no one else at his company knows what the class does, and after they figure out what part of the code does, adding comments to the code (even with ?marks) could always be helpful to someone else down the line. Renaming functions in production is a is a no-go, but comments and even setting up alias functions with verbose names are nearly always appropriate. Comments are awesome!...... Present comment excluded – mpag May 06 '23 at 19:13

score 13 · Answer 2 · answered May 05 '23 at 03:00

First, you need a goal when analyzing a class. If you don't have this, you have no idea when to stop. And with a million-line codebase, you could go on forever. Since we don't often read code for the sake of reading code, presumably you need to make changes to the class. Keep this goal in mind as you trace through the code.

Knowing where in the codebase these functions get called is good. The biggest challenge I've had analyzing a single class is understanding the use cases that are impacted if you make changes.

You haven't specified much about the application, but generally you need to identify where the major use cases of the application begin. For a web application, an HTTP request kicks things off. A GUI application will likely start a use case with some kind of event (application-generated or user-initiated). Think of the locations where use cases begin as one end of a spectrum, and where member functions of this class get called as the other end of the spectrum. Your challenge as a developer is to find the path from one end to the other.

To accomplish this you will need to:

Understand the big picture architecture of the application. Where does data access go? Data validations? User interaction logic? Raw business logic?
Understand the major modules of the application.
Determine where this application interfaces with other systems or subsystems.
Know where, within the architecture, does this class reside.
Figure out if this is an algorithm class (it calculates stuff) or a coordinator (it coordinates the actions of some number of other objects).

Once you have a picture in your head how the code is organized, think of the codebase in terms of use cases. Add logging to this class¹. Execute use cases, look for the log messages. Continue adding log messages further up the call stack until you get a meaningful picture about where this class gets used, and for what purposes. This allows you to understand the technical constraints you have when making changes to the class. For example, does it have ownership over any memory? Does it manage file handles, or allocate other resources? If you wanted to refactor the code to use dependency injection for testing purposes, does each impacted use case have an object that satisfies that dependency (and if not, how much of a pain is it to get one)?

Once you can see a plan to safely make code changes to this class, you can stop analyzing and start making those code changes. And hopefully verifying those changes with tests.

¹ A quick note about logging: keep it simple here. You don't need a robust enterprise solution. You should be playing around locally or on some machine designated for development. Hard code the silly path to the log file if necessary. Writing a simple logger should be quick and easy so you spend more time looking through code than building out logging infrastructure.

Thanks a bunch for your interesting proposed methodology, I feel like I can get somewhere with this method so I would like to learn more about it. Do you think it would be possible for you to suggest some sources where I can see some examples of this method in practice please? Also that "Logging" method that you suggest seems very interesting although it's my first time hearing about it (sorry I'm a newbie I know), could you suggest some useful links or sources. I also try to google it myself. Thanks again. — Dude, May 09 '23 at 20:12
@Dude: I don't know if this is really a "standard practice". I don't think there is a commonly accepted pattern or practice here. And "logging" is also a very basic tool every programmer should learn (basically saving log messages to a text file). This is just good old fashioned code analysis. There are as many ways to do this as there are programmers. — Greg Burghardt, May 09 '23 at 20:42

score 9 · Answer 3 · answered May 05 '23 at 18:36

9

If it is available, look at the history of commits related to the class in your version control system (CVS, SVN, git, mercurial...). Commit messages might help. Context of commits (ie other commits nearby) may have hints as to what features are implemented (or bugs fixed) by the changes being commited. Also, code parts added or changed together might be related in function/purpose.

answered May 05 '23 at 18:36

Pablo H

598
2
7

1

This is the correct answer. Multi-million line code base written over decades cannot be developed without source control. And without source control and correspondent bug database it is impossible to know what corner case a particular line of file of code was written to address. So, if they did not keep history, they are doomed to repeat it. – jhnlmn May 06 '23 at 20:54
@Pablo H, Your proposed method seems like a very practical and useful strategy, a colleague of mine also proposed the same method. thanks a lot for your useful suggestion, I'll for sure look into it. – Dude May 09 '23 at 20:14

score 1 · Answer 4 · answered May 05 '23 at 08:18

Debug it

Put few breakpoints. Run some unit tests in debug mode. Check two things. Input and output of the functions then the stack trace, see how you got there.

Try and repeat the test few times, every time on a broader scope. I mean that after understanding when the function was called you should also check what the caller was doing and then go to a higher level.

How to analyze and understand the use/application of a "class" in a colossal million-line legacy code base?

4 Answers4