Interpreting gprof profiling output

Question

I am practicing algorithm and data structures. Therefore, I keep profiling my programs. Here is an example output from gprof:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 29.76      2.58     2.58                             _IO_vfscanf_internal
 14.30      3.82     1.24                             ____strtof_l_internal
 11.71      4.84     1.02                             memcpy
  7.84      5.52     0.68  8400000    80.95    80.95  insertorupdate
  5.94      6.03     0.52                             _int_malloc
  5.77      6.53     0.50                             round_and_return
  3.81      6.86     0.33                             _IO_sputbackc
  3.23      7.14     0.28                             strlen
  2.88      7.39     0.25                             __strcmp_ia32
  2.13      7.58     0.19                             __isoc99_scanf
  2.13      7.76     0.19                             fgets
  2.02      7.94     0.18                             _IO_getline_info
  1.27      8.05     0.11                             __mpn_lshift
  1.21      8.15     0.11                             __memchr_sse2_bsf
  0.87      8.23     0.08                             malloc
  ... rest ...

But I found it hard to understand what this means. For example, my program takes a lot of time in standard functions, I guess it should be good, right? Because, if one of my functions were consuming most of the time, that could mean I was doing something wrong. But as I say, I am not sure if I am interpreting this correctly, so any suggesting or pointers reading this is welcome.

Arseni Mourzenko · Accepted Answer · 2012-05-02T01:02:42.077

It's not good nor bad.

Imagine I tell you that I spend 45 minutes to travel from point A to point B. Is it fast? Is it slow? Does it mean I spend too much time? Or not? Until I say what is the point A, what is the point B, and what are my speed expectations, you can't tell anything.

Profiler results are similar. They don't tell you that something is absolutely bad or if your code is perfect performance-wise. They just show you raw results, letting you to interpret and study them.

You may have a valid case where one of your methods will spend a large amount of time doing something, because, well, it is expected to spend a large amount of time, since it has lots of work to do. Or maybe it's a sign that you must optimize this part of the code, because you expected this method to be at least one thousand times faster then it is actually.

Profiling helps by giving you some hints. SQL profilers, in this way, are maybe easy to understand:

If you see that when loading a web page, your web application sends the exact same query with the exact same parameters ten times, there is probably something wrong in your code, because you should do the query only once.
If you see that when loading a web page, a simple query takes 1.2 seconds, while you expected it to take a few milliseconds, there is a good sign that you or your colleagues forgot an index on a table or did something wrong with the query.

In the same way, code profilers:

Show that the same method is called again and again: sometimes, that's what you want. Sometimes, it may indicate a bug.
Show the time spent by a method. Depending on the case, either it's expected, or there is a bug.

This means that the purpose of a profiler is not to tell that your code is good or bad, - this is what requirements and acceptance testing is about -, but to identify the bottlenecks when you fail to met one of the non-functional requirements related to performance.

score 4 · Answer 2 · answered May 02 '12 at 11:19

In this case, what it's telling you is that (as it's currently written) your program is spending most of its time doing I/O. The top two functions both appear to be involved with simply reading data from a file.

The part that would (or at least might) be somewhat disturbing is memcpy being the third highest, consuming almost 12% of the time. This may also be (a more or less unavoidable) part of reading input, but if you're using it directly, you might want to take a look where and how much you're using it to see whether you can leave data in the same place more, and (for example) rearrange pointers to it rather than moving blocks of data around (but keep in mind that even at best this is only going to improve speed by less than 12%).

As far as data structures and algorithms go, this is telling you they're (apparently) fast enough to contribute little to the overall run time. The major place to look is the I/O. From the looks of things, most of the time is being spent in a call to fscanf (or one of its cousins) doing a floating point conversion. If you were concerned with overall speed (not just the algorithms you wrote) an obvious possibility would be to look at storing that data in binary format to avoid converting from string to binary.

score 1 · Answer 3 · answered May 08 '12 at 01:43

If you're working with algorithms and trying to show big-O behavior, it's better to count basic operations, like comparisons or memory allocations.

What you're seeing in your gprof output is your program spends most of its time reading input. And that's just unblocked CPU time. It doesn't count I/O time, even though you are still waiting for it. (And I/O is also CPU time, just with a different CPU.) If it counted I/O time as well, you would see an even larger fraction in that code.

gprof estimates how much time each routine held the program counter. It also counts how many times each routine was called from each other routine. From that it tries to figure out how much time each routine was responsible for in the sense of either spending time, or causing time to be spent. The way it calculates this is riddled with problems, as the original authors freely admitted, and it has basically stayed that way for 30 years. It is venerable, but it is not going to tell you much that you can depend on.

score 0 · Answer 4 · 2017-12-17T05:49:35.203

Generally I would find CPU sample-based profiling useless if the profiler didn't include a way to get from callee to caller. Typically your top hotspots are often not the functions that actually need to be optimized. It's the functions calling them one or two levels down.

As a recent example, I got a session where the top hotspot was a comparison function between a couple of integers. There's hardly any value in trying to optimize that. However, when I drilled down to the caller, I found it was a qsort call being applied in a part of the system that was being invoked frequently. I replaced the qsort with C++'s std::sort which reduced the time for the total user-end operation from 4 seconds to 2, and then a radix sort and the times reduced from about 2 seconds to 1.2 seconds.

Another example I ran into in past codebases was to see malloc being one of the top hotspots. It would be extremely counter-productive to then think you need a faster general-purpose memory allocator. Instead the question should be why the codebase is performing so many heap allocations, and typically they'll be teeny ones if malloc is a huge hotspot. So then you can drill down to the caller which is spending the most time allocating memory through malloc and see if you can, say, allocate big blocks of memory in advance and pool the pre-allocated memory instead. At other times it might be as simple as using the stack for common case scenarios when the data fits and only using the heap in those rare case scenarios when the data wouldn't fit on the stack.

Anyway, to interpret profiling results, you often need to be able to drill down from callee to caller. The real culprit is usually not the "leaf" callee but one or two levels down the call stack.

Interpreting gprof profiling output

4 Answers4

Linked

Related