9

I write code in R, and often find myself attempting to optimize the code for better performance. In a given script that tackles a specific problem, I test different code alternatives and compare them to each other with benchmarking. At the end, I select the most performant method. However, I don't know how to document those benchmark tests.

I'll use an example to demonstrate (based on a real problem I asked about). In R, I want to write code that nests a dataframe by group. I have three possible methods I compare:

bench::mark(dplyr = mpg %>% group_by(manufacturer, year) %>% summarise(nest_cty = list(cty)),
            data.table = {MPG <- data.table(mpg); MPG[, .(nest_cty = list(cty)), by = list(manufacturer, year)] },
            collapse = mpg %>% fgroup_by(manufacturer, year) %>% fsummarise(nest_cty = list(cty)),
            check = FALSE)
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr        4.69ms    5.5ms      184.    2.38MB     5.56
#> 2 data.table   2.37ms   2.51ms      391.    2.16MB     0   
#> 3 collapse     95.2us  101.8us     9560.  206.56KB     6.22

The benchmarking table reveals that the third option is the most performant. So when finalizing my script, I'll choose that method. But I still want to document that I've tested different methods, and the results of the benchmarking, so future-me or collaborators could understand my choices.

How should I document this? I understand that writing such "story" in comments, inside the script, is considered a bad practice. Other option is via git commits. However, I find it too verbose to include such explanation in the description of a git commit. Furthermore, git commits have to do with tracking changes in code, but my need here is more of a metainformation about general strategy rather than specific change in code.

Christophe
  • 74,672
  • 10
  • 115
  • 187
Emman
  • 209
  • 1
  • 4
  • 4
    Does your code have actual english-language (or other-language) documentation? A section on 'theory' or 'overview' wouldn't be too unusual. – 2e0byo Nov 05 '21 at 21:26
  • @2e0byo, what I gather from your comment (correct me if you meant otherwise) is an overarching documentation about the code on the project-level. But my question is aimed at very high resolution, focusing on certain lines of codes: e.g., *"the method used in lines L25-L32 is `x` rather than `y` due to such and such benchmarking".* – Emman Nov 06 '21 at 06:26
  • 1
    Emman I meant something very like what @Christophe's answer describes, although I didn't actually know the phrase 'design document'. In any case the answers and comments cover the issues much more thoroughly than I likely could :) Before working on an opensource project, though, I really do look for a 'design' or 'implementation' or 'contribuing' or 'considerations' section and read it if it's there. Nobody has really mentioned the commit suggestions, but I (one data point!) very rarely run git blame before doing something; maybe I should. This might not be relevant to your field/usecase. – 2e0byo Nov 06 '21 at 17:13
  • You can create an alternate git branch for it. This way it is only there when you want it to be. In the end, it is up to you. There is no hard and fast rule for the process. Just ensure that, if you need it for whatever reason, it is easily obtainable. If you don't need it, it is a distraction from what you do need; hide it. – Nate T Nov 07 '21 at 00:08
  • @NateT, it's for me as well as for others. I want to back my decisions on the highest resolution, i.e., referring to specific lines of code to explain why I did `x` whereas I could've done `y` (maybe `y` is a more intuitive choice when first seeing the code). Problem is, `x` and `y` *aren't* general concepts such as *"I decided to take the `sum` rather than the `mean`"*, which is a simple thing to do using comments. Instead, `x` and `y` are actual bits of code, but it is important to show that those very specific bits of code were considered and contrasted against each other. – Emman Nov 07 '21 at 10:08
  • ... and for this reason, this "documentation" needs to be very conspicuous. One concern that I have about using git branches is that nobody will know such benchmarking even exists on a parallel universe (i.e., a separate git branch); and here I echo the point made by MartinMaat in his answer below. – Emman Nov 07 '21 at 10:16
  • 1
    Where I work, we use branches extensively, but we also have well documented standards that govern the our of branches. For this reason, in my case it makes sense to use them. If you are a part of a larger team working on this project and you are not a tech lead or project manager, then it is not your place to decide how to document this. It is the responsibility of the team coordinator to decide _ensure that everyone is on the same page_. The method doesn't matter, so long as it makes sense and everyone knows it. This is what standards are for: to ensure that everyone is on the same page. – Nate T Nov 07 '21 at 19:53
  • ...You could write a book about each one of your design decisions, but if nobody knows how to access and read them, they are useless. I suggest that you bring it up at your next scrum meeting, or the next time everyone is together. Ideally, you will be able to agree on a standard, or at least agree on a time/place to get it figured out. The result should be written out in a detailed document, which should be made freely available to all involved. – Nate T Nov 07 '21 at 19:54

4 Answers4

18

It is not bad practice to put this as a comment in code. Your goal is not to show everyone you did your homework, you want to prevent any successor to go "this is silly, I can do this more elegantly" => typer-de-type, fixed!. The only place to make this work is in the code. You can be brief, no tables with test results, just "it turned out this was about x times faster than that so I went with this solution instead of that one". Putting this in a design document is pointless, nobody will read that at the time it matters.

Martin Maat
  • 18,218
  • 3
  • 30
  • 57
  • I agree that a short sharp sentence "used x because ten times faster than y" would be an ideal comment. However, if this access strategy is used in 20 different places, would you repeat this comment 20 times just for not risking the point being missed? Anyway, the sharp sentence seems not sufficient for summarizing a benchmark and explaining the rationale: aren't comments supposed to document the current state of the code? Wouldn't historical reasons and pros and cons, benchmarks etc... clutter the code with unnecessary lengthy wording and make browsing the code unpractical? – Christophe Nov 06 '21 at 11:18
  • @Christophe A particular pattern being a corner stone of the application and being all over the place was indeed not what I had in mind. So as often "it depends". – Martin Maat Nov 06 '21 at 11:42
  • 4
    @Christophe - if the "access strategy" or whatever it is is used in 20 places then you need to use procedural abstraction to put it in _one_ place. And then, that's where you put your documentation. Once. (And then you can give it an appropriate name, too.) – davidbak Nov 06 '21 at 13:23
  • 3
    Also, you can embrace the power of "and". As in: put it in a short document (a one page note) that just describes this particular design choice (with its benchmarks) **and** leave a comment that says "performs better than alternatives attempted, see `access-foo-benchmarks.md`". Don't lose the information - someone else may discover that there's yet another alternative you haven't thought of, or a later version of the software has changed the behavior sufficiently that something you rejected now works better. But _in the code_: just a brief comment with the pointer to the doc. – davidbak Nov 06 '21 at 13:28
  • 1
    @davidbak an access strategy is a general principle and it may be too general for a procedural abstraction. The trick with the `access-foo-benchmark.md` is indeed a great idea for explanations that are not important enough for being in a design document and too lengthy to be in the comments (my point here is: too lengthy comments shall be avoided). Alternatively, if benchmarks are used to back a general technique that you may reuse across projects, you may even consider a comment that refers to `bigtable-grouping-pattern.md`. After a while OP could publish a book on "Effective R patterns" ;-) – Christophe Nov 06 '21 at 13:41
  • 1
    We can all get along. How about a URL in a comment that links to a page in an online design document? Peace.✌ – Martin Maat Nov 06 '21 at 14:49
  • A short comment reasoning about alternative code and design decisions is great and useful. But I think it is important to enphasize that one should definetely not leave commented chunks of code in the final design. – ianmandarini Nov 24 '21 at 11:46
  • “Lengthy comments should be avoided”: it’s not a comment, it’s code that is prevented from running. In C or C++ you’d write #if / #endif or /* … */. One is officially a comment. But really, both are code prevented to execute. – gnasher729 Nov 24 '21 at 21:39
6

This is the kind of information that you could document in some design document. The purpose is to keep track of important choices that were made and why.

Choices with alternatives that were seriously considered would be good candidates in such documents, especially if it's a key element in your algorithm or a general strategy/technique/pattern that you may use in several places (as your real problem suggest). This avoids loosing time reassessing over and over again the same questions.

Comments in the code should remain concise and sharp and should not distract with lengthy justification and historical reasons.

Edit: Strategies/techniques/patterns that are too specific for the general design, but worth to be known and reusable, could be explained in a separate design pattern document with all the justifications needed. As suggested by @davidbak in the comments, you may refer to this document in a concise comment, having the advantages of both approaches.

Christophe
  • 74,672
  • 10
  • 115
  • 187
  • 3
    I would regard this kind of choice an implementation detail. A design document should not become a "the making of" documentary. – Martin Maat Nov 06 '21 at 08:11
  • @MartinMaat Thank you for the opportunity to clarify. From what I read in the question, and not being myself R-literate, I am not able to judge if this is implementation detail or a key element in the algorithm. Moreover, I don't know if it's a general choice (a strategy that may be reapplied in many different places) or a purely opportunistic single-use one (I hope OP does not benchmark every single statement). This is why I clarified immediately in the second sentence "keep track of **important** choices" (and not "every choices like in a making-off"). – Christophe Nov 06 '21 at 11:06
  • 2
    @MartinMaat - There are design documents of various kinds and various _scopes_. This kind isn't the describe-the-system-detailed-design-with-section-headings-and-formal-signoffs kind. This kind is the one-pager with focus on one particular design _decision_. And the OP's issue isn't an implementation detail if it affects overall performance sufficiently that he had to (i.e., felt it worthwhile to) discover and benchmark various alternatives. It is worth a design note so successors can _know_ what was attempted (so they can try something else, or so they can see if things have changed). – davidbak Nov 06 '21 at 13:33
  • @davidbak I edited the answer to add the result of our exchanges across the different answers. I think it's an interesting alternative worth to be considered, for elements that are not important enough for being in the design document: it would allow to keep the comments clean and at the same time keep relevant knowledge that might otherwise get lost. – Christophe Nov 06 '21 at 14:01
  • @MartinMaat I agree with David: there's perhaps a misunderstanding about the kind of design documents that I mean. I meant concise long term assets, of key decisions (that are not so quickly obsolete btw) and that really help new team members. – Christophe Nov 06 '21 at 14:13
  • @Christophe, regarding your [comment](https://softwareengineering.stackexchange.com/questions/433329/how-to-document-alternative-code-i-considered-but-didnt-go-with-due-to-performa#comment952991_433337) above: when I come up with a strategy that I reapply in various places, I typically write a dedicated function for it, and call this function whenever I need it. So the documentation for that strategy needs to be written in one place. Therefore, using inline comments is a possibility. – Emman Nov 06 '21 at 16:17
  • @Emman I understand what you say. But as explained, a strategy can be something very general, like "filter first on the biggest table, then groupy by then summarize". Perhaps sometimes more than 2 tables are involved. Perhaps several filters or groups need to be applied in sequence, etc. What I mean is that perhaps, you can write procedural code in some cases, but maybe in other cases, it'll just be overkill. But now the other way round: if you'd manage to write a nice generic function, where would you inform about the benchmark? – Christophe Nov 06 '21 at 16:35
2

I would suggest a different approach. You did work benchmarking collecting numbers from which you made a decision.

If you could put that in with the rest of your code in a form that can be rerun later at will, it will allow others to see what you did, considered and the conclusion. It will also allow others to rerun the script at a later time where things might have changed to see if one of the other approaches is now more viable, instead of just having future maintainers blindly accepting an ancient conclusion (also known as https://en.wikipedia.org/wiki/Cargo_cult_programming)

  • Thanks. I don't understand your suggestion. What do you mean by `put that in a script`? The script is already there, containing the method I chose to go with. How do I express the benchmarking that led to choosing this method over others I tested? – Emman Nov 07 '21 at 09:56
  • 1
    Well, how did you do it? Write a single program that generates the output comparing the methods you showed. – Thorbjørn Ravn Andersen Nov 07 '21 at 12:07
  • Or in the very least, a script or small program that makes an assertion about how long it should take - a performance test, essentially. – Greg Burghardt Nov 08 '21 at 18:18
  • @ThorbjørnRavnAndersen very good idea. May I suggest that you update your answer to improve it and better explain what you mean by "the script" that they can rerun? (I understood what you meant, but clearly OP and others did not.) – user949300 Nov 09 '21 at 22:27
1

Git is a fine place to store this alternative functionality

Either in a feature branch or as a commit in the history

Most code quality tools discourage commented out code, comments should explain code not be code itself

Some answers here encourage a bad practice, going against industry wide and heavily scrutinized code quality tools, which makes me wonder if their rationale is extremely niche or even valid in thier own niche (had they even challenged thier opinion?)

Git itself is a favourite of mine for these exact use cases, and i also quite like the idea of adding a design document as markdown in the corresponding repo or as a pull request write up

Stof
  • 141
  • 3
  • I often create a folder called "OldStuff" or "Misc" to hold such "no longer used" code. Yes, the name could be better. Depending on you and your coworkers expertise with git (I'm not a star) this may be simpler for them than a branch or named commit. – user949300 Nov 07 '21 at 06:33
  • Long lived branches scales very badly. And how exactly would you document a single commit? – Thorbjørn Ravn Andersen Nov 08 '21 at 19:29
  • For all SO questions one assumes the OP doesn't (yet) have the expertise, or they'd not be asking. So what's your point? Do you post this comment on every SO answer @user949300 – Stof Nov 08 '21 at 21:28
  • There's no part of the question suggesting the alternative code must be long lived, and the entire premise of using git means the changes remin in the git history not the current source, ergo you may merge the branch and if you have a perception of scale challenges but i assure you that you also don't prune the git commit depth which is equal to keeping a branch but you probably have no perception of issues with that do you @Thorbjørn Ravn Andersen?? – Stof Nov 08 '21 at 21:33
  • As described, a rich documentation for the commit can be the PR, it's functionally equivalent to any other documentation method – Stof Nov 08 '21 at 21:34
  • @stof why would you spend documenting the alternative code at all if it wasn't for future maintainers? – Thorbjørn Ravn Andersen Nov 09 '21 at 12:04
  • Mind trying to write that question again? It's word soup, I don't understand the question. – Stof Nov 09 '21 at 21:41
  • @stof The only reason to document things like this is for somebody to eventually read it. Otherwise it is wasted work. In order to read it, they need to be able to find it. Long lived branches turn into a forest over time... Was it easier to understand now? – Thorbjørn Ravn Andersen Nov 11 '21 at 14:04
  • To be clear, you're saying that the PR is hard to find? The fastest way to know where a code feature came from is using gitblame, then grab the hash and look at the PR. every other method is an abstraction, a summary, and most likely an incomplete second hand piece of documentation, onlyl the commit is the truth and it take exactly 5ms to get a commit in a codebase like google that is famously the largest monorepo, so you're saying that truth and speed is not good enough? – Stof Nov 12 '21 at 01:27
  • So readers are clear, a 'forest' of branches is completely nonsensical. If you don't like using branches you're eliminating 99.999% companies using git. Only single person, hobbyist, and beginners avoid branches because they are solo and dont need to work with anyone else. I.e. you can still create a branch open a PR, commit the change, revent the unwanted code, merge the branch, delete the branch, and the PR documentation and commit is still there unless you chose to rebase and delete it on purpose. So it's unreasonable to say branches are messy when they're very clean by design if used right – Stof Nov 12 '21 at 01:38
  • So I see this code and think “it’s time critical and there is an obviously better method” (except you tried it, and it didn’t work). What are the chances that I look into git? Especially into some git branch, that hasn’t been updated for two years? – gnasher729 Nov 24 '21 at 21:42
  • what are the chances you 'use git'? what, do you use svn or cvs? sorry can't help you. If you 'use' git then use it.. don;t have git in use and avoid it's most basic functions like `git blame` and `git history`, or you might not be make it far in software developement – Stof Nov 25 '21 at 03:34