Why are functions in Emacs so large?

Question

I am a beginner in ELisp, but have programmed in C++ and a number of other programming languages before. My rule of thumb (and I think it is a common one) that a function should fit on the screen. However, it should preferably be even smaller (20 lines is on the long side). However, I notice that Emacs has functions that are very long, with lots of opportunities for decomposition. An example of what I mean follows.

The question is: Is there something I am missing with regards to the difference between programming in ELisp and programming in procedural languages? Are those who wrote Emacs bad programmers? (hard to believe) Other reasons?

 (defun forward-paragraph (&optional arg)
  "Move forward to end of paragraph.
With argument ARG, do it ARG times;
a negative argument ARG = -N means move backward N paragraphs.

A line which `paragraph-start' matches either separates paragraphs
\(if `paragraph-separate' matches it also) or is the first line of a paragraph.
A paragraph end is the beginning of a line which is not part of the paragraph
to which the end of the previous line belongs, or the end of the buffer.
Returns the count of paragraphs left to move."
  (interactive "^p")
  (or arg (setq arg 1))
  (let* ((opoint (point))
     (fill-prefix-regexp
      (and fill-prefix (not (equal fill-prefix ""))
           (not paragraph-ignore-fill-prefix)
           (regexp-quote fill-prefix)))
     ;; Remove ^ from paragraph-start and paragraph-sep if they are there.
     ;; These regexps shouldn't be anchored, because we look for them
     ;; starting at the left-margin.  This allows paragraph commands to
     ;; work normally with indented text.
     ;; This hack will not find problem cases like "whatever\\|^something".
     (parstart (if (and (not (equal "" paragraph-start))
                (equal ?^ (aref paragraph-start 0)))
               (substring paragraph-start 1)
             paragraph-start))
     (parsep (if (and (not (equal "" paragraph-separate))
              (equal ?^ (aref paragraph-separate 0)))
             (substring paragraph-separate 1)
           paragraph-separate))
     (parsep
      (if fill-prefix-regexp
          (concat parsep "\\|"
              fill-prefix-regexp "[ \t]*$")
        parsep))
     ;; This is used for searching.
     (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
     start found-start)
    (while (and (< arg 0) (not (bobp)))
      (if (and (not (looking-at parsep))
           (re-search-backward "^\n" (max (1- (point)) (point-min)) t)
           (looking-at parsep))
      (setq arg (1+ arg))
    (setq start (point))
    ;; Move back over paragraph-separating lines.
    (forward-char -1) (beginning-of-line)
    (while (and (not (bobp))
            (progn (move-to-left-margin)
               (looking-at parsep)))
      (forward-line -1))
    (if (bobp)
        nil
      (setq arg (1+ arg))
      ;; Go to end of the previous (non-separating) line.
      (end-of-line)
      ;; Search back for line that starts or separates paragraphs.
      (if (if fill-prefix-regexp
          ;; There is a fill prefix; it overrides parstart.
          (let (multiple-lines)
            (while (and (progn (beginning-of-line) (not (bobp)))
                (progn (move-to-left-margin)
                       (not (looking-at parsep)))
                (looking-at fill-prefix-regexp))
              (unless (= (point) start)
            (setq multiple-lines t))
              (forward-line -1))
            (move-to-left-margin)
            ;; This deleted code caused a long hanging-indent line
            ;; not to be filled together with the following lines.
            ;; ;; Don't move back over a line before the paragraph
            ;; ;; which doesn't start with fill-prefix
            ;; ;; unless that is the only line we've moved over.
            ;; (and (not (looking-at fill-prefix-regexp))
            ;;      multiple-lines
            ;;      (forward-line 1))
            (not (bobp)))
        (while (and (re-search-backward sp-parstart nil 1)
                (setq found-start t)
                ;; Found a candidate, but need to check if it is a
                ;; REAL parstart.
                (progn (setq start (point))
                   (move-to-left-margin)
                   (not (looking-at parsep)))
                (not (and (looking-at parstart)
                      (or (not use-hard-newlines)
                      (bobp)
                      (get-text-property
                       (1- start) 'hard)))))
          (setq found-start nil)
          (goto-char start))
        found-start)
          ;; Found one.
          (progn
        ;; Move forward over paragraph separators.
        ;; We know this cannot reach the place we started
        ;; because we know we moved back over a non-separator.
        (while (and (not (eobp))
                (progn (move-to-left-margin)
                   (looking-at parsep)))
          (forward-line 1))
        ;; If line before paragraph is just margin, back up to there.
        (end-of-line 0)
        (if (> (current-column) (current-left-margin))
            (forward-char 1)
          (skip-chars-backward " \t")
          (if (not (bolp))
              (forward-line 1))))
        ;; No starter or separator line => use buffer beg.
        (goto-char (point-min))))))

    (while (and (> arg 0) (not (eobp)))
      ;; Move forward over separator lines...
      (while (and (not (eobp))
          (progn (move-to-left-margin) (not (eobp)))
          (looking-at parsep))
    (forward-line 1))
      (unless (eobp) (setq arg (1- arg)))
      ;; ... and one more line.
      (forward-line 1)
      (if fill-prefix-regexp
      ;; There is a fill prefix; it overrides parstart.
      (while (and (not (eobp))
              (progn (move-to-left-margin) (not (eobp)))
              (not (looking-at parsep))
              (looking-at fill-prefix-regexp))
        (forward-line 1))
    (while (and (re-search-forward sp-parstart nil 1)
            (progn (setq start (match-beginning 0))
               (goto-char start)
               (not (eobp)))
            (progn (move-to-left-margin)
               (not (looking-at parsep)))
            (or (not (looking-at parstart))
            (and use-hard-newlines
                 (not (get-text-property (1- start) 'hard)))))
      (forward-char 1))
    (if (< (point) (point-max))
        (goto-char start))))
    (constrain-to-field nil opoint t)
    ;; Return the number of steps that could not be done.
    arg))

I don't feel that your `forward-paragraph` example is a *large* function; and it is certainly a matter of opinion and taste. And I don't believe that your rule of thumb that every function should fit in 20 lines is universal (or even a good one). I have seen in several languages functions of many hundreds of lines (and the code still being readable and well written) — Basile Starynkevitch, May 19 '17 at 11:39
@BasileStarynkevitch Don't you think that the readability of `forward-paragraph` could be improved drastically? For example, handling the forward and the backward cases in two loops is a suspect for code duplication. In any case, the structure would be greatly clarified by handling these cases in separate functions. Also, 20 lines is quite lenient. I recall that I read a book (it could be "The C++ Programming Language" by B. Stroustrup) that argued for 7-10 lines. — AlwaysLearning, May 19 '17 at 11:51
Length of named code blocks might be a good metric. We know that at more lines of code, more potential errors. However if you use this metric, you should consider your thresholds independently for each language, because languages expressiveness varies. - It could be possible to improve the presented code. Yet there could be other concerns such as performance, encapsulation, namespace pollution, or version history readability. Or most likely time has not been put into it because it works and they rather work on fixing something or adding something new, or have a life. Btw idk Lisp. — Theraot, May 19 '17 at 12:27
@Theraot This could be a reply. I especially appreciate the fact that you gave validity to the question, unlike those who down-vote without even leaving a comment. — AlwaysLearning, May 19 '17 at 12:55
There is no need to search for hidden reasons why this function is so large. I am pretty sure the true reason is the same reason as why there are so many systems with a [big ball of mud](https://blog.codinghorror.com/the-big-ball-of-mud-and-other-architectural-disasters/) architecture - functions, systems and programs grow over time, and are often maintained by devs who do not value refactoring and clean code enough (just look at the reply you got above - do I have to say more)? — Doc Brown, May 19 '17 at 19:08
Note you'll also find lots of C and C++ code with huge functions, this isn't tied to the language nor the project. Though those projects with a longer history tend to have such "book-length" functions. — Daniel Jour, May 20 '17 at 02:17
Whether this question is opinion-based or not is opinion-based. Is it a bad question? Why so many down-votes? — AlwaysLearning, May 20 '17 at 18:28
The code seems to be going through several major steps in sequence. If these steps are not individually reusable by other functions, there is no reason to make them available that way. Using separate functions will require communication of all of the state which those pieces currently have "under one lexical roof" (unless local functions are used: "let over lambda"). **Note that the steps mutate some shared variables, so a simple refactoring with pass-by-value arguments isn't directly applicable**. We have to think about how the state changes are communicated out of the functions. — Kaz, Jan 13 '18 at 02:13
@Kaz Short functions are much better for maintainability. The rule of not having functions that do not fit on screen is well known and accepted. For example, in "The C++ Programming Language" (the C++11 edition), B. Stroustrup explains this at length and says that his ideal is functions that are 7 lines long on average. I am wondering whether Emacs Lisp is different for some reason. Shared context is not a reason good enough for causing a mild (well, depending on the length of the function) headache to future maintainers of the code. — AlwaysLearning, Jan 13 '18 at 18:22

score 3 · Accepted Answer · edited May 23 '17 at 11:33

Length of named code blocks might be a good metric. We know that at more lines of code, more potential errors^※1. This is easily understood as each line of code being a possible point of failure.

In fact, bugs per lines of code (for a particular team) is a very stable metric^※2. The main factors for this metric are developer experience and tool selection^※3. Yet, developer experience does not have sudden changes (diregarding changes of the team) and the tool selection change with low frequency.

However if you use this metric, you should consider your thresholds^※4 independently for each language, because languages expressiveness varies (For example, you will not be able to do as much in 10 lines of Assembly as you do in 10 lines of Python).

Regarding improving the presented code. It must be possible (guessing that there is room for improvement is a good heuristic). Yet, I will not be able to tell what improvements would be good, because I am not familiar with Lisp.

Why haven't they improved this code?

There could be other concerns aside from the number of lines of code, such as:

Performance: Assuming that refactoring introduces function calls, it might have a noticeable impact on performance, depending on language implementation. This usually is not a problem, but it might be for performance sensitive software (such as video games, real times systems, and software meant to run in older hardware).
Version history readability: If we start moving code around, it will make it harder to track in version control. This will it make it hard to find where they introduced a bug. If this is a concern for the team, they will discourage refactoring that does not fix a bug or add new functionality. One example is the project Mono, which includes in its guidelines:

In general, we do not accept patches that merely shuffle code around, split classes in multiple files, reindent the code or are the result of running a refactoring tool on the source code. (...) it destroys valuable history that is often used to investigate bugs, regressions and problems.
Opportunity Cost: Instead of refactoring code to reduce number of lines, it is better to use the time and resources of the team in improving the software. That means fixing something or adding something new to the project. In addition, when the team are not busy doing that… well, they are human beings and they have a life.

Also these, but not particularly in Lisp^※5:

Encapsulation: Depending of the language and platform it might not be possible to break the code into function without exposing them. If this may allow clients (perhaps via reflection, audit mechanisms or other trickery) to leave the system in an invalid state, cause it to stop working, or access sensitive information, then you have a security problem.
Namespace pollution: The problem might not be breaking the function safety or efficiently, but the amount of new functions we would introduce if we do this. In addition, they need sensible names. Coming up with names can be problematic in out of it self (in particular if the platform has a max length for identifiers, looking at you Oracle ¬¬), aside from that it may also affect performance in an interpreted implementations of the language.

^※1: El Emam et al (2001): The confounding effect of class size on the validity of object-oriented metrics

^※2: Hatton, Leslie (1995): Computer Programming Languages and Safety-Related Systems

Why are functions in Emacs so large?

1 Answers1