How the rope data structure works when doing syntax highlighting

Question

I am looking into the Rope Data Structure, used by some text editors like the xi editor.

I get the basics of how it works, and have seen some sample implementations such as here or here. But I am confused as to how you properly (optimally) handle "upgrading" a rope to be syntax highlighted. It seems to work in the following steps.

You start of with a string, not a rope, which is an array of contiguous characters.
You then convert that to a rope, maybe by breaking the string every 128 characters (I saw that recommendation somewhere). So now you have a basic, non-syntax-highlighted rope which is just a bunch of 128 character chunks concatenated together.
Then somehow you convert this rope into a syntax highlighted rope.

So it is as if you go from here (the "original" tree):

        _ 1 _
      /       \
    2           5
  /   \       /   \
 3     4     6     7
 |     |     |      |
abc   xyz   foo    bar

To something like this lets say (the "target" tree):

         _ 1 _
       /       \
     2           5
   /   \       /   \
  3     4     6     7
  |     |     |      |
 / \   xyz   foo    / \
R   G              B   R
|   |              |   |
a   bc             ba  r

where R = red, G = green, B = blue.

That is, some extra "style" nodes were inserted to give the letters color. Basically just simple syntax highlighting for demonstrating the question.

So the question is, how you treat the two different trees. I am confused about if you should be keeping the "original" tree unmodified, and then construct a new tree for the syntax highlighted version, and somehow keep a mapping between the two. Or if you should just modify the original tree to create the target tree directly from it, so you end up with only 1 tree instead of 2. But then the tricky part is if you want to then save the string back into unsyntax-highlighted form, you have to constantly iterate through the syntax-highlighted target tree and regenerate an original / basic / unsyntax-highlighted tree from it. If you are doing text editing, then this means every file save it would have to iterate through the whole syntax highlighted tree, generate a new tree, and override the file contents with the new basic trees bytes. That somehow seems ineffecient and I'm getting confused in thinking about what to do here.

Another way to look at this is if you had 2 or n views rendering the same rope. In view 1 you want unsyntax-highlighted, and view 2 you want syntax-highlighted. Not sure what to do here. If they referenced the same rope data structure, then it seems you would have to construct an additional tree for the syntax-highlighted version separate from the original tree, so it didn't mess with the original tree for the first view. So then it seems like there would need to be some sort of mapping between these "derived" ropes.

I'm confused why you would need the original tree to save. Wouldn't saving to file mean just getting concatenated strings? If you want to save the tree, why not just save the highlight nodes too? — JimmyJames, Aug 20 '18 at 14:37
I am just thinking like a text editor like Sublime Text. It shows me syntax highlighted code, but then I do `cat file` in the terminal and it outputs unsyntax-highlighted code. So it saves the "rope" without the syntax highlighting. (Sublime I don't think uses ropes, but either way). — Lance, Aug 20 '18 at 14:41
When you `cat file` it's just printing the text, right? You don't see any tree structure in terminal, do you? — JimmyJames, Aug 20 '18 at 14:44
No, yeah it just prints the text. So the tree is converted back into a string of bytes, but how to do it properly. — Lance, Aug 20 '18 at 14:45

score 1 · Answer 1 · answered Aug 20 '18 at 14:50

Unless there is some other requirement here, I don't see any need to ever return to the original 'undecorated' rope. The rope, based on your question, exists only in order to support these decorations. An undecorated rope is therefore a needless complication.

Instead what you need to do is have 2 (or more) routines for how you turn the tree into an output. One converter will walk the tree and keep track of the decorations in order to display in the editor. The other version will completely ignore any decorations and simply find the leaves in order to concatenated them together for the raw output.

score 0 · Answer 2 · answered Apr 02 '21 at 18:00

Consider having two parallel structures: A rope for the characters, something else for the syntax highlighting. The something else could be another rope, or it could be an interval tree, or whatever else you'd find that was appropriate. Package the two together in a class that does your edit and highlighting and iterating operations against them - the class' responsibility is to keep them in sync during the editing operations, as well as to supply whatever accessors you need.

You might find this conceptually simpler, thus easier to get right.

You might also find this allows more interesting operations to be available. For example, you could easily throw away the syntax highlighting structure and rebuild it, if that was interesting to do for various reasons you might discover. You could independently optimize the character rope and the syntax highlighting (e.g., perhaps, in an editor, your rope would benefit from periodic restructuring to eliminate small nodes). You could have a rope where parts of the text were in a file on disk and only part in memory - and you could have a syntax highlighter structure that had only entries for the current visible page (and maybe the previous and next page).

How the rope data structure works when doing syntax highlighting

2 Answers2