How and when should I design a simple mark-up language parser?

Question

I want to write a simple markup language with its rendering engine.

First, I am not completely sure when I should try this... I am only 12... But I am competent in C++ having learned through the Web and books.

I am also good with JavaScript, PHP and HTML. I am currently learning Ruby and Haskell for a change.

I understand all low-level and high-level concepts. But the only thing that had always confused me is how people design these parsers to understand and compile or interpret things like markup languages and programming languages.

My question is when should I start writing a simple rendering engine for an even simpler markup language?

More like the custom xml-like custom languages frameworks use for their interface (Qt uses a .ui file that is similar to XML to define their forms).

Am I up to design something like that? Any good papers, articles or books to read?

Preferred Languages: C++, JavaScript, Haskell, Ruby

Age doesn't matter. Knowledge does. Design a simple language markup parser *when* you believe it will benefit your learning. — Robert Harvey, May 01 '14 at 20:13
parsers require something called "formal languages theory". Regular expressions are one example of such things. — tp1, May 01 '14 at 20:15
A look into BNF (Backus-Naur Form) might give you some -insight- into understanding parsing of languages perhaps. (At least it helped me when it was introduced to me in 'Organizations of Programming Languages' class). — Shelby115, May 01 '14 at 20:27
For someone your age, I'd suggest just diving in and trying it. You'll do it all wrong, but your experience at doing it wrong will make learning the right way to do it later sink in faster. It will set you up to really understand the *whys* of how it is done. — Gort the Robot, May 01 '14 at 20:31
@tp1 Actually parsing something doesn't require anything beyond the most basic ground work of the theory of formal languages. You *can* use that theoretical framework, but you can also ignore most of it and create something no worse (or even better). — , May 01 '14 at 20:51
@tp1: It's not strictly required. There is a lot of valuable information to be learned from taking a stab at it, and figuring everything out on your own, and having to work through many of the problems that the originators of formal language theory had to work out. — whatsisname, May 01 '14 at 22:38

Ben H · Accepted Answer · 2014-05-04T21:59:07.073

For your specific circumstance, I'd just go ahead and try to write a parser if you have the spare time. I'd advise starting with XML-based parsers, as these are the simplest (as the syntax tree is already neatly written for you in the XML file).

For the more general question of when it's valid to write a parser, I'd argue that the following must be true:

The inputs to the parser change often, and it would take longer to make the changes to the hardcoded equivalent parser outputs than it would to change the parser inputs
The parser tackles a finite, well-understood problem domain, which changes both rarely and gives notice of changes
The total time it would take to write the the hardcoded equivalents to all the parsers outputs from all its input files is greater than the amount of time it will take to write the parser itself
The language dealt with by the parser is simpler or more convenient for its end-user than the language the equivalent hardcoded output would be written in

This may seem a tad opinion-based and complex, but my reasoning is essentially that a parser takes a very long time to write well. In order for the parser to pay off its debt (in terms of time taken to write it), it must be dealing with a problem domain where the alternative to the parser would be to write a lot of code to deal with each potential input to the parser. So let's run through the above beliefs with the example of HTML and HTML parsers:

HTML pages do indeed change often, and it would take longer to change the visual tree as written in C++ than it does to change the visual tree as written in HTML. To change the location of a div in HTML, or to change its style, one can simply cut-and-paste the existing div somewhere else in the tree, and one can simply apply a new css class. Doing the equivalent in C++ would be a lot harder, because it wouldn't be anywhere near as easy as just cutting and pasting the same code to some other part of the C++ file.
The HTML specification is finite and well understood. It's well known when the specification will change, because W3C convenes many meetings prior to each change. This means that the writers of HTML parsers know when it's about to change, so they can be prepared for changes and don't waste large amounts of time anticipating changes in the problem domain. The fact the problem domain is well understood and finite also gives parser-writers a good basis from which to say that their parser is complete i.e. an HTML parser is complete when it handles all of the known HTML elements that it will read. Imagine attempting to write a parser for something that changes constantly, and is vaguely defined; how would you ever know your parser was complete?
Similarly to point 1, imagine attempting to write a web page as a set of C++ instructions. Coming up with a consistent way of handling the layouts of elements on the screen would take longer than writing a simple div! Additionally, given the fact that there's ~2.51 billion web pages, imagine the loss of time of writing each web page in its own C++ files, with its own frameworks. If a parser saves huge amounts of time over the alternative choice, and the parser will be used often, then it's a good sign that the parser may be a net positive.
Again, if web pages were written in C++ then the pool of people capable of writing them would be severely diminished. Not to be snobby, but I think we can all agree C++, with its numerous complex pitfalls and segfaults, is a lot harder than HTML. If only inveterate C++ developers could write web pages, then I'd hazard a guess that we sure as heck wouldn't have ~2.51 billion web pages.

As a bit of personal anecdata, my company has written a parser for a client which takes XML and uses that XML to read and write data to and from SQL stored procedures into spreadsheets. The client is able to understand something like:

<Workbook name="SomeWorkbook">
    <Sheet name="SomeWorksheet">
        <DataCell range="A1" name="employee" input="SPGetEmployees" />
        <DataCell range="A2" name="salary" input="SPGetEmployees" />
        <DataCell range="B3" name="total" input="SPGetEmployees" />
        <DataCell range="B4" name"isApproved" output="SPApproveWorksheet" />
    </Sheet>
    <DataSources>
        <DataSource direction="input" type="SP" database="someDatabase" name="SPGetEmployees">
            <Parameters>
                <Parameter name="financialYear" type="DateTime" isDataCell="false" />
            </Parameters>
        </DataSource>
        <DataSource direction="output" type="SP" database="someDatabase" name="SPApproveWorksheet">
            <Parameters>
                <Parameter name="isApproved" type="Bit" isDataCell="true" />
            </Parameters>
        </DataSource>
    </DataSources>
</Workbook>

because it all looks familiar to them in their job role (semi-technical systems administrators), but the client definitely wouldn't understand the C# code that would otherwise generate this workbook. Their data sources for their worksheets change often, too, and it's quicker to change some XML than it is to change a lot of C# code. The problem domain is also well understood, because we're just reading and writing from some well-understood data sources to some well-understood outputs (Excel files), so we can write an XML-based language which provides for all the client's needs and doesn't have to be changed very often.

I'll leave you with this final caution from xkcd on the topic of optimizations such as parsers: http://xkcd.com/1205/

That's exactly what I was looking for, a simple language like that one. But with different purposes (rendering). — , May 02 '14 at 12:16
@404NotFound Good, well in that case I'd recommend taking a look at XAML. Microsoft uses the XAML markup language to define its user interfaces for its WPF framework (Windows desktop applications). Even if you aren't interested in learning how to develop for Windows desktop, seeing how Microsoft uses an XML-based markup language to define graphical applications might be useful for your study. Also, if you genuinely are 12 and manage to get this working, then that'll be a very impressive achievement; I've worked with many professional developers who cannot implement a basic parser. — Ben H, May 04 '14 at 22:06

How and when should I design a simple mark-up language parser?

1 Answers1