5

I'm looking for a little guidance. Until this morning, this was all over my head. After spending today researching Wikipedia, StackOverflow, etc., I'd say I've got my nose above the water. I'm tasked with researching a project that, as I understand it, would involve reverse-engineering a domain-specific language that is basically a type of regular expression syntax and translating that into conventional (meaning, Perl-like) regular expressions. I'm hoping someone can fill in the gaps in my understanding, so let me explain what I've got and my understanding, thus far.

Regarding the DSL, it basically has two components to it. One is grep-like regular expressions, more or less. The other component consists of what you could call macros. These macros are used when you want to find Social Security numbers, credit card numbers and so forth in the text being analyzed. The DSL is a simpler way for the end-user to specify matching text and exceptions. (Honestly, it's still pretty complicated to use -- but easier than regular expressions.)

The DSL is used to write "rules," a rule being a one-line statement describing target text.

What I want to do is, taking a rule as input, translate that rule into a conventional regular expression where possible. If a macro doesn't break down into a single regular expression, I will translate the macro into one or more RE's, or any kind of text processing code needed.

My largest point of ignorance is the "translating." Here's my first question: will that involve what's called "parsing"? (As you can guess, I'm a self-taught programmer.) Would I be using a tool like ANTLR or PyParsing, or something like that? (I just found out about these tools today, during my research.) I'm guessing yes, but I'd appreciate confirmation.

How much do I have to learn? I'm not going to need to get the "Dragon book," right? I am not writing a DSL, but if I'm reverse-engineering one, then I'm guessing I will more or less need to learn how to create a DSL in the first place (since how else would I know how to do the parsing). Yes?

What do I need to get under my belt? That's what I'm trying to figure out. I've parsed configuration files; I've written a simple YAML to Java Bean code generator; I've written a simple HTML-stripper; I know a bit about regular expressions and text processing; but this particular endeavor is more sophisticated than anything I've done before. It seems doable, but as a sanity check please tell me if, in my ignorance, I've miscalculated and what I'm describing actually needs 4 years of being a computer science major to accomplish.

Can anyone point me in the right direction or give me some pointers? Thanks.

Mario
  • 151
  • 4
  • 2
    "will that involve what's called parsing" - yep. "Would I be using a tool like ANTLR or PyParsing?" - Yep. "I'm not going to need to get the "Dragon book, right?" Right (because you are not building a compiler), but I guess it would not harm. Seems you already have the answers to most of your questions. – Doc Brown Jun 20 '14 at 20:36
  • 1
    Could you provide some typical expressions of your DSL? – siledh Jun 25 '14 at 07:04
  • As I said, I'm reverse-engineering a DSL and translating it to Perl-like regular expressions or custom code, when regexes won't handle the whole thing. So, some examples: %SSN% will recognize a Social Security Number, %CC% will recognize a credit card number (checking the Luhn), %EMAIL% will recognize an email address -- you get the idea. Those are examples of the "macros" I was talking about. In addition to those is a more standard regular expression syntax, like Grep or something similar. – Mario Jun 25 '14 at 16:11
  • Is your DSL externally used (by clients or by some community)? Or is it just a matter of transforming a bunch of ad-hoc unmaintained scripts (written in this DSL) into something better? – Basile Starynkevitch Jul 22 '14 at 16:43
  • It is not clear at all if the DSL already exist, then show some example of old code in that DSL, or if you are in the process of inventing your DSL. – Basile Starynkevitch Jul 22 '14 at 16:51
  • @BasileStarynkevitch The DSL exists. However, what I will be doing is reusing the syntax. I will have to build the lexing, parsing, and then the actual program that executes the syntax. (I believe you assumed this and answered it below.) – Mario Jul 23 '14 at 20:23

2 Answers2

2

I have never written a compiler, so I am out of my depth here, but here is a try:

I would start by writing a lexer and parser for the language. There are many tools for this. ANTLR is one of them; it can handle both lexing and parsing. Alternatively, you can use Lex or GNU Flex to create a lexer and create a parse tree with Yacc and its implementation Bison are others. I am sure that there are many other lexer and parser generators I have not covered here.

As far as the code generation, I don't really know what the best way to do this is. However, there are many resources on the subject.

The main answer to this question lists many resources that you could use for both the lexing/parsing and code generation.

If you know a functional language (such as OCaml), this would be a good language to implement your compiler in, as I believe (never done it) they make working with trees relatively easy.

Demi
  • 826
  • 7
  • 18
2

I recommend become familiar with several programming languages (including Scheme or CommonLisp and Ocaml or Haskell).

It is probable that implementing your translator in such languages would be good for you.

Then I suggest to read Programming Language Pragmatics (by M.Scott) and Lisp In Small Pieces (by C.Queinnec).

Of course, you'll need to read a good compiler textbook.

The lexing & parsing phase is probably the simplest. Most of the issues are elsewhere.

Of course you may want to use a parser generator like menhir or antlr. You could also avoid them and write your own parser by hand (this answer is explaining why you would want to do that). But parsing is quite "easy" (but a lot of work!). Most of the issues are elsewhere.

BTW, you are re-implementing a DSL. You might try to (compatibly) improve it. And it is a significant amount of work (months, not days).

Of course your translator will involve making some internal abstract syntax tree of the parsed DSL code (and the evil is in the details) and transforming them to some other AST representing the output code.

My DSL2011 paper on MELT might be relevant for you.

Reading the Garbage Collection Handbook should also be useful.

Basile Starynkevitch
  • 32,434
  • 6
  • 84
  • 125