This is part of a series of questions which focuses on the sister project to the Abstraction Project, which aims to abstract the concepts used in language design in the form of a framework. The sister project is called OILexer, which aims to construct a parser from grammar files, without the use of code injection on matches.
Some other pages associated to these questions, related to structural typing, can be viewed here, and ease of use, found here. The meta-topic associated to an inquiry about the framework and the proper place to post can be found here.
I'm getting to the point where I'm about to start extracting the parse tree out of a given grammar, followed by a Recursive Descent parser which uses DFA to discern forward paths (similar to ANTLR 4's LL(*)), so I figured I'd open it up to get insight.
In a parser compiler, what kinds of features are ideal?
So far here is a brief overview of what's implemented:
- Templates
- Look ahead prediction, knowing what's valid at a given point.
- Rule 'Deliteralization' taking the literals within rules and resolving which token they're from.
- Nondeterministic Automata
- Deterministic Automata
- Simple lexical state machine for token recognition
- Token automation methods:
- Scan - useful for comments: Comment := "/*" Scan("*/");
- Subtract - Useful for Identifiers: Identifier := Subtract(IdentifierBody, Keywords);
- Ensures the identifier doesn't accept keywords.
- Encode - Encodes an automation as a series X count of base N transitions.
- UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
- Makes a unicode escape in hexadecimal, with hex 4-transitions. The difference between this and: [0-9A-Fa-f]{4} is the resulted automation with Encode limits the allowed set of hexadecimal values to the scope of IdentifierCharNoEscape. So if you give it \u005c, the encode version will not accept the value. Things like this have a serious caveat: Use sparingly. The resulted automation could be quite complex.
- UnicodeEscape := "\\u" BaseEncode(IdentifierCharNoEscape, 16, 4);
What isn't implemented is CST generation, I need to adjust the Deterministic automations to carry over the proper context to get this working.
For anyone interested, I've uploaded a pretty printed of the original form of the T*y♯ project. Each file should link to every other file, I started to link in the individual rules to follow them, but it would've taken far too long (would've been simpler to automate!)
If more context is needed, please post accordingly.
Edit 5-14-2013: I've written code to create GraphViz graphs for the state machines within a given language. Here is a GraphViz digraph of the AssemblyPart. The members linked in the language description should have a rulename.txt in their relative folder with the digraph for that rule. Some of the language description has changed since I posted the example, this is due to simplifying things about the grammar. Here's an interesting graphviz image.