2

Hello fellow programmers, I was discussing a project the other day with a colleague of mine and I was curious to see what others had to say or if such a thing already existed.

Background

There are many programming languages. There are many IDE's and source editors that highlight and edit source code. Following perfectly and exactly the rules of a language to present auto-complete options and understand scopes in the code is rather complex. This task is complex enough that most IDE's implement different source-editors as plugins that often re-implement the same features over and over but in a different way (netbeans).

From what I can tell most IDE's and source editors re-implement parsers that use regular expressions, or some meta-syntax Naur Form to describe the languages grammer generically. These parsers are implemented over and over and over again.

Question

Has anyone attempted to unify or describe a set of features through an API and have a consistent interface to parsing various programming languages and dialects. I'm not describing an IDE - but a consistent API for any program to use to parse and obtain meta-information from the source code.

I realize various programming languages offer many different features which are difficult to 'abstract' into a set of features, but I feel this would be a worthwhile venture.

It seems to me that this could possibly allow the authors of interpreters to help maintain a central grammer intepreter for their language. the Python foundation could maintain the Python grammer api, ANSI the C grammer api, Oracle the Java grammer API, etc

Example usage

If this was API existed code documentation generators could theoretically work across all dialects and languages to some level. It wouldn't matter if your project used 5 different languages a single application could document all of them and the comments and doc-tags within.

Has anyone attempted this comprehensively?

Ben DeMott
  • 601
  • 3
  • 9
  • Yes, the API exists. It's called http://www.gnu.org/software/bison/ and it almost works for a wide variety of languages. – S.Lott Mar 09 '11 at 23:00
  • 1
    @S.Lott, bison will handle only a tiny fraction of grammar classes used in the real world languages. – SK-logic Mar 09 '11 at 23:20
  • @SK-logic: That sounds like an answer. Please provide an answer so it can be updated. – S.Lott Mar 10 '11 at 00:10
  • @S.Lott: I've got no answer to the question. There are commercial frameworks (like http://www.semdesigns.com/Products/DMS/DMSToolkit.html), there are incomplete, toy academic solutions, but of course nobody yet came up with something that might be a unifying API. But it is not impossible at all, you just have to employ more modern parsing techniques. – SK-logic Mar 10 '11 at 00:28
  • I want to make it clear I'm not calling for a unified parser - I'm calling for a unified API to access many different parsers, so the parsers can be re-used instead of re-implemented. Obviously the parsers cannot be unified to one grammer parsing API construct, paradigm, etc there are just too many languages. – Ben DeMott Mar 10 '11 at 03:39
  • @Ben DeMott: In that case, I repeat the suggestion in my answer: try designing the API to be useful. Don't worry about details too much. I don't have any good ideas myself. – David Thornley Mar 10 '11 at 14:32

2 Answers2

3

One issue is that different languages have different parsing requirements. It would be nice if all existing computer languages had nice LL(1) context-free grammars, but they don't. Some languages, like C and C++, cannot be parsed without input from other parts of the compiler. Also, before the parser comes lexical analysis, which divides input into tokens, and different languages will require different tokenization (** is two tokens in C, and one in Fortran). Any parser general enough to parse all common computer languages would be slow and hard to write a grammar for.

Another is that all the parser can do on its own, really, is confirm whether the program is syntactically correct or not. The parser will have actions attached to certain parser productions, and it is these that provide the semantics. For example, given a + b, the parser can determine that a, +, and b go into one expression. The attached actions will determine that this is a result of applying '+' to 'a' and 'b', and determining that this is an addition of two variables is in a later conceptual stage.

Then there's the question of what the API returns. It can return the next token, but that won't tell us much. If it is to come up with any sort of meaning, such as something like Intellisense would use, it has to have a lot more specifications, and different languages have different types of entities. Write the parser for C and C++ constructs, and you have trouble with Java's finally and probably real trouble with Haskell's monads. The common thing compilers spit out is some sort of executable code, usually stripped of much or all of the more abstract semantics.

In short, it doesn't look feasible to me. If you want to continue to play with the idea, try writing the API and seeing what it needs. If you arrive at more concrete problems, feel free to ask about them here or on SO.

David Thornley
  • 20,238
  • 2
  • 55
  • 82
  • GLR or PEG can actually solve most of the problems you've listed, with one (but important) exception - C and C++ ambiguities still can't be resolved without quite a substantial part of semantic analysis. – SK-logic Mar 09 '11 at 23:15
1

You may want to check: http://en.wikipedia.org/wiki/Antlr

It's a tool for parsing several languages. There are some related tools, and I.D.E.

I have some background in compiler design, and as the previous answers, and many other people may tell you, its practically imposible to make a single A.P.I., for several programming languages, since each syntax differs from one to another.

The most practical approach, as you already mention, its to build a "partial scanner / parser". It's used by several editor with code highlighting. Instead of parsing an entire source code file, the parser looks for particular "keywords" or expressions.

umlcat
  • 2,146
  • 11
  • 16