What are good ways to parse a large amount of input for a C++ application

Question

For small command line applications I can get away with simple command line input.

./test input.png output.png

But for larger applications that requires a lot of input, simply using command line becomes messy very fast.

./test cam1.png cam1_focal cam1_principle_x cam1_principle_y cam1_k1 cam1_k2 cam2.png cam2_focal cam2_principle_x cam2_principle_y cam2_k1 cam2_k2 ... cam15.png cam15_focal cam15_principle_x cam15_principle_y cam15_k1 cam15_k2 output.png

What is the typical approach to deal with this? I heard my friend mentioning about writing all the input arguments into a XML or JSON file so it will become something like

./test config.xml

or

./test config.json

and the application will parse the XML file or JSON file to get the correct input.

If he is right and if I were to pick one to start with, should I go with XML or JSON?

If he is wrong, what is the correct way to handle large amount of inputs?

EDIT: Ideally this method should be

Cross platform, meaning it works on Windows, Linux and Mac.
Very readable and straight forward, so it's easy for an external developer to use this application at ease.
Very flexible, for example it should be able to handle optional arguments, nested arguments, different types like int, float, string, boolean.
Ability to handle complex input, for example, a 3x3 rotation matrix.
Minimal code is required in the C++ application itself to parse these input. If there is an external library to link to help the parsing so even less code is required to write that's even better.

Honestly, the size of the application is not much correlated to the number of input arguments it can process. And I don't see anything C++ related in this question, this seems just to be a distraction. — Doc Brown, Sep 17 '17 at 08:09
FWIW, the "correct way" to do is not to imagine some scenario, but to clarify your real requirements first, for a real use case. There are literally thousands of ways to pass multiple input data to a program. If you know your use case, you can much easily pick an approach which fits. — Doc Brown, Sep 17 '17 at 08:13
What operating system, and what kind of application (and for what kind of users)? Please **edit your question** to improve it and give more context and motivations — Basile Starynkevitch, Sep 17 '17 at 08:44
Not sure why all the downvotes. This is a very real problem. A lot of people who also upvoted my question will agree with me. At least despite all the negativity, at least I figure out XML is better than JSON, and all the alternatives described below doesn't seem to be better than using XML so far. Thanks a lot for everyone's input. — user3667089, Sep 17 '17 at 19:00
I don't think that XML or JSON are good in your case. Very few programs have configuration files in such formats. — Basile Starynkevitch, Sep 18 '17 at 07:02

score 3 · Answer 1 · answered Sep 17 '17 at 08:20

When dealing with command-line programs, there are two conventions that together allow us (and the program) to work effectively with a large number of arguments.

The program uses mostly named arguments/options rather than positional ones.
This means that most arguments to the program would be of the form --focal=<some value> --principle_x=<some value> --principle_y=<some value> -o <output file>. This allows parameters to become optional and to re-order the parameters. It is customary that input files can be provided as positional arguments (following all the others).
The program accepts a parameter that specifies an arguments-file. This arguments file is then parsed as if its contents were provided as arguments on the command line.

For the large majority of programs, these two conventions are sufficient, because usually the same set of named arguments applies to all input files.

If you have more complex requirements, such as different argument values for different inputs, then you could

accept the arguments using a more structured file format, like XML or JSON. Either one big file or a configuration file per input file, possibly augmented with regular command-line arguments for global options.
specify a rule that the command-line arguments will be interpreted as if they were structured like <global options> <input1> <input1 options> <input2> <input2 options> ... <global options>. This means that options like --focal can be specified multiple times and will always be applied to the image that was most recently seen on the command line.
If the inputs can't be combined into a single output, just accept only a single set of parameters and require that the program is executed multiple times for inputs that need separate sets of parameters.

user1118321 · Answer 2 · 2017-09-17T14:58:28.083

There are a few things that would make this easier for users. One would be to have named arguments. If you look at the Unix "find" command, for example, you write the name of the argument and then it's value, like:

find <directory to find in> -name <name pattern> -print

The parameter for the name you want to match is called "-name".

If you do want to have a file that contains the arguments, one that names what they are is helpful. Something as simple as an .ini file with sections and keys and values would be easy to implement:

imageFile = cam1.png
cam1Focal = 75
cam1_principle_x = 100
cam1_principle_y = -64
... etc.

This allows a reader of the file to quickly find a value they want to change and allows someone unfamiliar with the format to have a better chance of understanding it. This could also work with XML or JSON, especially if you already have a library to read those formats.

Another question to ask is why you need so many inputs. Does it make more sense to present the user with a UI where they can select options and defaults are obvious? Or does this application generally get called from another application, so the arguments are always generated rather than typed? If so, then maybe it doesn't matter. (Well, as pointed out in the comments, you'll still need to do error handling, but readability is less of a factor.)

You also don't have to choose between the two. You could do like other commands do and accept arguments from the command line, allow one argument to be the path to a file containing the other arguments (like "-c < config file >"), or have an argument that sets the app to "interactive" mode (like "rm -i" on Unix) and prompt the user or whatever's appropriate.

If the arguments are generated, OP does still need to contend with `E2BIG`. — Kevin, Sep 17 '17 at 07:23

Basile Starynkevitch · Answer 3 · 2017-09-18T07:06:14.573

On a POSIX system you would use globbing (see glob(7)...). So you could type

 ./mytest foo*.png bar*.jpeg

and the shell would expand the foo*.png so your program (its main function) would get an array of program arguments which could be quite long (e.g. hundreds or perhaps thousands of arguments). There are limitations (read execve(2), getrlimit(2)...) but quite often you won't care (typical limit could be several hundred thousands bytes for the program arguments). See also xargs(1).

^{(if you want your program to be able to handle a very large amount of files, e.g. several millions of them, you should have some indirect approach:
it is unreasonable to expect your main to get megabytes of program arguments)}

The shell is also able to do some even more powerful expansion, e.g. like using find(1) in

 ./mytest $(find a* -type f \( -name '*.jpeg' -o -name '*.png' \) )

Hint: use echo to understand expansion, e.g. replace ./mytest with echo in the line above to be told how that is expanded.

BTW, you could appreciate shells able to wisely do autocompletion; this is why I prefer using zsh as my interactive shell and I often use the tab key while typing some command (if you want similar facilities inside your program look into the GNU readline library).

And you can always design your program specifically, e.g. have it accept some configuration file, some file (or even some database, e.g. sqlite) containing a list of file names, some command giving such a file list (e.g. with popen(3)....), some file pattern provided in a configuration file or (quoted) program argument (use glob(3), wordexp(3), fnmatch(3), ....) etc.

^{(handling file names with spaces or return characters -or weird characters like ~ or # or | or starting - etc...- in them might be tricky, but is doable; beware of code injection and have quoting conventions, or decide to avoid such weird file names, but document what you expect and what you forbid)}

You could also have your program scan recursively some file tree (e.g. using nftw(3) or readdir(3) ...)

You could even embed some scripting interpreter (like Guile, Lua, etc...) in your application and enable a power user to code a script driving the work of your program (but embedding an interpreter is a strong architectural decision).

For larger applications that requires a lot of input, simply using command line becomes messy very fast.

It is not a matter of large vs small application (for example find(1) is a small program -only 240K bytes on my Debian/x86-64- by current standards, and cat(1) or wc(1) are even smaller, yet often used with many program arguments), and a power user would write some shell script to drive an application he uses often, and that is not messy .... So command line arguments are usually very convenient.

^{However, it really depends upon your audience: a developer is not scared by GCC accepting many options.... but your grandma might be scared by using the command line; so YMMV.}

... the application will parse the XML file or JSON file

Configuration or script files could be in JSON or XML format, but usually are not (since both formats are verbose, and comments are forbidden in standard JSON). Look at existing practice (e.g. many configuration files under /etc/ ...) for inspiration (e.g. INI format). It is sensible to accept some kind of comments in configuration files (usually starting with #, or perhaps // or ;, till end of line), to enable the sysadmin or the poweruser to explain the configuration.

You can find libraries helping in parsing configuration files, e.g. libconfig or Glib's GScanner etc etc.... And there are many functions helping in parsing program arguments (and you can combine several approaches). Cross-platform frameworks like Qt, POCO, GTK, Boost, etc... also provide support for configuration.

You could make your program become some server (e.g.using JSON RPC) and listen to messages or requests (see also socket(7)).

You could use some other forms of inter-process communication.

You could adopt (or not) the Unix philosophy and expect the power user to combine several programs (including yours), perhaps in some command pipeline.

See also this answer to a related question about "good habits for designing command line arguments".

If on Linux (or POSIX) read also Advanced Linux Programming.

Read also more about Operating Systems, notably Operating Systems : Three Easy Pieces (since your question is linked to the relation between OSes and application programs and how they are started).

Sometimes you write some shell script doing preliminary argument parsing (e.g. using getopt(1) or getopts builtin) and other stuff and driving some real executable (perhaps in /usr/libexec/). For example, on my Debian box, /usr/bin/firefox is a script.

^{PS. things are probably different on Windows (which I don't know) which is rumored to handle main argument expansion in some startup routine à la crt0 and provides a central registry. My answer is implicitly focused on Linux.}

Ideally the input parsing should be cross platform and work on Windows, Linux and Mac. You kind of indirectly answered one of my questions, which is XML is better than JSON because JSON doesn't allow comments. — user3667089, Sep 17 '17 at 18:20

score 0 · Answer 4 · answered Sep 16 '17 at 23:43

0

I would actually choose a simpler file format than JSON or XML. Use a text file with one parameter per line. No need to overcomplicate things with JSON or XML.

So, your text file config.txt would have:

cam1.png
cam1_focal
cam1_principle_x
cam1_principle_y
cam1_k1
cam1_k2
cam2.png
cam2_focal
cam2_principle_x
cam2_principle_y
cam2_k1
cam2_k2
...
cam15.png
cam15_focal
cam15_principle_x
cam15_principle_y
cam15_k1
cam15_k2
output.png

And you would run the program using ./test config.txt.

Obviously, if you want richer settings than what this one argument per line format can provide, you can consider JSON or XML.

answered Sep 16 '17 at 23:43

juhist

2,579
10
14

This to me is command line arguments in a bash script. What if the k1 and k2 can be optional? Also the focal, principle_x, principle_y, k1, k2 in the end are just numbers. If I am a new developer looking at the bash script I would have no idea what's going on. – user3667089 Sep 16 '17 at 23:50
2

I would go with json. It is easier to read by human (when xml). There is a lot of ready made parsers with simple API for programmer. And it is easily extensible in comparison to text format. – Artemy Vysotsky Sep 17 '17 at 03:46

What are good ways to parse a large amount of input for a C++ application

4 Answers4