1

I'm creating application in C++ and found out that storing information in XML file was very prodigally. Despite the possibility to read it without any specific application, random access via DOM or VTD library and possibility to backup by simple copying XML file, it's very RAM-expensive (even while using VTD it's necessary 17GB of RAM). Now, there are 12000 entries and I'm planning to store up to million ones. Each entry contains 37 fields (now, maybe more in future) with different types: string, double, float and 128-bit float. These fields are distributed into 5 groups (hierarchy which is possible by XML tags). I tried to find something more productive with the same benefits. Unfortunately, googling didn't help me because there are so many DBs and DBMSs that I'm totally confused.

The XML file structure:

<paient>
   <analysis name="">
      <result type="">some_data</result>
  </analysis>
  <diagnosis>
     <preliminary></preliminary>
     <final></final>
   <diagnosis>
...
<patient>

So could advise me a solution for my problem?

Eugene
  • 169
  • 4
  • 2
    There are other factors that are involved in answering this question. How much money are you looking to spend on the DBMS (some are extremely expensive, and an acceptable answer to this question could be "free", especially for a hobby project)? What DBMS' have you worked with in the past? If for an organization, what DBMS' do they already have and support? – Thomas Stringer Jan 19 '15 at 17:57
  • @ThomasStringer I was planning to make this app open-source. – Eugene Jan 19 '15 at 18:01
  • 3
    @Eugene Then go with an open-source DBMS (PostgreSQL or MySQL are two popular ones). Make sure you really understand SQL, though, as a table with 37 fields is probably designed wrong (use that same hierarchy to build foreign key relationships back to your main entities). – mgw854 Jan 19 '15 at 18:03
  • Are you looking for an *embedded* database? If so, start with SQLite, though there are many others. – GrandmasterB Jan 19 '15 at 18:09
  • @mgw854 Why an open-source DBMS? I can understand a free one for an OSS project, but not sure if I'm following why the DBMS itself should be OSS. – Thomas Stringer Jan 19 '15 at 18:10
  • @mgw854: But the user (or, worse, he himself) will have to *configure* MySQL/Posgress. It's a bother. He should go with a "zero configuration" solution, like Sqlite (more so if he won't be serving many user simultaneously, which seems to be the case). – Niccolo M. Jan 19 '15 at 18:12
  • @ThomasStringer I've added structure to the question to the question. As you can see I can't reduce the amount of fields. And is it possible to use user-defined variables in these DBMSs? – Eugene Jan 19 '15 at 18:16
  • 1
    It depends... You could have a look at [Introduction to NoSql by Martin Fowler](https://www.youtube.com/watch?v=qI_g07C_Q5I). Here you get a short introduction about your options. You are the only one who can decide which architecture will fit your needs - whether sql, document-oriented-databases or graph-databases will fit your demands. If your schema changes, you may decide for a "schema-less" database, with its drawbacks. – thepacker Jan 19 '15 at 18:53
  • 1
    @mgw854: 37 fields is not a lot. I worked on a CRM system once that had that many fields in the Customer table. Every one of them was essential. I'd look more at the actual design, rather than a metric that's less meaningful than SLOC. – Robert Harvey Jan 19 '15 at 22:03
  • Eugene, will the data be accessed only by your application from one running process? Or are you looking for a client/server database with remote access from multiple clients / workstations? That is one of the first question you should ask yourself when you pick a database. – Doc Brown Jan 20 '15 at 07:26
  • @DocBrown Only by one process, on local machine. – Eugene Jan 20 '15 at 20:47
  • Then I suggest to follow Mark Shevchenko's advice. – Doc Brown Jan 20 '15 at 21:33

3 Answers3

2

If your application is local (doesn't support remote access) you could use an embedded DB engine.

It gives you the easy installation and independence from other installed programs. So you need to choose the appropriate library. What to look for when choosing?

  1. As I said, the library should be embedded. It lets you create the single executable file without any difficult configurations.

  2. The library should support C/C++ to integrate with your existing code.

  3. The library should be well-known and widely-used. It ensures the most terrible errors have been discovered and fixed already.

  4. It's optional, but it would be cool if you could write SQL queries.

  5. It's optional, but... open-source!

Well, what's the choice?

I propose to pay attention to SQLite and Berkeley DB. Both are embedded, both are open-source, both supports C/C++ (also Java, Python, etc.)

SQLite is a relational DB engine, since Berkeley DB is not. My opinion, SQLite is quite easier to learn, but I could be wrong.

Try both. Use the one that will work earlier.

1

First of all. Do not bother much about it. When you separate your concerns well, retrieve data, from processing and displaying it, you will always have the chance to switch to a "better" system.

Sizes of about 10,000 entries is nothing a database will complain about, even million sets aren't a big deal. Your database will have to grow as your application/system grows. Think about the most common usecases and which data is presented together and which data is used together. This should lead you, whether to use a column-database, a document database, a graph database or simply a relational database.

Do not think too long about performance issues or theoretical problems which might occur, define what's important and select a system, which you can get most support for it. Remember to decouple your data(model) from your application(logic) and presentation(view), then you can always make a better informed guess next time. When making your application open source a decoupled design will propably encourage others to provide a database implementation when required which might be better suited the (or your) needs.

There is much knowledge around - Think first of good enough. Be pragmatic.

Please check the link i shared in the comment section. This is a nice overview about pro and cons about the type of database you might need.

thepacker
  • 893
  • 7
  • 11
1

First of all, where you see hierarchy, a database expert sees relations. Take some time understanding the relational database model and make a simple model that fits your data. This classical model is tried and tested, based on mathematical foundations and is the basis for virtually every large system out there. Do not be misled by people claiming a graph database or NoSQL database will solve your immediate problems. Most often they are oblivious of proper database design and medium to long-term maintenance. Graph databases and NoSQL databases shine in very specific niches of software engineering, but the choice to use them should generally be made after conventional solutions prove insufficient.

Performance should clearly be taken into account as one million records might already pose a problem when dealing with certain request-response scenario's. E.g., you might want to deliver a response within 200 milliseconds, which would likely entail the use of indices.

To choose a DBMS, you should start with PostgreSQL. It has a very good track record, comes with a performant query optimizer, is reasonably SQL standard and is used in many (larger) production settings. I've personally designed and implemented PostgreSQL settings containing tens of billions of records, hundreds of tables and probably around thousands of indices, driving SaaS applications with very respectable performance characteristics.

Dibbeke
  • 2,514
  • 1
  • 16
  • 13
  • Well, i did not propose nosql but an introduction where MF talks about all types of databases. I would not go so far to recommend a certain database, when i do not know much about the data. It could be useful to create a relational model, but it could also be useful to think about a key-value-store. I personally think performance is not the first thing to think about, since each new version will become faster and better over time. The time works in favour to your needs. You may compare that to Java-VMs where each one is more optimized than their ‎predecessor. – thepacker Jan 19 '15 at 23:27