7

We needed a persistent storage for API keys and I came up with reading and writing json in plain text and the user thinks it works but is it very ineffecient compared to a relational dbms?

I just save the data as json and then edit it manually for this use case.

[
    {
        "241103000000056": {
            "Value": "xxx-4c85-xxx-8420-xxx",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        },
        "2411030000000516": {
            "Value": "String",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        },
        "2411030000000562": {
            "Value": "Text",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        },
        "2411030000000564": {
            "Value": "100",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        },
        "2411030000000566": {
            "Value": "ZZZ",
        "account1": "3000",
        "account2": "1910",
   "series": "A"
        }
    }
]

This has the advantage of rapidly getting a working use case when my user can manually add API keys instead of waiting for an admin ui for a database or a NoSQL storage. The variables are

Value - an API key that the program uses per user account1 - the debit account of the payment account2 - the credit account of the payment

The data is only being read and written once every day in a batch process for payments and the data set is not very large (less than 100 and probably always will be less than 1000 because the API keys are merchants and businesses and not consumers).

Niklas Rosencrantz
  • 8,008
  • 17
  • 56
  • 95

5 Answers5

14

Is it less efficient to store a small bit of data like your example encoded in a string than as binary? Yes. How much less? Not enough to care.

Is it less efficient to store thousands of such records in a string than in binary? Oh god yes.

Here's why: I can't predict the index of "account1" in the 42nd record because the fields of the previous records didn't have a fixed length. That means I can't find it without parsing everything before it.

Now sure, you could do text with fixed length as well but no one ever respects that so no one does it. For some bizarre reason though they respect it when it's done in binary. Why? I don't know, maybe you get a better class of coder when you force them to bang away in a hex editor instead of notepad.

That right there is the big thing databases really give you that's of any value over your file system. Well, besides transactions.

As for 80 bytes vs 150 bytes? Feh! If I cared about a O(n) factor like that (which I don't) I'd just zip the thing.

What I do care about is running into a schlemiel the painter problem where I can't do small things, that before I could do just fine, because the amount of data has gotten large. This isn't something you can fix just by throwing a bigger hard drive at the problem. Think about how these records get used.

This is why people who ask about the maximum file size of a system before creating json files really need to be sat down and talked to.

candied_orange
  • 102,279
  • 24
  • 197
  • 315
  • 1
    Gzipping tends to work well for text-based formats and since it's so well supported by HTTP clients, you can even serve up the compressed files directly, if you want to get fancy. – JimmyJames Jul 28 '17 at 15:20
6

The question to ask is efficient in what way? It appears that your records could be stored in a more space efficient way by assuming a fixed record structure (like a database would do) and saving the values in binary.

The record numbers look like they'd fit in a 64-bit int, whereas currently they're stored in a 16 character string plus 2 double quotes, plus formatting. If the text fields can have a length limit, that would help, too. The account values look like they'd fit in 16-bit ints, though you'd probably want 32-bit ints for scaling. So let's say the "Value" and "series" strings can be limited to 31 characters plus a length byte. You're looking at all records being:

  • 8 bytes for record number
  • 32 bytes for value
  • 4 bytes for account 1
  • 4 bytes for account 2
  • 32 bytes for series

That's 80 bytes per record. The first record in your list is 150 bytes. Of course, if the strings need to be at least 1k, but they average 150 bytes, then that changes the equation. Of course, you don't have to let the strings have a fixed length, either. You could store a length byte and have a variable record size. Now it's very efficient to store on disk, but may take longer to read and write. Particularly if you need to do a lot of random access.

Does the entire set of records fit into memory? Then maybe slow read/write times don't matter because you only read once at app start up and write once at app shutdown. (Probably not, but I'm just giving an extreme example.)

There are ways you can optimize each of these things or find reasonable compromises for situations where optimizing any one thing would compromise efficiency otherwise. But it all depends on what you're going to do with this data.

user1118321
  • 4,969
  • 1
  • 17
  • 25
  • 3
    Nice analysis, but at the numbers OP is talking about (100s) they could store the values in Sanskrit and still be plenty fast enough. :-) – user949300 Jul 28 '17 at 04:19
  • Ha! Too true. That information was added while I was typing up my answer. But I think the answer's still relevant if they want to scale it up in the future. – user1118321 Jul 28 '17 at 04:32
  • Yes, 100 000 records should be possible. I'm just glad that you didn't call me crazy. – Niklas Rosencrantz Jul 28 '17 at 17:09
3

For batch processing efficiency is not the top concern. Store the data in a database. It is a lot easier to write the correct data using sql than it is to edit a text file, and there are more consistency checks with the database (if you use foreign keys and the correct data types). A relational database also happens to be faster, but, you are doing batch processing, so processing a text file will probably be fast enough.

I realize this is counter to the comments below (JSON is so easy to with any language, it is human readable, ... ). Many of the problems that I have encounter have boiled down to someone not editing a text file correctly. And yes, despite my age I have used JSON. I just feel that JSON is deceptively easy to make errors that can be checked by common database constraints.

Robert Baron
  • 1,132
  • 7
  • 10
  • 4
    I disagree. At least for most languages interfacing with a JSON parser is much simpler than using a database. – Jules Jul 28 '17 at 08:37
  • 2
    Multiple failed projects due to using a text file? How about failed project because it wasn't build to make changing the data source easy? – JeffO Jul 28 '17 at 11:20
  • While in this case efficiency doesn't seem to matter much, I have to question the assertion that efficiency in batching doesn't matter. Batching tends to need to happen in a specific window of time and therefore narrows the amount of time available to process records. It's not that efficiency is not important, it's that you need to worry more about throughput than latency. – JimmyJames Jul 28 '17 at 15:52
  • 1
    "Many of the problems that I have encounter have boiled down to someone not editing a text file correctly." -- your integration tests should include a sanity check on your configuration files, so that this situation can't occur. Also, I'll note that it's just as easy to deploy an incorrect SQL file that renders your database unusable if you don't have any way of testing such things. – Jules Jul 28 '17 at 16:32
  • @JimmyJames - yes, the overall efficiency of the processing system needs to be faster enough to the job (a daily job cannot take 27 hours to run), but it doesn't need to the most efficient code. Online processing has very strict time limits, where as batch processing tends to have much looser time limits. – Robert Baron Jul 28 '17 at 17:40
  • @Jules, Why implement the sanity checks on a JSON file that could be done using constraints on a relational database? This seems to be more work than needed. Furthermore, it is my experience that it is harder to write an incorrect SQL that only contains inserts and updates than it is to write a incorrect JSON file. And, not all text/JSON files are configuration, some contain user provided data - for example, merchant credit information. – Robert Baron Jul 28 '17 at 17:57
  • @RobertBaron In my experience it's common to have problems getting batches to fit inside the processing window. I've seen problem drive millions of dollars to be blown on big iron while these batches were using bubble sort for data on the order of millions of items because "efficiency didn't matter". It kind of drove me nuts. – JimmyJames Jul 28 '17 at 18:03
  • In my experience, SQL constraints are inadequate to prevent all possible errors. Only testing the running system and ensuring it is able to successfully communicate with third party back ends, etc, is enough to ensure that a configuration change can be safely deployed. – Jules Jul 28 '17 at 19:07
  • 1
    @JimmyJames OK, I have edited it. I too have worked on a project that the company had just bought bigger iron (unbeknownst to me) and I wound up optimizing daily processes that took 27+ hours to run and got them running in about 1 minute. This was done primarily by caching to avoid io - something that the multitude of people with master's degrees who worked there could have/should have done but didn't. That said, I think they would have been happier if took more like 1/2 hour. – Robert Baron Jul 28 '17 at 19:27
3

From a different perspective than the other answers:

I agree that a DBMS is likely overkill in terms of space complexity, but a DBMS can also give you data integrity guarantees as well as security. They often have built-in backup features, and allow you to have the data on disk encrypted (on top of any OS-level encryption that may or may not be in place).

This may not be a direct answer to your question, but whenever I see XML, JSON, INI or other human-readable text formats storing potentially sensitive data (matching account numbers with API keys, for example), I always think of how easy it would be for an attacker to get that file and use its contents. The file even spells out what each piece of data means when it's in the format in the OP.

If an attacker gets that JSON file, they have everything. They know all of your customers and have access to their API keys, which means they can easily get access to all data provided via that API. A DBMS can be configured to encrypt the entries individually. So if an attacker gets the database: Oh well, it's encrypted. If the attacker gets a customer record: Ok that's a bit bad, but they don't have access to all customer records.

If you try to roll-your-own data integrity, back ups, encryption, etc, I can almost guarantee you'll get something wrong. DBMS' have already 'got it wrong' many times and they've all had to fix those problems. Any perceived difficulty interfacing with a DBMS programmatically is nothing when compared to securing your data properly.

Maybe it's not a big concern - I don't know what access these API keys provide - but mentioning "accounts" and "payments" puts me a bit on edge. This sounds like data that needs to be secured properly. Space-wise everyone is absolutely right that the amount saved is trivial. But security-wise a JSON file is pretty scary.

Shaz
  • 2,614
  • 1
  • 12
  • 14
  • I think that we will use a DBMS and python for next version. This version was more like a proof of concept that we can do this without a DBMS. – Niklas Rosencrantz Jul 28 '17 at 17:10
2

For 100-1000 records processed a few times a day, efficiency is totally irrelevant. It will be faster than pressing a button in any case.

JacquesB
  • 57,310
  • 21
  • 127
  • 176