241

Instead of a database I just serialize my data to JSON, saving and loading it to disk when necessary. All the data management is made on the program itself, which is faster AND easier than using SQL queries. For that reason I have never understood why databases are necessary at all.

Why should one use a database instead of just saving the data to disk?

MaiaVictor
  • 5,820
  • 7
  • 27
  • 45
  • 70
    If managing the relationships of your data in your application is actually faster than doing it in a database (which I find extremely hard to believe) then you need to read up on SQL and database normalization. What you are experiencing is most probably the side-effect of a horribly designed database. – yannis Mar 14 '13 at 03:09
  • 5
    Well, try an example: imagine you're making a page on your site that shows a full list of members. How can you implement this? a) first query the database to get an array with the list of members then use that data to answer the request... b) just have that data stored on your program already and simply send it! See where I'm confused? Involving a database is just an additional step. How can an additional step be faster than not having it at all? – MaiaVictor Mar 14 '13 at 03:18
  • 78
    You don't need a database in the scenario you are describing because your data set is trivial. Databases are meant for more complex data sets, if all you do is read and show a list, your approach works. – yannis Mar 14 '13 at 03:24
  • 3
    @YannisRizos Perhaps: it's a list of members of a site, their passwords, login info, permissions, a list of exams and the grades of each member on those exams, a list of movies and their locations on the site and a list of messages sent between members, etc. Is this trivial? It seems. Am I fine storing that as files? Also, I'm curious: that's pretty much the kind of data a standard website needs. I've worked in some sites and it's never much different. What would be "non-trivial" data? What would need a database? Oh and thank you! – MaiaVictor Mar 14 '13 at 03:30
  • 5
    That sounds like enough complexity to justify a database. The great thing about a database is that it will work when the data is small, and it will still work when the data gets big (if it's designed properly). – Robert Harvey Mar 14 '13 at 03:32
  • 6
    Your schema is complex enough, and you should really start considering a database. Querying a database for a set of messages send between two members on specific date ranges shouldn't take more than a few milliseconds (assuming your database is properly designed), even if you have thousands of members and millions of messages. But if we are talking about, for example, 100 members and a couple of thousands messages between them, then you could probably make it work without a database. Anything larger than that and your app will start suffering, along with your users. – yannis Mar 14 '13 at 03:36
  • 22
    What race conditions could you encounter, and are you ready for that? Will you want to to scale past a single webserver? What is your backup plan if your server fails? Your answer to all of these questions is likely to be better if you have a database than if you don't. Also if you ever went over the hump of learning how to use databases, my guess is that you'd find your "easier than using SQL queries" should be amended to "easier than using SQL queries if you don't understand SQL." – btilly Mar 14 '13 at 05:21
  • 1
    @btilly I'm really getting many of what you're saying but there are some things I still don't understand like the race condition thing. Will it happen if I store my data as a JavaScript object, and access it only from the node.js application running it? That is, you mean the request handling events on node.js are spawning threads all the time? So I just can't access global objects from those events safely? Is that it? – MaiaVictor Mar 14 '13 at 09:30
  • 1
    I like using sqlite for such things where data is small and I want to keep dependencies as small as possible. – Xolve Mar 14 '13 at 09:54
  • 44
    Database stores data to disk anyway. It's just the end result of a natural evolution of systems for storing structured data to file. Chances are if you set out to use files to store your structured data you are going to find yourself reinventing features that have already been developed in databases. So why not just use a database from the start? – Benedict Mar 14 '13 at 10:00
  • 4
    your question should essentially be when should you store data in disk instead of databases – minusSeven Mar 14 '13 at 11:45
  • 16
    Depending on how your project evolves, you may find yourself having to deal with things like concurrent access and rollbacks. They sound trivial, but aren't. By the time you get done solving them, you will find you have basically written a database. Do you really want to be in the database business, or another business? – jwernerny Mar 14 '13 at 12:35
  • 1
    Some points in favor of file systems: https://news.ycombinator.com/item?id=5229883 – arxanas Mar 14 '13 at 12:43
  • 3
    @Dokkat Node.js uses cooperative multitasking, so as long as you're only using **one** CPU, you have no race conditions. As soon as you start a second process up, you have the possibility that both processes try to manipulate the same data at the same time, and hence race conditions. The most common being that both read it, both update data, both write it, and one of the two gets lost. But there are worse possibilities in which the file becomes corrupted and unreadable. – btilly Mar 14 '13 at 15:08
  • You couldn't find a faster way to write data to a disk than JSON? – JeffO Mar 14 '13 at 16:04
  • 3
    @YannisRizos comment 1. Could you write that in an answer so that I can downvote it? Databases in general incur massive overheads, if you can do something faster with a database than with in-process data in a program dedicated to that task, then it is your dedicated program that sucks. – aaaaaaaaaaaa Mar 14 '13 at 16:04
  • 1
    Can you do spatial queries on a file or in-memory data structure? Because it is trivial to do it on spatial DBs. – sakisk Mar 14 '13 at 17:38
  • Just because a 'cycle' serves your purpose doesn't imply that a car is useless :) – PhD Mar 15 '13 at 00:45
  • @btilly but why would I use another process? That node.js instance is the responsible for controlling the data, serving it and building up the state of the site/game. – MaiaVictor Mar 15 '13 at 01:08
  • @Dokkat If you ever have a popular website, then you'll find that one process does not have sufficient CPU. Likewise if redundancy becomes a business requirement, you will want to have multiple webservers able to go at a moment's notice. – btilly Mar 15 '13 at 05:40
  • @btilly oh that's interesting. So people actually open several node.js applications in different computers to serve a huge site? That makes a lot of sense now, but I have no idea on how this would be implemented. I mean, when someone makes a HTTP request it goes directly to the IP/port of the specific application... – MaiaVictor Mar 15 '13 at 05:43
  • 1
    @Dokkat Look up load balancing. Making many machines seem like one IP/port for web requests is a long-solved problem. – btilly Mar 15 '13 at 07:18
  • The OP is pretty much describing a homemade Mongo DBMS. It's a perfectly legitimate use case and was the basis for the NoSQL movement that was popularized when Google published its Map Reduce paper in the early 2000s. – Sridhar Sarnobat Mar 10 '16 at 00:18
  • This discussion is also useful I guess http://stackoverflow.com/questions/3748/storing-images-in-db-yea-or-nay – DmitryBoyko Apr 29 '16 at 15:41
  • @btilly you can have race condition even in cooperative multitasking system running in one CPU. Parallelism is not a necessary condition for race conditions. – Lie Ryan Mar 09 '20 at 09:14
  • Can't believe such an subjective/opinion-based question is still open after 7 years and no one voted it to be closed... – Tseng Mar 09 '20 at 10:55
  • @LieRyan I agree when dealing with external resources such as databases. When dealing with just node.js code race conditions are *possible* but really easy to avoid. – btilly Mar 09 '20 at 16:35
  • Which language are you using? If your program already keeps the data in objects AND the data is small enough to fit in computer's memory, then you don't need an extra layer of complexity (DB) at all. – Gabriel Sep 18 '22 at 05:45

13 Answers13

306
  1. You can query data in a database (ask it questions).
  2. You can look up data from a database relatively rapidly.
  3. You can relate data from two different tables together using JOINs.
  4. You can create meaningful reports from data in a database.
  5. Your data has a built-in structure to it.
  6. Information of a given type is always stored only once.
  7. Databases are ACID.
  8. Databases are fault-tolerant.
  9. Databases can handle very large data sets.
  10. Databases are concurrent; multiple users can use them at the same time without corrupting the data.
  11. Databases scale well.

In short, you benefit from a wide range of well-known, proven technologies developed over many years by a wide variety of very smart people.

If you're worried that a database is overkill, check out SQLite.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • 4
    Thanks for the input, but I still don't get why it's necessary. 1. What's the point? 2. How is this faster than "looking up" data that is already on the program? It's a step less. 3. You can do the equivalent, better, using maps and reduces. 4. I see that as a point but you can do it without databases aswell. 5. You can perfectly structure your data without databases. It's trivial to store, say, JSON objects as tables. 6. 7. 8. What? 9. So can not using it at all? 10. Programs can be concurrent. Also I don't think it's very safe to have 2 programs editing the same data at once? – MaiaVictor Mar 14 '13 at 03:03
  • 27
    6. Normalization, 7. See the link, 8. Read up on fault-tolerance. Oh, and before you get sucked up into the NoSQL craze, learn about SQL databases; get to know them on their own terms. You will understand. If you're just talking about simple configuration data, JSON may be all you need. But there are many other types of data out there besides program settings. – Robert Harvey Mar 14 '13 at 03:04
  • 30
    As far as it not being safe to have two programs editing the data at once, well, that's partly why databases exist. If you ever have this need (and some or all of the other needs I mentioned), you're going to be very glad that you don't have to re-invent all this. – Robert Harvey Mar 14 '13 at 03:07
  • 3
    I kind of "understand" you. There ARE clearly many benefits in using a database. I can see that. Yet, I think I just fail to see where not using it at all will be a problem for me. I can visualize myself saving all my data on disk and loading it, without ever having a problem. I think this is the problem: I see why it's "good", but why it's "necessary"? Mainly giving how MUCH, MUCH faster it is to manipulate data from memory rather than querying a database all the time. It seems like a cost that is not paid at all, even after the benefits. – MaiaVictor Mar 14 '13 at 03:10
  • 14
    I think you're thinking in terms of small amounts of memory and data. I routinely work with data sets that are many gigabytes in size, and it's just not practical to load up all that data and tie it up on a single machine every time someone wants to work on it. I would imagine the practical working limit for a JSON data set without any indexing is probably 50 megabytes or so. – Robert Harvey Mar 14 '13 at 03:11
  • 24
    @Dokkat It's not necessary, nothing is. If your approach works for you, by all means go for it. I should mention however that most half decent rdbms support memory based storages, you can load everything you need in memory when your app wakes up (as you already do), and query them as you would a typical database (keeping all the benefits Robert mentioned). – yannis Mar 14 '13 at 03:13
  • 30
    To put it another way, sometimes you need a tent, but sometimes you need a house, and building a house is a whole different ball game than pitching a tent. – Robert Harvey Mar 14 '13 at 03:13
  • 1
    I'm getting it - indeed, I'm not even considering gbs of data at all, so I see. But lets try an example? Imagine you're making a page on your site that shows a full list of members. How can you implement this? a) first query the database to get an array with the list of members then use that data to answer the request... b) just have that data stored on your program already and simply send it! This is the basic use case for a DB for me, so involving it is just an additional step. How can an additional step be faster than not having it at all? That's what is confusing me. Thoughts on this? – MaiaVictor Mar 14 '13 at 03:21
  • 1
    @Dokkat: OK. So how do you deal with the concurrency issue? People are going to be adding new accounts while you retrieve your user list. How will you deal with that? – Robert Harvey Mar 14 '13 at 03:26
  • 2
    @Dokkat Well, you are comparing in-memory data (read 'volatile') vs. database (read 'stable')... What happens if the application crashes: there is suddenly no "list of members". Having unnecessary data in memory slows down your application... until you run out of it. Then: puff! – rae1 Mar 14 '13 at 03:26
  • 1
    @rae1n if it crashes you save the data to disk and reload it? Does having data in memory actually slows down your application if you don't run out of it? – MaiaVictor Mar 14 '13 at 03:34
  • @RobertHarvey I don't get it... how is this a problem...? If you save your data as a JavaScript object in a node.js environment, write and read it on requests, will it have concurrency issues? What will happen? (Also, would this be solved if I used immutable data? Clojure?) – MaiaVictor Mar 14 '13 at 03:36
  • 11
    @Dokkat I doubt you'll have to time to save during a crash... by definition code execution just stops! – rae1 Mar 14 '13 at 03:37
  • @Dokkat And you get a limited amount of memory by the OS, who can take it away at a moment's notice (hopefully not)... and not even mention GBs of data! – rae1 Mar 14 '13 at 03:39
  • 56
    @Dokkat when people are referring to crashes, they mean stuff like... your CPU blew up halfway through writing your "database" file. What happens now? Most likely your file is corrupt / unreadable (at least, it may no longer conform to your own format), and you need to restore form a backup (while most "real" DBs would only lose the last transaction). Of course, you can write code to make it handle this. Then you can write code for all the other stuff. And then you realise you've spent 6 months writing a DB, which you could have used from the start, for very little effort. – Daniel B Mar 14 '13 at 06:44
  • OK, I get the crash problem, but how about the race condition thing? What do you guys mean? I can't access global objects from within node.js's request handler events safely? – MaiaVictor Mar 14 '13 at 09:31
  • 10
    @Dokkat - Is your program going to be used by more than one person at a time? (Hint - if it's web-based, the answer is very likely "yes".) As soon as more than one person has access to the data at the same time, you run the risk of race conditions. Database management systems already have mechanisms in place to help manage these situations. – Shauna Mar 14 '13 at 17:37
  • 2
    @Shauna sorry but this is just not true. Node.js is single threaded (as I've just learned) and can perfectly serve data to thousands of people without having concurrency problems. – MaiaVictor Mar 15 '13 at 01:09
  • 2
    @Dokkat: Not quite. See http://stackoverflow.com/q/7018093 – Robert Harvey Mar 15 '13 at 02:04
  • 9
    @Dokkat: On any non-hobby project, you'd want to use more than one machine, usually for availability and sometimes for scalability (lots of users using it, etc.). Using a file system will choke right there (can't see your user JSON from the other machine!). You can use a shared disk, but now every single query goes to disk. And you have be weary of race conditions (multi machines == multi threads!). etc. etc. Believe me, RDBMS didn't became de-facto standard of DB for no reason. SQL is highly useful as well. MR paradigm isn't "better" than SQL. They are just great for different things. – Enno Shioji Mar 15 '13 at 11:02
  • 3
    @Dokkat - re "You can do the equivalent, better, using maps and reduces." Not really. A map/reduce is at best equivalent to a database table scan. The database engine is very good about making use of information and indexes to do better than a table scan. Which, of course, you can do yourself, but then you've re-invented what a DBMS gives you. – parsifal Mar 20 '13 at 13:18
  • @JMoore for purposes of this answer, just substitute RDBMS everywhere you see the word "database." – Robert Harvey May 20 '13 at 00:54
  • Don't forget about views! – Jack Henahan Apr 11 '14 at 00:53
  • @Daniel Writing data to a file in a way that is crash safe was a solved problem 20 years ago. – gnasher729 Mar 09 '20 at 08:48
  • If your program already keeps the data in objects AND the data is small enough to fit in computer's memory, then you don't need an extra layer of complexity (DB) at all. – Gabriel Sep 18 '22 at 05:45
226

Whilst I agree with everything Robert said, he didn't tell you when you should use a database as opposed to just saving the data to disk.

So take this in addition to what Robert said about scalability, reliability, fault tolerance, etc.

For when to use a RDBMS, here are some points to consider:

  • You have relational data, i.e. you have a customer who purchases your products and those products have a supplier and manufacturer
  • You have large amounts of data and you need to be able to locate relevant information quickly
  • You need to start worrying about the previous issues identified: scalability, reliability, ACID compliance
  • You need to use reporting or intelligence tools to work out business problems

As for when to use a NoSQL

  • You have lots of data that needs to be stored which is unstructured
  • Scalability and speed needs
  • You generally don't need to define your schema up front, so if you have changing requirements this might be a good point

Finally, when to use files

  • You have unstructured data in reasonable amounts that the file system can handle
  • You don't care about structure, relationships
  • You don't care about scalability or reliability (although these can be done, depending on the file system)
  • You don't want or can't deal with the overhead a database will add
  • You are dealing with structured binary data that belongs in the file system, for example: images, PDFs, documents, etc.
Sam
  • 6,152
  • 2
  • 20
  • 34
  • 18
    +1, I think its important that you pointed out there's times when files actually are suitable for storage. – GrandmasterB Mar 14 '13 at 04:13
  • 18
    You could add another example to your third list: When data actually *is* files, e.g. uploaded images, pdf documents and such. It may seem obvious but I did see cases where images were stored in a database blob for no good reason whatsoever. – Goran Jovic Mar 14 '13 at 09:46
  • 2
    Good answer, but when would a web application *not* need to at least be thinking ahead about scalability? And why would *any* real application not need to think about reliability? – John M Gant Mar 14 '13 at 11:15
  • 5
    Well, there was never any explicit mention made to it being a web app but I did infer it from the JSON comment. However, sometimes something will only be used by a few people and you can justify the scope of the application to not worry about scalability and reliability. By this I mean, not worrying about things such as clustering and redundancy. – Sam Mar 14 '13 at 11:16
  • 8
    @GoranJovic it sometimes makes sense. Store 10,000+ images in a directory and some filesystems will grind to a halt - a DB might be easier than a manual sub-directory partition scheme. – Martin Beckett Mar 14 '13 at 19:48
  • 1
    Good mention of binary data and RDBMS instead of just database (explained for those still learning here, for SQL Server - http://www.youtube.com/watch?v=bXbm0qGwgAw). Sometimes a simple JSON/XML/INI file is good enough. – Bratch Mar 14 '13 at 20:48
  • Do your rules for binary files apply to heirarchical data formats like netCDF and HDF? Where do these fit in? They seem to work more like databases than binary files, but then I really don't know anything about how information in other binary files is used. – Abe Mar 17 '13 at 03:41
  • 2
    @MartinBeckett: which filesystem of the past decade does that? – Eamon Nerbonne May 19 '13 at 07:44
  • Of course, "embedded" databases like sqlite and [web storage](http://www.w3.org/TR/webstorage/) blur the lines between using a DB and using a file, you have the ease of a file and the power of SQL all at the same time. I can't see the point of writing your own file-content read/write system when someone's already done it for you, and done it way better. – gbjbaanb May 19 '13 at 13:11
  • The use of "unstructured" could be confusing. Both NoSQL and files can store data structures just fine (Json for example), but has limited possibility to query data or refer to data outside the structure. – JonasH Mar 10 '20 at 10:56
  • The necessity to access and edit the same data at the same time may be another reason to go for RDBMS. – Sergey Bushmanov Apr 05 '20 at 10:44
  • I think that storing telemetry data (which is not required immediately, probably EOD) to blob storage/ file system is much cheaper and can easily be processed by ELT job later when required is a good use case. – Kid101 Sep 13 '20 at 19:38
  • Scalability of what? data-size? sure, writes? probably, reads?... a small enough data-set should scale to lots of reads. – Myster May 31 '21 at 04:27
61

One thing that no one seems to have mentioned is indexing of records. Your approach is fine at the moment, and I assume that you have a very small data set and very few people accessing it.

As you get more complex, you're actually creating a database. Whatever you want to call it, a database is just a set of records stored to disk. Whether you're creating the file, or MySQL, SQLite or whatever is creating the file(s), they're both databases.

What you're missing is the complex functionality that has been built into the database systems to make them easier to use.

The main thing that springs to mind is indexing. OK, so you can store 10 or 20 or even 100 or 1000 records in a serialised array, or a JSON string and pull it out of your file and iterate it relatively quickly.

Now, imagine you have 10,000, 100,000, or even 1,000,000 records. When someone tries to log in you're going to have to open a file which is now several hundred megabytes large, load it into memory in your program, pull out a similarly sized array of information and then iterate 100s of thousands of records just to find the one record you want to access.

A proper database will allow you to set up indexes on certain fields in records allowing you to query the database and receive a response very quickly, even with huge data sets. Combine that with something like Memcached, or even a home-brew caching system (for example, store the results of a search in a separate table for 10 minutes and load those results in case someone else searches for the same thing soon afterwards), and you'll have blazing fast queries, something you won't get with such a large dataset when you're manually reading/writing to files.

Another thing loosely related to indexing is transfer of information. As I said above, when you've got files of hundreds or thousands of megabytes you're having to load all of that information into memory, iterate it manually (probably on the same thread) and then manipulate your data.

With a database system it will run on its own thread(s), or even on its own server. All that is transmitted between your program and the database server is an SQL query and all that is transmitted back is the data you want to access. You're not loading the whole dataset into memory - all you're sending and receiving is a tiny fraction of your total data set.

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Thomas Clayson
  • 1,013
  • 1
  • 7
  • 12
  • I'm not sure I get this one too. On the login/password validation example. In JS and ignoring encryption: you just create an object like logins = {login1:pass1,login2:pass2...}. Then a simple logins["test_login"] === "test_pass" will do it. Hash access is O(1), so this would be ridiculously fast, regardless of how many users you have. The data would be already on memory, the loading happens only once, when the program opens. Also that struct is lightweight, I guess you could have millions of users without needing much memory. So what's the point? – MaiaVictor Mar 14 '13 at 09:24
  • 1
    1. Please don't ever load all of your user information into client side code! (I'm sure it was only an example) 2. Loading that in the first place from a file 100s of MB large will take a while. 3. You're example is correct, however it assume that you're only ever going to search by username. What happens if you want to store more data about a user? e.g. Age. Now you want to search for all users who are between the ages of 20-30. Or even simpler, find a user by address when your json looks like this: {login:{pass:pass,add1:"123 sasd",city:"Wherever"}}. – Thomas Clayson Mar 14 '13 at 09:33
  • 2
    Your last point is potentially correct, but then I could be working from old data - specifically, if I open your program, load the current database then 5 mins later someone else logs on and edits something, my database is now a later version until I quit the program and start it again. If I then edit my database and save it again I'll overwrite any changes that the other user made. When you've got a user's database this could be anything from just changing your password. If two users change their password during each others sessions then one user will have their change reversed. – Thomas Clayson Mar 14 '13 at 09:37
  • You don't get the example, that code is supposed to run on node.js, server-side, for the validation of login/password pairs in constant time. That is, the data is only there, centralized. To access the data you'd communicate with the program handling it via TCP requests. Anyway, fair enough, I'm getting your point - databases index everything automatically, so anything I want to do it will do as fast as possible. Say, I need to find a user that has certain age and weight: the database manages to do it in constant time, regardless of it's size. Is that it? – MaiaVictor Mar 14 '13 at 09:40
  • 4
    I've learned a lot after searching some things about indexing. It was really enlightening. Databases make a little more sense now. There are still some things I don't understand, but that's a big progress. Thanks for that answer! – MaiaVictor Mar 14 '13 at 10:06
  • Yes exactly. (Didn't realise it was Node.js). Its not just that, in its most basic form it means that you don't have to write specific code for each lookup you want to do to find the information you need to find. Its the same process (1 sql query) to look up any field of any record, aggregate/average the results and much more. If you need any more insight let me know, I'm not a database expert but I have used them for as long as I can remember. – Thomas Clayson Mar 14 '13 at 11:16
  • 5
    About indices, no, database doesn't index everything automatically. Only few things are automatically indexed while the rest require explicit "please make this indexed". And indices reduce search to logarithmic time, O(log(n)) which is slightly slower than constant. – Emperor Orionii Mar 14 '13 at 12:46
  • @EmperorOrionii why O(log(n))? I always thought table lookup was O(1). – MaiaVictor Mar 14 '13 at 13:31
  • Because they are usually implemented as a B-trees (not to be confused with binary tree) and trees, if balanced, have O(1) lookup complexity in the best case and O(log n) in an average and the worst case. I've heard of hash based indices (O(1) lookup on average) but don't know if any DB supports them. – Emperor Orionii Mar 14 '13 at 15:42
  • @EmperorOrionii honestly that just complicates the things for databases in my judgement! I'll trust you guys but I admit I'm tempted to use files, memory and mapreduces at least once and learn by myself what will break first ;p – MaiaVictor Mar 15 '13 at 01:12
  • Same can be said for quicksort, binary search and hash tables. They are more complicated then selection sort, thorough search and plain arrays but when somebody implements them for you, there is no reason not to use them. After you make your app with ordinary files, try to remake it or part of it with databases. It would be great way to learn. – Emperor Orionii Mar 15 '13 at 08:47
  • 1
    Worrying about the difference between a hash-based and b-tree based implementation is a premature optimization. If data is in the index, it'll still be a dozen times faster than reading it off of disk. – SilverbackNet May 18 '13 at 19:16
  • Yes, my take is that at some point you would have to stop loading all the data into memory and need some kind of indexing, then you will find yourself reinventing a database system. Another point is relational integrity which you will have to re-implement. Then there are race conditions / locking. And finally I did experiment with using the FS for storage of documents without a database. I did not run into any concurrency problems but ran into some challenges regarding indexing and ordering, but most of all, this setup tended to make my code disorganized and hard to understand. – Rolf Mar 14 '18 at 01:46
  • 1
    @EmperorOrionii Postgres supports both BTree and hash indexes, but B-Tree is the default and recommended for most situations. B-Tree indexes are more versatile than hash tables indexes, because the query planner can use B-Tree indexes to serve range queries, queries that requires sorting, and joins that requires merge sorts in `O(n)`. Hash table indexes would require explicit sorting step for those adding `O(n*log(n))`. Real world measurements also don't see much actual performance differences between Hash and BTree even if you only need simple lookups. – Lie Ryan Mar 09 '20 at 22:26
16

A file system fits the description of a NoSQL database, so I'd say you should definitely consider using that when deciding on how to store your data and not just dismiss it off hand in favor of RDBMS, like some answers seems to suggest here.

One issue with file systems (and NoSQL in general) is handling relationships between data. If that is not major blocker here, then I'd say skip the RDBMS for now. Also remember the positive sides of using a file system as storage:

  • Zero administration
  • Low complexity, easy to set up
  • Works with any operating system, language, platform, libraries etc
  • Only configuration setting is the directory
  • Trivial to test
  • Trivial to examine with existing tools, backup, modify etc
  • Good performance characteristics and well tuned by the operating system
  • Easy for any developer to understand
  • No dependencies, no extra drivers
  • Security model is trivial to understand and is a base part of operating system
  • Data is not externally accessible
Martin Wickman
  • 13,305
  • 3
  • 31
  • 66
16

TLDR

It sounds like you made an essentially valid, short term data-store technical decision for your application - you chose to write a custom data store management tool.

You're sitting on a continuum, with options to move in either direction.

In the long term, you'll likely (almost, but not 100% certainly) find yourself running into trouble, and may be better off to change to using existing data store solutions. There are specific, very common, predictable, performance problems you will be forced to deal with, and you're better off using existing tools instead of rolling your own.


It sounds like you've written a (small) custom-purpose database, built into and directly used by your application. I assume you're relying on an OS and file system to manage the actual disk writing and reading, and treating the combination as a data-store.

When to do what you did

You're sitting at a sweet-spot for data storage. An OS and file system data store is incredibly convenient, accessible, and cross-platform portable. The combination has been around for so long, that you're certain to be supported, and have your application run, on almost any standard deployment configuration.

It's also an easy combination to write code for - the API is fairly straight-forward and basic, and it takes relatively few lines of code to get it working.

Generally, it's ideal to do what you've done when:

  • Prototyping new ideas
  • Building applications which are highly unlikely to need to scale, performance wise
  • Constrained by unusual circumstances, such as lack of resources for installing a database

Alternatives

You're on a continuum of options, and there are two 'directions' you can go from here, what I think of as 'down' and 'up':

Down

This is the least likely option to apply, but it's here for completeness sake:

You can, if you want, go down, that is, bypass the OS and filesystem altogether and really write and read directly from disk. This choice is usually relevant only in cases where extreme efficiency is required - think, for example, of a minimal/tiny MP3 player device, without enough RAM for a fully functional OS, or of something like the Wayback Machine, which requires incredibly efficient mass data write operations (most data stores trade off slower writes for faster reads, since that's the overwhelmingly more common use case for almost all applications).

Up

There are several sub-categories here - these aren't exactly exclusive, though. Some tools span both, providing some functionality in each, some can completely switch from working in one mode to working in the other, and some can be layered on top of each other, providing different functionality to different parts of your application.

More powerful data stores

You may find yourself needing to store higher and higher volumes of data, while still relying on your own application for managing the data manipulation complexity. A whole range of key-value stores are available to you, with varying extents of support for related functions. NoSQL tools fall into this category, as well as others.

This is the obvious path to scale up on when the following describe your application:

  • It is unusually heavy read reliant
  • You're OK with trading off higher performance for lower (short term) consistency guarantees (many offer "eventual consistency").
  • Is "directly" managing most of the data manipulation and lack of consistency (in practice, you'll probably end up using a third party tool at first, though eventually you'll bring this into your application or into a custom written intermediate layer).
  • You're looking to massively scale the amount of data you're storing and/or your ability to search through it, with "relatively simple" data manipulation requirements.

There is some wiggle room here - you can force better read consistency, for slower reads. Various tools and options provide data manipulation apis, indexing and other options, which may be more or less suited for easily writing your specific application. So if the above points almost completely describe your application, you might be "close enough" to work with a more powerful data store solution.

Well-known examples: CouchDB, MongoDB, Redis, cloud storage solutions like Microsoft's Azure, Google App Data Store and Amazon's ECE.

More complex data manipulation engines

The "SQL" family of data storage application, as well as a range of others, are better described as data manipulation tools, than pure storage engines. They provide a wide range of additional functionality, beyond storage of data, and often beyond what's available in the key-value store side of things. You'll want to take this path when:

  • You absolutely have to have read consistency, even if it means you'll take a performance hit.
  • You're looking to efficiently perform highly complex data manipulation - think of very complex JOIN and UPDATE operations, data cubes and slicing, etc...
  • You're OK with trading off rigidity for performance (think forced, fixed data storage formats, such as tables, which cannot easily and/or efficiently be altered).
  • You have the resources to deal with an often times more complex set of tools and interfaces.

This is the more "traditional" way of thinking of a database or data store, and has been around for much longer - so there is a lot that's available here, and there's often a lot of complexity to deal with. It's possible, though it takes some expertise and knowledge, and build simple solutions/avoid much of the complexity - you most likely will end up using third-party tools and libraries to manage most of it for you, though.

Well known examples are MySQL, SQL Server, Oracle's Database, and DB2.

Outsource the work

There are several, modern, third-party tools and libraries, which interpose themselves between your data storage tools and your application, to help you manage the complexity.

They attempt to initially take away most or all of the work that goes into managing and manipulating data stores, and, ideally, allow you to make a smooth transition into complexity only when and if it is required. This is an active area of entrepreneurship and research, with a few recent results that are immediately accessible and useable.

Well-known examples are MVC tools (Django, Yii), Ruby on Rails, and Datomic. It is hard to be fair here as there are literally dozens of tools and libraries which act as wrappers around the APIs of various data stores.


PS: if you prefer videos to text, you might want to watch some of Rich Hickey's database related videos; he does a good job of elucidating most of the thinking that goes into choosing, designing and using a data store.

blueberryfields
  • 13,200
  • 8
  • 51
  • 87
14

When you have simple data, like a list of things as you describe in the comments of your question, then an SQL database won't give you much. A lot of people still use them, because they know their data can get more complicated over time, and there are a lot of libraries that make working with database trivial.

But even with a simple list that you load, hold in memory, then write when needed, can suffer from a number of problems:

Abnormal program termination can lose data, or while writing data to disk something goes wrong, and you can end up killing the whole file. You can roll your own mechanisms to handle this, but databases handle this for you using battle-proven techniques.

If your data starts growing too big and updating too often, serializing all your data and saving is going to be a big resource hog and slow everything down. You'd have to start working out how to partition things, so it won't be so expensive. Databases are optimized to save just the things that change to disk in a fault tolerant way. Also they are designed, so you can quickly just load the little bits of data you need at any given time.

Also, you don't have to use SQL databases. You can use NoSQL "databases" which many do, just use JSON to store the data. But it is done in a fault-tolerant way, and in a way where the data can intelligent split up, queried, and intelligently split across multiple computers.

Also, some people mix things up. They might use a NoSQL data store like Redis for storing login information. Then use relational databases to store more complex data where they need to do more interesting queries.

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Keith Nicholas
  • 690
  • 1
  • 6
  • 12
12

I see a lot of answers focus on the problem of concurrency and reliability. Databases provide other benefits beside concurrency, reliability and performance. They allow to not to bother how bytes and chars are represented in the memory. In other words, databases allow programmer to focus himself on "what" rather than "how".

One of the answers mentions queries. "Asking SQL database a question" scales well with the complexity of a question. As code evolves during the development simple queries such as "fetch all" can easily expand to "fetch all where property1 equals this value and then sort by property2" without making it programmer's concern to optimize data structure for such query. Performance of most queries can be speed up by making index for a certain property.

Other benefit are relations. With queries it's cleaner to cross-reference data from different data sets then having nested loops. For example searching for all forum posts from users that have less then 3 posts in a system where users and posts are different data sets (or DB tables or JSON objects) can be done with a single query without sacrificing readability.

All in all, SQL databases are better then plain arrays if data volume can be big (let's say more than 1000 objects), data access in non-trivial and different parts of code access to different subset of data.

Emperor Orionii
  • 481
  • 2
  • 7
  • I'm a little leery about the idea that you can just ignore how stuff is represented. While you *can* ignore this, if you do, and esp. if you do write a slightly more complex query, it's exceedingly likely your application can no longer scale. "Adding an index" is not always possible - you've got writes to contend with, and it simply doesn't help that much with queries whose complexity spans multiple tables. When indexes are *necessary* that implies you've lost the benefit of interactive queryability since only specifically structured queries are answerable in reasonable time. – Eamon Nerbonne May 19 '13 at 07:42
  • 1
    @EamonNerbonne Perhaps better phrased, with an RDBMS, you can ignore how stuff is represented until it actually shows up as a real world problem. Only when things becomes a problem, then you can analyse the issue and in most cases the solution would be in the form of just add a few well placed index, and the RDBMS' query planner will take care of taking advantage of that index on all other operations that might benefit from it, and updating it. OTOH, if you're writing your own data storage using filesystem, you'll have to update the rest of the application. – Lie Ryan Mar 09 '20 at 23:03
9

File systems are a type of database. Maybe not a RDBMS like everyone else is talking about, but certainly a DB in the strictest sense. You're provide keys (file name) to look-up data (file contents), which has abstracted storage and an API by which your program communicates.

So, you are using a Database. The other posts can argue about the virtues of different types of database...

Chris S
  • 207
  • 1
  • 6
  • 1
    database and storage can't really be used interchangeably. A database is a type of storage, but a file systems is certainly not a type of database – GWed Mar 15 '13 at 15:57
  • 3
    "storage" is where bits and bytes are held. A database does not necessarily use files on a file system. A file system is most definitely a type of database in the strictest sense of the term. – Chris S Mar 15 '13 at 15:59
  • 7
    For someone who is arguing that there's no use in databases when they're alternative is to *use a database*; yes. It seems helpful to explain to them that their argument is based on a preconceived notion that is wrong. Once they have a better understanding of their initial situation we can help them move forward with more complete understanding of available technologies. File systems are hierarchical databases, there's good reasons relation and object database systems have supplanted them as faster, better organized, and more efficient data storage/retrieval. – Chris S Mar 15 '13 at 17:10
  • 3
    @Gaz_Edge The data is already in an inefficient "database" of sorts by being stored in a bunch of files whose structure and content are both managed by the OP's application. Trying to get the OP to understand and accept *that* is a useful first step to getting them to understand the use case for a "real" database system; once they understand that a "database" of some kind is happening anyway, it's easier to start talking about where a properly structured and managed service is more efficient than letting the app do its own thing. I'd suggest this answer does help, very much so. – Rob Moir Mar 17 '13 at 10:29
8

A database is needed if you have multiple processes (users/servers) modifying the data. Then the database serves to prevent them from overwriting each others changes.

You also need a database when your data is larger than memory. Nowadays with the memory we have available, this does indeed makes the use of databases in many applications obsolete.

Your approach is definitely better than the nonsense of "in-memory databases". Which are essentially your approach, but with a lot of overhead added.

funql.org
  • 81
  • 2
  • To be honest I love this answer and would like that to be true, but I'm not sure that's the case. For example, some users (and you) raised a concern about memory. Of course, if I'm storing GBs worth of data I can't keep it all on memory. But what if I'm certain the data would never be that large, should I just use memory? Well, there are other things too. For example, I've learned about CouchDB's incremental views. That is certainly something that, differently from indexing, would NOT be trivial to implement yourself, and is certainly a huge speedup when you are using a view model, – MaiaVictor Mar 14 '13 at 13:37
  • which I guess I am. For example, when I transform data from "player list" to "ranking", this is nothing but a map reduce operation. When creating a game or an interactive site, pretty much everything you present is a mapReduce operation from your core data! So having that kind of optimization could be really desirable. Well, I have no idea if any of what I'm talking proceeds, but that makes sense. Learning a lot today, and I'm really liking the NoSQL concepts. Thanks for the answer (: – MaiaVictor Mar 14 '13 at 13:42
7

You should always ask yourself if a particular application needs an RDBMS. Too many applications are built with a design process that automatically assumes all the required tools and frameworks at the beginning. Relational databases are so common and many developers have worked on similar applications as before, that they're automatically included before the project starts. Many projects can get away with this, so don't judge too harshly.

You started your project without one, and it works. It was easier for you to get this up and running without waiting until you SQL. There is nothing wrong with that.

As this project expands and the requirements become more complicated, some things are going to become difficult to build. Until you research and test alternate methods, how do you know which is better? You can ask on Programmers and weed through the flames and 'it depends' to answer this question. Once you learn it, you can consider how many lines of code you're willing to write in your language to handle some of the benefits of a database. At some point, you're reinventing the wheel.

Easy is often relative. There are some frameworks that can build a web page and connect a form to a database table without requiring the user to write any code. I guess if you struggle with the mouse, this could be a problem. Everyone knows, this isn't scalable or flexible because god forbid you've tightly coupled everything to the GUI. A non-programmer just built a prototype; lots of YAGNI to be found here.

If you'd rather learn an ORM manipulated by your language of choice instead of learning SQL, go for it, but try to install, create a table and pull some data out of a popular database with SQL (Select * From ; isn't mindblowing stuff). It's easy to do. That's why someone created them in the first place. It doesn't seem like such a huge investment in order to make an informed decision. You could probably do a performance test as well.

JeffO
  • 36,816
  • 2
  • 57
  • 124
  • Just to note, I've actually used mysql for years when I hosted an "otserv". Guess what? All it brought was problems. People could "clone" items using a dirty trick after they realized their characters was saved when they logged out but not when the server crashed. This is a serious problem for otservs. And the otserv community is HUGE. That wouldn't happen if they just stored data on memory and serialized it periodically. So I modified the source by myself, those long C++ files and started saving to mysql periodically, instead of when characters logged out. Guess what? It was SLOW! – MaiaVictor Mar 15 '13 at 01:18
  • Mysql simply couldn't handle fully saving state each 2 minutes or so. It was pretty clear when the saving happened - the whole server "lagged" for a second. Now I'd really appreciate if people posting here had an answer for that one! – MaiaVictor Mar 15 '13 at 01:20
  • 1
    Don't judge RDBMSs by what happened with a single application that was probably coded poorly. Especially when the modifications to support a database were made by someone with no database experience. – alroc Mar 15 '13 at 15:37
  • 1
    @Dokkat, I hope that nobody kicks the power cord in between depositing funds in your bank account and "periodically" writing the account balance to disk. You've described a guaranteed data loss architecture. That is fine for some applications, but most database applications give users the power to choose. You can run a single database node with backups and risk some data loss or use replication to eliminate data loss if a single node fails. – mikerobi May 18 '13 at 22:45
  • @Dokkat so you don;t use MySql or any other full-featured "server" style DB. You use Sqlite (or similar) and it will persist to disk every time, whilst giving you a DB embedded in your app (so no need for a separate install) and still giving you sql access, transactional integrity and disk persistence. – gbjbaanb May 19 '13 at 12:58
6

Saving the data to disk IS writing it to a database, especially if you put each object in its own file with the name of the file being the key to the record. And to minimize lookup times for reading the file, create subdirectories based on the first few characters of the key.

For instance key=ghostwriter would go in g/ho/stwriter.json or g/h/o/stwriter.json or g/ho/ghostwriter.json or g/h/o/ghostwriter.json. Choose your naming scheme based on the distribution of your keys. If they are sequence numbers then 5/4/3/12345.json is better than the other way around.

That is a database and if it does all that you need, then do it that way. Nowadays that would be called a NoSQL database like GDBM, or Berkeley db. So many choices. First figure out what you need, then build an interface library to deal with the details, perhaps a get/set interface like memcached or a CRUD interface, and then you will be able to swap libraries if you need to change the database format for one with different characteristics.

Note that some SQL databases like PostgreSQL and Apache Derby DB, will allow you to do SQL queries on top of many NoSQL formats including your own homegrown databases. Not sure about MyBatis but it may be similar.

Avoid NoSQL hype. Read about the features, test the performance and capability and then choose based upon how well it matches your application needs.

http://www.hdfgroup.org/HDF5/ is yet another interesting and widely used datastore format that people do not often consider.

Michael Dillon
  • 273
  • 1
  • 4
4

As soon as the data are updated concurrently, the approach using a database (it could well be an in memory database) will likely be more correct and more performant, while at the same time your code remains easy, because you simply don't have to worry about concurrent updates, transactions, caching, asynchronous I/O and all that.

Ingo
  • 3,903
  • 18
  • 23
  • Concurrent modification within a process will be more efficient using in-process locks rather than IPC to a database daemon that acquires a bunch of locks. But you're presumably talking about multiple processes modifying the data. – dhasenan Mar 15 '13 at 16:18
  • @dhasenan - This is another advantage of good database systems. You get the concurrency, and it works in all cases: Multi-threaded, multi-process, multiple clients on different servers, or any combination thereof. Your well though out multi-threaded program may be "more efficient" in certain cases, yet it simply won't scale. – Ingo Mar 15 '13 at 21:16
-5

You need a databse to store / retrieve QAs like the ones we are posting here ! A simple file is unable to organize data related to different topics.

joe
  • 207
  • 1
  • 2
  • 10
  • 3
    No, "topics" could be folders, and the "posts" on the site could be files. It's definitely possible to run a site like this off a filesystem. It is not efficient: slow and complicated to develop, run queries, insert new data, etc. – Chris S Mar 15 '13 at 17:12
  • slow + complicated = unable ? – joe Mar 15 '13 at 17:13
  • Slow and complicated to build != slow and complicated to function – joe Mar 15 '13 at 17:22
  • 1
    @joe, it's really not true that a file (maybe not a "simple" file, but what does that mean?) can't be used to organize data related to different topics. You could use JSON, as Dokkat suggests, or XML, or mixed-record files like we used to do back in the pre-XML days, or whatever file format you can dream up. I wouldn't recommend any of these approaches for most scenarios, but that doesn't mean they can't be done. – John M Gant Mar 15 '13 at 17:30
  • @John M Gant: totally agree with you, databases cannot replace single (since you do not like simple) files, and vice versa, for the only reason that a car cannot replace a bicycle. i speak 3 "human" languages, and my choice of words and vocabulary is the reason why i was misunderstood...i guess – joe Mar 15 '13 at 17:38
  • @joe, "single" makes more sense to me. I understand what you're saying now. I think I mostly agree with you in principle. The downvoters (which I'm not) probably object to your saying you *can't* build a site without a database. Most of seem to agree you *shouldn't*, but you technically can. – John M Gant Mar 15 '13 at 17:51