Create my own database system

Question

I need to learn how databases work in order to use them more efficiently, and my way of learning is by doing.

I want to create my own database system. I am not referring to creating a pseudo-database that would use query to parse files; this would simply be a filesystem interface with a query language. I am talking about the actual structure of a database engine. And since what I have in mind is neither relational nor document-oriented (it's "node-oriented", if that even exists), I would need any resource to be as abstract and high-level as possible.

So how would I go about creating that? What resources/tutorials/books can I read to understand?

The language does not matter in the slightest. Ideally, the code would be pseudo-code to illustrate the concept, not tied to a particular language, but anything would do. I was not able to find anything on the matter on google (since I am so illiterate on the subject, maybe I am just not entering the right search).

If such resources are not available, then I guess something about how to create a client would at least be a step in the right direction.

Why not write a compiler instead? Or even better, your own operating system? If you are really serious about writing your own database, there are a thousand and one open source databases out there: Study their source code, contribute a few patches. Then start thinking of building your own. — yannis, Nov 25 '11 at 06:20
I knew I was going to get bashed :) As I said, the purpose is to learn. I studied open-source DBs, but their codebase is too huge to get a real grasp; that's why I want to build from the ground up, beginning small and expanding until I get a real feel of how things work. THEN it will be useful for me to deeply study what has been done. — Xananax, Nov 25 '11 at 06:30
You can take some college level and graduate level database courses. There are many open source courses online. You can also buy a few textbooks and study them in spare time. This will give you some ideas and starting points. Reading the history and news about PostgreSQL will also help (in terms of imagination, although it will not give you any idea how those features are actually implemented) — rwong, Nov 25 '11 at 06:35
By "node-oriented", are you referring to [Graph database](http://en.wikipedia.org/wiki/Graph_database)? — rwong, Nov 25 '11 at 06:37
`I studied open-source DBs, but their codebase is too huge`: If something like [redis](https://github.com/antirez/redis/tree/unstable/src) or [flockdb](https://github.com/twitter/flockdb) is too huge for you to read, I don't see how you'll cope writing or own database. — yannis, Nov 25 '11 at 06:37
@rwong: Yes. I did not use the term because I never used graph databases and I'm not sure they are what I think they are. I have experience in relational/document-oriented DBs only. Also, I have a hunch that understanding graph DBs might be even more difficult so I try to not be looking for that exclusively. — Xananax, Nov 25 '11 at 06:39
@YannisRizos In fairness, reading code (imo) is much more difficult than writing it yourself. — , Nov 25 '11 at 06:44
@YannisRizos: Ok, I have to confess I had no idea code for a db could be so small. Thanks for the pointer, this is actually something I can study. — Xananax, Nov 25 '11 at 06:45
@AlexWebr Yeap that's why I suggested it. Op wants to learn, choosing the easy way is almost never the best way to learn. — yannis, Nov 25 '11 at 06:47
@YannisRizos that's not an argument. One could argue that reading code is much easier than writing custom code yourself. It might depend on people. I know that I can rarely learn only by reading. I have to do. If, for example, I begin to understand the code sources you provided, the first thing I am going to do is close the source, open a blank project and try to build again in another language, only going back to the source if I really get stuck. That's the only way what I learn sticks. — Xananax, Nov 25 '11 at 06:54
@Xananax I'm not interested in debating learning methodologies. Good luck building your database. — yannis, Nov 25 '11 at 07:01
@YannisRizos Agreed. This was not my intention either. Sorry if I offended you and thanks for the severe but useful help & advice. — Xananax, Nov 25 '11 at 07:05
@Xananax Didn't offend me, the "Good luck building your database" was sincere. I'm actually hoping you get there... — yannis, Nov 25 '11 at 07:08
@Xananax: don't listen to the frogs (http://www.crystal-reflections.com/stories/story_73.htm). Do whatever you enjoy and it is not necessary to have an objective to take pleasure in the process. — , Nov 25 '11 at 09:15
I know where you're coming from @Xananax, because I wanted to write my own RDBMS too, purely for my own understanding, and the suggestion to study the source of SQLite, Postgres or MySQL, for example, is just far to overwhelming when you don't even know the basics. In my case, I wanted to build something that uses Tutorial D as the query language, instead of SQL. I started picking SQLite apart, but I need more of a basic start first. FWIW, the documentation for how SQLite's internals work is *excellent*. — d11wtq, Nov 25 '11 at 11:35
@Yannis Rizos: I wrote my own compiler as a Uni project and it was one of the best lessons in my IT career. I guess writing own RDBMS is equally educating. But of course - if you need to ship the product in some finite time then use standard solution. — MaR, Nov 25 '11 at 12:59
@MaR Me too :) Read the last paragraph of [my answer](http://programmers.stackexchange.com/q/121706/25936) — yannis, Nov 25 '11 at 13:00
Very old post (and answer is set) - however if somebody else comes up with the same idea - here is the approach to 'understand databases by doing' without even coding: — Quicker, Sep 10 '18 at 13:18
1. make sure you REALLY UNDERSTOOD the concepts of ACID (see wikipedia) - that implies you'd be able to explain every sentence on that wikipage 2. think through how to read and/or write tiny portions of a huge files by multiple users/sessions and how to get that ACID compliant a break down the processes of reading and of writing data portions to/from the file by defining the kind of processors, components and flows between them b in your thinking process run reads and writes at the same time -> if both are ACID compliant, congrats, you are there, else refine what you defined under a — Quicker, Sep 10 '18 at 13:19
If you got ACID compliance without understanding the terms below you are a perfect candidate for any world-xyz-expert-price: - transactions (in the context of databases) - log files (in the context of databases) - locks (in the context of databases) - memory caches - memory pages (in the context of databases) - OS services or deamons - queues and pipes - seek latency and I/O throughput — Quicker, Sep 10 '18 at 13:19
Also to shortcut your way on what makes a database complex: - ACID demands immediate persistence on writes - immediate persistence implies disk access, multi redundant RAM access or flash access - disk is cheap; write disk seek latency is creepy; once seeked, dumping bytes in bulks is relatively fast -> that is what log file concepts build on - understand the flow picture at https://dba.stackexchange.com/questions/142224/in-mysql-how-exactly-does-data-flows-from-query-to-disk - ACID demands transactions to not interfer with each other — Quicker, Sep 10 '18 at 13:19
If you can ignore A and C and I and D in your 'database' use case do not do database but store your information in plain files! — Quicker, Sep 10 '18 at 13:19
I've recently been trying to do the same thing, and have stumbled across this fantastic tutorial: https://cstack.github.io/db_tutorial/ -- hopefully this helps someone — Toastrackenigma, Jul 06 '20 at 14:06

score 82 · Accepted Answer · edited Nov 03 '20 at 05:36

(it's "node-oriented", if that even exists)

Start here. When dealing with a complex application like a database (even a simple database is a complex application), you should be familiar with the history of the domain and the proper terminology and have at least a very high level idea of the architecture. You could start from the Wikipedia article on Database. Spent a few days reading all the articles on the related concepts and the different database types.

And since what I have in mind is neither relational nor document-oriented

Next, you pick Relational or NoSQl. If you pick NoSQL, you should pick one type of NoSQL. That's extremely important, you won't find any architectural documents that discuss all different database families. It doesn't really matter which one you pick, just pick one and stick with it.

The language does not matter in the slightest.

Yes it does (unfortunately), because after you pick a database family you should start exploring code from open source databases of that family. There are a few generic guidelines on what to look for:

Relatively small codebase,
Architectural documents or at least a development blog,
The database you pick should be close to what's considered generic in the family, it'd be harder to learn from if it's highly specialised.

A few examples that fit:

Get the source, compile it and play around with it. You don't have to submit patches or anything that fancy, just explore the code and make small alterations here and there to see what happens. It's an incremental process, the more you play around with it the easier it'll be to understand what the code does. If the first project you picked seems extremely hard to understand, just move on to the next one.

Another great option would be to concentrate on building an engine for MySQL, as @N.B. suggests in an earlier answer.

If you do reach a point where you are able to do something useful with the codebase, get involved in the project's community, that's the easiest way to find more detailed resources on the concepts involved.

And then, finally, start working on your database. At first you could just write up an extremely scaled down clone of the code you've been exploring. It doesn't have to be original, quite a few great projects started out as clones or forks.

What resources/tutorials/books can I read to understand?

There are quite a few books:

And a few hundred others, plus a myriad of academic papers you could easily trace via Google. You need to define what you want to do first, and then search for a book. Getting involved with a community of fellow database authors will also help you narrow down the list of books and perhaps get a lot better suggestions than the above.

Good luck! I'm expecting a comment with a link to your repository when you're done. And if you're never done, make sure you leave a comment reminding me that I still haven't finished that compiler I started writing in 2001.

This is super! Even more coming from you :) I'd like to accept almost each other answer but since I gotta pick one this has to be it. `I'm expecting a comment with a link to your repository when you're done`: most definitely! Thanks again, to you and everyone else, this was really uplifting. — Xananax, Nov 25 '11 at 19:49
And for anyone coming here looking for the same answers: I find flockDB to be the best candidate to learn, the codebase is really small, the code very readable (although I don't speak scala) and easy to understand. — Xananax, Nov 25 '11 at 19:55
@Yannis, Btw which of the books you recommend are the ones that you have read? — Pacerier, Nov 02 '14 at 14:31
@Xananax Sooo how is it coming along ? Any repository we can look at ? :) — Radu Murzea, Sep 29 '16 at 11:19
@RaduMurzea Woah this is so long ago. Yes, I did create my own database, several times too, crappy crappy stuff, but working ok enough for me to grok the general principles. I don't have a repo though and don't intend to release any code as it wouldn't be useful to anybody. It also wouldn't reflect my current expertise. — Xananax, Nov 23 '16 at 10:32

score 33 · Answer 2 · 2011-11-25T11:57:33.513

33

You should just do it and stop thinking too much. Enjoying the learning process and enthousiasm are gifts.

Asking others if it's a good idea is certainly not a good strategy. If I had listened to all the frogs, I would still work at Ikea today pushing shopping cart from the parking to the depot.

You don't have to justify yourself like Ayende did in that interesting post. The question was:

However as a pragmatic developer, I am wondering what new this project is offering in a saturated market where you have quite mature alternatives like CouchDB, MongoDB, Tokyo, Redis, and many more ? Many of these products are also cross platform and run at C speed with a proven record, being used in very big web sites where their sharding capabilities and fault tolerance have been pushed far.

If you take your pleasure in the process, don't be worry about the objective, you already won.

edited Nov 25 '11 at 11:57

answered Nov 25 '11 at 09:21

6

+1 , very thoughtful and teaching reply :-) ... really nice answer to someone who wants to do somehting – Pankaj Upadhyay Nov 25 '11 at 11:51
Very nice indeed. I was on the verge of accepting this. I didn't because I thought yannis's answer to be more to the point, and more likely to help people having the same question. But this was definitely encouraging. Thanks a lot. – Xananax Nov 25 '11 at 19:51
Yannis answer is better than mine and deserve your choice – Nov 25 '11 at 21:13

score 7 · Answer 3 · answered Nov 25 '11 at 10:25

"(it's "node-oriented", if that even exists)". - This may be why you are not finding much!

Dive in with version 0.1 and see where you get. You may learn more from trying to produce what you want that from asking what you "should" do. Give it a few days and then review where you've got.

About 18 years ago I wrote a basic database system (for fun, go figure) with btree indexes and learned an awful lot.

score 5 · Answer 4 · answered Nov 26 '11 at 01:07

5

Sounds like a great project. Apparently your goal is not to create a production software, but to learn about databases and the process of creating a database system.

I don't really think you need to do a lot of research. It seems like the purpose is to get the experience of what goes in to creating a node-based database system.

Here's how I would get started:

Choose your favorite language or a language that you want to improve.
Create the node object (or whatever is closest in your language). Figure out how to link them.
Make a short list of SQL statements that you will implement first.
Decide how to save the data. One obvious solution is to serialize all the nodes, load them when the program starts, and save them when the program ends.

After you get the basics working, you will have much more insight into what is difficult or problematic. Then you can do some research about it, find some improvements, and integrate them.

answered Nov 26 '11 at 01:07

B Seven

3,105
3
16
14

1

I am following the steps you suggested. 1,2,3 are no problem. However, I am stumped by #4. In the case of large amounts of data, how would I load in memory only the relevant parts? I thought of storing everything as binary data and keep another index file, but what if the user is not querying by index? I still would have to loop through the whole file...I guess it's time for another question – Xananax Nov 27 '11 at 01:48
Why not just store everything in memory? If you dedicate 1 GB, that would hold a lot of data. Alternately, store each binary data node on disk, and then you would have 1 GB for text. That is a lot of text. Anyway, I think handling a database larger than 1GB is not central to the exercise. – B Seven Nov 27 '11 at 02:11
Also, you can create many methods to efficiently work within 1 GB of memory. You can improve that part of the system later. One solution is to load all the node information into memory, but store the data of each node on disk. That way you can efficiently traverse the nodes, and only access the disk to retrieve the data you really need. – B Seven Nov 27 '11 at 02:11

score 4 · Answer 5 · answered Nov 25 '11 at 09:41

4

MySQL has pluggable storage engine structure, it might be an idea to check out how engines are created to work for MySQL.

answered Nov 25 '11 at 09:41

Mjh

218
1
7

Link is broken. – Joseph Garvin Nov 28 '21 at 01:18
@JosephGarvin then you can google the updated one and suggest an edit. – Mjh Nov 15 '22 at 13:58
Or I could just downvote your answer. That's the time I have to contribute today. FWIW, I did google, and there is not an obvious direct hit in the first page of results. – Joseph Garvin Nov 15 '22 at 22:44

score 4 · Answer 6 · answered Nov 25 '11 at 11:52

Writing your own database “so as to understand how it all works” is the only good reason to do so (since databases are crazy hard to get right, and difficult to prove correct). You're crazy, but in a good way!

In order to see how it's done, I suggest looking at SQLite. The SQLite source is only about 1.3MB compressed, and it's a fully ACID-compliant transactional database. It's also public domain and the main author's a nice guy who I'm sure will be happy to answer your questions. (I think the trickiest bits come in how to truly commit information to disk; persuading OSes and hardware to stop lying and really write the transaction NOW is surprisingly difficult and why I'm so glad I never have to write a DB.)

The SQLite codebase approaches 0,2*10^6 loc (5mb decompressed), it's not _that_ small. — yannis, Nov 25 '11 at 12:14
@Yannis: Compare with MySQL or PostgreSQL or any of the other server DB systems… — Donal Fellows, Nov 26 '11 at 14:51

score 3 · Answer 7 · answered Nov 25 '11 at 12:16

Buy a book:

http://www.springer.com/computer/database+management+%26+information+retrieval/book/978-1-84628-394-9

Indexes are imho the most important aspect of databases today. Study the alternatives that exists like binary trees.

Also read about cartesian product which is a way to calculate how complex joins will be.

score 3 · Answer 8 · answered Nov 25 '11 at 15:40

3

I need to learn how databases work

Learn relational algebra.

Find a small DB engine, learn the source.

in order to use them more efficiently

Nope. You need to learn how to efficiently use a database. You might be a better driver if you understand how your car works, but you'll be a much better driver if you really focus on driving.

Take the traditional routes: take a course, read a book, peer review, ask questions, use the index luke.

answered Nov 25 '11 at 15:40

Incognito

3,458
2
25
38

6

`Learn relational algebra.` Assuming op is interested in relational databases of course... – yannis Nov 25 '11 at 16:06

Create my own database system

8 Answers8