65

I am trying to improve the backup situation for my application. I have a Django application and MySQL database. I read an article suggesting backing up the database in Git.

On the one hand I like it, as it will keep a copy of the data and the code in sync.

But Git is designed for code, not for data. As such it will be doing a lot of extra work diffing the MySQL dump every commit, which is not really necessary. If I compress the file before storing it, will git still diff the files?

(The dump file is currently 100MB uncompressed, 5.7MB when bzipped.)

Edit: the code and database schema definitions are already in Git, it is really the data I am concerned about backing up now.

Léo Lam
  • 105
  • 1
  • 3
wobbily_col
  • 1,861
  • 3
  • 16
  • 25
  • 13
    If your company has an IT (ops) department, they should be handling this. – Michael Hampton May 26 '14 at 14:09
  • 1
    is the data part of the application, or what is created through the application? – Winston Ewert May 26 '14 at 18:25
  • 1
    Git will attempt to diff all files when you run `git gc` (or it's underlying `git repack`; git will, by configurable default, occasionally run it automatically). It will also _always deflate them_, so it might be actually better to store them uncompressed. – Jan Hudec May 26 '14 at 21:46
  • 1
    What kind of database is it: is it production or development database? – el.pescado - нет войне May 27 '14 at 08:47
  • 6
    http://viget.com/extend/backup-your-database-in-git, he is a "senior developer". – wobbily_col May 27 '14 at 13:18
  • Well, those guys apparently are all Dev and no Ops. Pushing your live database to Github? (Hope it's a private repo.) Now taking bets on the exact date this blows up in their face... – Michael Hampton May 27 '14 at 21:34
  • 1
    @ Michael Git != Github – wobbily_col Jun 17 '14 at 11:01
  • @wobbily_col while size is not really an issue with Git, I wonder how the viget developer handles restoring the dump -- in my experience it's really much slower once you use `--skip-extended-insert` (and without, it's hard to keep track of changes and the size stored will start to grow much faster) – michel-slm Sep 09 '14 at 09:11

5 Answers5

107

Before you lose any data, let me try to introduce a sysadmin perspective to this question.

There is only one reason we create backups: to make it possible to restore when something goes wrong, as it invariably will. As such, a proper backup system has requirements that go far beyond what git can reasonably handle.

Here are some of the issues I can foresee with trying to backup your database in git:

  • The repository will grow dramatically with every "backup". Since git stores entire objects (albeit compressed) and then diffs them later (e.g. when you run git gc), and keeps history forever, you will have a very large amount of data stored that you don't actually need or even want. You might need to limit the amount or retention period of backups you do to save disk space or for legal reasons, but it's difficult to remove old revisions from a git repo without a lot of collateral damage.
  • Restoring is limited to points in time that you have stored in the repository, and since the data is so large, going back more than a trivial amount of time may be slow. A backup system designed for the purpose limits the amount of data stored while potentially providing more granularity, and provides faster restores, reducing downtime in the event of a disaster. Database-aware backup solutions (example) can also provide continuous backup, ensuring that not a single transaction is lost.
  • Commits are likely to be slow as well, and get slower as the database grows. Remember that git is essentially a key-value data store mapped onto a filesystem, and thus is subject to the performance characteristics of the underlying filesystem. It is possible for this length of time to eventually exceed the backup interval, and at that point you can no longer meet your SLA. Proper backup systems also take longer to backup as the data grows, but not nearly so dramatically, since they will automatically manage their own size based on the retention policy you will have configured.

Despite the fact that there are apparently several interesting things you can do with a database dump if you put it into git, overall I can't recommend it for the purpose of keeping backups. Especially since backup systems are widely available (and many are even open source) and work much better at keeping your data safe and making it possible to recover as quickly as possible.

Michael Hampton
  • 2,960
  • 2
  • 17
  • 18
  • This is the best answer as Michael has covered consistency issues. Depending on the size and usage of the database a snapshot can't reliably reproduce the data at given point in time and you're likely to run into constraint issues. Replication may be something you want to look into - http://dev.mysql.com/doc/refman/5.0/en/replication.html – Aaron Newton May 27 '14 at 05:16
  • 5
    This isn't just the best answer, it's the only answer. As a general rule you're a developer so backups aren't your business; somebody else is (or should be) already looking after them, and if you start getting involved you may be interfering with a system that already works. These boxes should *already* be being backed up, so then you'll have a backup, your own backup, and a backup of your own backup, all with ever-increasing size. That's just nuts. Plus: you're a developer: why are you (probably) going near production boxes anyway? – Maximus Minimus May 27 '14 at 12:50
  • 3
    @JimmyShelter There's a school of thought that DevOps means not that Dev and Ops work closely together, but that Dev actually _does_ Ops. It usually doesn't work well, but that doesn't stop people trying it. – Michael Hampton May 27 '14 at 12:55
  • 1
    This should be the accepted answer. It clearly explains the requirements for and purpose of a backup system, then shows how git does not fit. Extra bonus points for discussion of consistency and performance. – Gabriel Bauman May 27 '14 at 18:55
  • Let me remark that I posted my answer assuming that the OP does not have any Operations team that can handle this issue for him. I agree with you that this kind of task is best left to those who are actually operating the system, and know their way around it. But there are situations where you have to put on a hat that is not exactly yours, and I believe in that situation it is better to try to learn some best practices than just come up with your own contrived solution. I have to say I have also found your answer very instructive! – logc May 28 '14 at 13:06
  • I believe the output from mysql dumper can be plain ASCII which should work well with git if not too many changes happen. Note that large files may require a beefier machine. – Thorbjørn Ravn Andersen Apr 13 '23 at 14:53
39

My two cents: I do not think it is a good idea. GIT does something like "storing snapshots of a set of files at different points in time", so you can perfectly use GIT for something like that, but that doesn't mean you should. GIT is designed to store source code, so you would be missing most of its functionality, and you would be trading a lot of performance for just a little bit of convenience.

Let me assume that the main reason why you are thinking about this is to "keep a copy of the data and the code in synch", and that this means you are worried that version 2.0 of your code needs a different database schema than version 1.0. A simpler solution would be to store the database schema, as a set of SQL scripts with CREATE statements, along the source code in your Git repository. Then, a part of your installation procedure would be to execute those scripts on a previously installed database server.

The actual contents of those just CREATE-d tables have nothing to do with the version of your source code. Imagine you install your software, version 1.0, on server A and on server B, which are used in different companies by different teams. After some weeks, the contents of the tables will be very different, even though the schemas are exactly the same.

Since you want to back up the contents of the database, I would suggest to you that you use a backup script that tags the backup dump with the current version of the software that the dump belongs to. The script should be in the GIT repository (so that it has access to the source code version string), but the dumps themselves do not belong into a version control system.

EDIT:

After reading the original post that motivated the question, I find this an even more dubious idea. The key point is that the mysqldump command transforms the current state of a DB into a series of SQL INSERT statements, and GIT can diff them to get only the updated table rows.

The mysqldump part is sound, since this is one of the backup methods listed in MySQL's documentation. The GIT part is where the author fails to notice that database servers keep a transaction log in order to recover from crashes, including MySQL. It is using this log, not GIT, that you should create incremental backups for your database. This has, first and foremost, the advantage that you can rotate or flush the logs after recovery, instead of bloating a GIT repository into infinity and beyond ...

logc
  • 2,190
  • 15
  • 19
  • 2
    I am not sure I see any point in storing the database schema without the data in version control. The data is the most important thing, and that is what I want to back up. I like the idea of tagging the database backup with the current software version however. I'll try to implement something like that. – wobbily_col May 26 '14 at 09:36
  • 12
    The point of storing the schema without the data is that, right after installation, your software should be "ready to be used". If it is a wiki, then it should be ready to start creating wiki pages and writing something into them. If you install the schema *and* the contents, then your wiki is already filled with X wiki pages after installation... That is not exactly "installing a wiki system to write our content", but "copying a wiki from somewhere to read it". – logc May 26 '14 at 09:40
  • 1
    Ok, I see what you mean, but that doesn't really apply in my situation. It is an in house app, and without the data it won't be much use. – wobbily_col May 26 '14 at 09:43
  • 3
    It may be a good idea to modify your question with the actual situation you are in. Even if you can't post all details, it would be important to state that you need a lot of the data to appear unmodified in each installation, or there is a single installation ... – logc May 26 '14 at 09:51
  • 2
    @wobbily_col A non-text, binary based format has limited value in the context of source control. You can't *diff* it, you can't *branch* / *merge* it, etc. So, while you certainly CAN use git to store the DB, most people prefer to script the DB structure as well as the necessary data. It's a compromise between having a bit more work, but providing the above list of features. You'll have to weigh up whether or not this is a good idea for your solution. Otherwise, you can probably get GIT to store the DB directly, it's just not exactly the best fit for the task. – Daniel B May 26 '14 at 11:16
  • 1
    By "can't diff it", I mean "meaningfully". – Daniel B May 26 '14 at 11:24
  • What about a very small database dump (< 1 MB) ? – Radu Murzea May 26 '14 at 12:35
  • 3
    @RaduMurzea: I think this is a question of principles. A version control system is designed to manage source code, and not binaries, that's all. It's not a question of size. No, database dumps should not be checked in into the repository, just like training videos should not be checked in either. But no one is stopping you from doing so. :) – logc May 26 '14 at 12:38
  • @wobbily_col Even with an in-house LOB application, it can be useful to see how the database schema has changed over time. Having the schema available also helps in emergency situations, and makes it easier to create development environments (without data), though you might not need this. I would absolutely keep this in revision control along with the source code. – Michael Hampton Jul 26 '14 at 16:49
  • I can smell the database admin in this answer. If everything you know is a hammer and you are using it for 50 years you see everything as a nail. I still think it's ok to do a git (and it does not store full objects - only for a short time before they are pushed into the packfile). While i wouldn't use it a main backup strategy. Storing the last point of full backup in git and then doing incrementals in your db backup solution seems to be nice. A git repo is the software company nowadays, especially for small companies. – Lothar Apr 12 '23 at 18:52
  • @Lothar: I do not enjoy reading that my answer "smells" of anything. It is an "ad hominem" attack: you are not addressing my argumentation but me as a person. To clarify, I have never been a DBA, I am a data engineer and do work with current tech. But even if I was a grumpy old man in a cave with a hammer, I could be making valid points. Please address them -- not me. Also: sending incremental backups? Sounds like `rsync`. If everything you know is `git`, using it for 5 or for 50 years, as a hammer or as a database backup solution, is not really nailing it. – logc Apr 16 '23 at 18:10
7

Personally, I don't think it's a good idea to use a source control version system to store the backup files, because the GIT version control is designed for data files, not for binaries or dump files like a MySQL backup dump file. The fact that you can do it doesn't mean automatically that you should do it. Moreover, your repository, considering a new database backup for each new commit, will dramatically grow, using a lot of hard disk space and the performance of GIT will get affected, resulting in a slow source control system. For me it's fine to execute a backup strategy and have always ready a backup file when you need to restore the database when something in your code goes wrong, but source control tools aren't made to store binary data.

For these reasons, I don't see any utility in storing the backup files for day 1 and for day 2, and then seeing the differences between the two backup files. It will require a lot of extra and useless work. Instead of using GIT to store database backups when you commit new code, store the database backups in a different path, separated by date and time, and insert in your code some reference to the new database backups created for each version, using the tags, as someone already suggested.

My final note about the database backups and GIT: A database administrator, when he needs to restore a database because some data has been lost, doesn't need to check the differences between the backup file for day 1 and the backup file for day 2, he needs just to know which is the last backup file that will allow him to restore the database, without any error and data loss, reducing downtime. Indeed, the task of a database administrator is to make the data available for recovery as soon as possible, when the system, for some reasons, fails. If you store the database backups in GIT, linked to your commits, you don't allow the database administrator to restore the data quickly, because your backups are limited to points in time that you stored in the GIT repository, and to reduce the downtime of the system, because the performance of your GIT repository will be dramatically reduced having a lot of data to store.

Then, I don't recommend to store the backups using GIT, use instead a good backup software solution (there are some of them here), that will provide more granularity and will allow you to keep your data safe and secure, and making your data recovery simple and fast in case of disasters.

Alberto Solano
  • 181
  • 2
  • 12
  • Maybe the downvoter will explain why he/she downvoted.. – Alberto Solano May 26 '14 at 09:25
  • 1
    Not the downvoter, but I think this approach introduces an ever present merge conflict which is not particularly conducive to the branch-often, merge-often workflow that most git users prefer. – Daniel B May 26 '14 at 11:25
  • @DanielB I propose to not use the version control system to store the database backup files. I think the database backup problem could be easily solved without using any version control system. The version control systems (GIT, TFS, SVN and so on..) are designed for software, not dump files or database backups or for just storing data (there're are a lot of solutions for that). – Alberto Solano May 26 '14 at 13:20
  • I think most users read first few sentences and downvote, as it seems you will be saying it's ok to use. –  May 26 '14 at 20:28
  • 1
    @AlbertoSolano I see; but reading the question ("can I backup my DB in in GIT?") and then your first statement ("it's fine to store the backup file..."), it *seems* like you're saying the opposite. The rest of the answer seems to be saying that it's neither here nor there, while I suspect most people think it's a train-wreck waiting to happen. – Daniel B May 27 '14 at 06:13
  • @edvinas.me Thanks. I updated the question. I hope now my opinion is clear. – Alberto Solano May 27 '14 at 07:14
  • @DanielB Thanks for your comment. Yes, maybe my opinion was misunderstood. I hope now my opinion is clear. – Alberto Solano May 27 '14 at 07:15
1

You should not store binary data in Git - especially database.
Code changes and database DML changes are totally different things.

MySQL and Oracle can write archive logs for the purpose of being restored to any point in time. Just backup those logs to somewhere safe and you will be okay.

To use Git to back up these "archive logs" does not make sense. Archive logs in production environments are rather heavy and should be removed after making regular full backups. Also it is useless to put them in git - those are already a repository in some sense.

gnat
  • 21,442
  • 29
  • 112
  • 288
Jehy
  • 119
  • 2
  • 2
    why wouldn't one use Git to back up these "archive logs" created by MySQL? – gnat May 26 '14 at 14:39
  • 1
    Just because it does not make sense. Archive logs in production environments are rather heavy and should be removed after making regular full backups. Also it is useless to put them in git - those are already a repository in some sense. Michael Hampton gives a pretty good answer on this issue (on this page). – Jehy May 26 '14 at 14:44
  • 1
    Why bother rotating logs, if you are going to keep a copy of everything in git? Might as well just keep one monster log file. – wobbily_col May 27 '14 at 11:46
0

You SHOULD store a database dump.

In fact i say, you even may have to. How else can developers can play with the real data without destroying something?

But it is a set of data from some snapshots not for the purpose of backup (unless it is your last line of defense). Also there is an interesting privacy policy problem here.

Lothar
  • 629
  • 5
  • 6