24

Our project is about 11GB, 10 of which are binary data (.png images). Consequently, a git diff or git status operations take up more than a minute. Fortunately all data files are separated into a folder with the wonderful name data. The assignment is "Avoid compressing, diffing and other costly operations on binary files."

  • It was considered splitting the project into two repos. Then data would be an external repo, that is checked out by the main source code repo. It was decided that the overhead of keeping the repos in sync would be too much, especially for the artists, who work with the data files.

  • Explicitly telling git those files are binary, excluding files from diffs were considered, but those seem like only a partial solution to the question.

I feel that git attributes is the solution, but how? Or is there a better architecture than a monolithic repo?

Vorac
  • 7,073
  • 7
  • 38
  • 58
  • 2
    The first big question here is how important are those data files. Does your program *need* all of those images available in order to do anything useful, or can it get away with a small subset during typical development/testing? – Ixrec Mar 11 '16 at 17:02
  • @Ixrec, the images are actually more important than the source code. All of them must be present, and .png checksums are checked always for corrupt files. – Vorac Mar 14 '16 at 09:56
  • 1
    Why isn't this question on stack overflow? The Q. Seems exactly suited to it. – spirc Mar 17 '16 at 00:41
  • 1
    @spirc this question straddles the line between "help with a software tool" which is on-topic at SO, and "version control strategy" which is on-topic here. Since it is not asking for what git command to execute to do something, it is not clearly on the SO side of the line so I voted to leave it open here. –  Mar 18 '16 at 21:49
  • @Snowman thanks for the response. Which item of the on-topic list does that fit into? http://programmers.stackexchange.com/help/on-topic – spirc Mar 21 '16 at 00:06
  • 2
    @spirc Questions on source control fall under the topic of "software configuration management." As long as a source control question is not also about "how to use specific tools" it should be on-topic. If you reread my previous comment I mention that it is not asking about which git command to execute so it is not about software tools: it appears to fall under the SCM category. –  Mar 21 '16 at 04:41

3 Answers3

18

You can use git-lfs or similar tools (git-fat, git-annex, etc.). Those tools basically replace the binary files in your repo with small text file with hashes, and store the actual binary data in a non-git way - like a network share.

Makes diffs and everything superfast as only hashes get compared, and is - at least for git-lfs - transparent to the user (after installing once).

Afaik git-lfs is supported by github, gitlab, VisualStudio, and is open source.

kat0r
  • 295
  • 2
  • 9
  • 2
    Have you tried using `git-lfs` on a project with many gigabytes of assets with a mixed developer/artist team? I'm interested to know if people are using git-lfs for projects such as games and animation. Since its still fairly new at time of writing. From my own experience the barrier of entry to git for less technical users is *already* very high, so having an extra layer for file management on-top of it - may be difficult for people to use unless they're already comfortable with git. – ideasman42 Mar 15 '16 at 03:19
  • Only for up to around ~1GB of data, sorry. But git-lfs should add no additional steps for endusers, it should be completely transparent. – kat0r Mar 15 '16 at 08:40
  • This seems to be the correct answer, if some problems arise during the integration I will report back here. So the installation procedure needs to be completed only once on the server, and not on each client machine? – Vorac Mar 15 '16 at 12:06
  • Afaik you need to install a small client addin, too, check the github page. But that should be easy to roll out with a group policy/simpler than any alternative. – kat0r Mar 15 '16 at 12:12
2

Use both GIT & SVN repos

If the binary files can be separated logically from the source, you might consider using git for text files, and a non DVCS such as subversion for the binary files.

A project I work on does this since we have many GB for per-compiled libraries (for OSX/Win32 dependencies), which we need to keep versioned.


On the other hand if you have non-technical users, using two version control systems may be problematic. However if the artists aren't working on code you could provide a script to perform the update, and they can use subversion to commit binary assets.

Use SVN (with git svn)

While this trade-off isn't always so nice for developers who are used to using regular git, you could use SVN for the main repository, and developers can use git svn tools.

This does make it a little more work for developers using git, but means for everyone who isn't familiar with DVCS (or VCS in general) - they can use SVN's simple model without having to use multiple complex version control systems.


git-lfs is an option too, but I didn't use it so can't speak to how well it works.

ideasman42
  • 873
  • 6
  • 18
0

Specifically for PNG, you are lucky, as you could use png-inflate as a git filter, which "uncompresses" PNGs when checking them into the repo, which might reduce the repo size by quite a bit. Every user has to install this separately, but you could provide a script for that, and they only have to do that once.

I'd recommend to apply it to the whole history of your project, and then see whether it is any smaller, and if the diffs are faster and so on.


side-note: The same thing can be done for ZIP based files (which includes MS Word, Libre Office and FreeCAD, amongst others) with ReZipDoc.

hoijui
  • 343
  • 2
  • 9