Do sync programs like Dropbox typically track file changes by doing byte by byte comparisons, or using hashes, or using diff
/ keeping local commit logs like version control, or what?

- 191
- 6
3 Answers
On Windows there's a mechanism to have the OS alert you when there's a change to a 'watched' directory structure - FindFirstChangeNotification(). When that indicates a file has changed, an application can then go about comparing files in the changed directory to find the actual files that have changed by looking at size, modified date, hash, etc.
This (as Michael points out below) is something that each platform would provide in some manner. I wasnt saying this sort of thing was unique to Windows.

- 37,990
- 7
- 78
- 131
-
1And there's [inotify](https://en.wikipedia.org/wiki/Inotify) on Linux. – Michael Oct 02 '12 at 20:22
-
But what if the sync program wasn't running at the time of the file change? Wouldn't it miss the opportunity of being notified by the hook? – mcandre Oct 02 '12 at 20:46
-
You used the specific example of dropbox, which generally is running in the background. Obviously if the program isnt running, it wont get notified. Then it has to use other methods (modified date, perhaps). I dont think you're going to get a concrete answer about what the specific methods used ARE, because different programs do things differently. Better to ask the creators of the specific programs in question. – GrandmasterB Oct 02 '12 at 21:03
-
@mcandre Pretty much what GrandmasterB said - if you missed the notification, you'd have to scan the folder. Depending on how "accurate" you want to be, this can mean simply looking for new files and modified timestamps / file sizes (these are relatively inexpensive to do), or in the worst case, comparing the entire file. Programs like rsync typically hash the file in chunks, so changes early on in the data can be detected earlier, but in the worst case (files are identical), you're going to read the whole thing in. – Daniel B Oct 03 '12 at 06:57
Ultimately to compare files you need to compare every byte - how else would you notice a single byte change?
In reality you read blocks of bytes and compute a hash value, you then check against a list of hashes. A good example is "rsync"
As far as I know dropbox only dedupes entire files, so will compute a hash of the entire file to check fro the same file

- 15,776
- 3
- 42
- 69
-
1
-
1Wouldn't hashing create a small but real risk of collisions, resulting in a file not being synced? Dropbox apparently uses a diff-like implementation. https://www.dropbox.com/help/8/en – mcandre Oct 02 '12 at 20:47
-
1
-
@ratchetfreak: date modified, on some systems, isn't necessarily reliable for this kind of problem. A simple touch would cause the date modified to be different, where a sync may not actually be required. – Steven Evers Oct 02 '12 at 22:28
-
1@SnOrfus then double check the changes when the date modified gets changed – ratchet freak Oct 02 '12 at 22:33
-
@mcandre - yes, but if you used a sufficently large hash, and/or combine with the file size the risk is low. Hashes are a pretty intensively studied field. – Martin Beckett Oct 02 '12 at 22:41
-
@ratchetfreak: Certainly. I assume that minimizing network traffic is priority 1 for a sync service, so I wouldn't use date modified as the sole criteria for syncing. That's all I was saying. – Steven Evers Oct 02 '12 at 22:44
-
@MartinBeckett, I recognize that the risk is low. It's called a corner case, and some rsync user out there will be unable to sync his file change. – mcandre Oct 02 '12 at 23:53
-
@mcandre, calculate the probability, and weigh it against your risk. For example, the chance of collisions in MD5 is about 2^-128; so if a user has a trillion files (~2^40) and backs up every file every second, then in only 2^88 seconds (i.e. 10^19 years) there is likely to be a collision. I'm willing to take that risk! – Alex Feinman Oct 03 '12 at 14:25
-
@AlexFeinman, MD4 and MD5 collisions have been found. I don't know which particular hash algorithms rsync uses now, but collisions aren't just theoretical probabilities, they're very real. – mcandre Oct 03 '12 at 22:25
.NET for instance has a FileSystemWatcher class. I'm sure other low-level languages and runtimes can provide similar capabilities.

- 546
- 5
- 6