29

In light of recent revelations about widespread government monitoring of data stored by online service providers, zero-knowledge services are all the rage now.

A zero-knowledge service is one where all data is stored encrypted with a key that is not stored on the server. Encryption and decryption happens entirely on the client side, and the server never sees either plaintext data or the key. As a result, the service provider is unable to decrypt and provide the data to a third party, even if it wanted to.

To give an example: SpiderOak can be viewed as a zero-knowledge version of Dropbox.

As programmers, we rely heavily on, and trust some of our most sensitive data - our code - to a particular class of online service providers: code hosting providers (like Bitbucket, Assembla, and so on). I am of course talking about private repositories here - the concept of zero-knowledge does not make sense for public repositories.

My questions are:

  1. Are there any technological barriers to creating a zero-knowledge code hosting service? For example, is there something about the network protocols used by popular version control systems like SVN, Mercurial, or Git that would make it difficult (or impossible) to implement a scheme where the data being communicated between the client and the server is encrypted with a key the server does not know?

  2. Are there any zero-knowledge code hosting services in existence today?

HighCommander4
  • 429
  • 3
  • 5
  • 1
    Without [homomorphic encryption](http://en.wikipedia.org/wiki/Homomorphic_encryption), I don't see how a zero-knowledge code hosting site could provide any sort of benefit over a zero-knowledge version of drop-box. I don't believe anyone has yet come up with such a scheme which is both secure (i.e., secure enough that the experts trust it) and fast enough to be usable. – Brian Aug 16 '13 at 10:38
  • 1
    There's a write-up on diff-management (a fundamental part of VCS) up at Security.SE's [Applying file deltas to an encrypted file](http://security.stackexchange.com/q/6639/13146). – apsillers Aug 16 '13 at 14:51
  • @Brian I took a look at SpiderOak's website and they make a claim I don't understand (which seems related to your comment about homomorphic encryption): they claim they do NOT know the plaintext file, but that they can "detect similarities between files" and only store "deltas" (if you resubmit a file with a change, for example). But your link states that there is currently no know way of doing `E(f) + E(d) = E(f')`. Is SpiderOak's claim bogus then? – Andres F. Aug 16 '13 at 16:41
  • @Brian this is the specific claim: `"SpiderOak's historical versions are space efficient. Even though your historical versions are encrypted and only stored on the server, SpiderOak detects the similarity between those historical versions and your new versions - only saving the parts that actually changed."` I don't see how they can do this, if they have no way of inspecting the stored encrypted file... Unless somehow this is all solved on the _client_? – Andres F. Aug 16 '13 at 16:44
  • 2
    @AndresF. I can only assume SpiderOak means that diff-generation occurs on the client, the server stores encrypted diffs, and then diff-to-base application occurs again on the client when the diff and base are encrypted. I agree that their language is very unclear. – apsillers Aug 16 '13 at 16:56
  • @apsillers I'd still need an explanation of how that works. The server cannot do anything with the encrypted diff, so would the client store the relation between the original and the diff? If you delete this tracking info from the client, are you somehow "corrupting" the storage on the server? (The server cannot do anything with the diff, because I don't think it can merge two chunks of encrypted info, even if it's told one is a diff of the other). – Andres F. Aug 16 '13 at 17:12
  • @AndresF. Could they simply do the following: break up your file into chunks of some granularity (say 256 KB) on the client, encrypt each individually (with the same key across all files), and send the encrypted chunks to the server. The server hashes each encrypted chunk, and if it receives a new chunk with the same hash, it just points to the old chunk rather than storing it again. This would effectively give you deltas at the granularity of the chunk size. – HighCommander4 Aug 16 '13 at 18:28
  • @HighCommander4: Block cyphers tend to use a random Initialization vector (or at least, a nonce), causing a given plaintext to generate a different cypher each time it is encrypted. The motivation is to ensure that someone cannot detect if any given pair of blocks in a particular cypher were generated from an identical plaintext. I doubt this issue is always applicable (and you could just weaken your security promises). I would imagine the techniques used to support performant disk-level encryption are similar to those used to support performant hosting of differential uploads. – Brian Aug 16 '13 at 18:42
  • The issue @Brian describes is applicable if you are using a "probabilistic" cipher mode. If you use a "deterministic" block cipher mode like ECB (so that `E(p)` always produces the same `c`), you could detect changes in a block of ciphertext -- that's basically like your idea, but the dividing-into-chunks is already a built-in property of block ciphers. Note that deterministic modes like ECB are really bad, because you can directly compare blocks (e.g., a long run of zeros that spans multiple blocks would produce two identical blocks side-by-side). – apsillers Aug 16 '13 at 19:57
  • @apsillers Is that bad because a snoop could analyze the blocks and see that they have repeated content? (even if they don't know what that content is, though of course they could guess it's zeros, for example) – Andres F. Aug 16 '13 at 21:08
  • 1
    @AndresF. Right, exactly. The classic example is that certain executable examples might have long runs of zeros in particular points in their file layout. By noting repeated blocks at certain points in the file, you might be able to identify that a certain ECB-encrypted file was using a particular format. – apsillers Aug 16 '13 at 22:13
  • 2
    @apsillers: Or you could deliberately stuff such content into a file and use it to identify the file itself (e.g., if someone was trying to use encryption to hide piracy). – Brian Aug 16 '13 at 22:57
  • 4
    It's not something i have any experience in, but i can imagine one possible technological barrier to having a zero-knowledge code hosting service: won't all users need to know/use the exact same key? And if that's the case, what will be the authentication mechanism that ensures different levels of user access? – C.B. Aug 19 '13 at 14:55
  • Questions asking us to recommend a tool, library or favorite off-site resource are off-topic for Programmers as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it. – gnat Aug 20 '13 at 21:16
  • 2
    @gnat: I'm not asking for a recommendation. I'm merely asking for whether a service of the sort I described exists. The existence of such a service would provide evidence that the technological barriers that I ask about earlier in the question are overcomable. – HighCommander4 Aug 20 '13 at 22:14
  • @HighCommander4 if the barriers are "overcomable" indeed, nothing would stop answers to your question from becoming an ever-growing list of "evidences", so that "every answer is equally valid", as explicitly prohibited by [help/dont-ask]. If your intent is different from that, consider [edit]ing the question to avoid this – gnat Aug 20 '13 at 22:21
  • @gnat: By that logic... would a question that asks whether C++ is Turing-complete have same problem because nothing would stop answers from becoming an ever-growing list of C++ Turing machine implementations? – HighCommander4 Aug 21 '13 at 00:06
  • @HighCommander4 well by that logic, question like this would have this problem too, if (note: _if_) asker [poisons it](http://meta.stackexchange.com/a/184574/165773 "'Recipe to spoil an otherwise great question...'") with additions like "Are there any C++ Turing machine implementations today?" – gnat Aug 21 '13 at 00:14

4 Answers4

3

You can encrypt each line seperately. If you can afford to leak your file names and approximate line lengths and the line numbers on which lines changes occur, you can use something like this:

https://github.com/ysangkok/line-encryptor

As each line is encrypted seperately (but with the same key), the uploaded changes will (like usually) only involve the relevant lines.

If it is presently not convenient enough, you could make two Git repositories, one with plaintext and one with ciphertext. When you commit in the plaintext repository (which is local), a commit hook could take the diff and run it through the line encryptor referenced above, which would apply it to the ciphertext repository. The ciphertext repository changes would be committed and uploaded.

The line encryptor above is SCM agnostic, but can read unified diff files (of plaintext) and encrypt the changes and apply them to the ciphertext. This makes it usable on any SCM that will generate you a unified diff (like Git).

Janus Troelsen
  • 279
  • 2
  • 16
  • Couldn't you use [git's smudge-clean](http://git-scm.com/book/ch7-2.html#Keyword-Expansion) for this? – svick Aug 22 '13 at 19:51
  • @svick: You could, but that way, I don't see how you would nicely allow avoiding re-encrypting the whole file. But of course, it wouldn't matter much for code since the file sizes are small. But there is no need for a "line encryptor" then, you can just use any encryption tool. – Janus Troelsen Aug 22 '13 at 20:01
  • Wouldn't lots of text samples (with a known structure) be something that would make it *easier* to attack the key? Every blank line would encrypt the same. Every start and end of a javadoc would be the same. Now you know the clear text and the cipher text for some segment of the code which can be used. This likely wouldn't be useful against any but hobbyists (anyone with trained crypto types or sufficient computing power could break it with enough effort). –  Aug 22 '13 at 20:02
  • @MichaelT: No, because of IV's. Try it out yourself :) Using the linked implementation, lines encrypt to `,`. – Janus Troelsen Aug 22 '13 at 20:02
  • @JanusTroelsen Except if you do that, delta compression won't work anymore. And delta compression is the reason why `.git` directories tend to be quite small, even with long history. – svick Aug 22 '13 at 20:11
  • @svick: Exactly, I want to take advantage of delta's too. That's why I want to avoid redoing the whole file like smudge-clean does. Did I misunderstand you? – Janus Troelsen Aug 22 '13 at 20:31
  • @JanusTroelsen I didn't take IVs into account and assumed that encryption is deterministic. So you would reencrpyt the whole file, but deltas would still work. And I don't know enough about using IVs correctly to suggest an approach that would work (if there is one). – svick Aug 22 '13 at 20:41
  • 1
    @svick: Lines are encrypted individually. If you change a line, the whole *line* would get re-encrypted, but with a new IV (as always). But the rest of the file won't be touched! Encryption is deterministic, but the IV's are inputs too, and they are pseudo-randomly chosen. – Janus Troelsen Aug 22 '13 at 20:43
1

I hate to do one of those 'this isn't quite going to answer your question' answers.. but..

I can think of two ready solutions which should address these worries.

  1. Host a private Git server on your own. Then put that server on a VPN to which you give your team members access. All communication to and from the server would be encrypted, and you could of course encrypt the server at the OS-level.

  2. BitSync should do the trick as well. Everything would be encrypted, and in a huge network which would be available from anywhere. Might actually be a really good application of all this BitCoin/BitMessage/BitSync technology..

Lastly, the folks over at https://security.stackexchange.com/ might have some more insight.

Rubber Duck
  • 337
  • 3
  • 9
  • Regarding BitSync: are you suggesting that it be used as a replacement for a version control system, or somehow used together with a version control system? If the former, then sure, but that's not very interesting. I could just as well share the files over SpiderOak and it would be centralized, but still zero-knowledge. If the latter, then how? – HighCommander4 Aug 20 '13 at 22:17
  • 1
    @HighCommander4 Haven't tried it, but shouldn't be any reason for it to not work.. Couldn't you setup sync to share your initialized git folder, then just do a normal `'git push ./syncedFolderActingAsServer/MyAwesomeProject/src/'`? You could also do git level permissions, etc.. someone should try this! – Rubber Duck Aug 21 '13 at 03:13
1

I don't think there are any barriers - consider SVN, what gets sent to the server for storage is the delta between what the previous and current version of your code - so you change 1 line, just that line gets sent to the server. The server then 'blindly' stores it without doing any inspection of the data itself. If you encrypted the delta and sent that instead, there would be no impact on the server, in fact you wouldn't even need to modify the server at all.

There are other bits that might matter, such as meta data properties that are not easily encryptable - such as mime type - but others could be encrypted, eg comments in the history log, just as long as you know you have to decrypt them on the client to view. I'm not sure if the directory structure would be visible, I think it would not be visible due to the way SVN stores directories, but its possible I'm wrong. This might not matter to you if the contents are secure however.

This would mean you couldn't have a web site with the various code view features, no server-side repository browser or log viewer. No code diffs, no online code review tools.

Something like this already exists, to a point, Mozy stores your data encrypted with your private key (you can use their own, and they make noises about "if you lose your own key, too bad, we can't restore your data for you", but that's more targeted at the common user). Mozy also stores a history of your files, so you can retrieve previous versions. Where it falls down is that upload is on a regular basis, not checkin when you want, and I believe it discards old versions when you run out of storage space. But the concept is there, they could modify it to provide secure source control using their existing system.

gbjbaanb
  • 48,354
  • 6
  • 102
  • 172
  • Re: "This would mean you couldn't have a web site with the various code view features, no server-side repository browser or log viewer. No code diffs, no online code review tools." - You could still have these if the application logic was in client-side JS and it made you enter your password/key (but not send it to the server), right? – HighCommander4 Aug 21 '13 at 13:32
  • Yes, it could.... Anything would as long as it knew it was receiving encrypted data over the network. It's just an obvious limitation of the server that it cannot decrypt the data. – gbjbaanb Aug 21 '13 at 22:59
1

As I understand it, the way git pull works is that the server sends you a pack file that contains all the objects that you want, but don't have currently. And vice versa for git push.

I think you couldn't do it like this directly (because this means the server has to understand the objects). What you could do instead is to let the server work just with a series of encrypted pack files.

To do pull, you download all the pack files that were added since your last pull, decrypt them and apply to your git repo. To do push, you first have to do pull, so that you know the state of the server. If there are no conflicts, you create a pack file with your changes, encrypt it and upload it.

With this approach, you would end up with large number of tiny pack files, which would be quite inefficient. To fix that, you could download a series of pack files, decrypt, combine them into one pack file, encrypt and upload them to the server, marking them as a replacement for that series.

svick
  • 9,999
  • 1
  • 37
  • 51