Detecting Internet code leaks

Question

I'm curious after reading today about a minor code leakage from a large project to some blogs & forums (to say it short: the guys forgot to anonymize the code before asking for help), all detected and (angrily) complained by the customer who reasonably wanted to protect the IP they paid for.

First of all, I know it's a good practice (correct me if I'm wrong), when required to disclose code to public sites in order to ask for help, to take appropriate anonymization measures, like renaming in all snippets something like com.thecompanyiworkfor with com.somecompany or com.bank.MortgageRiskCalculatorClass to com.somecompany.SomeGenericRiskIndicatorClass and so on...

Now, suppose I would like to find out if and where (and maybe who, but it's not important now and it's not part of the question) the originally restricted code was leaked in order to properly react (read "send all the guys an email telling either they delete/clean posts or something bad will happen to the culprit", haha).

I suppose that a good way could be googling something unique that you could find in the code. For example, if I worked for Inintech I would try to google for com.inintech to see if somebody was stupid enough to paste code full with import/using directives.

It's not a comprehensive method, it's based on the assumption that someone is better trying to protect the binding between company and code (ie. for security by obscurity reasons, public image...) rather than preventing intellectual property from being freely available to the public.

My straight questions are:

Do you know/think there are other good practices to perform such investigations? How would you do that if your boss asks to find if someone leaked the company's code? I don't think someone would ever try to google for an entire source code file in one query string :)

Do you know if there are companies performing such investigations? If so, what could you tell me about them more than their names, like the way they work?

What is the purpose of the anonymization? To avoid embarrassment? — David Schwartz, Sep 23 '11 at 08:08
I am pretty sure that there is more code than is copied from Internet, than code leaked and it might be as serious or worse if the copied code licensed under a GPL like license, making mandatory redistribution of derived work source code... — Xavier T., Sep 23 '11 at 08:23
Of course, but mainly "to avoid telling all the world that code in the Acme repository is so bad that it crashes once a microsecond" from the NDA's point of view :) — usr-local-ΕΨΗΕΛΩΝ, Sep 23 '11 at 08:25
I often see web.config files posted in full on SO. I would not be comfortable with something like that sitting out in the open with connection strings out there for anyone to see. — Morgan Herlocker, Sep 23 '11 at 18:22

score 2 · Answer 1 · answered Sep 27 '11 at 10:01

2

There are softwares used by teachers to detect plagiarism in the works of their students. Maybe this can work with code too. However, be cautious to avoid divugate yourself the code by using an unreliable software ...

answered Sep 27 '11 at 10:01

Clement J.

411
5
8

:) This seems to answer the part of the question in which I ask if there are companies providing such services. Copyscape provides services for documents, I guess that the method could be readapted to code... +1 – usr-local-ΕΨΗΕΛΩΝ Sep 27 '11 at 13:43

Tom Squires · Answer 2 · 2011-09-23T09:40:54.940

1

Ultimately I cant see any damage that has been done. So a small number of techie saw your companies name is a block of code, big deal.

If you make a huge fuss over this it will alienate your staff and discorage them from using internet resources like SO. That really will damage the company.

My advice would be act as if you haven't seen it. Since it seems to bother you, if you do happen to find out exactly who is doing it then send them a private email asking them to take more care but go no further. Dont waste time and resources finding out who it is.

EDIT: This advice is only relvent if your company is developing internaly not for clients. See the comments below.

edited Sep 23 '11 at 09:40

answered Sep 23 '11 at 08:11

Tom Squires

17,695
11
67
88

You answered my question by supposing I am on the PM's side. I agree with your answer. But my answer is focusing on "**how** do usually companies find small fragments of their own code scattered along the Internet?" even since the episode I mentioned about is not a huge Wikileaks dump of code, but rather people posting small fragments on forum. I gave myself an answer "search for namespace on google and see if you find anything" but I wonder if there is more than that. Nothing else. +1 for you, I would behave the same – usr-local-ΕΨΗΕΛΩΝ Sep 23 '11 at 08:20
4

Partially disagree. As a company owner, I would keep silence to the public, of course, because I don't want to invoke the Streisand effect. But when the developers are working for me, then usually they've signed some IP and/or NDA contract. By posting non-anonymized code, they've breached these contracts. Even more so if it contains our customer's data, with which my company could have a separate NDA contract signed, which is also breached by this. "No damage done" doesn't cut it then. – Secure Sep 23 '11 at 09:25
1

This is not very useful advice when the developing company is not the final recipient. Fictional example: PiskvorInternetWidgets Ltd. may not care that their code written for NigiDotar is crap and SQL-injectable, but hey, what's the wost that could happen, that system is not facing the Internet anyway (I see this excuse a lot: "internal code doesn't have to be secure at all, we trust our staff"). The client, OTOH, could be *slightly* unhappy to see `import nl.nigidotar.piskvorwidgets` plastered over the interwebs, for various reasons (security by obscurity/bad publicity). – Piskvor left the building Sep 23 '11 at 09:31
3

@Secure: Also, consider that, from the PoV of a customer, finding a breach in the contract could be seen as a good way to avoid paying or more generally inflicting penalties to the contractor. This explains the customer actively performing investigations. – usr-local-ΕΨΗΕΛΩΝ Sep 23 '11 at 09:33
@Secure Good point. – Tom Squires Sep 23 '11 at 09:41
-1 The question was "how to do it?" not "should I do it?" – Jacek Prucia Sep 27 '11 at 13:43

score 0 · Answer 3 · answered Sep 27 '11 at 11:02

Unless you're finding someone actually stealing code and developing their own product on your work it's pretty much a waste of time. The internet is like a big coding cook book already. Finding snippets and code pieces that fit into your own puzzle isn't that hard. Optimization to make it fit into your own project is the hard part.

Writing software is easy. Writing easy software that performs well is hard.

As long there is no actually leakage of private keys in any matter involved I'd rather not care.

Detecting Internet code leaks

Do you know/think there are other good practices to perform such investigations? How would you do that if your boss asks to find if someone leaked the company's code? I don't think someone would ever try to google for an entire source code file in one query string :)

Do you know if there are companies performing such investigations? If so, what could you tell me about them more than their names, like the way they work?

3 Answers3