3

I'm interested in doing code obfuscation (native code, to make it clear what I mean by "native code": the actual machine code in x86/x64/arm executable files - PE, ELF, Macho and etc, not the sources) as a part of a build process. I can think of several ways of doing it:

  1. Writing a compiler plugin, that would get internal representation after all optimization phases, transform it and then send it to compiler's back-end. In this case, could you advise me some compilers that support plugins like that? Programming language doesn't really matter, but I would like to try out something other than C/C++ (I know that both GCC and Clang support plugins, but I could never get it working properly on Windows).

  2. Dump internal representation after all optimization phases to file, parse and process it with my obfuscation tool and pass it back to the compiler. I couldn't find how to do it with GCC (I successfully dumped different internal representations, but I couldn't find how to make compiler generate code from it). And I know that Clang is able to generate LLVM bitcode, that can be processed and compiled with LLVM compiler, but again I couldn't make it work properly on Windows. Do you know any compiler that allows doing such things?

  3. Currently I'm doing obfuscation by preprocessing sources, it works, but it takes too long to do it for a large code base. So I'm looking for alternative ways of doing it, and I want to read your opinions on what is the best way to do native code obfuscation.

PS Please bear in mind that I'm asking a concrete question. I'm not asking whether I need obfuscation on not. Obfuscation is done for a reason and I have a reason to do it. I'm not asking how to perform actual native code obfuscation or about algorithms of code obfuscation. I know how to do it and how to break it.

I'm asking about native code compilers that either have good support for plugins or can dump and recompile some low level internal representation like AST (abstract syntax tree), RTL (register transfer language), three-address code or something else.

Dave Hillier
  • 3,940
  • 1
  • 25
  • 37
user2102508
  • 195
  • 2
  • 6
  • 2
    Just outsource your development to unskilled programmers -- then nobody will understand the code ever. Seriously though optimized machine code seriously hard to understand, any obfustication process wouldn't make it much harder while drastically slowing down your execution. – James Anderson Aug 21 '13 at 08:02
  • 3
    You mean, you want your machine code to to become more obfuscated as it is already from an optimizing compiler? Because you believe it is still "too readable"? – Doc Brown Aug 21 '13 at 08:33
  • 1
    @DocBrown: Exactly! It is in fact too readable. I've been doing reverse engineering for 6 years now, believe me, I know what I'm saying. – user2102508 Aug 21 '13 at 08:45
  • @user2102508: Have you picked all the low hanging fruit of eliminating most symbols (setting visibility to hidden for everything where at least tiny bit possible) and maxing out inlining? – Jan Hudec Aug 21 '13 at 08:54
  • @user2102508: GCC _intentionally_ does _not_ load internal representation. It's a licensing policy and it's one of main reasons why clang was started. – Jan Hudec Aug 21 '13 at 08:58
  • @JanHudec: Sorry, but your question isn't related to what I'm asking here. I can't still figure out why you guys keep misunderstanding what I'm asking here... I know about GCC, as I tried it myself, I wrote about it in the question. I'm looking for compiler that can do it. – user2102508 Aug 21 '13 at 08:59
  • 1
    @user2102508: My question is very related to what you are asking here. You say you are currently obfuscating by preprocessing sources. The only thing (and it's what high-level language obfuscators do) I can imagine you do there is rename symbols. Now the linker can eliminate most symbols from ever appearing in the output. So I am asking, whether you did everything possible to eliminate symbols from the output altogether. And what else do you want to obfuscate. – Jan Hudec Aug 21 '13 at 09:04
  • 2
    @JanHudec: What else? The code! I'm obfuscating the code. I'm generating trash code, encrypting strings, using antidisassembly tricks and all this is done by preprocessing source files. No symbols are left after the compilation to native code, what are you talking about? – user2102508 Aug 21 '13 at 09:12
  • 7
    @user2102508: That last answer should definitely go as question edit. When you ask a question, you **must** explain **why** you need it. Because for 99% of questions of the form "how do I use X to do Y" the correct answer is "use Z". I read the question carefully. But I am not going to answer it before I am certain there is no easier way. And so is not most other people here. – Jan Hudec Aug 21 '13 at 09:37
  • 1
    @user2102508: As for symbols, there are many symbols left after compilation to native code. It depends on the binary format, but ELF specifically leaves _all non-static symbols_ in the symbol table by default. And that's the easiest hint for disassembling most of stuff. Telling the linker not to is the first step in protecting native code from being successfully disassembled (yes, PE is different, it does not do by-name late binding; ELF does). – Jan Hudec Aug 21 '13 at 09:43
  • Buy an obfuscation tool. Plug it in. Done. What am I missing? You should probably separate this into 3 questions – Dave Hillier Aug 21 '13 at 09:44
  • 2
    _"asking about native code compilers"_ -- Questions asking us to recommend a tool, library or favorite off-site resource are off-topic for Programmers as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it. – gnat Aug 21 '13 at 09:49
  • 1
    I found [Collberg and Nagra's book](http://www.amazon.com/Surreptitious-Software-Obfuscation-Watermarking-Tamperproofing/dp/0321549252) pretty interesting, even though I don't work in this field. I recommend also reading [this article](http://www.gamasutra.com/view/feature/3030/keeping_the_pirates_at_bay.php) on one of the very few success stories of obfuscation. – Gilles 'SO- stop being evil' Aug 21 '13 at 10:10
  • @DaveHillier: It is an easy solution, but the problem is that I'm interested in doing it myself. – user2102508 Aug 21 '13 at 10:12
  • @gnat: I don't really see how to describe a problem in another way. I just ask for a tool that meets a specific criteria. – user2102508 Aug 21 '13 at 10:14
  • 1
    @user2102508 If you're interested in doing it yourself then try an approach and ask specific questions about problems you encounter (that's what this site is for). After that, delete this question :P – Dave Hillier Aug 21 '13 at 10:21
  • 1
    What's wrong with Clang on Windows? It's pretty easy to build, and it works well with the MSVS toolchain. – SK-logic Aug 21 '13 at 11:27
  • @user2102508 - I think you will get better answers and fewer off-topic comments if you remove "obfuscation" from your question, and focus on transforming intermediate representations (personally, I think AST may be too high level). – kdgregory Aug 21 '13 at 13:38
  • I agree with @SK-logic. Clang is the only suitable compiler out there and _should_ work on Windows. So if you are having problem getting it to work, ask a specific question (on stack overflow) about what does not work for you there (note: clang works fine with mingw libraries, but with microsoft libraries it has some problems calling linker from MSVC++; I didn't investigate further) – Jan Hudec Aug 22 '13 at 08:53

2 Answers2

5

Have a look at Stuxnet for a starting point. I realize you didn't imply or ask about malware technique, but it's the only example I can think of that will meet your requirements. Regardless of the political intent for Stuxnet, the technical structure is quite interesting.

Symantec's analysis provides a really good introduction to the structure of that malware. There's quite a few additional analyses available that go deeper into the code.

The heart of Stuxnet consists of a large .dll file that contains many different exports and resources. In addition to the large .dll file, Stuxnet also contains two encrypted configuration blocks.
The dropper component of Stuxnet is a wrapper program that contains all of the above components stored inside itself in a section name “stub”. This stub section is integral to the working of Stuxnet. When the threat is executed, the wrapper extracts the .dll file from the stub section, maps it into memory as a module, and calls one of the exports.

And I think the generalized approach would work for you. You'll have a stub or wrapper that hosts your application. Each separable component is encrypted and packed within the wrapper. An attacker would then need to decrypt the packed component(s) before they could read the machine code it contains.

For additional obfuscation, use separate encryption keys for each component. Keep in mind that an attacker will still have access to the keys since you have to include them for the wrapper to decrypt the components. However, if you are distributing based upon a licensing model then you don't have to distribute all the keys. Likewise, you could setup keys on a per-client distribution so if a copy of your code escapes to the wild, you could potentially track down the source of the escape.


Alternatively, take a look at what mainframe programmers of old used to do with code that would overwrite itself.

A conditional would jump to another location which would insert instructions in the area that just executed. The code then jumps back to the beginning of the modified area. It was "useful" when hotfix patches needed to be applied to a system. It was also frequently perilous to the application's health, so be cautious with this technique.

The obfuscation comes into play because the program has to run and write over the existing instructions before the "actual" program can operate. I don't think this would scale for a larger project

  • Hmm, interesting I hope it is a co-incidence that this is a good example and that the OP isn't writing malware ;) – Dave Hillier Aug 21 '13 at 11:54
  • @DaveHillier - I pondered a bit about that one too. However, I thought it was fairly unlikely that a malware author would seek advice in a public forum _and_ the details about stuxnet are already pretty commonly known in those circles anyway. Most malware authors who are writing the kits that get sold are reasonably intelligent and stay on top of events like that. –  Aug 21 '13 at 13:45
  • Problem with self-rewriting code is that current operating systems greatly limit it. You can generate code on heap and run it (virtual machines need that), but you can't write over code loaded from the binary file. – Jan Hudec Aug 22 '13 at 08:47
-1

You seem to be confused.

Code obfuscation is a (weak) method of protecting your source code from being understood by others if they get their hands on it. It can be a superficial barrier to amateur when variable names, formatting etc. are distorted to be "unreadable", but the attackers you should worry about - hackers and crackers - aren't fazed by such cosmetic changes. They will just normalize the code automatically and read it with little difficulty.

Obfuscating native code makes even less sense. There are only so many ways you can add two numbers, and if your program needs additions, it will have to use one of them. Assembler code is never printed in anything but normalized one-op-per-line format, and it rarely uses long, meaningful identifiers in the first place. In other words, obfuscated native code looks just like normal native code. (And yes, professional software thieves can read machine code like they read the newspaper.) While it is possible to make the actual operations performed by the program more obscure and needlessly complicated, this carries a heavy risk of introducing errors and inefficiency. Is this really worth the perceived additional safety against code thieves?

In short, I suggest you reconsider what exactly your threat model is and what measures make sense to defend yourself.

Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • 10
    I'm not confused and I'm not asking whether I need obfuscation or not. I'm asking a concrete question about native compilers. And by the way there are a lot of ways to perform native code obfuscation actually, some of them may be very complex, like control flow flattening for example. People write dissertations on this topic, and you are saying that it doesn't make sense... – user2102508 Aug 21 '13 at 07:57
  • 1
    -1. Not because of the angry comment by asker that it's not what he wanted, but because it's answer before you knew. – Jan Hudec Aug 21 '13 at 09:45
  • This is not true. Code obfuscation can go way beyond things like symbol names, and can obscure things like how a program comes up with a certain number. Look up white-box cryptography for an extreme case. It's difficult, it makes debugging all but impossible, it doesn't protect against copying the program unless you do it right for this purpose, all in all it's a huge increase in development costs that almost always goes way beyond any additional revenue that it might bring, but it is possible. – Gilles 'SO- stop being evil' Aug 21 '13 at 10:06
  • @Gilles: Yes, it makes debugging impossible, but you don't really need to debug obfuscated code (you just disable obfuscation when you are developing, and enable it when you build release version), unless the obfuscation itself introduces some bugs to your code, which is unacceptable. – user2102508 Aug 21 '13 at 10:17
  • @user2102508 “Unless the obfuscation itself introduces some bugs to your code, which is unacceptable.” Hahahaha, good luck with that. Have you never tracked a hard-to-reproduce issue due to a missing initializer, use-after-free, concurrency, etc.? Your code may well seem to work until you run it through the obfuscator. And then the obfuscator itself can be buggy. – Gilles 'SO- stop being evil' Aug 21 '13 at 10:25
  • if you know native code - you know that it all maps to simple operations on the CPU, registers and memory segments. All one has to do is keep track of what operations are performed on what registers/memory segments and they have what your code does down. That you can't "obfuscate". Unless crypto is involved. – Nisk Aug 21 '13 at 10:29
  • 1
    @Nisk: Of course you can obfuscate native code without crypto. For example, this [Automated x86 instruction obfuscation](http://stackoverflow.com/q/7947353/18192) post. Obfuscating native code just means translating into code which has the same semantics but different, more difficult (if only marginally so) to understand instructions. Whether such obfuscations are of value from a cost-benefit analysis is a separate question. – Brian Aug 21 '13 at 17:15
  • @Brian I think you're misunderstanding, what I meant was that it's no use in the end. You can write all the Rude Goldberg code you want, it's an exercise in futility and outputting inefficient software/code. And you're not obfuscating native code either - you're rewriting it to do more useless stuff. You can't obfuscate native code, only the task that you're trying to achieve with code. – Nisk Aug 22 '13 at 09:39