Introduction
I recently abandoned IDA Pro, a professional disassembler, for Binary Ninja
. Binary Ninja (or Binja) is a complete disassembler, offering more or less the same capabilities as IDA Pro.
One of the strengths of this software, apart from the fact that the basic license includes support for all common architectures (x86, x64, ARM), a debugger and a decompiler, lies in its API. It can be used to automate many aspects of a malware analyst’s workflow, interacting not only with the binary in static mode (which goes without saying), but also with the debugger, enabling the creation of automated unpackers, for example.
To learn how to use this software and its API effectively, what better way than to analyze a sample of IcedID
, a trojan that’s particularly in vogue at the moment, and see what Binja has to offer.
Unpacking
Initially, I thought I’d do a complete analysis on this sample. IcedID is back from the dead and I thought I’d see what had changed.
But I came across a buggy sample. I unpacked it manually the first time, before sending the sample to Unpac.me to save time.
To make a long story short, here’s how the unpacking of this sample goes:
- Decrypt a shellcode and execute it
- The shellcode replaces sections of the original malware, replacing the entire original code.
- The malware then hooks
ZwCreateProcess
with a custom function
- A call to
CreateProcessW
is made to the program svchost.exe
- The hooked function will create the svchost.exe process in suspended mode, then inject code into it
- Svchost will decrypt the next stage and create a scheduled task to execute the binary, triggered by the next user login.
It will repeat this process several times, each time using the same technique, but removing more and more obfuscation from the binaries. This technique works very well, as it prevents a sandbox from gaining access to the binary’s “plaintext” code, as several user logins are required.
So why couldn’t I continue the analysis?
1
2
3
4
5
6
|
00401c79 push dword [esp+0x4 {size}] {var_4}
00401c7d push 0x8 {var_8}
00401c7f call dword [GetProcessHeap]
00401c85 push eax {var_c}
00401c86 call dword [HeapAlloc]
00401c8c retn {__return_addr}
|
This function is very simple: it’s responsible for allocating memory in the process. However, HeapAlloc
will always return 0 at some point in the code, preventing the program from allocating additional memory and thus continuing the process.
If we refer to the Joe Sandbox analysis, we can see that no domain or IP has been contacted, or that no behavior other than that present during unpacking has taken place, which leads me to believe that this specific sample is possibly buggy.
But I didn’t waste any time analyzing it.
Api Hashing
This sample uses API Hashing
, a technique widely used by malware to obscure its operation from antivirus software or analysts.
As a reminder, a Windows API is a function proposed by Microsoft to interact with the operating system. For example, the CreateThread
API is the only way to create a thread in Windows.
Instead of calling Windows APIs in the normal way (which would make them appear in the import table and thus be usable for detection purposes), malware uses a technique that resolves the addresses of these APIs dynamically.
APIs are functions exported by Windows DLLs (kernel32.dll
, NTDLL.dll
to name but the most common), so it is possible to retrieve their addresses and use them dynamically without having to import them.
Here’s how the hashing API works:
- During the malware development phase, the author chooses a hashing algorithm to hash the names of the various APIs he wishes to use (e.g.
CreateProcessW
becomes 5c856c47 using CRC32
).
- At runtime, the malware will first retrieve the address of the DLL containing the functions to be imported. This can be done using the
GetModuleHandle('nom_dll')
API or the PEB.
- Next, the program browses the list of functions exported by the DLL, retrieving their name and address.
- For each API name, the malware will hash it with the same hash algorithm used to precompile the APIs to be used, then compare it with the searched hash.
- If they match, the function address will be returned to the program or stored in a variable.
How to reverse an API hashing function efficiently?
In this sample, here’s how the hashing API is used:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
{
HINSTANCE eax = GetModuleHandleA("NTDLL.DLL");
if (eax != 0)
{
data_402000 = 1;
int32_t esi_3 = (((sub_4012c2(eax, eax, 0, 0xe50cd451, &data_402030, 0x402076) | sub_4012c2(eax, eax, 0, 0xb0d89fb2, &data_402028, 0x402070)) | sub_4012c2(eax, eax, 0, 0xd37bdaeb, &data_402008, 0x402058)) | sub_4012c2(eax, eax, 0, 0xf4b15f66, &data_402040, 0x402082));
int32_t esi_7 = ((((esi_3 | sub_4012c2(eax, eax, 0, 0x8c795ddf, &data_402018, 0x402064)) | sub_4012c2(eax, eax, 0, 0xc5509c94, &data_402010, 0x40205e)) | sub_4012c2(eax, eax, 0, 0xae46d1e4, &data_402020, 0x40206a)) | sub_4012c2(eax, eax, 0, 0xfd06b77e, &data_402048, 0x402088));
int32_t esi_8 = (esi_7 | sub_4012c2(eax, eax, 0, 0x2d7fdd26, &data_402038, 0x40207c));
int32_t eax_11 = (sub_4012c2(eax, eax, 0, 0x530c1aee, &data_402050, 0x40208e) | esi_8);
int32_t eax_12 = (-eax_11);
return ((eax_12 - eax_12) + 1);
}
return eax;
}
|
First, we can see that the address of the NTDLL.dll
library is retrieved.
Then, several calls to the same function, with clear arguments (the first 2 represent the handle of NTDLL
, the 4th corresponds to the hash to be resolved and the 5th is the variable where the address of the resolved API will be stored).
When you identify a function with similar arguments, check for cross-references. If it is called many times, it may well be an API hashing function.
Find the hashing algorithm
The most important part will be to find the hashing algorithm used. This is often a common algorithm, and tools like capa may be able to detect it. But sometimes, authors can modify already known algorithms or create one from scratch.
That’s why it’s useful to use a debugger once you’ve found the hashing function, which makes the task easier.
In my case, the function used to create a hash of the API name was ROT13
, coupled with an XOR
whose value is hardcoded in the binary :
1
2
3
4
5
6
7
8
9
10
11
12
13
|
while (true)
{
int32_t eax;
eax = *esi;
if (eax == 0)
{
break;
}
ecx = ((RORD(ecx, 0xd)) + eax);
esi = &esi[1];
}
...
if (*eax_1 != (ecx ^ 0x401056))
|
Replicating the hashing algorithm and creating a lookup table
Once I understood the algorithm, I replicated its operation in Python and created a lookup table. This will contain all the names of the APIs of the chosen DLL (here NTDLL
) and their corresponding hash:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
import json
def ROR(x, n, bits = 32):
mask = (2**n) - 1
mask_bits = x & mask
return (x >> n) | (mask_bits << (bits - n))
def hash_api(api):
h = 0
for a in api:
h = ROR(h, 13)
h += ord(a)
return (hex(4198486 ^ h))
apis = []
with open("NTDLL_apis.txt", 'r') as f:
lines = f.readlines()
for line in lines:
api_couple = {}
api_couple["api"] = line[:-1]
api_couple["icedid_hash"] = hash_api(line[:-1])
apis.append(api_couple)
apis = json.dumps(apis)
with open("hash_apis.json", 'a') as f:
f.write(apis)
|
This script goes through a list of APIs, extracted thanks to a little program I coded in C (but it’s also possible to do it in Python
, and much simpler) and hashes all the APIs before producing a json file which will be our lookup table :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
...
{"api": "EtwEnumerateProcessRegGuids", "icedid_hash": "0x91246a5e"},
{"api": "EtwEventActivityIdControl", "icedid_hash": "0xd5072de7"},
{"api": "EtwEventEnabled", "icedid_hash": "0xf0906dff"},
{"api": "EtwEventProviderEnabled", "icedid_hash": "0xb51b3f6e"},
{"api": "EtwEventRegister", "icedid_hash": "0xfb73ec0e"},
{"api": "EtwEventSetInformation", "icedid_hash": "0x4f3be0bb"},
{"api": "EtwEventUnregister", "icedid_hash": "0xf55690dc"},
{"api": "EtwEventWrite", "icedid_hash": "0x2007d3b8"},
{"api": "EtwEventWriteEndScenario", "icedid_hash": "0x72faa875"},
{"api": "EtwEventWriteEx", "icedid_hash": "0x1458ec56"},
{"api": "EtwEventWriteFull", "icedid_hash": "0xbdeefe7"},
{"api": "EtwEventWriteNoRegistration", "icedid_hash": "0xdd452084"}
...
|
Automation
Now that we have the lookup table, it’s time to resolve these APIs on BinaryNinja
. Of course, this process won’t affect the binary in any way, it will just make it easier for us to find our way around during static analysis.
We have all the information we need:
- The address of the API resolution function
- The desired hash
- The variable where the API address will be stored
The first step is to retrieve the list of all cross-references from the resolution function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
>>> refs = bv.get_code_refs(0x4012c2)
>>> for ref in refs:
... print(ref.mlil)
...
eax_1 = 0x4012c2(var_24, var_20, 0, 0xe50cd451, 0x402030, 0x402076)
eax_2 = 0x4012c2(var_3c, var_38, 0, 0xb0d89fb2, 0x402028, 0x402070)
eax_3 = 0x4012c2(var_54, var_50, 0, 0xd37bdaeb, 0x402008, 0x402058)
eax_4 = 0x4012c2(var_24_1, var_20_1, 0, 0xf4b15f66, 0x402040, 0x402082)
eax_5 = 0x4012c2(var_3c_1, var_38_1, 0, 0x8c795ddf, 0x402018, 0x402064)
eax_6 = 0x4012c2(var_54_1, var_50_1, 0, 0xc5509c94, 0x402010, 0x40205e)
eax_7 = 0x4012c2(var_24_2, var_20_2, 0, 0xae46d1e4, 0x402020, 0x40206a)
eax_8 = 0x4012c2(var_3c_2, var_38_2, 0, 0xfd06b77e, 0x402048, 0x402088)
eax_9 = 0x4012c2(var_54_2, var_50_2, 0, 0x2d7fdd26, 0x402038, 0x40207c)
eax_10 = 0x4012c2(var_24_3, var_20_3, 0, 0x530c1aee, 0x402050, 0x40208e)
eax_1 = 0x4012c2(nullptr, var_24_1, var_20_1, 0xe50cd451, 0x4020c8, 0x40210e)
eax_2 = 0x4012c2(nullptr, var_3c_1, var_38_1, 0xb0d89fb2, 0x4020c0, 0x402108)
eax_3 = 0x4012c2(nullptr, var_54_1, var_50_1, 0xd37bdaeb, 0x4020a0, 0x4020f0)
eax_4 = 0x4012c2(nullptr, var_24_2, var_20_2, 0xf4b15f66, 0x4020d8, 0x40211a)
eax_5 = 0x4012c2(nullptr, var_3c_2, var_38_2, 0x8c795ddf, 0x4020b0, 0x4020fc)
eax_6 = 0x4012c2(nullptr, var_54_2, var_50_2, 0xc5509c94, 0x4020a8, 0x4020f6)
eax_7 = 0x4012c2(nullptr, var_24_3, var_20_3, 0xae46d1e4, 0x4020b8, 0x402102)
eax_8 = 0x4012c2(nullptr, var_3c_3, var_38_3, 0xfd06b77e, 0x4020e0, 0x402120)
eax_9 = 0x4012c2(nullptr, var_54_3, var_50_3, 0x2d7fdd26, 0x4020d0, 0x402114)
eax_10 = 0x4012c2(nullptr, var_24_4, var_20_4, 0x530c1aee, 0x4020e8, 0x402126)
|
Where BinaryNinja proves more efficient than IDA is that we can interact directly in IL (Intermediate language)
mode, where the code looks more like C, making it easier to retrieve function arguments, for example.
Here’s how to retrieve the hash of the function to be solved for each call to the function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
for ref in refs:
print(ref.mlil.params[3])
0xe50cd451
0xb0d89fb2
0xd37bdaeb
0xf4b15f66
0x8c795ddf
0xc5509c94
0xae46d1e4
0xfd06b77e
0x2d7fdd26
0x530c1aee
0xe50cd451
0xb0d89fb2
0xd37bdaeb
0xf4b15f66
0x8c795ddf
0xc5509c94
0xae46d1e4
0xfd06b77e
0x2d7fdd26
0x530c1aee
|
Finally, here’s how to retrieve the variable where the function address will be stored:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
for ref in refs:
print(ref.mlil.params[4].constant)
4202544
4202536
4202504
4202560
4202520
4202512
4202528
4202568
4202552
4202576
4202696
4202688
4202656
4202712
4202672
4202664
4202680
4202720
4202704
4202728
|
If we put all this together in a Python script :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
from binaryninja import BinaryView
import json
ADDRESS = 0x4012c2
#Load la lookup table à partir du fichier json
with open('/Users/lordtmk/Malwares/IcedID/hash_apis.json', 'r') as f:
lookup = json.load(f)
refs = bv.get_code_refs(0x4012c2)
log_info("Starting...")
for ref in refs:
hash_value = ref.mlil.params[3]
address_pointer = ref.mlil.params[4].constant
key = [x for x in lookup if x["icedid_hash"] == str(hash_value)] #Recherche la clé de la lookup table pour le hash en cours
log_info(f"Found api {key[0]['api']} for this hash !")
bv.define_data_var(address_pointer, "void*", key[0]['api']) #Renomme la variable avec le nom de l'API trouvée
|
Here’s the result after using the script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
[Python Console] Running script from file: /Users/lordtmk/Malwares/IcedID/decode_api_hashing.py
[Default] Starting...
[Default] Found api LdrGetProcedureAddress for this hash !
[Default] Found api LdrLoadDll for this hash !
[Default] Found api NtAllocateVirtualMemory for this hash !
[Default] Found api NtCreateUserProcess for this hash !
[Default] Found api NtProtectVirtualMemory for this hash !
[Default] Found api NtWriteVirtualMemory for this hash !
[Default] Found api NtWaitForSingleObject for this hash !
[Default] Found api RtlDecompressBuffer for this hash !
[Default] Found api RtlExitUserProcess for this hash !
[Default] Found api NtFlushInstructionCache for this hash !
[Default] Found api LdrGetProcedureAddress for this hash !
[Default] Found api LdrLoadDll for this hash !
[Default] Found api NtAllocateVirtualMemory for this hash !
[Default] Found api NtCreateUserProcess for this hash !
[Default] Found api NtProtectVirtualMemory for this hash !
[Default] Found api NtWriteVirtualMemory for this hash !
[Default] Found api NtWaitForSingleObject for this hash !
[Default] Found api RtlDecompressBuffer for this hash !
[Default] Found api RtlExitUserProcess for this hash !
[Default] Found api NtFlushInstructionCache for this hash !
[Analysis] Analysis update took 0.025 seconds
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
{
HINSTANCE eax = GetModuleHandleA("NTDLL.DLL");
if (eax != 0)
{
data_402000 = 1;
int32_t esi_3 = (((mw_resolve_api_by_hash(eax, eax, 0, 0xe50cd451, &LdrGetProcedureAddress, 0x402076) | mw_resolve_api_by_hash(eax, eax, 0, 0xb0d89fb2, &LdrLoadDll, 0x402070)) | mw_resolve_api_by_hash(eax, eax, 0, 0xd37bdaeb, &NtAllocateVirtualMemory, 0x402058)) | mw_resolve_api_by_hash(eax, eax, 0, 0xf4b15f66, &NtCreateUserProcess, 0x402082));
int32_t esi_7 = ((((esi_3 | mw_resolve_api_by_hash(eax, eax, 0, 0x8c795ddf, &NtProtectVirtualMemory, 0x402064)) | mw_resolve_api_by_hash(eax, eax, 0, 0xc5509c94, &NtWriteVirtualMemory, 0x40205e)) | mw_resolve_api_by_hash(eax, eax, 0, 0xae46d1e4, &NtWaitForSingleObject, 0x40206a)) | mw_resolve_api_by_hash(eax, eax, 0, 0xfd06b77e, &RtlDecompressBuffer, 0x402088));
int32_t esi_8 = (esi_7 | mw_resolve_api_by_hash(eax, eax, 0, 0x2d7fdd26, &RtlExitUserProcess, 0x40207c));
int32_t eax_11 = (mw_resolve_api_by_hash(eax, eax, 0, 0x530c1aee, &NtFlushInstructionCache, 0x40208e) | esi_8);
int32_t eax_12 = (-eax_11);
return ((eax_12 - eax_12) + 1);
}
return eax;
}
|
The APIs have been correctly resolved, making it easier to understand the malware’s capabilities.
Conclusion
This concludes my very first article using Binary Ninja software. There’s still a lot to discover about it, and its API gives me plenty of ideas for the future.
For those who want to try out BinaryNinja, there’s a rather complete demo version.