Solving API Hashing with BinaryNinja

Introduction

I recently abandoned IDA Pro, a professional disassembler, for Binary Ninja. Binary Ninja (or Binja) is a complete disassembler, offering more or less the same capabilities as IDA Pro.

One of the strengths of this software, apart from the fact that the basic license includes support for all common architectures (x86, x64, ARM), a debugger and a decompiler, lies in its API. It can be used to automate many aspects of a malware analyst’s workflow, interacting not only with the binary in static mode (which goes without saying), but also with the debugger, enabling the creation of automated unpackers, for example.

To learn how to use this software and its API effectively, what better way than to analyze a sample of IcedID, a trojan that’s particularly in vogue at the moment, and see what Binja has to offer.

Unpacking

Initially, I thought I’d do a complete analysis on this sample. IcedID is back from the dead and I thought I’d see what had changed.

But I came across a buggy sample. I unpacked it manually the first time, before sending the sample to Unpac.me to save time.

To make a long story short, here’s how the unpacking of this sample goes:

Decrypt a shellcode and execute it
The shellcode replaces sections of the original malware, replacing the entire original code.
The malware then hooks ZwCreateProcess with a custom function
A call to CreateProcessW is made to the program svchost.exe
The hooked function will create the svchost.exe process in suspended mode, then inject code into it
Svchost will decrypt the next stage and create a scheduled task to execute the binary, triggered by the next user login.

It will repeat this process several times, each time using the same technique, but removing more and more obfuscation from the binaries. This technique works very well, as it prevents a sandbox from gaining access to the binary’s “plaintext” code, as several user logins are required.

So why couldn’t I continue the analysis?

1
2
3
4
5
6


00401c79  push    dword [esp+0x4 {size}] {var_4}
00401c7d  push    0x8 {var_8}
00401c7f  call    dword [GetProcessHeap]
00401c85  push    eax {var_c}
00401c86  call    dword [HeapAlloc]
00401c8c  retn     {__return_addr}

This function is very simple: it’s responsible for allocating memory in the process. However, HeapAlloc will always return 0 at some point in the code, preventing the program from allocating additional memory and thus continuing the process.

If we refer to the Joe Sandbox analysis, we can see that no domain or IP has been contacted, or that no behavior other than that present during unpacking has taken place, which leads me to believe that this specific sample is possibly buggy.

But I didn’t waste any time analyzing it.

Api Hashing

This sample uses API Hashing, a technique widely used by malware to obscure its operation from antivirus software or analysts.

As a reminder, a Windows API is a function proposed by Microsoft to interact with the operating system. For example, the CreateThread API is the only way to create a thread in Windows.

Instead of calling Windows APIs in the normal way (which would make them appear in the import table and thus be usable for detection purposes), malware uses a technique that resolves the addresses of these APIs dynamically.

APIs are functions exported by Windows DLLs (kernel32.dll, NTDLL.dll to name but the most common), so it is possible to retrieve their addresses and use them dynamically without having to import them.

Here’s how the hashing API works:

During the malware development phase, the author chooses a hashing algorithm to hash the names of the various APIs he wishes to use (e.g. CreateProcessW becomes 5c856c47 using CRC32).
At runtime, the malware will first retrieve the address of the DLL containing the functions to be imported. This can be done using the GetModuleHandle('nom_dll') API or the PEB.
Next, the program browses the list of functions exported by the DLL, retrieving their name and address.
For each API name, the malware will hash it with the same hash algorithm used to precompile the APIs to be used, then compare it with the searched hash.
If they match, the function address will be returned to the program or stored in a variable.

How to reverse an API hashing function efficiently?

In this sample, here’s how the hashing API is used:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


{
    HINSTANCE eax = GetModuleHandleA("NTDLL.DLL");
    if (eax != 0)
    {
        data_402000 = 1;
        int32_t esi_3 = (((sub_4012c2(eax, eax, 0, 0xe50cd451, &data_402030, 0x402076) | sub_4012c2(eax, eax, 0, 0xb0d89fb2, &data_402028, 0x402070)) | sub_4012c2(eax, eax, 0, 0xd37bdaeb, &data_402008, 0x402058)) | sub_4012c2(eax, eax, 0, 0xf4b15f66, &data_402040, 0x402082));
        int32_t esi_7 = ((((esi_3 | sub_4012c2(eax, eax, 0, 0x8c795ddf, &data_402018, 0x402064)) | sub_4012c2(eax, eax, 0, 0xc5509c94, &data_402010, 0x40205e)) | sub_4012c2(eax, eax, 0, 0xae46d1e4, &data_402020, 0x40206a)) | sub_4012c2(eax, eax, 0, 0xfd06b77e, &data_402048, 0x402088));
        int32_t esi_8 = (esi_7 | sub_4012c2(eax, eax, 0, 0x2d7fdd26, &data_402038, 0x40207c));
        int32_t eax_11 = (sub_4012c2(eax, eax, 0, 0x530c1aee, &data_402050, 0x40208e) | esi_8);
        int32_t eax_12 = (-eax_11);
        return ((eax_12 - eax_12) + 1);
    }
    return eax;
}

First, we can see that the address of the NTDLL.dll library is retrieved.

Then, several calls to the same function, with clear arguments (the first 2 represent the handle of NTDLL, the 4th corresponds to the hash to be resolved and the 5th is the variable where the address of the resolved API will be stored).

When you identify a function with similar arguments, check for cross-references. If it is called many times, it may well be an API hashing function.

Find the hashing algorithm

The most important part will be to find the hashing algorithm used. This is often a common algorithm, and tools like capa may be able to detect it. But sometimes, authors can modify already known algorithms or create one from scratch.

That’s why it’s useful to use a debugger once you’ve found the hashing function, which makes the task easier.

In my case, the function used to create a hash of the API name was ROT13, coupled with an XOR whose value is hardcoded in the binary :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


while (true)
    {
        int32_t eax;
        eax = *esi;
        if (eax == 0)
        {
            break;
        }
        ecx = ((RORD(ecx, 0xd)) + eax);
        esi = &esi[1];
    }
...
if (*eax_1 != (ecx ^ 0x401056))

Replicating the hashing algorithm and creating a lookup table

Once I understood the algorithm, I replicated its operation in Python and created a lookup table. This will contain all the names of the APIs of the chosen DLL (here NTDLL) and their corresponding hash:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


import json

def ROR(x, n, bits = 32):
    mask = (2**n) - 1
    mask_bits = x & mask
    return (x >> n) | (mask_bits << (bits - n))

def hash_api(api):

    h = 0
    for a in api:
        h = ROR(h, 13)
        h += ord(a)

    return (hex(4198486 ^ h))

apis = []

with open("NTDLL_apis.txt", 'r') as f:
    lines = f.readlines()
    for line in lines:
        api_couple = {}
        api_couple["api"] = line[:-1]
        api_couple["icedid_hash"] = hash_api(line[:-1])
        apis.append(api_couple)
      
apis = json.dumps(apis)

with open("hash_apis.json", 'a') as f:
    f.write(apis)

This script goes through a list of APIs, extracted thanks to a little program I coded in C (but it’s also possible to do it in Python, and much simpler) and hashes all the APIs before producing a json file which will be our lookup table :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


...
{"api": "EtwEnumerateProcessRegGuids", "icedid_hash": "0x91246a5e"}, 
{"api": "EtwEventActivityIdControl", "icedid_hash": "0xd5072de7"}, 
{"api": "EtwEventEnabled", "icedid_hash": "0xf0906dff"}, 
{"api": "EtwEventProviderEnabled", "icedid_hash": "0xb51b3f6e"}, 
{"api": "EtwEventRegister", "icedid_hash": "0xfb73ec0e"}, 
{"api": "EtwEventSetInformation", "icedid_hash": "0x4f3be0bb"}, 
{"api": "EtwEventUnregister", "icedid_hash": "0xf55690dc"}, 
{"api": "EtwEventWrite", "icedid_hash": "0x2007d3b8"}, 
{"api": "EtwEventWriteEndScenario", "icedid_hash": "0x72faa875"}, 
{"api": "EtwEventWriteEx", "icedid_hash": "0x1458ec56"}, 
{"api": "EtwEventWriteFull", "icedid_hash": "0xbdeefe7"}, 
{"api": "EtwEventWriteNoRegistration", "icedid_hash": "0xdd452084"}
...

Automation

Now that we have the lookup table, it’s time to resolve these APIs on BinaryNinja. Of course, this process won’t affect the binary in any way, it will just make it easier for us to find our way around during static analysis.

We have all the information we need:

The address of the API resolution function
The desired hash
The variable where the API address will be stored

The first step is to retrieve the list of all cross-references from the resolution function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


>>> refs = bv.get_code_refs(0x4012c2)
>>> for ref in refs:
... 	print(ref.mlil)
... 
eax_1 = 0x4012c2(var_24, var_20, 0, 0xe50cd451, 0x402030, 0x402076)
eax_2 = 0x4012c2(var_3c, var_38, 0, 0xb0d89fb2, 0x402028, 0x402070)
eax_3 = 0x4012c2(var_54, var_50, 0, 0xd37bdaeb, 0x402008, 0x402058)
eax_4 = 0x4012c2(var_24_1, var_20_1, 0, 0xf4b15f66, 0x402040, 0x402082)
eax_5 = 0x4012c2(var_3c_1, var_38_1, 0, 0x8c795ddf, 0x402018, 0x402064)
eax_6 = 0x4012c2(var_54_1, var_50_1, 0, 0xc5509c94, 0x402010, 0x40205e)
eax_7 = 0x4012c2(var_24_2, var_20_2, 0, 0xae46d1e4, 0x402020, 0x40206a)
eax_8 = 0x4012c2(var_3c_2, var_38_2, 0, 0xfd06b77e, 0x402048, 0x402088)
eax_9 = 0x4012c2(var_54_2, var_50_2, 0, 0x2d7fdd26, 0x402038, 0x40207c)
eax_10 = 0x4012c2(var_24_3, var_20_3, 0, 0x530c1aee, 0x402050, 0x40208e)
eax_1 = 0x4012c2(nullptr, var_24_1, var_20_1, 0xe50cd451, 0x4020c8, 0x40210e)
eax_2 = 0x4012c2(nullptr, var_3c_1, var_38_1, 0xb0d89fb2, 0x4020c0, 0x402108)
eax_3 = 0x4012c2(nullptr, var_54_1, var_50_1, 0xd37bdaeb, 0x4020a0, 0x4020f0)
eax_4 = 0x4012c2(nullptr, var_24_2, var_20_2, 0xf4b15f66, 0x4020d8, 0x40211a)
eax_5 = 0x4012c2(nullptr, var_3c_2, var_38_2, 0x8c795ddf, 0x4020b0, 0x4020fc)
eax_6 = 0x4012c2(nullptr, var_54_2, var_50_2, 0xc5509c94, 0x4020a8, 0x4020f6)
eax_7 = 0x4012c2(nullptr, var_24_3, var_20_3, 0xae46d1e4, 0x4020b8, 0x402102)
eax_8 = 0x4012c2(nullptr, var_3c_3, var_38_3, 0xfd06b77e, 0x4020e0, 0x402120)
eax_9 = 0x4012c2(nullptr, var_54_3, var_50_3, 0x2d7fdd26, 0x4020d0, 0x402114)
eax_10 = 0x4012c2(nullptr, var_24_4, var_20_4, 0x530c1aee, 0x4020e8, 0x402126)

Where BinaryNinja proves more efficient than IDA is that we can interact directly in IL (Intermediate language) mode, where the code looks more like C, making it easier to retrieve function arguments, for example.

Here’s how to retrieve the hash of the function to be solved for each call to the function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


for ref in refs:
	print(ref.mlil.params[3])

0xe50cd451
0xb0d89fb2
0xd37bdaeb
0xf4b15f66
0x8c795ddf
0xc5509c94
0xae46d1e4
0xfd06b77e
0x2d7fdd26
0x530c1aee
0xe50cd451
0xb0d89fb2
0xd37bdaeb
0xf4b15f66
0x8c795ddf
0xc5509c94
0xae46d1e4
0xfd06b77e
0x2d7fdd26
0x530c1aee

Finally, here’s how to retrieve the variable where the function address will be stored:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


for ref in refs:
	print(ref.mlil.params[4].constant)
 
4202544
4202536
4202504
4202560
4202520
4202512
4202528
4202568
4202552
4202576
4202696
4202688
4202656
4202712
4202672
4202664
4202680
4202720
4202704
4202728

If we put all this together in a Python script :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


from binaryninja import BinaryView
import json

ADDRESS = 0x4012c2

#Load la lookup table à partir du fichier json
with open('/Users/lordtmk/Malwares/IcedID/hash_apis.json', 'r') as f:
    lookup = json.load(f)

refs = bv.get_code_refs(0x4012c2)

log_info("Starting...")

for ref in refs:
    hash_value = ref.mlil.params[3]
    address_pointer = ref.mlil.params[4].constant
    key = [x for x in lookup if x["icedid_hash"] == str(hash_value)] #Recherche la clé de la lookup table pour le hash en cours
    log_info(f"Found api {key[0]['api']} for this hash !")
    bv.define_data_var(address_pointer, "void*", key[0]['api']) #Renomme la variable avec le nom de l'API trouvée

Here’s the result after using the script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


[Python Console] Running script from file: /Users/lordtmk/Malwares/IcedID/decode_api_hashing.py
[Default] Starting...
[Default] Found api LdrGetProcedureAddress for this hash !
[Default] Found api LdrLoadDll for this hash !
[Default] Found api NtAllocateVirtualMemory for this hash !
[Default] Found api NtCreateUserProcess for this hash !
[Default] Found api NtProtectVirtualMemory for this hash !
[Default] Found api NtWriteVirtualMemory for this hash !
[Default] Found api NtWaitForSingleObject for this hash !
[Default] Found api RtlDecompressBuffer for this hash !
[Default] Found api RtlExitUserProcess for this hash !
[Default] Found api NtFlushInstructionCache for this hash !
[Default] Found api LdrGetProcedureAddress for this hash !
[Default] Found api LdrLoadDll for this hash !
[Default] Found api NtAllocateVirtualMemory for this hash !
[Default] Found api NtCreateUserProcess for this hash !
[Default] Found api NtProtectVirtualMemory for this hash !
[Default] Found api NtWriteVirtualMemory for this hash !
[Default] Found api NtWaitForSingleObject for this hash !
[Default] Found api RtlDecompressBuffer for this hash !
[Default] Found api RtlExitUserProcess for this hash !
[Default] Found api NtFlushInstructionCache for this hash !
[Analysis] Analysis update took 0.025 seconds

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


{
    HINSTANCE eax = GetModuleHandleA("NTDLL.DLL");
    if (eax != 0)
    {
        data_402000 = 1;
        int32_t esi_3 = (((mw_resolve_api_by_hash(eax, eax, 0, 0xe50cd451, &LdrGetProcedureAddress, 0x402076) | mw_resolve_api_by_hash(eax, eax, 0, 0xb0d89fb2, &LdrLoadDll, 0x402070)) | mw_resolve_api_by_hash(eax, eax, 0, 0xd37bdaeb, &NtAllocateVirtualMemory, 0x402058)) | mw_resolve_api_by_hash(eax, eax, 0, 0xf4b15f66, &NtCreateUserProcess, 0x402082));
        int32_t esi_7 = ((((esi_3 | mw_resolve_api_by_hash(eax, eax, 0, 0x8c795ddf, &NtProtectVirtualMemory, 0x402064)) | mw_resolve_api_by_hash(eax, eax, 0, 0xc5509c94, &NtWriteVirtualMemory, 0x40205e)) | mw_resolve_api_by_hash(eax, eax, 0, 0xae46d1e4, &NtWaitForSingleObject, 0x40206a)) | mw_resolve_api_by_hash(eax, eax, 0, 0xfd06b77e, &RtlDecompressBuffer, 0x402088));
        int32_t esi_8 = (esi_7 | mw_resolve_api_by_hash(eax, eax, 0, 0x2d7fdd26, &RtlExitUserProcess, 0x40207c));
        int32_t eax_11 = (mw_resolve_api_by_hash(eax, eax, 0, 0x530c1aee, &NtFlushInstructionCache, 0x40208e) | esi_8);
        int32_t eax_12 = (-eax_11);
        return ((eax_12 - eax_12) + 1);
    }
    return eax;
}

The APIs have been correctly resolved, making it easier to understand the malware’s capabilities.

Conclusion

This concludes my very first article using Binary Ninja software. There’s still a lot to discover about it, and its API gives me plenty of ideas for the future.

For those who want to try out BinaryNinja, there’s a rather complete demo version.