Reversing Malicious Code

  • Goal is to understand common malware characteristics at a code level

  • May include potential branches of execution with code analysis

  • Overview of the code lifecycle

  • Source code is translated into object code by a compiler

  • Object code is then combined with libraries and an executable file is created

  • To run the file, the operating system reads various information from the executable file, allocates memory, and loads required libraries into memory

  • Control is transferred to the code to execute

  • At this final stage is where we examine the code with a debugger

  • Note: Libraries may be loaded during the programs execution

Ghidra

  • Developed by NSA

  • Its decompiler produces a C representation of the code to speed up analysis

  • Includes support for writing java and python scripts to automate analysis

  • Help is accessed via F1 key

  • Ghidra v10 includes a debugger

Create a new project

  • File --> New Project

  • Choose the project type

  • Click Finish

  • Drag and drop the specimen into the project window

  • Accept defaults in the Imports windows and click Ok

Launch the code browser and being the auto-analysis

  • Make sure to enable WindowsPE x86 Propagate External Parameters option

  • Finally click the Analyze button and wait for Ghidra to finish

  • Once auto analysis is completed an Auto Analysis summary will show any warnings or issues encountered during the process

  • A common warning is that the file does not contain debug information

  • This is common and not an issue

Before Proceeding save the project and take a snapshot

Ghidra Overview

  • Main window is the Listing View which presents the target programs code and data

  • Will initially bring you to the beginning of the file in the Listing View --> notice the MZ string

  • If you scroll down from there you can examine the programs header

Program Tree

  • Window is in the top left and shows the different sections and headers

  • Section names are typically:

.text - Contains executable code
.rdata - Contains read-only data
.data - Contains data 
.reloc - Contains relocation data to fix up addresses in the file if it is not loaded at the prefered address

FUN in Ghidra

  • In Ghidra the FUN_ prefix generically refers to a function while the numeric value refers to the address where the function is loaded into memory

  • Original name of the function is normally lost during compilation

  • Execution occurs linearly one instruction after the next

  • On the far left you will have a 32 bit address such as 00401007 (hex)

  • This address represents the location of code in memory after the program is loaded, not the address of a location on disk i.e. within a file hex editor

  • On the right there are x86 assembly instructions

  • Note: - This is the beginning of the .text section, not the beginning of the program, that occurs at the entry point

Function graph view provides a visual perspective on code

  • Click on the function you want i.e. FUN_00401007

  • Browse to Window --> Function Graph menu item

  • Helpful for visualizing loops and complex conditionals within a function but the Listing view is more compact nd easier for some people to navigate

  • The color of the arrows symbolize code flow

  • If the code block ends in a conditional jump green arrows indicate the path here execution will continue if the condition is met

  • If the condition is not met a red arrow will show where execution continues

  • If the arrow is blue the code ends in an unconditional jump

  • View Imports to review a programs external dependencies

  • The import address table (IAT) helps direct code analysis

  • You can view imports in the Symbol Tree window but we will access this information via Window --> Symbol References

  • Filter symbols by "Imported" to focus on dependencies

Look for API call patterns associated with malware behavior

  • We can examine imports to identify potential functionality associated with common malware characteristics

  • Learn more about an API call at microsoft.com

  • Types of API Calls:

A --> (ANSI)
W --> (Wide)
Ex --> (Extended)
  • Refers to if the function supports ANSI (8 bit character)

  • Wide refers to a two byte character representation (UTF-16)

  • Extended is when MSFT updates a function and the new function is not compatible with the old one

  • Instructions reference registers, immediate values and memory

  • Instructions have two components: operation and operand

  • Instructions can have 0-3 operands

  • An Operand can be:

A register
A memory location 
An immediate value e.g. 0x6453)
  • Consider MOV EAX, 0x6453

  • EAX is the destination (first)

  • 0x6453 is the source (second)

  • You are setting EAX to the value 0x6453

  • Operands may be implied

Intel processor uses registers to track the state of computation as instructions are executed

  • Registers are on chip memory locations

  • Instructions act on registers and memory locations

  • A CPU has a series of registers

Some registers are general purpose
Some have a particular use
Some are both
  • We monitor registers to track arguments, variables, and function return values

  • The x86 architecture uses the following general purpose registers to hold code and data

EAX --> Used for addition, multiplication, and return values
ECX --> Used as a counter 
EBP --> Used to reference arguments and local variables
ESP --> Points to the last item on the stack 
ESI/EDI --> Used by memory to transfer instructions 

Special use registers hold flags and track program execution

  • EIP points to the next instruction to execute

  • EFLAGS bit represents the outcome of computers and they control CPU operations

Segment registers include:

CS - Code segment
DS - Data segment 
SS - Stack segment 
  • 32 bit registers can also be accessed as 16 and 8 bit registers

  • On 32 bit arch, registers can be accessed by their default dword size

  • To access a registers lower 16 bits the leading E is omitted from the name e.g. EAX becomes AX

  • The naming scheme for EAX EBX ECX EDX is as followed

  • E<letter>X --> dword 32 bit value of the register

  • <letter>X --> lower word 16 bit value of the register

  • <letter>H --> high byte 8 bit of the <letter>X value of the register

  • <letter>L --> low byte 8 bit of the letterX> value of the register

EAX means 32 bits 
AX means the low 16 bit value 
AH means the high 8 bytes of AX 
AL means the low 8 bits of AX
  • The length of a word, dword, and qword are 16, 32, and 64 bits

  • A word in assembly is the natural size for a unit of data

  • 16 bit processor has 16-bit words

  • Many tools consider a word to be 16 bits regardless of processor size

  • Additional common data sizes:

8 bits --> 1 byte 
32 bits --> dword 
64 bits --> qword
  • The operand for one push instruction is a pointer to a string

  • A pointer is a variable that holds a memory address (it points to a memory location)

  • When the address that the pointer points to is accessed it is called dereferencing because the pointer references another location in memory

  • Pointers are more efficient, rather than copying around a data structure in memory its more efficient to copy the value of a pointer (4 bytes on 32 bit systems)

  • A PUSH instruction before a CALL often represents arguments passed to the function specified by the CALL

Memory can be accessed directly by many assembly instructions

  • Example:

MOV EAX, [0x410230]
  • Brackets mean fetch data at the specified address (dereference)

  • This is direct addressing because we are dereferencing an immediate value

  • The result is that 4 bytes of data at 0x410230 will be moved to EAX

  • Some tools like IDA omit brackets for direct addresses (IDA: dword_410230)

  • Memory may also be addressed by reference indirectly

  • The address may be calculated or in a register

  • This is called an Effective Address and it enables us to work efficiently with data structures

  • Format: Base + (Index * Scale) + Displacement

BASE        Index   Scale       Displacement
(EAX EBX) + (EAX EBX  1)   +     (None)
(ECX EDX) + (ECX EDX  2)   +     (8 bit value)
(ESP EBP) + (EBP ESI  4)   +     (16 bit value)
(ESI EDI) + (EDI      8)   +     (32 bit value)
  • Indirect Referencing: address of the destination is calculated or it resides in a register. The calculated address is called the effective address (EA)

  • If the address sits in a register, it is still different from direct memory addressing where the register is the destination

  • In indirect memory addressing the register holds the address of the destination.

  • Large advantage of indirect memory addressing is the capability to efficiently work with data structures

  • You can increment the value of a single register to step through fields of a data structure or the same field of an array of data structures

  • If the scale is used and index register must also be used

  • Examples of indirectly addressing memory

  • [EAX] : Access dynamically allocated memory (base)

  • [EBP + 0x10] : Access data on the stack (base + displacement)

  • [EAX + EBX * 8] : Access an array with 8-byte structure ( base + index * scale)

  • EAX +EBX + 0xC] : Access fields of a two dimensional array of structures (base + index + displacement)

  • Indirect memory addressing may pose challenges for static code analysis because registers are not populated until runtime

  • Strings are an example of a data structure

  • Data structures groups simple variables into more complex types

  • Examples of data structures include: strings, linked lists, sockets, and file handles

  • When reversing determine the type of data structure by usage

  • Data structures enable us to group bytes and advance our understanding of the code

**Code vs Data **

  • Context determines the answer

  • RegOpenKeyExA Example

  • The API call will have to have a symbolic constant i.e. PUSH 0x80000001

  • During compilation it will be changed from the symbolic constant into the hex representation

  • Right click the hex value, choose Set Equate and then choose HKEY_CURRENT_USER to change it back to the symbolic constant

  • Will bring clarity to the code

Branch instructions direct code execution to another location

  • The flow of execution i.e. control flow is sequential until a branching instruction is reached

  • Then the EIP is updated and execution is transferred to another location in memory

  • The code under review contains two types of jumps

  • Jumps are an example of a branching instruction

  • Unconditional jumps always perform a jump JMP, CALL, RET

  • Conditional jumps only jump if a condition is met: JCC, Loop

  • Conditional jump represents a decision point

  • Conditional jumps require that we review multiple instructions

  • To evaluate whether a conditional is true, arithmetic instructions and Boolean are used

  • sub ecx, 8 Will test if ECX is equal to 8

  • and eax, eax will test if EAX is equal to zero

  • If the result of zero then the ZF bit is set in the flags register

Jumps

  • A Jcc instruction will be performed if a jump condition is met

  • Form: Jcc

A --> jump if Above 
B --> jump if Below
E --> jump if jmp if equal 
G --> jump if greater 
L --> jump if less than 
Z --> jump if if zero 
N --> jump if not condition JNZ jump if not zero 

Comments

  • Use the ; key to add a comment

  • Can add EOL comments, Pre, Post or other types of comments

HTTP Command and Control

  • These APIS enable HTTP C2

InternetOpen, InternetConnect --> Create an HTTP connection
HttpOpenRequest, HttpAddRequestHeaders (Optional) --> Build an HTTP request
HttpSendRequest --> Send an HTTP request
InternetReadFile --> Read a response 
  • To view the API calls

Window --> Symbol References --> Locate API's of interest in the Symbol Table
  • The code references variables, which holds code or data not known at compile time

  • Local variables are relevant for the current function and are not saved

  • Local variables are stored on the stack relative to ESP and EBP

  • Global variables are accessible from all functions e.g. DAT_00403374

  • Also static variables can be only used from within the function that allocates it, but unlike local variables it does not get marked for reuse when the function exists

Viewing Function Call Trees

  • Window --> Function Call Tree

  • View the outgoing calls on the right side

  • View is ideal for determining which functions are called from the current function

  • Once you determine what the current function is being used for make sure to Rick Click --> Edit Label and give it a meaningful name

GetTempFileNameW

  • Creates a file name for a temp file

  • Can explore other function references to find new IOCs

  • Look for a PUSH to lpPrefixString_XXXXXX

  • MSFT documentation states the first three characters make up the temp file name prefix

  • To assist Ghidra:

Right click on the lpPrefixString --> Click data --> terminate Unicode

Functions

  • A function is a group of instructions that performs a specific task (read, write files, send network data, log keystrokes)

  • Three Basic Components

Input: values passed int
Body: code to perform tasks
Return: value passed back
  • Calling a function involves a jump to another memory location

  • After the function is done execution continues at the instruction after the original function call

  • Calling a function involves two control transfers

  • Function format: return = function(arg0, arg1)

  • Specific events occur when calling a function

Pass in parameters (stack/register)
Save the return pointer 
Transfer control to the funciton 
  • Specific events occur when returning from a function

Set up a return value (typically EAX)
Clean up the stack and restore registers 
Transfer control to the saved return pointer
  • Within a function, the prologue and epilogue perform setup and cleanup activities

  • Most functions contain a standard prologue and epilogue

  • The prologue occurs at the start of the function

Allocates space for variables
Saves resisters that will be reused in the function body
  • Function epilogue occurs at the end of the function

It cleans up the stack e.g. POP allocated variables
It restores registers
  • The stack is a section in memory used to store saved registers, local variables and function parameters

  • The stack is LIFO Last in First out

  • PUSH adds an element and POP removes one

  • ESP points to the next item on the stack and changes with instructions like PUSH POP CALL LEAVE RET

  • EBP a.k.a frame pointer serves as an unchanging reference

  • EBP - value = local variable registers may also be used

  • EBP + value = parameter

  • When EBP is set up in the function prologue in this manner, it means that when you see code reference EBP minus some value i.e. [EBP -8] it is accessing a local variable

  • When its EBP plus some value i.e. [EBP +8] it is referencing a parameter that was passed in

  • When cleaning up the stack compilers use some tricks

  • Compilers may POP off a value i.e. POP EDX which has the result of adding four to ESP

  • It is also very common to see a value added to ESP the used of the RET (which can also pop stuff off the stack, and the leave instruction

Functions are called according to calling conventions

  • The convention describes how data is passed into and out of functions

  • The implementation of the convention may vary by compiler

  • The cdecl convention (most common) has these characteristics

The arguments are placed onto the stack right to left
The return value is placed into EAX
The caller cleans up the stack (removes the arguments)
  • The stdcall convention has the following characteristics

Similar to cdecl but the callee cleans up the stack 
This is the convention used in !IN32 APIs
  • Additional calling conventions include fastcall and thiscall

  • fastcall

  • Arguments are stored in registers

  • Any extra arguments are placed on the stack

  • The callee cleans up arguments on the stack

  • thiscall

  • Used in C++ code (member functions)

  • This convention includes a reference to this pointer

  • For MSFT compilers, ECX holds the "this" pointer and the callee cleans up the arguments on the stack

  • For GNU compilers the "this" pointer is pushed onto the stack last and the caller cleans up

  • Reviewing strings reveals filenames and directories of interest

  • To Locate a reference to a string right click on it and choose to show references

Loops in malware

  • Used to encrypt and decrypt network traffic --> loop over each character in the string to send

  • Attempt to connect to C2 server --> loop over a lists of servers

  • Perform a port scan --> try to connect to a port 1-65535

  • Log keystrokes --> Check state for each key code 0...92

  • Similar to JCC the Cs in LOOPcc represent the conditional code that must be met for the loop instruction to branch to the address specified

  • The conditions are:

Z --> Loop if zero 
E --> Loop if equal
N --> Inverts the logic of the looping condition

Reviewing imports to direct our code analysis

  • The import table lists functions used to access the resource section

FindResourceW --> determine the location of a resource
SizeofResource --> obtain the size of a resource
LockResource --> obtain a pointer to a resource
  • The resource .rsrc section is often used to store information like icons, dialog boxes, and version information

  • However malware may hide executables here

  • Malware that drops files is called a dropper

CreateMutexA

  • CreateMutexA --> creates or opens a mutex object

  • Malware authors often use a mutex to avoid re-infecting a machine

Keylogging

  • GetKeyState and GetAsyncKeyState --> Determine if a particular key is pressed

  • GetWindowText --> Retrieves text from a windows title bar

  • OpenClipboard, GetClipboardData, and CloseClipboard --> Opens the clipboard for access, gathers data, and then closes the clipboard

  • GetWindowText --> obtains the text of a windows title bar, combined with the two previous APIs an attacker could learn about what keys are pressed and what the application context is.

  • GetAsyncKeyState determines if a key is currently up or down or if it was pressed since the last call to the API

64 Bit Malware

  • Vast majority is 32 bit

  • We will see more 64 bit in the future as they become the standard

  • Two types of 64 bit malware have been common

Browser Helper Objects for 64 bit Internet Explorer
Device Drivers (rootkits) for Windows x64

Analyze 32-bit malware on 64-bit OS with caution

  • 32 bit code running on a 64 bit operating systems runs in the WOW64 Subsystem

  • 32 bit executables load 32 bit dlls

  • 32 bit dlls are located in %SystemRoot%\Syswow64

  • 32 bit processes reference Software hive registry values in Wow6432Node using registry redirection

  • Some executables run subtly different under WoW64 than on a native 32 bit OS

64-Bit Assembly Differences

  • All general purpose registers are expanded to 64 bits

  • EAX --> RAX

  • There are eight new general purpose registers R8 --> R15

  • Special use registers are exted and renamed EIP --> RIP

  • RSP not RBP is often used to access parameters and variables

  • Calling convention resembles fastcall (parameters via registers)

First four parameters are passed in RCX RDX R8 R9
Additional parameters are stored on the stack 
  • There is a new addressing mode (RIP + displacement)

Last updated