Analyzing Malicious Documents

PDF files can possess powerful capabilities that adversaries misuse to infect systems

  • The structure and contents of a PDF file are defined using objects, which issue directives using ASCII based keywords

  • Same risky keywords include

Execute Embedded Javascript --> /JS /Javascript /AcroForm /XFA
Try launching external or embedded programs --> /Launch /EmbeddedFiles
Take actin automatically when the PDF file is opened --> /AA /OpenAction
Interact with websites --> /URI /SubmitForm

A PDF file is a collection of elements

header --> %PDF-1.6
object --> object delimited with: 
X Y obj 
endobj
...
xref --> Table with offsets of objects in the file
trailer --> Lists the number of objects and the offset of xref

PDF objects can reference each other and specify actions

  • Indirect object 1 0 references 43 0

1 0 obj 
Type: /Page
<<
  /AA /O 43 O R
>>
endobj

Streams can encode various data

44 0 obj 
<<
  /Filter 
    [/FlatDecode]
  /Length 463
>>
stream
    encoded contents
endstream 
endobj

Always start by opening the sample in vs-code

unzip steel1.zip
code steel1.pdf
  • Use pdfid.py for an initial perspective to check for risky keywords

  • pdfid.py scans for suspicious keywords without formally parsing the PDF file

  • Its useful for an initial review to inform the next steps

  • The /URI keyword indicates clickable URLs can be used in PDFs as phishing bait

  • We use "keyword" in a generic sense through PDF specs use other terms

pdfid.py steel1.pdf
  • Use pdf-parser.py for a more detailed look at the PDF file

  • The -a parameter to pdf-parser.py shows statistics

  • Because pdf-parser.py properly parses PDF syntax, its output is more accurate than that of pdfid.py

pdf-parser.py steel1.pdf -a 
  • The -k parameter shows just the values for the given key

pdf-parser.py steel1.pdf -k /URI

Images in PDF Documents

  • The attacker tries to persuade the victim to clicking on the picture

  • To locate images in the PDF file, look for objects of type /XObject

Examine an Object

  • Use the -o parameter to pdf-parser.py to examine object 6 which contains /XObject

pdf-parser.py steel1.pdf -o 6
obj 6 0
  Type: /XObject
  Referencing 7 0 R
  Contains Stream     <-- Object includes encoded data
  
  <<
    /Type /XObject
    /Subtype /Image
    /Width 625      <-- Image size is 625 x 155 pixels
    /Height 155
    /BitsPerComponent 8
    /ColorSpace /DeviceRBG
    /Length 7 0 R
    /Filter /DCTDecode      <-- This decoding is used for JPEG images
  >>

Extract and view the image object

pdf-parser.py steel1.pdf -o 6 -d object6.jpg
  • Follow the trail of references that leads to object 6 to see if the strail starts with a link

  • The -r parameter finds a reference to the specified object

  • Object 6 which was of type /XObject is referenced by object 13

obj 13 0 
  Type:
  Referencing: 4 0 R, 3 0 R, 8 0 R, 9 0 R, 6 0 R
  
  <<
    /ColorSpace
      <<
        /PCSp 4 0 R
        /CSp /DeviceRGB
        /CSpg /DeviceGray
      >>
    /ExtGState
  • Note: /Annotes offers a way to associate a link with an object

  • Continue to follow the trail of references

  • If you see /Annotes 14 0 R --> Look at object 14 now

Dealing with Malicious Websites / Retrieving malicious 2nd stages

  • One-by-one requests using wget or curl

  • Recomment spoofing HTTP headers to make these requests look more like a normal web browser....Especially the UA strings for wget and curl!!!!

  • Can also tweak the config files of wget and curl

~/.wgetrc, ~/.curlrc
  • Specialized tools such as Pinpoint or Scout

  • Honeyclients software such as Thug

  • Real borwser on a purposefully vulnerable Windows system enabling the website to infect the lab machine

Activate behavioral monitoring tools to observe the infection
Capture network traffic 
If using a sniffer such as Fiddler configure it to save SSL keys 
Visit the website from several different IPs to see if its behavior changes 

View PDF Object Streams

pdf-parser.py steel2.pdf -O -a
  • If you see an /ObjStream from the output of pdf-parser.py steel2.pdf -a command then you need to view the /ObjStream

  • pdf-parser.py does not examine object streams by default

Find all objects that refer to object 10

pdf-parser.py steel2.pdf -O -r 10

Aditional Considerations with PDFs

  • Look for risky objects, examine them, follow the trail of referenced or otherwise related objects

  • If you see a suspicious object with a stream you can dump that stream to a file using parameters -f -w -d

  • Malicious PDFs can include JS --> look for /JS /Javascript /Acroform /XFA

  • PDF files could be password protected

  • The strucutre will be visible but youll need to decrypt streams to examine them

  • Youll need to determine the password then decrypt with tolls such as qpdf and pdftk

VBA Macros in Microsoft OFfice Documents

  • Note: Even if the document of VBA project is password protected the macros are not stored in an encrypted way

  • Office docsuments can follow two different formats

  • The "legacy" binary format is OLE2 (a.k.a structured storage etc)

  • OLE2 mimics capabilities of a file system using the concepts of storages (like folders) and streams (like files)

  • The more modern XML based format OOZML incorporates multiple files that include the documents contents in a ZIP file

  • Both formats can carry macros

  • Macros in an OOZML file are inside a binary OLE2 file which is inside the zip archive

  • Normally VBA macro code is embedded inside streams as compiled code (p-code) and compressed source code

Initial Triage

file particulars.doc
trid particulars.doc

trid

  • Open XML Format --> means its an OOXML files

Examine the files that comprise the OOXML document using unzip or zipdump.py

zipdump.py particulars.doc 
unzip particulars.doc -d particulars-files
  • Can extract individual files as well with zipdump.py

  • -s --> specify the file

  • -d --> extract or dump it

zipdump.py particulars.doc -s 5 -d > image1.jpeg
  • Use feh image viewer to view the image

feh image1.jpeg &

olevba to extract VBA Macros

olevba particulars.doc > particulars.olevba #extract
code particulars.olevba #view
  • olevba utility can locate, decode, and extract VBA macros from Office files. The tool also shows a summary of the risky keywords it located in the macro

  • Any line that starts with ' it is a comment in VBA

  • When Office sees AutoOpen it automatically executes that function as soon as the function is allowed to run

  • Example:

Sub AutoOpen()
g
End Sub 
-----------------------------------------
Sub g()
' useless comment 
' another useless comment for obsfucation 
y
' blah
' blah blah 
B
End Sub
  • Can see that AutoOpen() calls Sub g() which then call function y and function B which are defined later

  • For deeper visibility into VBA macros and related artifacts examine streams

  • Use oledump.py

oledump.py particulars.doc -i
  • M means there is a macro present

  • 2823+809 Size of the compiled code is the first number, second number is the size of the compressed source code

  • Example:

A3: M 3632 2823+809 'VBA/Pj
  • Use -s a parameter to oledump.py to extract VBA macros from all streams in particulars.doc

oledump.py particulars.doc -s a -v | more
  • Pass the oledump.py output through grep to eliminate the comments

oledump.py particulars.doc -s a -v | grep -v "^'" | more
  • Sometimes minor aspects of the document can offer additional context for your investigation

  • They can sometimes reveal artifacts used in its previous version

  • Use oledump.py to extract them

Macros via LOLBin

  • Be on the look out for obsfucated strings that are backwards

Public Const O As String = 
" 23rvsger"
...
Function U5(qe)
Dim bT As New WshShell
bT.exec StrReverse(O) & " " & DU(1)
End Function
  • When this is executed it will use the LOLBin regserv32

  • Be aware of LOLBin mshta as well

Viewing MetaData

exiftool filename.doc
  • XML source code files sometimes include details such as:

  • Hidden comments such as URLs from which images were pasted

  • The language code of the system where the document was created

Analyzing OOXML

  • You can unzip its contents and examine individual XML files

  • Start with zipdump.py with no command line arguments

zipdump.py particualars.doc 
  • Once you have identified the index of the file you'd like to examine you can call zipdump.py again specifying the desired files index using -s

  • -d parameter will direct the tool to dump the file to STDOUT

  • Can then pipe to xmldump.py with the parameter pretty to reformat the file

zipdump.py particulars.doc -s 9 -d | xmldump.py pretty | more

Vipermonkey can emulate VBA macros

vmonkey particulars.doc > particulars.vmonkey
code particulars.vmonkey
  • Tool will auto decode the VBA macros

Numbers to Strings

  • After performing analysis you notice a macro in A3

  • When extracting it with oledump.py

oledump.py mydoc.docm -s A3 -v | more
  • You see alot of these lines

exec = exec & ChrW(112) & ChrW(111)...
  • You can use numbers-to-strings.py

oledump.py mydoc.docm -s A3 -v | numbers-to-string.py -j | more
  • Make sure to add new lines and examine the output

numbers-to-strings.py -j | sed "s/;/;\n/g" > mydoc.oledump

Password protected VBA Macros

  • Can see the VBA macro using oledump.py even though MSFT Office refuse to show you the code due to the password being set

oledump.py invoice.doc -i
oledump.py invlice.doc -s 7 -v | more

Remove the distracting junk code, then examine the macro

oledump.py invoice.doc -s 7 -v | grep -v "^GoTo" | grep -v ":$" > invoice.oledump

xor-kpa.py

  • The tool xor-kpa.py is designed to derive an XOR key from the supplied plaintext and cipher text

  • It can also XOR a string with its multi-byte key which mimics the algorithm employed by our malicious macro

  • -x tells the tool to XOR the data with the key

  • Start each param with #h# to designate it as a hex-encoded string and enclose in ''

xor-kpa.py -x '#h#89789FD89AF897AKJHF43HK23' '#h#66546F'

Auto deobsfucation with oledump.py

  • plugin_http_heuristics --> will automatically decode embedded URLs if they are encoded using a common obsfucation method

oledump.py invoice.doc -p plugin_http_heuristics
  • Sometimes a faster approach to deobsfucate macros involves the VBA debugger built into MSFT Office

evilclippy -uu invoice.doc
  • Will remove the macro password with -uu flag

  • Then open MSFT Word click View tab --> Macros --> View Macros --> edit

  • Bring up the locals window so you can see the variables

  • Add the following at the beginning of the macro (e.g. at the start of the AutoOpen function) so the macros starts the debugger

Sub AutoOpen() <-- Line already there
Debug.Assert False <-- Line you add
GoTo jlskdffjieoajioehjfueahfekjanufiw <-- Start of obsfucated mess
  • Save the macro so the line you added doesn't get lost

  • Switch to the MSFT word main view and enable macros

  • Once you enable the macros it will run and pause in the AutoOpen function on the line you set

  • Set the breakpoint on the line that interests you

  • Then click Run > Continue

  • Once it hits your breakpoint examine the locals window, it will show the current variables in the bottom window, you should see what you are looking for

VBA Stomping

  • When a macro is added to an Office Document MSFT Office compiles it into a bytecode form known as p-code

  • This is the code that is actually executed when the macro is run (most of the time: https://github.com/bontchev/pcodedmp)

  • Malware authors could modify or fully delete the source code version of the macro while keeping the p-code version intact

  • Our analysis tools focus on the source code of the macro and wont recognize the true nature of the file

Extract the file as always

olevba order.docm
  • Now extract the file structure info

oledump.py order.docm -i
  • Will see a ! which will indicate an Unusual start of source code

  • Another sign of VBA stomping will show if the size of the compressed source code being 0

  • oledump.py can extract the p-code but it cannot decode it

oledump.py order.docm -s A3 -v <-- Will get an error "Cannot decompress"
oledump.py order.docm -s A3s -A <-- -A will show the contents the way a hex editor might show them 
oledump.py order.docm -s A3c | more <-- adding -C will show the compiled code (what c stands for)
  • Use pcodedmp.py to disassemble VBA p-code

pcodedmp order.docm > order.pcodedmp
code order.pcodedmp
  • Use pcode2code to decompile VBA p-code

pcode2code order.docm | more
  • Note:

  • MSFT Office automatically decompiles the p-code generating the VBA source code, however:

  • Macros without the source code will only run in the specific version of Office for which the p-code was created

  • If you want to debug the macros you can decompile the p-code using pcode2code you can embed the macro in a document

Base64 PowerShell

  • If you identify some base64 encoded PowerShell, ensure to use bse64dump.py to convert it

oledump.py checkbox.doc -s 7 -d | base64dump.py -s 1 -t utf16 > checkbox.ps1
more checkbox.ps1
  • However when you view the dump we can see that it is also gzip encoded data

  • Extract the gzip data

base64dump.py checkbox1.ps1 -s 3 -d | gunzip - > checkbox2.ps1
code checkbox2.ps1

Shellcode

  • Shellcode is machine code that the CPU can understand

  • It is represented as a series of bytes sorted in a memory region

base64dump.py checkbox2.ps1 -n 10 
  • -n parameter directs base64dump.py to only consider strings that when decoded are at least 10 bytes long

  • You should now see the long shellcode string, and see that it is the second stream, use -s 2 to extract that stream

base64dump.py checkbox2.ps1 -n 10 -s 2 -d | translate.py "byte ^ 35" > checkbox.bin
  • Use scdbgc to emulate the execution of shellcode to understand its capabilities

scdbgc /f checkbox.bin /s -1
  • Can now use yara-rules to identify known malware patterns in file

yara-rules checkbox.bin
1768.py checkbox.bin

Examining Malicious RTF Documents

  • RTF documents are supported by MSFT word and many non-MSFT applications

  • RTF does not support macros but it allows attackers to embed other dangerous files as OLE objects and other binary contents

  • Users can be persuaded to open and execute the embedded file

  • RTF files can also directly target a vulnerability using an exploit to execute the embedded shellcode payload

  • When examining RTF documents, focus on the objects or other embedded artifacts

RTF format

  • Usually formatted as ASCII plaintext and includes control words and groups

  • Control words start with / and specifies how the RTF rendering application should format and display the characters

  • A group encloses other elements in {} delimiters and specifies the text affected by the group and its formatting

  • Groups can be nested

  • Objects and other binary content are embedded as serialized strings that represent hex values

  • You will see the /objdata control work followed by a string encoded in hex

  • Use rtfdump.py and | more to get and overview of the RTF files groups and to spot embedded objects

  • -o will allow you to examine the object

rtfdump.py new-order.doc -O
  • -s parameter specifies the index of the object

  • -d tells the tool to dump the object in its raw form

rtfdump.py new-order.doc -O -s 1 -d > new-order.object
  • Use oledump.py to examine the extracted object

  • oledump.py new-order.object -i

  • If you now want to examine a specific steam use the -A parameter

  • oledump.py new-order.object -s 4 -A

  • When analyzing malicious documents that might have exploits look for shellcode to understand the payload of the attack

  • Use the -S parameter to examine the strings

  • oledump.py new-order.object -s 4 -S

  • For parsing Equation Editor 3.0 data we have an option -f name=eqn1

  • oledump.py new-order.object -s 4 -d | format-bytes.py -f name=eqn1

Shellcode searching in Binary files

  • When looking for shell code look out for a lot of 0x90 also known as a NOP sled

  • Use xorsearch to spot shellcode patterns in binary files

  • xorsearch -W -d 3 qa.bin

  • EIP points to the current instruction but assembly code cannot read it directly, so malware authors do it indirectly

Call followed by a POP allows code to get its EIP contents
CALL 00401024
POP EAX
Sellcode developers attempt to evade detection by using other instructions to perform GetEIP
00401027 JMP SHORT 0040102C #Happens first and moves down to the CALL
00401029 POP ESI
0040102A JMP SHORT 00401031
0040102C CALL 00401029 #Call is made and it moves back up to the POP
00401031 ADD ESI, 9 
This code suceeds at making the CALL and then POP in an indirect manner

Shellcode Requirements

  • Shellcode needs to do some work before it can make API calls

  • To load DLLs and resolve API function names, shellcode often seeks kernel32.dll for LoadLibrary and GetProcAddress

  • Shellcode loos for the Process Environment Block (PEB) to locate kernel32.dll in memory of the exploited process

  • For every process the Windows OS creates a structure called the PEB

  • This data structure contains information about the process including the list of modules (DLLs) that have been loaded or mapped into the processes memory

  • The FS register contains the address of the data structure called the Thread Information Block (TIB), which contains information about the currently running thread

  • A pointer to the PEB resides within the TIB at offset 0x30 with respect to the beginning of the TIB

  • Therefore a pointer to PEB is always located at FS:[0x30]

  • This syntax directs the processor to look for the address stored 0x30 bytes away from the beginning of the TIB structure

  • Two methods to retrieve the PEB

MOV EAX, DWORD PTR FS:[30h]

PUSH 30h
POP EBX
MOV EAX, FS:[EBX]

scdbgc

  • Use scdbgc to analyze shellcode by emulating its execution

  • the -foff parameter specifies the hex offset within the file where the shellcode starts

  • This can be determined by xorsearch

  • Press CTRL+C three times if scdbgc gets stuck

scdbgc /f a.bin /s -1 .foff 3B
  • /s -1 parameter indicates to continue the emulation without restricting the max number of instructions

  • Direct scdbgc to open a handle to the malicious file so the shellcode can find the overlay to where it likely stores additional contents

  • Hit CTRL+C three times after it starts to avoid too many repeating instructions from filling your screen

  • Can hide the numerous READ/WRITE events with /norw

scdbgc /f qa.bin /s -1 /foff 3B qa.doc /norw
  • If you see shellcode attempting to drop another file such as an exe, we can allow the shellcode to execute in order to capture the file

  • use runsc

runsc32 runsc64

  • Can use it also on REMnux due to wine being installed

  • To execute shellcode:

runsc32 -f qa.bin -o 0x3B -d qa.doc -n 
find ~/.wine -name WINWORD.EXE -exec -cp "{}" .\;

XML Macros

  • Microsoft Excel 4 (XML) macros are legacy technology that can offer attackers an alternative to VBA macros

  • Were built in 1992 before the introduction of VBA in 1993

  • Are being retired by MSFT but work in recent versions of Excel

  • Are defined as formulas in cells of sheets

  • Sheets are often hidden

  • The formulas are often in white text on white background

  • To see where the XLM macro execution starts use zipdump.py with -s parameter to examine the xl/workbook.xml

zipdump.py koti.xlsm -s "xl/workbook.xml" -d | xmldump.py pretty
  • To see where execution starts look for:

<definedNames>
  <definedName name="_xlnm.Auto_Open">Lodet!$A$154</definedName>
 </definedNames>
  • Execution starts in cell A154 in sheet Lodet

  • Look above at the <sheet name=> parameter to figure out which rId number is assigned to our sheet Lodet and whether it is hidden or not

  • To see which XML files represent the sheets Loded and kOTI look at the xl/rels/wordkbook.xml file

zipdump.py koti.xlsm -s "xl/_rels/workbook.xml.rels" -d | xmldump.py pretty
  • It will show you:

  • `<Relationship Id="rId3"...Target=worksheets/sheet2.xml"/>

  • Now examine the worksheets/sheet2.xml

  • Now extract the contents of Lodet which is macrosheets/sheet1.xml using zipdump.py

  • zipdump.py koti.xlsm

  • zipdump.py koti.xlsm -s 6 -f | xmldump.py pretty | more

  • For easier analysis, direct xmldump.py to display just the cell text

zipdump.py koti.xlsm -s 6 -d | xmldump.py celltext > koti.csv
  • XML Macro obsfucation techniques include the following:

  • Use formulas to compute sensitive values such as strings during the runtime of the macro

  • Compute some values randomly during runtime i.e. the URL

Static analysis to compute possible values can be complex and time consuming 
Cached value saves time byt displays only one possible outcome
  • Instead of including a string in the formula include a reference to a string that is stored in a shared table elsewhere in the document

  • The shared strings are always in xl/sharedStrings.xml

  • Shared strings can reveal IOCs

  • You can direct xmldump.py to look up the strings for you by using the -j paameter and pointing to a stream that has the macros

zipdump.py koti.xlsm -j | xmldump.py -j 6 celltext
  • MSFT office is very helpful for decoding XLM macros

  • Use the built in debugger to examine and deobsfucate code

  • Covert file format from OOXML to OLE2 and the other way

  • Execute the macro the way a victim would to observe effects on the system from a behavorial perspective

  • Use Windows AMSI functionality to observe which script commands end up executing

logman start AMSITrace -p Microsoft-Antimalware-Scan-Interface Event1 -o AMSITrace.etl -ets
  • Run the suspicious script or macro you wish to examine

  • Stop AMSI Monitoring

logman stop AMSITrace -ets
  • Examine the AMSI data saved to the file

  • AMSIScriptContentRetrieval

  • Additional tools and considerations for XLM macro analysis

  • oledump.py can examine XLM macros in OLE2 files

  • oledump.py file.xls -p plugin_biff --pluginoptions "-x"

Last updated