Analyzing Malicious Documents
PDF files can possess powerful capabilities that adversaries misuse to infect systems
The structure and contents of a PDF file are defined using objects, which issue directives using ASCII based keywords
Same risky keywords include
A PDF file is a collection of elements
PDF objects can reference each other and specify actions
Indirect object 1 0 references 43 0
Streams can encode various data
Always start by opening the sample in vs-code
Use
pdfid.py
for an initial perspective to check for risky keywordspdfid.py
scans for suspicious keywords without formally parsing the PDF fileIts useful for an initial review to inform the next steps
The
/URI
keyword indicates clickable URLs can be used in PDFs as phishing baitWe use "keyword" in a generic sense through PDF specs use other terms
Use
pdf-parser.py
for a more detailed look at the PDF fileThe
-a
parameter topdf-parser.py
shows statisticsBecause
pdf-parser.py
properly parses PDF syntax, its output is more accurate than that ofpdfid.py
The
-k
parameter shows just the values for the given key
Images in PDF Documents
The attacker tries to persuade the victim to clicking on the picture
To locate images in the PDF file, look for objects of type
/XObject
Examine an Object
Use the
-o
parameter topdf-parser.py
to examine object 6 which contains/XObject
Extract and view the image object
Follow the trail of references that leads to object 6 to see if the strail starts with a link
The
-r
parameter finds a reference to the specified objectObject 6 which was of type
/XObject
is referenced by object 13
Note:
/Annotes
offers a way to associate a link with an objectContinue to follow the trail of references
If you see
/Annotes 14 0 R
--> Look at object 14 now
Dealing with Malicious Websites / Retrieving malicious 2nd stages
One-by-one requests using
wget
orcurl
Recomment spoofing HTTP headers to make these requests look more like a normal web browser....Especially the UA strings for
wget
andcurl
!!!!Can also tweak the config files of
wget
andcurl
Specialized tools such as
Pinpoint
orScout
Honeyclients software such as
Thug
Real borwser on a purposefully vulnerable Windows system enabling the website to infect the lab machine
View PDF Object Streams
If you see an
/ObjStream
from the output ofpdf-parser.py steel2.pdf -a
command then you need to view the/ObjStream
pdf-parser.py
does not examine object streams by default
Find all objects that refer to object 10
Aditional Considerations with PDFs
Look for risky objects, examine them, follow the trail of referenced or otherwise related objects
If you see a suspicious object with a stream you can dump that stream to a file using parameters
-f -w -d
Malicious PDFs can include JS --> look for
/JS /Javascript /Acroform /XFA
PDF files could be password protected
The strucutre will be visible but youll need to decrypt streams to examine them
Youll need to determine the password then decrypt with tolls such as
qpdf
andpdftk
VBA Macros in Microsoft OFfice Documents
Note: Even if the document of VBA project is password protected the macros are not stored in an encrypted way
Office docsuments can follow two different formats
The "legacy" binary format is OLE2 (a.k.a structured storage etc)
OLE2 mimics capabilities of a file system using the concepts of storages (like folders) and streams (like files)
The more modern XML based format OOZML incorporates multiple files that include the documents contents in a ZIP file
Both formats can carry macros
Macros in an OOZML file are inside a binary OLE2 file which is inside the zip archive
Normally VBA macro code is embedded inside streams as compiled code (p-code) and compressed source code
Initial Triage
trid
Open XML Format --> means its an OOXML files
Examine the files that comprise the OOXML document using unzip
or zipdump.py
unzip
or zipdump.py
Can extract individual files as well with
zipdump.py
-s
--> specify the file-d
--> extract or dump it
Use
feh
image viewer to view the image
olevba
to extract VBA Macros
olevba
to extract VBA Macrosolevba
utility can locate, decode, and extract VBA macros from Office files. The tool also shows a summary of the risky keywords it located in the macroAny line that starts with
'
it is a comment in VBAWhen Office sees
AutoOpen
it automatically executes that function as soon as the function is allowed to runExample:
Can see that
AutoOpen()
callsSub g()
which then call functiony
and functionB
which are defined laterFor deeper visibility into VBA macros and related artifacts examine streams
Use
oledump.py
M
means there is a macro present2823+809
Size of the compiled code is the first number, second number is the size of the compressed source codeExample:
Use
-s a
parameter to oledump.py to extract VBA macros from all streams inparticulars.doc
Pass the
oledump.py
output throughgrep
to eliminate the comments
Sometimes minor aspects of the document can offer additional context for your investigation
They can sometimes reveal artifacts used in its previous version
Use
oledump.py
to extract them
Macros via LOLBin
Be on the look out for obsfucated strings that are backwards
When this is executed it will use the LOLBin
regserv32
Be aware of LOLBin
mshta
as well
Viewing MetaData
XML source code files sometimes include details such as:
Hidden comments such as URLs from which images were pasted
The language code of the system where the document was created
Analyzing OOXML
You can unzip its contents and examine individual XML files
Start with
zipdump.py
with no command line arguments
Once you have identified the index of the file you'd like to examine you can call
zipdump.py
again specifying the desired files index using-s
-d
parameter will direct the tool to dump the file to STDOUTCan then pipe to
xmldump.py
with the parameterpretty
to reformat the file
Vipermonkey can emulate VBA macros
Tool will auto decode the VBA macros
Numbers to Strings
After performing analysis you notice a macro in
A3
When extracting it with
oledump.py
You see alot of these lines
You can use
numbers-to-strings.py
Make sure to add new lines and examine the output
Password protected VBA Macros
Can see the VBA macro using
oledump.py
even though MSFT Office refuse to show you the code due to the password being set
Remove the distracting junk code, then examine the macro
xor-kpa.py
The tool
xor-kpa.py
is designed to derive an XOR key from the supplied plaintext and cipher textIt can also XOR a string with its multi-byte key which mimics the algorithm employed by our malicious macro
-x
tells the tool to XOR the data with the keyStart each param with
#h#
to designate it as a hex-encoded string and enclose in''
Auto deobsfucation with oledump.py
oledump.py
plugin_http_heuristics
--> will automatically decode embedded URLs if they are encoded using a common obsfucation method
Sometimes a faster approach to deobsfucate macros involves the VBA debugger built into MSFT Office
Will remove the macro password with
-uu
flagThen open MSFT Word click
View tab --> Macros --> View Macros --> edit
Bring up the locals window so you can see the variables
Add the following at the beginning of the macro (e.g. at the start of the AutoOpen function) so the macros starts the debugger
Save the macro so the line you added doesn't get lost
Switch to the MSFT word main view and enable macros
Once you enable the macros it will run and pause in the AutoOpen function on the line you set
Set the breakpoint on the line that interests you
Then click
Run > Continue
Once it hits your breakpoint examine the locals window, it will show the current variables in the bottom window, you should see what you are looking for
VBA Stomping
When a macro is added to an Office Document MSFT Office compiles it into a bytecode form known as
p-code
This is the code that is actually executed when the macro is run (most of the time: https://github.com/bontchev/pcodedmp)
Malware authors could modify or fully delete the source code version of the macro while keeping the
p-code
version intactOur analysis tools focus on the source code of the macro and wont recognize the true nature of the file
Extract the file as always
Now extract the file structure info
Will see a
!
which will indicate an Unusual start of source codeAnother sign of VBA stomping will show if the size of the compressed source code being
0
oledump.py
can extract thep-code
but it cannot decode it
Use
pcodedmp.py
to disassemble VBAp-code
Use
pcode2code
to decompile VBAp-code
Note:
MSFT Office automatically decompiles the
p-code
generating the VBA source code, however:Macros without the source code will only run in the specific version of Office for which the
p-code
was createdIf you want to debug the macros you can decompile the
p-code
usingpcode2code
you can embed the macro in a document
Base64 PowerShell
If you identify some base64 encoded PowerShell, ensure to use
bse64dump.py
to convert it
However when you view the dump we can see that it is also
gzip
encoded dataExtract the gzip data
Shellcode
Shellcode is machine code that the CPU can understand
It is represented as a series of bytes sorted in a memory region
-n
parameter directsbase64dump.py
to only consider strings that when decoded are at least 10 bytes longYou should now see the long shellcode string, and see that it is the second stream, use
-s 2
to extract that stream
Use
scdbgc
to emulate the execution of shellcode to understand its capabilities
Can now use
yara-rules
to identify known malware patterns in file
yara-rules
command will scan the file to see if it hits off any rules1768.py
is designed for parsing Cobalt Strike artifacts and is installed on REMnuxIn CS files the License ID is stored as a 32-bit integer in the last 4 bytes of the shell code
Examining Malicious RTF Documents
RTF documents are supported by MSFT word and many non-MSFT applications
RTF does not support macros but it allows attackers to embed other dangerous files as
OLE
objects and other binary contentsUsers can be persuaded to open and execute the embedded file
RTF files can also directly target a vulnerability using an exploit to execute the embedded shellcode payload
When examining RTF documents, focus on the objects or other embedded artifacts
RTF format
Usually formatted as ASCII plaintext and includes control words and groups
Control words start with
/
and specifies how the RTF rendering application should format and display the charactersA group encloses other elements in
{}
delimiters and specifies the text affected by the group and its formattingGroups can be nested
Objects and other binary content are embedded as serialized strings that represent hex values
You will see the
/objdata
control work followed by a string encoded in hexUse
rtfdump.py
and| more
to get and overview of the RTF files groups and to spot embedded objects-o
will allow you to examine the object
-s
parameter specifies the index of the object-d
tells the tool to dump the object in its raw form
Use
oledump.py
to examine the extracted objectoledump.py new-order.object -i
If you now want to examine a specific steam use the
-A
parameteroledump.py new-order.object -s 4 -A
When analyzing malicious documents that might have exploits look for shellcode to understand the payload of the attack
Use the
-S
parameter to examine the stringsoledump.py new-order.object -s 4 -S
For parsing
Equation Editor 3.0
data we have an option-f name=eqn1
oledump.py new-order.object -s 4 -d | format-bytes.py -f name=eqn1
Shellcode searching in Binary files
When looking for shell code look out for a lot of
0x90
also known as a NOP sledUse
xorsearch
to spot shellcode patterns in binary filesxorsearch -W -d 3 qa.bin
EIP
points to the current instruction but assembly code cannot read it directly, so malware authors do it indirectly
Shellcode Requirements
Shellcode needs to do some work before it can make API calls
To load DLLs and resolve API function names, shellcode often seeks
kernel32.dll
forLoadLibrary
andGetProcAddress
Shellcode loos for the
Process Environment Block (PEB)
to locatekernel32.dll
in memory of the exploited processFor every process the Windows OS creates a structure called the
PEB
This data structure contains information about the process including the list of modules (DLLs) that have been loaded or mapped into the processes memory
The
FS
register contains the address of the data structure called theThread Information Block (TIB)
, which contains information about the currently running threadA pointer to the
PEB
resides within theTIB
at offset0x30
with respect to the beginning of theTIB
Therefore a pointer to
PEB
is always located atFS:[0x30]
This syntax directs the processor to look for the address stored
0x30
bytes away from the beginning of theTIB
structureTwo methods to retrieve the
PEB
scdbgc
Use
scdbgc
to analyze shellcode by emulating its executionthe
-foff
parameter specifies the hex offset within the file where the shellcode startsThis can be determined by
xorsearch
Press CTRL+C three times if
scdbgc
gets stuck
/s -1
parameter indicates to continue the emulation without restricting the max number of instructionsDirect
scdbgc
to open a handle to the malicious file so the shellcode can find the overlay to where it likely stores additional contentsHit CTRL+C three times after it starts to avoid too many repeating instructions from filling your screen
Can hide the numerous
READ/WRITE
events with/norw
If you see shellcode attempting to drop another file such as an exe, we can allow the shellcode to execute in order to capture the file
use
runsc
runsc32 runsc64
Can use it also on REMnux due to wine being installed
To execute shellcode:
XML Macros
Microsoft Excel 4 (XML) macros are legacy technology that can offer attackers an alternative to VBA macros
Were built in 1992 before the introduction of VBA in 1993
Are being retired by MSFT but work in recent versions of Excel
Are defined as formulas in cells of sheets
Sheets are often hidden
The formulas are often in white text on white background
To see where the XLM macro execution starts use
zipdump.py
with-s
parameter to examine thexl/workbook.xml
To see where execution starts look for:
Execution starts in cell A154 in sheet Lodet
Look above at the
<sheet name=>
parameter to figure out whichrId
number is assigned to our sheetLodet
and whether it is hidden or notTo see which XML files represent the sheets Loded and kOTI look at the
xl/rels/wordkbook.xml
file
It will show you:
`<Relationship Id="rId3"...Target=worksheets/sheet2.xml"/>
Now examine the
worksheets/sheet2.xml
Now extract the contents of Lodet which is
macrosheets/sheet1.xml
usingzipdump.py
zipdump.py koti.xlsm
zipdump.py koti.xlsm -s 6 -f | xmldump.py pretty | more
For easier analysis, direct
xmldump.py
to display just the cell text
XML Macro obsfucation techniques include the following:
Use formulas to compute sensitive values such as strings during the runtime of the macro
Compute some values randomly during runtime i.e. the URL
Instead of including a string in the formula include a reference to a string that is stored in a shared table elsewhere in the document
The shared strings are always in
xl/sharedStrings.xml
Shared strings can reveal IOCs
You can direct
xmldump.py
to look up the strings for you by using the-j
paameter and pointing to a stream that has the macros
MSFT office is very helpful for decoding XLM macros
Use the built in debugger to examine and deobsfucate code
Covert file format from OOXML to OLE2 and the other way
Execute the macro the way a victim would to observe effects on the system from a behavorial perspective
Use Windows AMSI functionality to observe which script commands end up executing
Run the suspicious script or macro you wish to examine
Stop AMSI Monitoring
Examine the AMSI data saved to the file
AMSIScriptContentRetrieval
Additional tools and considerations for XLM macro analysis
oledump.py
can examine XLM macros in OLE2 filesoledump.py file.xls -p plugin_biff --pluginoptions "-x"
Last updated