The Malware
I received a malicious attachment in my email yesterday that uses a technique that I've started to see more and more in documents - utilizing the metadata fields to hold some of the malicious code. The advantage to this technique is that it spreads the code throughout the document and makes it more difficult to analyze. Despite this, all signs pointed to this being an easy document to analyze. As you'll see, I was wrong.resume.doc
MD5: e618b9ef551fe10bf83f29f963468ade
SHA1: 93993320c636c884e6f1b53f9f878410efca02da
SHA256: d400d6392a17311460442e76b26950a0a07e8a85c210c31e87a042a659dc9c52
Once more, I used REMNux to statically analyze the file. Yes, I could have executed it with Lazy Office Analyzer to speed up my analysis, but frankly my Windows VM is temporarily fubar'd, so I was stuck doing it this way.
The first step in my analysis was to figure out what type of document I was dealing with. Relying on the trusty-old file command, we see that this is a composite document file. This means that we can use the oletools suite to analyze it.
Notice that file also gives us some information obtained from the metadata of the document. The metadata of a document is stored inside the document and contains information about it, such as the last time it was saved, how many characters are in it, and the author. As analysts, we can use this information as indicators, to track actors across multiple documents, and sometimes to provide attribution.
In this case, the author is set to "Caleb Cheng" and was last saved by the username "caive". Are these legitimate? There's no way for us to tell as they could have been modified after the document was saved, but we could at least create a yara rule to track these usernames and see if they pop up in other documents!
Speaking of Yara, my next step was to use the rules from the Yara Rules Project and see what they found inside of the document.
Two of the Yara rules indicated the document contained embedded VBA code. Not surprising as many malicious documents use VBA to execute their second stage malware. Yara didn't find any embedded executables, so this probably meant the document downloads its next stage to execute. The only way to find out was to extract the VBA and analyze it.
The olevba.py script from the oletools suite can be used to extract VBA code from Office documents. Initially, I didn't give it any options to so I could see how the code looked natively. Fortunately in this document, the VBA code was pretty short and easy to understand.
The code can be broken down as follows:
- Declares a number of variables.
- Reads the value of the metadata title field with the ThisDocument.BuiltInDocumentProperties("title") command.
- Converts the title to ASCII from Unicode. In reality, this is just converting the characters to their integer equivalents and putting them into an array.
- Loops through each letter of the title and subtracts 7 from its ASCII value.
- Converts the array back to Unicode. This is just converting the array back to a string.
- Reverses the string (e.g. turns "abc" to "cba").
- Uses the shell() function to execute some of the string, after replacing and splitting it up with multiple values.
As expected, the title field contained obfuscated text that needed to be decoded in order to see exactly what the malicious document was doing. Before I could decode it, however, I needed to extract it properly.
The Problem
Exiftool will display periods for both actual period characters and binary data, so I first attempted to use some command-line fu in order to properly extract the characters. Normally, this can be done with the following command, which will convert all characters into a hex-encoded string that you can put directly into a python script.
exiftool -title -b resume.doc | hexdump -v -e '"\\" "x" 1/1 "%02X" ' ;Since the VBA code was fairly straight-forward, I wrote a quick python script to decode it. However, when I ran the script I didn't get the results I expected. While I saw some PowerShell commands and an obfuscated URL, there were some binary characters that shouldn't have been there; it should have decoded cleanly to text.
This is where I spent the next few hours trying to figure out what was going on.
At first I thought the VBA code was doing some value conversions when it converted from Unicode to ASCII integer values; this was not the case. I tested this by writing some similar VBA code, launching it in Word, and debugging it to see the values before and after the conversion - all was as expected.
Then I went over my python script to make sure I hadn't made a programming error (this would not have been the first time). Everything was good.
Finally, I went back to the Word document itself to see if I could figure out if my command-line fu had worked correctly. Thats when I noticed something interesting. If I opened the document in a hex editor and compared the location where the title string was located at to what was extracted by exiftool, some of the binary characters were different.
In the example above, the top hex dump is from the document itself and shows exactly what is within the document. The bottom is what exiftool extracted. You can see that in the original document, the highlighted byte was 0x83 but when exiftool extracted it, the byte was converted to 0xC6 0x92. This occurred multiple times in the extraction.
I'm currently not sure why exiftool did this. I tested it with multiple options and the latest version and all did the same thing. I'm waiting to hear back from the developers to see if I found a bug.
Unfortunately, I was unable to come up with some genius command line fu to extract the real title string in one fell swoop. So how did I do it? I copied the bytes from the hexdump command above and did some copy and pasting to get it in the right format. Sometimes thats just the easiest option. If anyone comes up with anything, please let me know.
I should also note that some other metadata extraction tools, like olemeta.py from oletools, crashed when attempting to extract the title. I suspect this is because they expect this to be a string and not have binary characters in it.
The final python code I came up with is below and can be downloaded from here.
When run, it gave me the output I expected.
As initially thought, the malicious document downloads an executable, saves it to the file system, and executes it. At the time of this writing, the file (61a5af5acea342ee5ca8dbd2fba0de06) is still accessible at that IP address. We'll save that analysis for another day.
This analysis is a prime example as to why you should trust, but verify, your tool results. In the beginning I assumed that exiftool was extracting the data properly - and why not? It had never failed me before. However, when the results were not what I was expecting I had to dig deeper to see what the issue was. Without the knowledge on how to look into a file and compare what my tool was giving me to what I was actually getting, I wouldn't have been able to figure out that my tool was giving me incorrect data and work towards a process of getting it rectified.
Update
I posted on the exiftool forum asking about the potential bug I found. Phil Harvey, the creator of exiftool, said the changing of the characters is because exiftool is attempting to convert the binary character to UTF-8. Unfortunately, outside of using the -v4 option to dump the output in hex and carve it back (which is what I did with hexdump), there is other workaround in exiftool at this time. I'll keep looking for a better way to do this going forward.