Why does TextCensor fail when tested against a file, despite the text being in the file?

This article applies to:

Trustwave MailMarshal (SEG)
Trustwave ECM/MailMarshal Exchange

Question:

How does the MailMarshal Engine process different types of attachments within a message?
My TextCensor script is not triggering on a message that contains the specified text.
Why is a TextCensor script not triggering on an attachment?

Symptoms:

Testing of a TextCensor script fails when tested against an MML or other file, despite the text being in the email
Testing of a TextCensor script triggers when tested against an MML or other file, despite the text not existing anywhere in the email

Information:

A common reason for this issue is that the TextCensor test function does not "unpack" text from files in the same way as the MailMarshal Engine.

The test function built into the TextCensor scripts is intended to be used against readable text, not an MML or other complex file in its entirety. This fact may lead users to falsely conclude that there is something wrong with a TextCensor script that they have written, and that the TextCensor script they have created will incorrectly trigger or fail to trigger against an actual email message. This, however, is quite frequently not the case, due to the way that messages are actually processed by the MailMarshal Engine service.

To test a script against a message with attachments, send an actual test mail through MailMarshal.

Technical details:

If you examine an email message (MailMarshal MML file) using a test editor such as Notepad, you may see various sections within the message. Sections containing non-text attachments are specified as base64 encoded. The attachment will appear to be arbitrary non-human-readable characters; this is the base64 encoded section. This encoding allows binary data to be sent within an email message.

When processing the message, the Engine attempts to decode the base64 section into binary data. Once this is done, the Engine checks the binary signature to determine what type of file it is (for instance PDF, XLS, executable) and apply rules accordingly. You can see the results of this analysis in the service text logs for the message.

In certain cases, such as with a PDF or Microsoft Word document, the Engine continues by unpacking the text from the file. Below is a sample of what might appear within the Engine log when a message such as this is unpacked:

0696 16:02:13.872 Thread 2 Starting to unpack <B418690980001.000000000001.0001.mml>
0696 16:02:15.528 Type=MAIL, size=265313, Name=B418690980001.000000000001.0001.mml
0696 16:02:15.528   Type=MHDR, size=3175, Name=MsgHeader.txt
0696 16:02:15.528   Type=MBODY, size=110, Name=Plain.txt
0696 16:02:15.528   Type=MBODY, size=562, Name=Quoted-Printable.txt
0696 16:02:15.528   Type=MAIL, size=260828, Name=Unknown.txt
0696 16:02:15.528     Type=MHDR, size=3842, Name=MsgHeader.txt
0696 16:02:15.528     Type=MBODY, size=1443, Name=Quoted-Printable.txt
0696 16:02:15.528     Type=MBODY, size=8642, Name=Quoted-Printable_1.txt
0696 16:02:15.528     Type=PDF, size=63339, Name=Sample PDF file.pdf
0696 16:02:15.528       Type=TEXT, size=4041, Name=ExtractedText0.txt

Whenever a TextCensor script is set to apply to the "Message Attachment" portion, it actually scans the text extracted from the PDF file (in this example, ExtractedText0.txt). This temporary text file will appear completely different from the MML file in its entirety.

In order to accurately test the TextCensor script against only the body of a message that contains an attachment, the base64 encoded sections would need to be removed. If attempting to test the TextCensor script against the text contained within the attachment, the text would have to be manually extracted from the attachment and placed in a text file.

Notes:

This article was previously published as:: NETIQKB46650

Last Modified 4/1/2020.
https://support.trustwave.com/kb/KnowledgebaseArticle10952.aspx