This article applies to:
- Trustwave SEG 7.2 and above
- MailMarshal SPE 3.6 and above
- WebMarshal 6.9.6 and above
- TextCensor functionality
Question:
- What are the changes in behavior in TextCensor 2?
- What differences in behavior could be caused by upgrading to TextCensor 2?
- Error message: Script too big, exceeded maximum number of DFA states
Information:
The product versions named above use TextCensor 2.
When you upgrade to the new versions, existing TextCensor scripts and expressions are migrated to the new syntax.
- For SEG, to validate the changes and list any scripts that require manual update, use the TextCensor Upgrade Preview tool. You can download this tool from the product upgrade page (requires customer login).
- For MailMarshal SPE upgrade, see the upgrade document on the SPE upgrade page.
Unicode and wide characters
In WebMarshal 6.9.6 and above, TextCensor 2 works with double-byte text, including non-Roman alphabets (such as Hebrew and Arabic), and non-alphabetic languages (such as Chinese).
SEG and SPE do not currently support searching for double-byte text. This functionality is under investigation for addition to SEG in a later release to be announced.
Upgrade Considerations
- Customized scripts that contain thousands of items can cause an error on upgrade ("Script too big, exceeded maximum number of DFA states").
- If you encounter this issue, revise the script to use fewer expressions. Eliminate old and irrelevant entries. Trustwave strongly recommends against TextCensor scripts with large numbers of items, since they can cause significant performance issues.
- Alternatively, split the script into two or more segments.
- Be aware that wildcard matching provides additional options in TextCensor2.
- "Increasing" and "decreasing" score options are no longer supported. Scripts are automatically updated to use the "maximum matches" option.
- In rare cases, an item contains a series of keywords that cannot be upgraded automatically. You must change these items manually. You can contact Support for assistance.
Syntax Differences
There are minor differences in the treatment of word boundaries and quoted text between the two versions of TextCensor. These differences affect the upgrade of TextCensor "items" (now called "expressions").
- In the earlier version of TextCensor, all non-word characters (such as punctuation) were treated as word breaks by default, unless added to a list of special characters. The new version of TextCensor does not treat most symbols as word breaks (for details see the User Guide).
- Double quotes are used in both versions to mark text as case sensitive.
- In the new version, double quotes MUST be preceded by or followed by white space (or the beginning or ending of the expresssion).
- You can "escape" the quote by preceding it with a \ character. It will then be treated as a literal.
Following from the above considerations, when a TextCensor item is migrated, space is added as necessary to make a legal TextCensor 2 expression. In some cases matching behavior is changed.
Original item |
Expression after migration |
Comments |
a"b"c |
a "b" c |
Closely matches the old behavior |
a"&@#"c |
a&@#c |
Case sensitivity and boundaries lost |
a"&@#" c |
a&@# c |
Case sensitivity and first boundary lost |
a "&@#"c |
a &@#c |
Case sensitivity and second boundary lost |
The original TextCensor would match continuous sequences or split sequences. The migration/upgrade process assumes that where symbols are combined in the absence of spaces, the intent was to match the sequence literally. Case sensitivity is lost on sequences where symbols combine with alphanumeric strings.
Case sensitivity could be lost in some other cases where non-alphanumeric characters are used in literal sequences.
However, an expression consisting entirely of a single quoted string will preserve case sensitivity across the entire expression.
Note:
The syntax issues would not affect most uses of TextCensor to look for phrases in ordinary language. They are more likely to affect scripts aimed specifically at JavaScript or other code.