7.6 Identifying Web Content Using TextCensor Scripts
TextCensor scripts allow you to check for the presence of particular content in a file or text. You can apply TextCensor to any text, including web pages, plain text files, and text extracted from PDF or Office documents.
TextCensor supports Unicode and wide characters, and is designed to work intuitively with a wide variety of languages.
A script can include many conditions based on text combined using logical and positional operators. Triggering of the script is based on the weighted result of all conditions.
TextCensor scripts are used in Content Analysis rules. The rule can block a request, classify a site, or add a site to a URL Category. To see a list of rules that use a script, view the properties of the script.
TextCensor scripts contain one or more expressions, each consisting of a word or phrase.
You can use two wildcard characters, anywhere in a word or phrase.
•* matches zero or more letter or digit characters.
•? matches one letter or digit.
Wildcards match only letters and digits, and apostrophes or hyphens that are treated as part of words (see “Word Breaks”). Wildcards do not match other symbol characters.
If you want to set the order of evaluation of a complex expression that uses more than one operator, use parentheses ( ).
Each TextCensor expression can include logical and positional operators. The operators must be entered in UPPERCASE.
TextCensor works with the positions of words or phrases within a file. For example, in the sentence “The quick brown fox jumps over the lazy dog” the word “quick” starts and ends at position 2, and the phrase “jumps over” starts at position 5 and ends at position 6.
A positional operator works with expressions that evaluate to sets of positions. It takes two sets of positions as parameters, and returns a new set of positions.
|
Tip: In a simple TextCensor expression, you can think of the expression result as “true” or “matched” if the word or phrase is found in any position in the text. When the word or phrase is found in more than one position, this counts as more than one match of the expression. When you combine positional operators to make a complex expression, note the explanations of the sets returned by each operator (see below). Test your script before applying it in production. |
You can specify a distance for many positional operators. The default distance (if you do not specify a value) is 4.
Operator and Syntax |
Matching Results |
---|---|
FOLLOWEDBY A FOLLOWEDBY[=distance] B |
The start of B occurs within distance words from the end of A. Returns a set of positions spanning from the start of A to the end of B. dog FOLLOWEDBY hous* matches Dog in the house |
NOT FOLLOWEDBY A NOT FOLLOWEDBY[=distance] B |
The start of B does not occur within distance words from the end of A. Returns a set containing the positions in A that are not followed by B. dog NOT FOLLOWEDBY=1 hous* matches Dog in the house |
PRECEDEDBY A PRECEDEDBY[=distance] B |
The end of B occurs within distance words from the start of A. Returns a set of positions spanning from the start of B to the end of A. dog PRECEDEDBY cat matches Cat chasing dog |
NOT PRECEDEDBY A NOT PRECEDEDBY[=distance] B |
The end of B does not occurs within distance words from the start of A. Returns a set containing the positions in A that are not preceded by B. dog NOT PRECEDEDBY=2 cat matches Cat was not chasing dog |
NEAR A NEAR[=distance] B |
If A occurs within distance words before B the resulting position spans from the start of A to the end of B. If B occurs within distance words before A the resulting position spans from the start of B to the end of A. dog NEAR cat matches Cat chasing dog and also matches Dog chasing cat |
NOT NEAR A NOT NEAR[=distance] B |
Returns the positions of all instances of A where B is not found within distance words from A dog NOT NEAR=2 cat matches Cat was not chasing dog and also matches Dog was not chasing cat |
OR A OR B |
This form of the OR operator is applied when both A and B are sets of positions, even if one or both are empty sets. It returns the union of position sets A and B. For the sentence “A rose is a rose”, the expression (rose OR is) returns the position set 2,3,5. |
7.6.1.2 Logical (Boolean) and Special Operators
A logical operator takes Boolean (true/false) values as input, and returns a Boolean result. These results cannot be used as parameters of a positional operator.
When one of the parameters to a logical operator is an expression that returns a position set, the parameter is treated as a logical value. A set with at least one position match is treated as true. A set that has no matches is treated as false.
TextCensor also supports the special operator INSTANCES.
Operator and Syntax |
Matching Results |
---|---|
OR A OR B |
Returns true if A or B (or both) is true. This form of the OR operator is applied when either A or B (or both) are logical expressions. If both A and B are position sets then the positional OR operator is used instead. |
AND A AND B |
Returns true if both A and B are true. |
NOT NOT A |
Returns the opposite of A (true if A is false). |
INSTANCES A INSTANCES=count |
A must be an expression that returns a position set. The result is true if A contains count or more word positions; otherwise the result is false. |
7.6.1.3 Anchored Regular Expressions
TextCensor supports use of Regular Expressions through the ARX operator.
An anchored regular expression is a regular expression (regex) which must be preceded by a word on the left hand side of the ARX operator. Matching of the regex begins at the next character following the word. Regex patterns must always begin by matching one or more non-word characters. In most cases you can start the regex pattern with \W to match non-word characters..
|
Notes: •ARX is based on Google RE2. The syntax generally follows well known Regular Expression syntax. •Distance parameters for ARX operators are specified in characters. •Regular expressions are case insensitive by default. You can force case sensitive matching with the operator ?-i •ARX does not support lookahead or lookbehind. •ARX does not support capture groups. Capture groups will be ignored or converted to non-capturing sequences. •ARX does not support \Q...\E literal text. •For further details of the ARX syntax, see the RE2 wiki. |
Operator and Syntax |
Matching Results |
---|---|
ARX A ARX[=distance] /pattern/ |
Locates instances of A where it is followed by text matching the regex pattern within distance characters of the end of A. The entire pattern must occur within the specified distance. The resulting position list spans from the beginning of A to any content matched by the regex. dog chasing ARX /\W(one|two|10) cat(s*)/ |
NOT ARX A NOT ARX[=distance] /pattern/ |
Locates instances of A where text matching the regex pattern does not occur within distance characters of the end of A. When this expression matches, the resulting position list is the position list of A. |
Multiple anchored regular expressions can be combined with other expressions and operators to create complex statements. For example,
((dog OR boy) FOLLOWEDBY=1 ((chasing OR leading) ARX /\W(one|two|10)/) NEAR big) FOLLOWEDBY (white ARX /\W(horse|cat)s*/)
matches the phrase: dog chasing one or more big white cats.
The following concepts clarify how TextCensor expressions are evaluated.
|
Note: This section does not apply to Regular Expression patterns. |
A word is made up of one or more letters and digits, and sometimes symbols.
•In alphabetic languages, a word is a group of letters or digits separated by other characters (such as punctuation, other symbols, and white space).
•In Chinese, or Japanese kanji, a word or “token” may be composed of one or more characters (ideographs).
A phrase is made up of a series of words separated by word break characters.
7.6.2.3 Symbols and Punctuation
Symbols other than letters and digits are not treated as part of a word unless they appear in the specific statement being evaluated. A group of symbols is not treated as a word.
|
Tip: •The text word$deed is matched as two words by the expression word FOLLOWEDBY deed, and also by the exact expression word$deed •The text $word$ is matched by any of word, $word, word$, or $word$ •The text Save $$$ Now is matched by save FOLLOWEDBY=1 now |
The sets of characters that are treated as word and number break characters generally follow Unicode standards.
A word break character can also be matched exactly or by a wildcard.
|
Tip: •Each of the following strings is treated as one word: •The text half-baked is treated as two words and is matched by any of the following expressions: |
TextCensor treats each accented character as a single letter. A letter with additional composed accent characters is normalized to a single character before the text is evaluated.
Some characters have special meanings in TextCensor. These characters are parentheses, square braces, the asterisk, the equal sign, the double quote character, and the question mark. You can place a backslash character (‘\’) before any of these characters in order to use the character’s normal meaning. To use a normal backslash character, place two of them together (“\\”).
Within ARX expressions, the Regular Expression reserved characters apply (in particular the forward slash ‘/’) which marks the start and end of the regex pattern).
TextCensor evaluation is NOT case sensitive by default. To perform a case sensitive match, quote the content using double quote characters. All special characters and escape characters retain their meaning within double quotes.
You can use TextCensor Classes to match specific types of characters inside a word, or special types of words.
Operator and Syntax |
Matching Results |
---|---|
[LETTER] |
Matches any single letter inside a word. |
[DIGIT] |
Matches any single digit inside a word. For example, A[LETTER]B[DIGIT]C would match both “axb0c” and “aab9c”. |
[NUM] |
Use in place of a word to match any number made up of one or more digits. This class does not match numbers with a decimal point, or Asian language numbers that use words between characters |
[CCARD] |
Use in place of a word to match a series of digits that look like credit/payment card numbers. These numbers consist of up to 5 groups of digits, are up to 19 digits in length, and must pass checksum validation (using the Luhn algorithm). This class should match most card numbers. |
[US-SSN] |
Use in place of a word to match series of digits that look like US Social Security Numbers. Valid numbers must follow a specific format. However, the format is loosely defined and it is not possible to prevent accidental matching of other numbers. |
[CAN-SIN] |
Use in place of a word to match a series of digits that looks like a Canadian Social Insurance Number. Valid numbers must follow a specific format and pass a Luhn check. |
You can give a TextCensor statement a name. When a named statement is executed, the result is stored. You can reference it in later statements within the same script.
If a statement contains only words or only uses positional operators, the stored result is the set of word positions found by that statement. If the statement uses any other operators then the result is logical.
You can reference the result of a statement by using [@name] inside a statement. This can be used anywhere that you would otherwise use the bracketed result of an operator.
|
Note: Naming a statement does not affect the statement’s score. To use a named statement as a macro expression, in most cases you should set the statement’s score to zero. When using named statements within other expressions, remember that the result must match the required parameter type. If a statement returns a logical result you cannot use it as a parameter to a positional operator. Test your scripts before applying them in production. |
7.6.3 Scoring a TextCensor Script
Each script is given a trigger threshold, expressed as a number. Each expression in a script is given a positive or negative score. If the total score of the content being checked reaches or exceeds the trigger threshold, the script is triggered.
The total score is determined by summing the scores resulting from evaluation of the individual expressions in the script.
For each expression, if the result is a true logical value, the expression score is the base score.
If the expression result is a position set (the word or phrase was found one or more times in the text), by default the final score of the expression is the base score. You can choose how to add the score when the expression is matched more than once. The options are:
Option |
Description |
---|---|
Every time |
Each match of the words or phrases adds the score to the total. |
First Match Only |
Only the first match of the words or phrases adds the score to the total. |
First N Matches |
Each match, up to the number you set, adds the score to the total. For instance if the expression score is 5 and you select “first 3 matches,” then the expression can contribute up to 15 to the total score, but never more than 15. |
Negative scores and trigger levels allow you to compensate for the number of times a word could be used in text that you do not want to match. For instance: if breast is given a positive score in an “offensive words” script, cancer could be assigned a negative score (since the presence of this word suggests the use of breast is medical/descriptive).
|
Note: Script evaluation always checks all expressions to obtain the final score. The order of expressions in a script is not significant. This is a change from earlier versions. |
7.6.4 Adding a TextCensor Script
1.Select Policy Elements > TextCensor Scripts.
2.Click the New TextCensor Script icon in the tool bar to open the New TextCensor Script wizard. If necessary click Next to continue to the TextCensor Expressions window.
3.Click New to open the Add TextCensor expression window.
4.Enter the expression, optionally using the operators described earlier. For example:
(Dog FOLLOWEDBY hous*) AND NOT cat
In this example the expression score is added to the script total if the document contains the words dog house (or dog houses, and so forth) in order, and does not contain the word cat.
|
Note: TextCensor expressions are not case sensitive by default. However, quoted content is case sensitive. So textcensor would match TextCensor, but “textcensor” would not. |
5.Select a score and contribution method for this expression (see “Scoring a TextCensor Script” for more information).
6.Click Add (or press Enter) to add the expression to this script. The window remains open so you can create additional expressions.
7.When all expressions have been entered, click Close to return to the New TextCensor Script window.
8.Select a trigger threshold. If the total score of the script reaches or exceeds this level, the script is triggered. The total score is determined by evaluation of all expressions in the script.
9.Click Next
10.On the TextCensor Script Information page, enter a name and optional description for the script.
11.Click Next, then Finish, to add the script.
7.6.5 Editing a TextCensor Script
1.Select TextCensor Scripts in the left pane.
2.Double-click the script name in the right pane to open the script properties window.
3.Double-click an expression to edit it.
4.To delete a line, select it and click Delete.
5.Change the script name and trigger level as necessary.
6.Click OK to accept changes or Cancel to revert to the stored script.
7.6.6 Importing a TextCensor Script
You can import TextCensor scripts from files.
|
Note: You can import scripts in the format used by WebMarshal 6.9.5 and above, as well as scripts in the format used by earlier versions. The earlier version scripts will be upgraded to the new format automatically. Any problems with upgrading will be reported. |
To import a Script:
1.Select TextCensor Scripts in the left pane.
2.Click the New TextCensor Script icon in the tool bar to open the New TextCensor Script wizard.
3.On the TextCensor Expressions window, click Import.
4.Select the file you wish to import, and click Open.
5.Complete the Wizard to add the script.
7.6.7 Exporting a TextCensor Script
You can export TextCensor scripts to XML or text files.
To export a Script:
1.Select TextCensor Scripts in the left pane.
2.Double-click the name of the script you want to export in the right pane to open the script properties window.
3.Click Export.
4.Enter the name of the file you want to create, and click Save.
5.In the script properties window, click OK.
7.6.8 Using TextCensor Effectively
The effective use of TextCensor scripts depends on understanding how the Text Censor facility works and what it does.
TextCensor evaluates rules against plain text or HTML documents. The rules can be used to block a request, classify a site or add it to a URL Category. If a Content Analysis rule includes a “block” action, TextCensor scripts are evaluated before the material is returned to the user.
Blocking does not apply to content cached on the local computer.
7.6.8.1 Constructing TextCensor Scripts
The key to creating good TextCensor scripts is to enter words and phrases that are not ambiguous. They must match the content you want to block. Also, if certain words and phrases are more relevant to the match than others, those words and phrases should be given a higher score to reflect the greater relevance.
In creating TextCensor scripts, you should strike a balance between overly-general and overly-specific. For instance, suppose a script is required to check for sports-related sites. To enter the words score and college alone would be ineffective because those words are likely to be used on non-sports sites. Hence, the script would trigger too often, potentially stopping access to acceptable sites such as general news sites.
The same script (to find sports-related sites) would be better constructed using the phrases extreme sports, college sports and sports scores as these phrases are sport specific. However, using only a few very specific terms may mean that the script does not trigger often enough.
Again using the sports example used above, the initials NBA and NFL, which are very sports specific, should be given a suitably higher score (that is, promoting earlier triggering) than, for example, college sports.
7.6.8.2 Decreasing Unwanted Triggering
TextCensor scripts might trigger on pages that are not obviously related to the content types they are intended to match.
To troubleshoot this problem:
1.Use the problem script in a TextCensor Rule that classifies sites and adds them to a URL Category, such as “suspected sports sites.”
2.After using this rule for a while, check the sites that have triggered the script and determine which ones are triggering it falsely.
•In the URL Categories display of the console, double-click a URL to view the reasons it was added (for example, TextCensor details).
3.Revise the script by changing the score or key words, so as to decrease false triggering.
4.When satisfied, create a Standard Rule which denies access to sites in the URL Category generated by the script, and/or add a Block action to the original TextCensor rule.
7.6.9 Testing TextCensor Scripts
To test the operation of a TextCensor script:
1.Click Test on the New TextCensor Script or script properties window to open the Test TextCensor window.
2.In the Test TextCensor window, enter the sample text you want to test using one of two methods.
•Select Test script against file. Enter the name of a file containing the test text (or browse using the button provided).
•Select Test script against text. Type or paste the text in the field.
3.Click Test. The result of the test (including details of the expressions which triggered and their scores) displays in the Test Results pane.
You can also test a script as part of the test of a content rule. This method allows you to test using pages drawn directly from the Web. See “Testing Access Policy” for detailed information on Rule testing.