This is my project page on datamining techniques as applied to spam and other network security problems.

Find unique strings: substringMine

This is a demonstration/prototype tool designed to scan two large corpuses of text to identify substrings that occur frequently in one corpus but not in the other. This is unlike word frequency histogram programs because it doesn't do any tokenization at all. If it is told to look for 10-character strings, it will brute-force all 300 million 10-character strings in a 300mb corpus.

Note that this program is an untuned demo and test for the techniques used, I offer it because it may be useful. It requires about 100-200 megabytes of RAM to operate, and can scan corpuses up to about 1 gigabyte.

Input

substringMine  <width> <posRate> <negRate>  <posFile> <negFile>

width
is the length of string the program should search for. A reasonable range is from 8 to 200.
posRate
is the minimum occurance rate of a substring in the positive file. Note that setting this too low will lead to high memory consumption.
negRate
is the maximum occurance rate of a substring in the negative file. Note that setting this too low will lead to the probabilistic algorithm missing some strings. An estimate of the number missed is output during processing.
posFile
the positive file
negFile
the negative file

The program will scan both files and output all strings that occur more freqently than posRate times per megabyte and occur no more frequently than negRate in the other file. The output consists of some informational messages to stderr and the found strings to stdout. It takes about a minute to analyze two 20 megabyte corpuses.

A sample invocation to analyze some spam is ./substringMine 20 3.2 1.5 /tmp/SPAM /tmp/HAM > /tmp/log . This will output all strings of length 20 that occur at least 3.2 times per megabyte in spam, but no more than 1.5 times per megabyte in ham. The positive and negative rates are a reasonable starting point. Adjust upward or downward as needed.

Notes

This application, like many machine learning applications, is very good at identifying any artificially induced cruft in the dataset, such as extra headers induced by spam filtering, and any other bias. Clean datasets are essential.

Processing time for this algorithm is strictly a function of the input length, at about 30 megabytes a minute. A future versions should be about ten to twenty times faster or up to a third of the memory use.

Outputs

The informational message consistes an estimate of the number of matches and the estimated percentage of matches that it will fail to output. (this happens because the probabilistic algorithm may make a mistake) The estimated number of matches should not be over 200,000. The false miss rate shouldn't be over 10%. If either of these is too high, posRate and negRate higher.
 TAB <Number of hits in positive file> TAB <Number of hits in negative file> TAB <Escaped string> 
An excerpt of an output follows:
 
        1110    3       800\nX-Priority:
        1108    3       0800\nX-Priority
        1107    3       +0800\nX-Priorit
        1106    3        +0800\nX-Priori
        402     0       -Mailer: FoxMai
        402     0       X-Mailer: FoxMa
        402     0       Mailer: FoxMail
        820     2       iority: 3\nX-Mai

Availability

This program is currently only available as an OCaml binary under linux x86. It may be fetched here

The version that uses more RAM can find more infrequent patterns allowing you to set posRate and negRate about two-thirds lower.