This is my project page on datamining techniques as applied to spam and other network security problems.
This is a demonstration/prototype tool designed to scan two large corpuses of text to identify substrings that occur frequently in one corpus but not in the other. This is unlike word frequency histogram programs because it doesn't do any tokenization at all. If it is told to look for 10-character strings, it will brute-force all 300 million 10-character strings in a 300mb corpus.
Note that this program is an untuned demo and test for the techniques used, I offer it because it may be useful. It requires about 100-200 megabytes of RAM to operate, and can scan corpuses up to about 1 gigabyte.
substringMine <width> <posRate> <negRate> <posFile> <negFile>
width | is the length of string the program should search for. A reasonable range is from 8 to 200. |
posRate | is the minimum occurance rate of a substring in the positive file. Note that setting this too low will lead to high memory consumption. |
negRate | is the maximum occurance rate of a substring in the negative file. Note that setting this too low will lead to the probabilistic algorithm missing some strings. An estimate of the number missed is output during processing. |
posFile | the positive file |
negFile | the negative file |
The program will scan both files and output all strings that occur more freqently than posRate times per megabyte and occur no more frequently than negRate in the other file. The output consists of some informational messages to stderr and the found strings to stdout. It takes about a minute to analyze two 20 megabyte corpuses.
A sample invocation to analyze some spam is ./substringMine 20 3.2 1.5 /tmp/SPAM /tmp/HAM > /tmp/log . This will output all strings of length 20 that occur at least 3.2 times per megabyte in spam, but no more than 1.5 times per megabyte in ham. The positive and negative rates are a reasonable starting point. Adjust upward or downward as needed.
Processing time for this algorithm is strictly a function of the input length, at about 30 megabytes a minute. A future versions should be about ten to twenty times faster or up to a third of the memory use.
TAB <Number of hits in positive file> TAB <Number of hits in negative file> TAB <Escaped string>An excerpt of an output follows:
1110 3 800\nX-Priority:
1108 3 0800\nX-Priority
1107 3 +0800\nX-Priorit
1106 3 +0800\nX-Priori
402 0 -Mailer: FoxMai
402 0 X-Mailer: FoxMa
402 0 Mailer: FoxMail
820 2 iority: 3\nX-Mai
The version that uses more RAM can find more infrequent patterns allowing you to set posRate and negRate about two-thirds lower.