BooleSuite - a program that can use a variety of learning algorithms on binary featured data

By Bob Marinier

To contact me, go here.

Files

Source code for the program can be downloaded here. The test.scr file is an example scripting file. Use the Makefile to compile on Unix. Add the files to a console application to compile on Windows (it should compile on both platforms without modification).

Background

This program was developed in collaboration Prof. David C. Page, Jr. at UW-Madison for my CS699 (directed study) class at in the Spring 2001. The program was written in Visual C++ 6.0 on Windows 2000, but was specifically designed to compile and run on Unix (since that's what Prof. Page uses).

What can this program do? This program can learn using ensemble, decision trees, focus trees, and focus ensembles (more on focus stuff later), all with boosting. It does stratified cross validation to any number of folds, uses the AdaBoost.M1 algorithm to do boosting, reports true and false positive rates, and can generate gnuplot files to make ROC curves if you want. It can also report the most frequently missed examples and various debugging info.

Acknowledgement

This work was supported in part by NSF grant 9987841.

Legal Stuff

The legal stuff can be found in the legal.lic file.

Final Note

If you use the program or make useful modifications, or especially if you do any research with it, we'd love to hear about it! To contact me go here, and Professor Page can be contacted at page@biostat.wisc.edu.

Using the Program

There are two parts to using the program, the command line and the script file. To run the program, type

boolesuite -file {example file} -script {script file} {other options}

Command line arguments:

-file {filename} : this is the file that contains the examples to learn from. This is required.

-script {script filename} : this is the file that contains the script to run. This is required.

-test {filename} : if you want to learn from one file and test on another file, then you can specify the test file here.

-name {namesfile} : if the features have different names (i.e. bit column 1 isn't bit column 1 but is really bit column 5245) then you need to specify this. If specified, the names will be used in reporting final results. This is required if you are using the focus tree algorithm.

-plot {file prefix} : if you want various TP and FP rates made into a gnuplot file so you can generate ROC curves, then you need to specify this. The data points will be put into files named {file prefix}0.data, {file prefix}1.data, etc. The actual gnuplot file will be {file prefix}. When you run gnuplot on {file prefix} then it will generate a postscript file, learn.ps, which will have the actual plot. Each data set is plotted twice, once with lines and once with points. To differentiate the line and point types, I start them a 1 and skip one, so the types used for line and points are 1, 3, 5, etc. I don't know what the max is, so if you try to plot too many lines at once, then you may have gnuplot problems. Also, the scale is always 0-1 (marked as 0-100%) on both axes. If you want some other scale, you'll have to manually edit the gnuplot file.

-ssize {positive integer} : this is the size of the foci to use for focus trees. By default it is 51300.

Script file commands: A script file is simply a file that specifies any number of learning runs to do. The possible flags are as follows.

learn [ensemble|tree|sensemble|stree] : specifies the learning algorithm to use. ensemble and decision tree are standard algorithms. sensemble and stree are modified versions of those algorithms. The default is tree.

s [gini|entropy|entropyratio|x2] : specifies the scoring type to use. The default is entropyratio, which is what C5.0 uses.

min [number] : specifies the minimum score allowed in a decision or focus tree. This is used to terminate the tree, and is not meaningful in ensemble runs. The default is 0.

pk [number, 0<=number<=1] : specifies the fraction of top scoring results to keep in an ensemble run. The default is .0001. Regardless of the percentage specified, it will always keep at least 1 result.

cost [number, number>0] : specifies the initial ratio of positive examples to negative examples for any learning type. The default is 1.

pv [number, 0<=number<=1] : specifies the fraction of the vote needed from the result sets in a boosted ensemble run to call an example positive. The default is 0.25.

c [non-negative integer] : specifies the number of folds to use for cross validation. Specifying 0 does n-fold cross validation. Specifying 1 does no cross validation. The default is 1.

i [non-negative integer] : specifies the number of boosting iterations to do. Specifying 0 does no boosting. The default is 0.

force : forces a run to finish, even if the boosting error is >0.5. Results sets with error >0.5 have 0 weight in the final vote. If this is not specified, then runs which exceed the 0.5 error threshold are terminated (other runs in a script will still be executed).

show : calculates a final result (no cross validation) and shows it. If no cross validation was specified, then this is on by default.

// : marks the beginning of a comment. If a comment is on the same line as a parameter, then there must be a space separating the end of the parameter and the //.

END : marks the end of a learning run. You can specify several learning runs, each separated by an END.

ENDGROUP : if you're going to plot the results, then you can mark groups that should be plotted on the same line together with this.

Here's an example of a script file. The first two runs will be plotted on the same line, and the last two runs will be plotted together on a different run. All options have been specified for all runs, even though they don't all apply (it's easier to cut and paste that way). Options that don't apply to a run are ignored.

***Script file example begins on next line***
//this is a script file
//note that for comments, there must be a space before the // when it's on the same line as a parameter
learn tree // learning alg to use
min .1 // min score to terminate decision tree
pk .0001 // percent of the results to keep
c 5 // number of folds to use
i 3 // number of boosting iterations
pv .25 // percent of vote required for ensemble to call it positive
cost 1 // ratio of weight of active to weight of inactive examples for boosting weight initialization
s inew // scoring type
force // force boosting to finish
show
END //end of this run
learn tree // learning alg to use
min .1 // min score to terminate decision tree
pk .0001 // percent of the results to keep
c 5 // number of folds to use
i 3 // number of boosting iterations
pv .25 // percent of vote required for ensemble to call it positive
cost 2 // ratio of weight of active to weight of inactive examples for boosting weight initialization
s inew // scoring type
force // force boosting to finish
show
END //end of this run
ENDGROUP //end of this group
learn ensemble // learning alg to use
min .1 // min score to terminate decision tree
pk .0001 // percent of the results to keep
c 5 // number of folds to use
i 3 // number of boosting iterations
pv .25 // percent of vote required for ensemble to call it positive
cost 1 // ratio of weight of active to weight of inactive examples for boosting weight initialization
s inew // scoring type
force // force boosting to finish
show
END //end of this run
learn ensemble // learning alg to use
min .1 // min score to terminate decision tree
pk .0001 // percent of the results to keep
c 5 // number of folds to use
i 3 // number of boosting iterations
pv .25 // percent of vote required for ensemble to call it positive
cost 2 // ratio of weight of active to weight of inactive examples for boosting weight initialization
s inew // scoring type
force // force boosting to finish
show
END //end of this run
ENDGROUP //end of this group
***Script file example ended on previous line***

When the program runs

The program will load the example file(s). Progress will be indicated by a dot for every 10 examples loaded and a count of examples loaded so far every 100. The program will show the number of positive and negative examples loaded as well as the number of features (counting the first column). The program will then begin the runs. Each run is preceded by information showing the options for the run. The run itself will show progress by printing the letters c, b, e, t, and f, as well as digits. The c means the cross validation loop has been entered (should happen once per run). The b means the boosting loop has been entered (should happen once per fold). The digit specifies the current fold, starting at 0. The e means that the ensemble loop has been entered (applies only to ensembles). The t means the boost testing has begun (should happen once for each boost iteration). The f means the final fold testing has begun (should happen once per fold). If specified, the program will show final results. An example run might look like this:

Loading examples......
Number positive examples: 26
Number negative examples: 34
Number features: 139352
Loading names
***Run Info***
Learn Alg: decision tree
Folds: 5
Iterations: 3
Force boost finish: true
Vote percent for pos: 0.25
Percent results to keep: 0.0001
Cost: 2
Score type: inew
Min tree score: 0.1
Show results: true
Plot group: 0
Progress: c
b0tttf
b1tttf
b2tttf
b3tttf
b4tttf
Avg percent correct: 0.765967
TP: 13
FP: 1
TN: 33
FN: 13
TP Rate: 0.5
FP Rate: 0.0294118

It should be noted that, depending on various options specified either in the script, the command line, or in the code, the output can look significantly different.

File Formats

Example File Format: This format is used in the files specified by the -file and -test command line options. Each example is on it's own line. The first character is either I or A to specify an Active or Inactive example. This is followed by a string of 0's and 1's separated by commas, one for each feature. A 0 means the feature is not present; a 1 means the feature is present. A line can end with a period if you want, or you can leave the period out. Each example must have the same number of features. Here's an example:
A,1,0,1,0.
A,0,0,1,1.
A,0,1,1,1.
A,0,0,0,0.
A,0,0,1,1.
I,1,1,1,1.
I,1,1,0,0.
I,1,0,0,1.
I,0,1,1,1.
I,0,0,1,0.

Optionally, you can take all of the commas and periods out. I recommend this for large files, since they will load faster. Here's the previous file in this format:
A1010
A0011
A0111
A0000
A0011
I1111
I1100
I1001
I0111
I0010

Note on how this was programmed: The file is read in 1 line at a time. For files with a small number of features and a large number of examples, this is not very efficient. However, the case I programmed for was the opposite, so it worked out fine. Also, you need not use commas and periods as separators. The parser ignores anything that's not a 1, 0, I, or A. Also, you need not use I and A in the first column. Using a 0 or a 1 will work fine there, too.

Names File Format: The names file format specifies the actual names of each column, one on each line. There must be a name for every bit column (excluding the first column), and the names must be numbers that fit into whatever an unsigned long is on your machine. Furthermore, the numbers must be in increasing order. Here's an example for the previous example file:
19175
42397
929568
993485

FAQ

Q: What is a focus tree and focus ensemble?
A: A focus tree is a special modification of the decision tree algorithm that takes advantage of the particular way in which some data is structured. Some data has related features grouped together in what we call foci. If the highest scoring feature appears in a particular foci, then we may want to only consider other features in that foci for further splits. That is what the focus tree and focus ensemble are for. The fsize parameter is the size of each focus given the names file. So if the fsize is 1000, then features 0-999 would be focus 1, and 1000-1999 would be focus 2, etc. A names file is required to this. The names file contains what the features are actually numbered as, so irrelevant features can be removed. So if the fsize is 1000, and the first few named features are 10, 323, 899, 1103, etc, then only the first 3 features would be in focus 1, since they are in the range of 0-1000.

Q: What's the maximum number of features my examples can have?
A: By default the max is 140000. If you would like to change this value (either because you need a larger one or you want the program to be more memory efficient), you can change it in the constants.h file.

Q: What's the maximum number of examples I can have?
A: By default the max is 10000. If you would like to change this value, you can change it in the constants.h file. Changing this will not have a significant impact on memory usage.

Q: What's the maximum size of a line in a script file?
A: By default this is 1000. This is already quite large to allow room for comments, but if you want to change it, you can do so in the constants.h file.

Q: Are there other defaults I can change?
A: Yes, there are lots, although I doubt that most people will find the rest very useful. Some defaults are specified in the constants.h file. Others are in the RunInfo.h file in the constructor. I suppose any constructor that initializes values is setting defaults, though, so if you really need absolute control, then you'll need to look through all the code. Finally, there are defines in the compilerOptions.h file that take parts of the code in or out. Most of this is for debugging, although two of the defines are on by default. The options that are turned on report the most frequently missed examples and show the boosting error after each boosting iteration (scale 0-1). All of the options are explained in the compilerOptions.h file. If you're going to turn on some of the other debug defines, I highly suggest that you only run the resulting executable on very small datasets. Some of the options show some or all of the examples, and if you have lots of them, or large ones, your screen will scroll forever.

Q: I ran the program and got a message about missing, invalid, or conflicting arguments. What's up?
A: If there's a typo in the script file, you may get this message. The program tries to recover by using default values, and so will probably run anyway, but not necessarily how you wanted it to.

Q: I ran the program and got unexpected results. What's up?
A: Make sure that you're using the correct script and data set. If you're running everything from a batch file, it's really easy to forget to change those. Another thing you might try is recompiling with fewer optimization options. On some Unix machines with -O3 you may get NaN for all the scores. On these, try -O2 or no optimization at all (although it will run pretty slow that way). The -finline-functions also seems to cause problems on some machines.

Q: How do I determine which examples get missed all the time?
A: If this compiler option is turned on (it is by default), then the examples most missed will be printed after each run. The output looks like this:
Most unpredictable results:
45 weight: 15
50 weight: 15
23 weight: 0.517241
5 weight: 0.517241
7 weight: 0.517241

The first number is the example number from the file, starting at 1 (so 45 is the 45th example in the file). The weight is how disproportionate the weight of that example to the rest. This usually means it was boosted a lot because it was missed repeatedly. That said, you set the cost option to something other than 1, this may be incorrect. Even if you don't do boosting, the program will still change the weights internally as if we were going to do boosting, so the list is still meaningful. If you want to change the number of examples it reports, there's a constant in the constants.h file.

Q: Why does the progress show that it's entering the cross validation loop and/or the boosting loop when I have cross validation and/or boosting turned off?
A: The program always goes through all the loops; in those cases, it only goes through the loops once.

Q: How do I interpret the tree output?
A: Here's a sample decision tree output:
Set number 0
Weight: 4.85798
Prev Bit Set: 0 bit: 1154046 score: 0.420105 examples: 60
Prev Bit Set: 1 examples: 16 vote: A
Prev Bit Set: 0 bit: 7310226 score: 0.283515 examples: 44
Prev Bit Set: 1 examples: 5 vote: A
Prev Bit Set: 0 bit: 8845981 score: 0.266763 examples: 39
Prev Bit Set: 1 examples: 3 vote: A
Prev Bit Set: 0 examples: 36 vote: I
The first line specifies the set (if there is boosting, there will be multiple sets, i.e. multiple trees). The next line specifies the weight this set has in the final voting. The weight is higher for more accurate sets. The next set of lines show the actual tree. Prev Bit Set means this branch is for examples that do or do not have the splitting feature. In the root case, this means nothing. The bit is the actual name of the splitting bit. The score is the score of the splitting bit based on the examples that made it that far. The examples is the number of examples in that branch at that point. The vote (in leaf nodes only) is whether the example that makes it there is Active or Inactive. Each sub branch is indented. In some cases (low minimum score, boosting) both leaf branches may vote the same way.

Q: Can I save the output to a file?
A: This functionality is not built into the program, although in Unix you can redirect the output from the screen to a file. In fact, I recommend writing a batch file that does this, and then just running the batch file every time. For example, create a file that contains the following (depending on your shell):
learn [whatever your args are] >& [output file]
-OR-
learn [whatever your args are] > [output file] 2>&1

Then, at the command line type "chmod +x [batch file]" (no quotes, of course). Finally, to run the batch file type "./[batch file]" -OR- just "[batch file]" (again, no quotes). Note that I am redirecting both standard output and standard error here. All debug information (including the two options that are on by default) is written to standard error, so if you want that saved, too, you need to do this. If you would prefer to see error messages, like possible loading errors, then leave the "&" out of the script. If you want to save the output in Windows, I recommend increasing the buffer size of the window so none of the output is lost, and then copy/paste the output to a text file.

Q: The program just crashed on me! What gives?
A: We've run the program a lot, and I'm pretty sure that if it's being used correctly it won't crash. I've tried to have to code catch most cases where required arguments were left out, but that's definitely a possible cause. If you do find a bug, though, you can send us a report.

Q: What are the compiling options?
A: There are lots of compiling options, mostly for debugging. Two of them are turned on by default, though, since they are so useful (to us, at least). The options that are turned on report the most frequently missed examples and show the boosting error at each step (scale 0-1). All of the options are in the compilerOptions.h file, and are explained there. If you're going to turn on some of the other debug defines, I highly suggest that you only run the resulting executable on very small datasets. Some of the options show some or all of the examples, and if you have lots of them, or large ones, your screen will scroll forever.

Q: Why do some of the comments in the code not make any sense?
A: This program was written and then modified over several months. Some comments may not apply to the current code any longer. I made some attempt to update the comments, but I'm not making any promises.

Q: How optimized is this?
A: I had at one point the evaluation version of a profiling tool. I used this to optimize the ensemble routine quite a bit. Stuff that I wrote later (trees, foci) did not go through this process, but a lot of the code is shared, so overall, I think it's pretty fast. Then again, I did all my optimizing in Windows, but it should carry over to Unix fine. I use the STL a lot, though, and that may slow things down in some places, but it really makes my life easier. Finally, I'm not an expert in g++ compiler options, so the ones I used may or may not make it faster. If you try some others and it makes a big difference, let us know!

Q: How do I compile this?
A: In Unix, use the make file I provide. I'm no make file expert, though, so if you are you might find problems with it. Rather than rely on it to recompile only the necessary parts, I always do a make all. In Visual C++, you'll have to create a Win32 Console project and add all the files.

Q: I have tons of questions about all of this!
A: You can always email
me, and I'll try to answer your questions, although I'm not sure how much I'll remember several months from now.

History

v1.0 - Initial release