First a short intro…
What is this?
It’s a perl native support vector machine module with threading. So it’s pure perl no C that then talks to perl via bindings. It’s coded to use threading in a few areas and for general training. It also comes with front end perl scripts that for the most part (see the weight options) mirror libsvm’s command line scripts. So you “could” use this as a drop in replacement of libsvm (with one exception, the pre-computed kernel, I never got around to it).
Why I did this
First and foremost I wanted to hands on build an SVM engine, to learn the structure. Second was to play around with threading svm model generation without needing to worry about memory cleanup. Finally, not the original goal but a nice side effect, improve some of my coding style as it relates to efficiency.
My job and my undergraduate degree are not focused or related to coding. I do however end up needing to create data mining and log parsing scripts on a regular basis and while I can code in many other languages I’ve become most adept at perl. (Mostly because it’s quick, and dirty in a way I find amusing (crazy one liners that do ridiculously complicated tasks).
As you can guess though perl is a terrible choice for computing SVMs and as with most of my side projects, I’ll likely rewrite things in C when it becomes painfully obvious I have no other choice.
It’s a really bad idea
I can’t stress enough how bad perl is at doing this. It really upset me to realize I was out of things I could do to optimize the code any further and even with threading there was a 30x to 500x increase in computation time.
If on the other hand you can read perl more easily than C and want to learn how to compute SVMs there is this.
Seriously though, if your implementing any of these types of algos for real use stick to C and at most create bindings to other languages. (by real use I mean if you run a test set and it consumes your entire memory space, you shouldn’t even be considering what I wrote here as implementable)
Building, Testing, Sadness…
The code is in large part based on libsvm’s svm.cpp file. It’s a near 1-1 for large portions. I went through and modified and optimized a few things and added threading where it made sense (e.g . when the overhead of threading wouldn’t actually slow things down).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
I tried to keep threading and non threading branches by wrapping threading points into functions so I could run the linear version as a reference and for nytprof.
1 2 3
Testing was done using libsvm as a comparison and computting the same SVMs and diffing the results. Perl uses full sized floats by default so it’s results mismatched C using the epsilon-svr kernel.
Much to my dismay I was never able to get perl even close to the same performance as C. Now I know that seems crazy but the svm-scale.pl does out preform the C version purely because of threading. I had hoped the same might be true of large data sets using the perl version but as the chart below shows it grows at the samse rate as C and never converges. I have some ideas that could help get the curve down a bit but I’m about ready to call it done and forcus on the C version and get threading working there.
Start with C on the first run, it was interesting and I found I had been doing some bad things in perl for a long time, but it just didn’t result in anything all that useful.
NYTProf is your friend. This is some super nice perl proffiling. You can’t use threads with this (it’ll just get confused) but use it linearly and it’ll defentaly help weed out performance issues.
Don’t do this
1 2 3 4 5 6 7
I was surprised to find out just how much cputime doing shift on the global
@_ variable does. Nearly 1/3 of the cpu time was being wasted on that alone. Weird thing is I used to do it the right way till a year ago I saw someone using
shift and thought to myself “hey that looks pretty”, lesson learned.
There’s some open bugs I need to fix still. Mostly because I haven’t been using them in this exact way.
- svr_probability isn’t working - numbers dumped are wrong. Likely a typo somewhere in the calculation. I did a dead 1-1 copy from libsvm on this and just haven’t gone back to fix it yet.
- Shrinking - I don’t have it though for almost all my svm types it never helped, and I already cache the Q/Q_bar into memory so this is of little value in reducing cycle time.
- Cntrl-C handle - I intended to have a handler to allow you to stop the svm process mid run, dump the contents to a file and then allow you to pickup latter. (In case you sleep in the same room as your server and can’t sleep with it running all night). Current tasks I run haven’t needed this as I usually trim down things using test sets and pca but it would still be nice.
- Memory optimizations - When it threads it does a full memory copy of the problem so if I’m using 1GB and I spawn 4 threads I’m using 5GB, which is rediculous since the sub threads don’t need the full training set, at most 2x would be ok (it’s still not), so I’m working on it.
- Perl Doc - It’s good practice to document things, and well I haven’t :(
Finally with perl (and yes python) being unviable because of memory usage and runtime, the C version is still the best option. I’m working on cleaning up a C version based on libsvm using pthreads. libsvm code isn’t all that bad but I don’t like the structure, and I’m also going to use some libs to handel some trivial things like reading in options and parsing files, theirs no reason to do this by hand. I also want it to handle various compression formats like mine does.
Break down of the SVM
Now to the meat of this exercise.
You can walk the libsvm code yourself and I did a couple of times, but like learning a foreign languadge it’s best you know the foundation (the math itself) and then practice writing it yourself.
I’ll walk through the simple one_class svm case using the rbf kernel.
svm-train.pl is the wrapper it calls out the Algorythim::ML::SVM module. I’ve decided to cary around the training data as part of the svm engine itself instead of creating another object or data structure to do this. Inside this scipt you’ll see a set of default variables being set they setup the conditions for the SVM but most of this gets auto set on SVM creation too so a lot can be omitted.
1 2 3 4 5
Now we’ll call out the training, this is where all the work happens. the
$save_file variable isn’t needed. This is something I added to retain the original support vectors used to build the model. I re-use these in other tools.
Inside this there’s an if/else that selects the SVM type. if it will have multiple categories of Y then it splits each and trains on each grouping. In the single and the multiple Y version they then call on
train_one will in threaded mode create sub groups of evenly seperated vectors for each grouping to train on (this might no always be the best choice but on large enough same sets the error is very low). In non-threaded mode it calls out
train_one_lin calls out the solver based on the selected mode. We’ll follow
@alpha = $self->solve_one_class($model);. An array of the alpha values (weights per vector) is returned and used later to decide which vectors will be kept as “Support Vectors”. This sets up some arrays needed in the computation. It’s seperated here to deal with each of the svm types. It finally calls out the actual
solver function that does the work (I know it took a while to widdle down to here).
Now things get interesting… (I’ll continue on the next write up)