Next top model abstractions

The battle for Greece’s next top model may have been over for this year…😂
However, the battle’s still on in almost every machine learning task a data scientist comes across in their daily -and nightly- lives.

The long standing questions in these cases are:

Which model is the best?
Can I know that in advance?
If not, how am I supposed to try every possible different model and decide afterwards, without creating the messiest 🍝 spaghetti code in the whole of Italy?
And most importantly, should I eat more cake?…err sorry, that’s always a yes!

Fortunately, the machine learning community can offer some real engineering gems for deploying your machine learning models easily and with the least possible pain – my favourite ones being keras/tensorflow and scikit-learn (sklearn). Today, I’ll focus on scikit-learn only, which offers great implementations for a large set of popular “traditional” machine learning classifiers (aka non deep-learning based).

Yet, if I had to try every possible classifier integrated within scikit-learn, would I be able to do that in a coherent/non-redundant/straightforward way?…🤔 i.e. write my code once and seamlessly run it for every possible classifier? (unlike Java…)

Today’s our lucky day as the answer to that is: (you guessed it) yes!
The very first step for doing that is defining a class -let’s call it SklearnWrapper– that can invoke standard built-in methods, such as fit, predict, etc., seamlessly across multiple sklearn classifiers.

This class would look like this:

class SklearnWrapper(object):
    
    def __init__(self, clf, params={}):
        self.clf = clf(**params)

    def fit(self, x, y):
        return self.clf.fit(x, y)

    def predict(self, x):
        return self.clf.predict(x)

    def predict_proba(self, x):
        return self.clf.predict_proba(x)

    def evaluate(self, x, y, verbose=0):
        self.clf.evaluate(x, y, verbose)

    def feature_importances(self, x, y):
        return (self.clf.fit(x, y).feature_importances_)

    def get_coef_(self):
        return self.clf.coef_

The critical part in this case is passing a clf and params argument that are then instantiated into a specific sklearn object by calling the clf(**params) constructor.

For example, you can create a new Random Forest classifier like that:

from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 100,
    'max_features' : 'auto',
    'max_depth': 15,
    'min_samples_leaf': 2,
    'min_samples_split': 4,
    'warm_start': False,
    'verbose': 0
}

rf_model = SklearnWrapper(clf=eval("RandomForestClassifier"), params=rf_params)

Note that the clf argument’s input value is a string representing a classifier that is eval‘ed and thus interpreted into the respective sklearn classifier object. This means that the string passed onto clf has to be a valid sklearn module name implementing a machine learning classifier! The clf variable/argument in that case is merely a placeholder name for a real sklearn module (that needs to be explicitly imported), while params are a dictionary of classifier-specific input parameters.

Similarly, we can create a Support Vector Classifier model as such:

from sklearn.svm import SVC

# Support Vector Classifier parameters
svc_params = {
    'C': 0.01,
    'kernel': 'linear',
    'gamma': 'auto',
    'probability': True,
    'shrinking': True
}

svc_model = SklearnWrapper(clf=eval("SVC"), params=svc_params)

So far so good… We’re now just missing hooking up our model definitions with data objects (train/test) to actually train our models and make predictions. In a next post, I’ll talk about creating a higher level class that implements further functionality on top of SklearnWrapper to provide a standard interface for building the model, training, evaluating and extracting predictions. 🙂

Dimitrios Vitsios's blog