Decision Tree (Machine Learning)

In this section, I am trying to use decision tree method to predict the taxi trip (low speed time, trip duration). I wrote a python class (Decision Tree), including extracting predictor and response variable from dataframe, and training the model with cross validation.

Predictor and Response Variables

In this the example, the predictor variable including hour of the day,day of the week, pickup lat, pickup lng, dropoff lat, dropoff lng, trip distance. The response variable is low speed time and trip duration.

The following code snippet is for create the predictor and response variables:

    def get_x_and_y_predict_timecost(self,df):
        trip_distance = df.trip_distance
        picklat = df.pickup_latitude
        picklng = df.pickup_longitude
        droplat = df.dropoff_latitude
        droplng = df.dropoff_longitude
        hourofday = df.hourofday
        x = map(lambda x1,x2,x3,x4,x5,x6: [x1,x2,x3,x4,x5,x6],picklat,picklng,droplat,droplng,hourofday,trip_distance)
        y = df.timecost
        return x,y
    def get_x_and_y_predict_tripduration(self,df):
        trip_distance = df.trip_distance
        trip_duration = df.triplength
        picklat = df.pickup_latitude
        picklng = df.pickup_longitude
        droplat = df.dropoff_latitude
        droplng = df.dropoff_longitude
        hourofday = df.hourofday
        x = map(lambda x1,x2,x3,x4,x5,x6: [x1,x2,x3,x4,x5,x6],picklat,picklng,droplat,droplng,hourofday,trip_distance)
        y = trip_duration
        return x,y

Training model

Decision Tree model is trained with x and y, max_depth of the tree is used to prevent over fitting We also calculate the residual of the prediction.

    def train_model_predict_timecost(self,x,y):
        from sklearn import tree
        import numpy as np
        clf = tree.DecisionTreeRegressor(max_depth=15)
        clf = clf.fit(x, y)
        from sklearn import cross_validation
        scores = cross_validation.cross_val_score(clf, x, y, cv=5)
        residule = np.mean((clf.predict(x) - y) ** 2)
        return clf, scores.mean(), residule

Cross Validation

Cross Validation is used to choose the regularization parameter (max_depth of the tree).

Cross Validation splitting strategy

In this work, I use K-fold method to calculate the Cross Validation Score. KFold divides all the samples in k groups of samples, called folds, of equal sizes (if possible). The model is tranined using k-1 folds, and the fold left out is used for test.

Cross Validation Score

In this work, we use coefficient of determination (\({R^2}\) of the prediction) as the cross validation score (cv score).

A data set has n values marked \({y_1} \ldots {y_n}\), each associated with a predicted value \({f_1} \ldots {f_n}\). \(\bar y = \frac{1}{n}\sum\limits_{i = 1}^n {{y_i}}\) is the mean of the observed data.

The definition of the Cross Validation Score (coefficient of determination) is:

\({R^2} = 1 - S{S_{res}}/S{S_{tot}}\)

where \(SS_{tot}\) is the total sum of square:

\(S{S_{tot}} = \sum\limits_i {{{({y_i} - \bar y)}^2}}\)

\(SS_{res}\) is the residual sum of squares(RSS)

\(S{S_{res}} = \sum\limits_i {{{({y_i} - {f_i})}^2}}\)