Decision Tree (Machine Learning)¶
In this section, I am trying to use decision tree method to predict the taxi trip (low speed time, trip duration). I wrote a python class (Decision Tree), including extracting predictor and response variable from dataframe, and training the model with cross validation.
Predictor and Response Variables¶
In this the example, the predictor variable including hour of the day,day of the week, pickup lat, pickup lng, dropoff lat, dropoff lng, trip distance. The response variable is low speed time and trip duration.
The following code snippet is for create the predictor and response variables:
def get_x_and_y_predict_timecost(self,df):
trip_distance = df.trip_distance
picklat = df.pickup_latitude
picklng = df.pickup_longitude
droplat = df.dropoff_latitude
droplng = df.dropoff_longitude
hourofday = df.hourofday
x = map(lambda x1,x2,x3,x4,x5,x6: [x1,x2,x3,x4,x5,x6],picklat,picklng,droplat,droplng,hourofday,trip_distance)
y = df.timecost
return x,y
def get_x_and_y_predict_tripduration(self,df):
trip_distance = df.trip_distance
trip_duration = df.triplength
picklat = df.pickup_latitude
picklng = df.pickup_longitude
droplat = df.dropoff_latitude
droplng = df.dropoff_longitude
hourofday = df.hourofday
x = map(lambda x1,x2,x3,x4,x5,x6: [x1,x2,x3,x4,x5,x6],picklat,picklng,droplat,droplng,hourofday,trip_distance)
y = trip_duration
return x,y
Training model¶
Decision Tree model is trained with x and y, max_depth of the tree is used to prevent over fitting We also calculate the residual of the prediction.
def train_model_predict_timecost(self,x,y):
from sklearn import tree
import numpy as np
clf = tree.DecisionTreeRegressor(max_depth=15)
clf = clf.fit(x, y)
from sklearn import cross_validation
scores = cross_validation.cross_val_score(clf, x, y, cv=5)
residule = np.mean((clf.predict(x) - y) ** 2)
return clf, scores.mean(), residule
Cross Validation¶
Cross Validation is used to choose the regularization parameter (max_depth of the tree).
Cross Validation splitting strategy¶
In this work, I use K-fold method to calculate the Cross Validation Score. KFold divides all the samples in k groups of samples, called folds, of equal sizes (if possible). The model is tranined using k-1 folds, and the fold left out is used for test.
Cross Validation Score¶
In this work, we use coefficient of determination (\({R^2}\) of the prediction) as the cross validation score (cv score).
A data set has n values marked \({y_1} \ldots {y_n}\), each associated with a predicted value \({f_1} \ldots {f_n}\). \(\bar y = \frac{1}{n}\sum\limits_{i = 1}^n {{y_i}}\) is the mean of the observed data.
The definition of the Cross Validation Score (coefficient of determination) is:
\({R^2} = 1 - S{S_{res}}/S{S_{tot}}\)
where \(SS_{tot}\) is the total sum of square:
\(S{S_{tot}} = \sum\limits_i {{{({y_i} - \bar y)}^2}}\)
\(SS_{res}\) is the residual sum of squares(RSS)
\(S{S_{res}} = \sum\limits_i {{{({y_i} - {f_i})}^2}}\)