Decision Tree (Machine Learning) ================================ In this section, I am trying to use decision tree method to predict the taxi trip (low speed time, trip duration). I wrote a python class (Decision Tree), including extracting predictor and response variable from dataframe, and training the model with cross validation. Predictor and Response Variables -------------------------------- In this the example, the predictor variable including hour of the day,day of the week, pickup lat, pickup lng, dropoff lat, dropoff lng, trip distance. The response variable is low speed time and trip duration. The following code snippet is for create the predictor and response variables: .. literalinclude:: machine_learning.py :pyobject: Decision_tree.get_x_and_y_predict_timecost .. literalinclude:: machine_learning.py :pyobject: Decision_tree.get_x_and_y_predict_tripduration Training model -------------- Decision Tree model is trained with x and y, max_depth of the tree is used to prevent over fitting We also calculate the residual of the prediction. .. literalinclude:: machine_learning.py :pyobject: Decision_tree.train_model_predict_timecost .. _cross-label: Cross Validation ---------------- Cross Validation is used to choose the regularization parameter (max_depth of the tree). Cross Validation splitting strategy ``````````````````````````````````` In this work, I use K-fold method to calculate the :ref:`cvscore-label`. KFold divides all the samples in k groups of samples, called folds, of equal sizes (if possible). The model is tranined using k-1 folds, and the fold left out is used for test. .. _cvscore-label: Cross Validation Score `````````````````````` In this work, we use coefficient of determination (:math:`{R^2}` of the prediction) as the cross validation score (cv score). A data set has n values marked :math:`{y_1} \ldots {y_n}`, each associated with a predicted value :math:`{f_1} \ldots {f_n}`. :math:`\bar y = \frac{1}{n}\sum\limits_{i = 1}^n {{y_i}}` is the mean of the observed data. The definition of the Cross Validation Score (coefficient of determination) is: :math:`{R^2} = 1 - S{S_{res}}/S{S_{tot}}` where :math:`SS_{tot}` is the total sum of square: :math:`S{S_{tot}} = \sum\limits_i {{{({y_i} - \bar y)}^2}}` :math:`SS_{res}` is the residual sum of squares(RSS) :math:`S{S_{res}} = \sum\limits_i {{{({y_i} - {f_i})}^2}}`