Preprocessing ============= Machine learning algorithms learn from the data. To choose the right machine learning algorithms, I need to have the right data. We need to make sure it is in a useful format, scale and useful features are included. In this section, I will show how to prepare the NY city Taxi Trip data for a machine learning algorithm, including cleaning, formatting, decomposition and aggregation. I wrote a python class (Preprocessing) to do the Preprocessing job. Cleaning -------- Cleaning data is the removal or fixing of missing data. Some data instances are incomplete or nonsense. By “incomplete” I mean there is some nan in the data instances column. By “nonsense” I mean there are some data instances not possible in the real world. For example, for NYC Taxi Trip data, there are some instances have trip distance larger than 500 miles. The following code snippet is for data cleaning. .. literalinclude:: preprocessing.py :pyobject: Preprocessing.clean_data Formatting ---------- The format of the date column in the NYC Taxi Trip data instances is not suitable for me to work with. I need to change the format of the date column to the timestamp. The following code snippet is for formatting: .. literalinclude:: preprocessing.py :pyobject: Preprocessing.format_date .. _decomp-label: Decomposition ------------- The date instances in NYC Taxi Trip data have hour ,day, week, month, year components that in turn could be split out further. In this problem, I think the hour of the day, day of the week is relevant to the problem being solved and can be use as categorical features (season variables) for the machine learning algorithm. .. literalinclude:: preprocessing.py :pyobject: Preprocessing.get_hour_of_the_day .. _aggre-label: Aggregation ----------- There are some features in NYC Taxi Trip data that can be aggregated into a single feature. For instance, we can calculate the time in slow traffic time (rates 50 cents per 60 seconds) using the fare amount and trip distance. We can calculate the trip duration using the pickup date time and dropoff datetime. We can also calculate the taxi trip speed using the trip distance and trip duration. After getting the trip speed, we can do further cleaning on the data. For example. we assuming that the taxi speed should be less than 80 mi/hour and larger than 5 mi/hour. The following code snippet is for data aggregation and further cleaning. .. literalinclude:: preprocessing.py :pyobject: Preprocessing.get_trip_duration .. literalinclude:: preprocessing.py :pyobject: Preprocessing.get_time_cost .. literalinclude:: preprocessing.py :pyobject: Preprocessing.furthur_clean_data