Preprocessing¶
Machine learning algorithms learn from the data. To choose the right machine learning algorithms, I need to have the right data. We need to make sure it is in a useful format, scale and useful features are included.
In this section, I will show how to prepare the NY city Taxi Trip data for a machine learning algorithm, including cleaning, formatting, decomposition and aggregation. I wrote a python class (Preprocessing) to do the Preprocessing job.
Cleaning¶
Cleaning data is the removal or fixing of missing data. Some data instances are incomplete or nonsense. By “incomplete” I mean there is some nan in the data instances column. By “nonsense” I mean there are some data instances not possible in the real world. For example, for NYC Taxi Trip data, there are some instances have trip distance larger than 500 miles.
The following code snippet is for data cleaning.
def clean_data(df):
"""
clean the data, including frop nan value,
drop all trip_distance is larger than 100 miles
:param df: pandas dataframe
:return: pandas dataframe
"""
df_copy = df.dropna()
df_copy = df_copy[df_copy.trip_distance <= 100]
df_copy = df_copy[df_copy.draws > 0]
return df_copy
Formatting¶
The format of the date column in the NYC Taxi Trip data instances is not suitable for me to work with. I need to change the format of the date column to the timestamp.
The following code snippet is for formatting:
def format_date(cls, df):
tempdf = df.copy()
tempdf['tpep_pickup_datetime'] = pd.to_datetime(tempdf.tpep_pickup_datetime) # date to timestamp
tempdf['tpep_dropoff_datetime'] = pd.to_datetime(tempdf.tpep_dropoff_datetime)
return tempdf
Decomposition¶
The date instances in NYC Taxi Trip data have hour ,day, week, month, year components that in turn could be split out further. In this problem, I think the hour of the day, day of the week is relevant to the problem being solved and can be use as categorical features (season variables) for the machine learning algorithm.
def get_hour_of_the_day(cls,df):
df['hourofday'] = map(lambda x: x.hour,df.tpep_pickup_datetime)
return df
Aggregation¶
There are some features in NYC Taxi Trip data that can be aggregated into a single feature. For instance, we can calculate the time in slow traffic time (rates 50 cents per 60 seconds) using the fare amount and trip distance. We can calculate the trip duration using the pickup date time and dropoff datetime. We can also calculate the taxi trip speed using the trip distance and trip duration. After getting the trip speed, we can do further cleaning on the data. For example. we assuming that the taxi speed should be less than 80 mi/hour and larger than 5 mi/hour.
The following code snippet is for data aggregation and further cleaning.
def get_trip_duration(cls,df):
df["tripduration"] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
df["tripduration"] = map(lambda x: (x / np.timedelta64(1, 'D')).astype(float),df.triplength)
df["tripduration"] = map(lambda x: x*24*60,df.triplength)
return df
def get_time_cost(cls,df):
df['timecost'] = map(lambda x,y: (x-int(y/0.2)*0.5-2.5)/0.5,df.fare_amount,df.trip_distance)
def furthur_clean_data(cls,df):
df = df[df.timecost < df.tripduration]
df = df[df.timecost>=0]
df = df[df.speed>5]
df = df[df.speed<80]
return df