Project presentation

Nowadays, it is claimed that more than half of the world population lives in urban areas. Added to the increasing amount of available urban data, this allows data-driven analyses to greatly improve the lives of citizens. As a consequence, we decided to focus more particularly on Taxi trips. Indeed, these are valuable sensors that can tell us about economic activity and human behavior in citys.

We found a data set in a kaggle competition supported by Google. The goal of this competition is to predict the fare amount (inclusive of tolls) for a taxi ride in New York City. The dataset initially contains taxi trips from 2009 to mid-2015. For each row in the set, corresponding to a trip, there is the pickup and drop-off geolocalisations, the number of passengers, and the date and time of the trip as covariates. We considered this data valuable as they appeared to be rather exhaustive, covering more than 54M of trips. Moreover, there were no missing data, although some seemed erroneous as we will see later.

One the competition homepage, it was stated that basic predictors achieve RMSEs of about \(5-8\$\). Our objective was then to improve such scores. However, the dataset was too heavy for us to analyze and we modified the dataset under study to better match our computational power. Therefore, as you will see, we have decided to first aggregate these data and perform some prediction on these. After that, we have sampled \(50000\) from the original dataset. These were then split between train (trips before June 2014) and test sets (trips after June 2014).

The code we have produced is available at https://github.com/TRandrianarisoa/NYCTaxiFare.

Exploratory data analysis and feature engineering

Geographical insights

First, we have looked on the locastion of the taxi trips. Looking at extremal values, we have seen that some trips were located outside of New York, sometimes even on the other side of the world. In addition, we have only kept trips with less than 10 passengers, as more doesn’t seem much plausible. These rows were then discarded.

Below, we have plotted the locations that are the most likely to be associated to a trip. The bigger the dot is the more the color tend to a red shade, the more trips have their pickup (or drop-off) location located there.

As we can see, most of the trips start in the center of Manhattan or the airports. As for the drop-offs, the repartion is still concentrated on Manhattan, although a bit more spread. All in all, airports and Manhattan accounts for most of the trips origin/destination. If we look at a similar plots for the fare amounts below, the prominent role of airports is more obvious. Indeed, trips coming from airports seem more expensive. We then add new features to the data to indicate whether the trips came from or were going to the airports (Newark, LaGuardia and JFK).

Temporal insights

We have tried to discover some trend in the data with regards to temporal characteristics of trips. We’ve found that the data were evenly distributed with regards to most of the features we have extracted (day of the week, year…). The fare amount were as well not really influenced y most of these features.

Still, we have managed to find some interesting facts in the data. First, the histogram of the mean fare observed per month below shows that there is a significant price increase inseptember 2012. We have found that a new law in New York resulted in this price increase. Consequently, to better forecast the prices, we have rescaled the post-legislation data and modified their fare mean.

Then, among the different the time features that we’ve extracted, the hour is the only one that seemed to impact the price. Indeed, we can see a peak in the trip price for trips around 5a.m. Moreover, only a few trips are observed at this time of the day. Therefore, it is still valuable to add feature such as the hour of the trip as covariates for our models.

Model on aggregated data: Temporal component estimation

In an attempt to highlight a trend on the whole city, we have aggregated the data to obtain the median fare observed over the city per hour. As explained at the beginning of this report, these new data were then split between train and test sets according relative position of the data to june 2014. From the autocorrelation plot below, we can clearly see that there are two different seasonnality (daily and weekly). With a bigger scale on the plot, we have also discovered a yearly seasonnality. Following the paper from R.J. Hyndman et al. (2012), we have fit a tbats model as this method is built to forecast complex seasonal time series. This preliminary step allows us to analyze the temporal component of our data. Indeed, the aggregation allows us to obtain data with even time steps which is a prerequisite for most time series analysis tools.

The new time series is then presented on the plot below (with the effect of the new legislation taken into account and offset). We have tried to understand what could have caused the extreme values observed in mid-2011. It occured that these happened when the hurricane Irene struck New York. This has led us to include weather data as covariated as you will see later.

TBATS

As shown above, one can see that there numerous seasonnality in the aggregated series. We have decided to first fit a TBATS model on these data. First, the model BATS is described by the following equations, including a Box-Cox transformation on the observed series \(y_t\):

\[y_t^{(\omega)} = \left\{ \begin{split} \frac{y_t^\omega-1}{\omega},\ \omega\neq 0\\ \log(y_t),\ \omega=0 \end{split} \right.\]

\[y_t^{(\omega)} = l_{t-1} + \phi b_{t-1} + \sum_{i=1}^T s_{t-m_i}^{(i)} + d_t\] \[l_t = l_{t-1} + \phi b_{t-1} + \sum_{i=1}^T s_{t-m_i}^{(i)} + \alpha d_t\] \[b_t = (1-\phi)b + \phi b_{t-1} + \beta d_t\] \[s_t^{(i)} = s_{(t-m_i)}^{(i)} + \gamma_i d_t\] \[d_t = \sum_{i=1}^p \varphi_i d_{t-i} + \sum_{i=1}^q \theta_i \epsilon_{t-i} + \epsilon_t\] Here, \(m_i\) denotes the seasonal period of the \(i^{th}\) seasonal component \(s_t^{(i)}\), \(l_t\) is the local level, \(b\) is the long-term trend, \(b_t\) isd the short-term trend, \(d_t\) is an process and \(\epsilon_t\) is Gaussian white noise. Besides, \(\alpha\), \(\beta\) and \(\gamma_i\) are smoothing parameters and \(\phi\) is a damping parameter. The model is then associated to the identifier BATS(\(\omega,\phi,p,q,m_1,m_2,...,m_T\)). One can note that this model generalizes the Holt–Winters additive seasonal model as it is given by the BATS\((1, 1, 0, 0, m1 )\). One interesting feature there is that it allows to take into account autocorrelation in the residuals.

In the TBATS model, one uses a trigonometric representation of seasonal components based on Fourier series with smoothing parameters \(\gamma_1^{(i)}\) and \(\gamma_2^{(i)}\) and \(\lambda_j^{(i)}=\frac{2\pi j}{m_i}\), which results in:

\[s_t^{(i)} = \sum_{j=1}^{k_i} s_{j,t}^{(i)}\] \[s_{j,t}^{(i)} = s_{j,t-1}^{(i)}\cos(\lambda_j^{(i)}) + s_{j,t-1}^{*(i)}\sin(\lambda_j^{(i)}) + \gamma_1^{(i)}d_t\] \[s_{j,t}^{*(i)} = -s_{j,t-1}^{(i)}\sin(\lambda_j^{(i)}) + s_{j,t-1}^{*(i)}\cos(\lambda_j^{(i)}) + \gamma_2^{(i)}d_t\]

\[y_t^{(\omega)} = l_{t-1} + \phi b_{t-1} + \sum_{i=1}^T s_{t-1}^{(i)} + d_t\]

Such representation with finite numbers of harmonics \(k_i\) allows a much simpler estimation procedure. The model is then designated by the identifier TBATS\((\omega,\phi,p,q,{m_1, k_1}, {m_2, k_2},...,{m_T,k_T}).\). Applied on our time series, with daily, weekly and yearly seasonnalities, one obtain the model TBATS\((0,0.8,0,0,{24, 5}, {168, 4},{8766,7})\). On the train data, we have got a RMSE of \(0.4607317\) on the train and a \(R^2\) coefficient of \(0.5318792\). On the following plots, we expound the results obtained

There, one can see that the model struggled to fit extreme values. In addition, there seems to be some heteroscedasticity in the residuals. Classicaly, this means that there miss some useful covariates in our model. As we have explained before, extreme values may be explained by weather conditions. However, those were the least of the issues with this model as the predicted values of the model on the test set were completely absurd (RMSE of \(1.314827\) on the test set).

ARIMA with regressors

One needed to find another model. Therefore, one adds weather data with the Riem package as we have already explained their interest. One extracted the following features that we have considered useful:

  • tmpf: Air Temperature in Fahrenheit
  • relh: Relative Humidity
  • sknt: Wind Speed in knots
  • vsby: Visibility in miles

However, there were some missing variables on the train.

Variable Percentage of missing observations
relh 0.04923700
sknt 0.04820420
vsby 0.01802125
tmpf 0.01207740

Traditionally, the threshold to decide whether to drop a feature or not with regards to the percentage of missing is \(5\%\). Therefore, we decided to keep all of these features and perform some imputation. To do so, we used the random forest method and the mice package. The algorithm to imput missing values is as follows for one variable to be imputed:

  • First, replace missing values by the average values.
  • For i = 1,…,M
    • With the imputed values so far, train a Random Forest.
    • Compute the proximity matrix from the random forest.
    • Using the proximity as the weight, impute missing values as the weighted average of non-missing values.

The density of the imputed data for each imputed dataset is showed below in magenta while the density of the observed data is showed in blue. As these are pretty similar, it confirms the hypothesis that the data are MCAR which legitimate our imputation of the missing values.

Following the advice of R.J. Hyndman, we fitted an ARIMA process to our data, including our regressors and Fourier terms to take into into account the seasonnality of our data. The previous TBATS model selected \(5\), \(4\) and \(7\) harmonics to approximate the seasonal components. We thus considered that taking \(10\) harmonics for each seasonality was a safe choice. The principle is to fit a regression of the fare on our different regressors and then fit an ARIMA model on the residuals. This gives us the following results on the train set:

We got a RMSE of \(0.5119916\) and a \(R^2\) of \(0.4219209\), which is slightly worse than our previous model. Still, the model still struggles to fit extreme values, which is even more striking on the test set on which we get a RMSE of \(0.76609\):

Nevertheless, one can see that some peak in the data are reproduced in the fit (e.g. during hurricane Irene). Therefore, the weather data seems to be valuable there. On the test set, our model predicts correctly a price slump. The plot of the residuals shows a strong heteroscedasticity. So, the model may still be incomplete. Weather data are valuable in the forecasting of extreme values, but don’t seem to help much for the prediction of high fares.

Residuals prediction

Now, let’s go back to our initial goal. The previous model provides us with a prediction per hour. Let’s refine this model to predict the fare of individual trips in New York. To accomplish this, we will try to predict the error made by our first rough model which associates to each trip the hourly prediction of the median fare over New York. The following models hyperparameters will be calibrated using a moving window cross-validation and optimizing the RMSE with the caret package. In the following parts, the RMSE values shown are valid for the lone models predicting the ‘errors’ as well as for the composite model predicting the median fare over and the errors. However, the \(R^2\) values are only valid for the former models.

Elasticnet

The first model that we have tried to fit the errors is the ElasticNet. It yields a RMSE of \(3.884417\) and a \(R^2\) of \(0.7976696\) on the train set, the cross-validation resulting in a mixing parameter of \(1\) and a penalization coefficient of \(0.05623413\). Therefore, it is a Lasso regression that is performed and one can look at the relative importance of the variables chosen by the Lasso regression. However, a linear regression performed on these features yields a slightly worse result. Then, the Lasso regression achieves a test RMSE of \(4.375342\). Surprisingly, the variables related to the airports don’t help much for the prediction.

Below, we can see that the residuals are not evenly distributed with the fitted values. There is a cluster of points that are increasingly overestimated even though fitted values are increasing. Once again, it seems that the model struggles to fit extreme values and tend to mostly underestimate the fare.

Decision Tree

The second model we have analyzed is a CART decision tree. We managed to obtain a train RMSE of \(4.095963\) and a train \(R^2\) of \(0.7732267\). On the test set, the RMSE was about \(4.082969\). Apparently, this model has fewer issues with overfitting. The variable distance is once again the most important one, followed by spatial features, as shown below.

Still, the remarks made on the previous model residuals are still valid and the situation hasn’t improved. On can also see that the model struggles to predict low fares on long trips on the second plot below. There are some long trips that are less expensive than usual and the model tend to overestimate their price. In addition, the distribution of residuals for short trips (almost null distance) has a large variance in comparison with other trips. As the model mostly relies on the distance to predict the fare, it may be caused by the fact that the model can’t predict the fare in less usual situation (cancellation fees when no trip has been made, discount prices,…)

XGBoost

We decided then to use ensemble methods and started with boosting. This model improves on the previous ones as it achieves a RMSE of \(3.762135\) and a \(R^2\) of \(0.8091787\). On the test set, the RMSE achieved was about \(3.790195\). However, the residuals show the same patterns as with the Decision Tree.

Random Forest

The last model that we have analyzed is a random forest which yields a RMSE of \(3.75737\) and a \(R^2\) of \(0.8092412\) on the train (RMSE of \(3.693425\) on the test set) with \(20\) trees (hyperparameter chosen by cross-validation as you can see below, ). This is the best model so far even though the results are very similar to that of XGBoost.

However, the residuals plots look just like those preceding and the model still can’t fit fare related to unusual trips most likely.

Looking at the feature importance, we once again found out that the most important variable is the distance followed by the location of the pickups and drop-offs.

Composite model: conclusion

Our best model yields a RMSE that was close to those that we could have seen on the competition webpage. All along this first analysis, we found out that extreme values were pretty hard to fit/predict. We have highlighted a global price trend over New York in the first part. We were short of time to improve the model on the aggregated data and better fits may have been obtained if we could have tested higher values of \(p\) and \(q\) for the ARMA processes.

We also came to the conclusion that we lacked some important variables that could have helped us to distinguish special trips (e.g. cancellations) as each of our models showed trends in the residuals. Interestingly, the airport-related variables that we have added at the beginning proved useless. Given the importance of the distance in the prediction of the fare, we think that airport trips are more expensive only because they are longer (airports are far from the downtown). We thought about other variables related to the events in New York or the traffic data but we didn’t manage to find reliable datasets. Finally, we also could have worked with functionals of the distance if we had more time.

All in all, this first approach taking the temporal dimension of data into account before the other ones showed its limits there as our last models seemed to be limited to test RMSE of \(~3.7\$\).

Second Approach: Direct focus on individual trips

We now try to another approach, directly tackling individual trips. To evaluate the performance of each of our model presented in what follows, we will implement a “time-series” cross validation strategy illustrated below.
.

The metric obsversed will be the mean RMSE (Root Mean Squared Error) between the true and the fitted prices for each of the train/test pairs. Finally, each model will predict the price for the taxi trips after the 1st of June 2014 (basically 8k rows) just like before and will be evaluated by its test error, which is the RMSE of the predicted fare in the test dataset.

Additional feature engineering

Based on the dataset features, others has been produces
* trip_distance the distance (in km) between the pickup and the dropoff location.
* Distance from JFK, LaGuardia, Newark: the distance to the airports are meaningful for explaining the price.
* log_trip_distance logarithm of the trip distance.
* direction a angle value in degrees between -180 and 180.

To capture the temporal influence on the taxi fare price we add several engineered features of the timestamp of the taxi ride. We have first that w_day and hour are respectively a number between 1 and 7 coding the day of the week and a number between 0 and 23 for the hour.
The features night_surcharge and wday_peak_hours are binary and are build to code the peak hours and night hours price surcharge. (informations found on: https://www1.nyc.gov/site/tlc/passengers/taxi-fare.page)

Linear model

To begin with, let’s predict our taxi fare prices with linear models. First, we fit a linear model of the logarithm of the price log(price) on the given features plus the weather and pollutions features.

## 
## Call:
## lm(formula = log_fare ~ . - fare - date, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2627 -0.1875 -0.0022  0.1769  3.2967 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.375e+03  5.859e+02   2.346  0.01896 *  
## ilong                  1.835e+01  7.987e+00   2.297  0.02160 *  
## ilat                  -1.889e+00  2.099e-01  -9.000  < 2e-16 ***
## olong                 -7.300e-01  5.428e-02 -13.448  < 2e-16 ***
## olat                   1.775e-01  7.375e-02   2.407  0.01607 *  
## pcount                 3.679e-03  1.224e-03   3.007  0.00264 ** 
## tmpc                  -1.710e-03  1.246e-03  -1.372  0.16999    
## dwpc                   2.056e-03  1.333e-03   1.543  0.12295    
## relh                  -3.827e-04  3.406e-04  -1.124  0.26116    
## mslp                  -5.278e-06  6.403e-06  -0.824  0.40980    
## p01m                   1.056e-03  1.919e-03   0.550  0.58213    
## w_day                  4.849e-03  8.044e-04   6.027 1.68e-09 ***
## hour                  -1.604e-04  5.734e-04  -0.280  0.77972    
## night_surcharge1TRUE  -3.556e-02  5.997e-03  -5.929 3.07e-09 ***
## night_surchage2TRUE   -7.589e-02  7.634e-03  -9.941  < 2e-16 ***
## wday_peak_hoursTRUE    4.137e-03  4.908e-03   0.843  0.39929    
## direction              5.279e-05  2.005e-05   2.633  0.00847 ** 
## trip_distance          1.498e-01  5.664e-04 264.551  < 2e-16 ***
## pickup_from_jfk        1.169e-01  1.488e-02   7.858 4.01e-15 ***
## pickup_from_laguardia -5.181e-02  1.646e-03 -31.478  < 2e-16 ***
## pickup_from_newark    -8.850e-02  6.955e-02  -1.273  0.20319    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3129 on 39693 degrees of freedom
##   (453 observations deleted due to missingness)
## Multiple R-squared:  0.709,  Adjusted R-squared:  0.7089 
## F-statistic:  4836 on 20 and 39693 DF,  p-value: < 2.2e-16

As we can see, the spatial features are very relevant, looking at the p-values, whereas the weather data are of less importance. There is still information left in the residuals as shown in the plot above.

The mean RMSE obtained is of \(6.69\)$ and the test error is of \(6.6\)$. The following plot represents the predicted and true fare for the 1000 first observations in the test set.

We notice that the linear model tends to underestimate the fare price. It is probably due to the high number of outliers. Indeed, by looking at a density estimation of the fare price we notice a heavy tail and a highly positive skewness. With such a taxi fare price distribution, fitting a linear model will be difficult.

Let’s try to cope with this issue by affecting a zero weight to the observations with a fare price superior to \(50\)$ With this second linear model, we obtain a mean RMSE of \(4.79\)$. But it degrades the test error to \(6.4\)$.

This model seems to be a bit less conservative than the previous one. A main issue that we will try to tackle with the following models is the heteroskedasticity of the fare price with respect to the trip distance as well as a bias as we see in this plot:

Our model seem to underestimate the fare price for small trip distances. Thus, this variable seems to be useless to predict prices in this range of values. In fact, it’s reasonable to think that when the distance of the taxi ride is close to zero, the price might depend more on other variables (traffic, weather, etc…) than on the distance. This effect has already been seen with the first approach.

The linear model are clearly not efficient all the more as the exponential transformation leads to propagation of errors. Now let’s dwell on a new family of models, which will be more robust and allow us to detect the non linearities.

CART and Random Forest

Random Forest method is based on bagging of decision tree. Let’s look at one decision tree first.

With this basic plot, one can see that the trip_distance variable seems to have a lot of influence. With a “Time-series” cross validation method, a mean RMSE of \(3.33\)$ and a test error of \(4.67\)$ is found. That’s a lot better than our linear models. We have now the feeling that performing a random forest on our dataset prove quite satisfying. Let’s have a look at how much the mean RMSE could be improved.

A random forest has a lot of hyperparameters to tune. First let’s determine the number of trees by a performing a grid search on the train dataset.

num.trees = 300 seems to be efficient enough in terms of OOB (Out Of Bag) error.

Let’s do a grid search to find which hyperparameters are optimal in terms of OOB (Out Of Bag) error.

grid_search <- expand.grid(
  mtry       = seq(1, 9, by = 1),
  node_size  = seq(3, 9, by = 2), 
  sample_size = 0.5
)

The final random forest is then the following one

rf <- ranger(formula = fare ~ . - date,
                     data = train,
                     num.trees = 300,
                     mtry = 8,
                     min.node.size = 9,
                     sample.fraction = 0.5)

With a “time series” cross validation method we obtain a mean RMSE of \(2.93\)$ and a test error of \(3.316\)$ for the optimal Random Forest, which is a bit better than with a basic CART.

Now, an interesting idea to help the fitting of our algorithm could be to do a unsupervised machine learning method like clustering to add some useful features to feed our random forest and improve the mean RMSE. We then perform 3 K-Means algorithm (with \(K = 5\)) on the dropoff, pickup, and both coordinates. We obtain after that a mean RMSE of \(2.64\)$ and a test error of \(3.314\)$. The K-means seems to capture valuable information about the spatial distribution of the taxi rides, which helps our random forest to learn.

As we see on this plot, our cluster random forest fit a lot better than a linear model. At this point, we need to know where are the difficulties we need to tackle for a better prediction. To observe variable importance we compute how much a variable reduce the node impurity.

As shown on the plot above, The variable trip_distance seems to be very impactful compared to the others to predict the fare price, It is then a good reason to plot the relation between fare price and trip distance.

This plot is to be compared with our random forest prediction.

By looking at this plot one may notice that the random forest predicts pretty well the prices that are very dependent on the distance but not the other ones. An interpretation could be that prices in this region are dependent of latent variables that are not in our dataset (just like we conclude with our first approach).

To confirm this hypothesis, a new linear model is fitted only on the trip_distance variable.

## 
## Call:
## lm(formula = log_fare ~ trip_distance, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6218 -0.2013 -0.0017  0.1887  2.8803 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.7249035  0.0022691   760.2   <2e-16 ***
## trip_distance 0.1383768  0.0004771   290.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3309 on 40165 degrees of freedom
## Multiple R-squared:  0.6768, Adjusted R-squared:  0.6768 
## F-statistic: 8.412e+04 on 1 and 40165 DF,  p-value: < 2.2e-16

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.725   1.898   2.020   2.189   2.267   5.476

The information left in the residuals seems to be of less importance compared to the previous linear model trained on all the features. But the test error is very high (\(8.65\)$).

Conclusion of the second approach

Our work was meant to be exploratory, with us observing the performance of several models, seeing where the predictions could be improved and what were the limits of our dataset. In terms of performance error, our work can be compared with some the best kaggle kernels from the original competition. Indeed, the best test error on Kaggle lies around \(3\)$. In terms of interpretability, looking at the variable importance for random forest give a very good insight why nobody seems to get performance metric lower than \(3\)$ on test sets. The trip_distance variable is so influent that it might hinder some other dependencies of the fare price to others dataset’s features. An other interpretation could be that some observations could be independent of the dataset variables and thus very hard to predict. A solution could be to enriched our dataset with other features (like traffic in NY for instance).