Time Series Data – Exploratory Data Analysis & Forecast

This is my first project in data science. I would like to thank Zillow for providing the data to the aspiring data scientists like us. Also i have referenced Lisa & Jason for forecasting the Time Series data. Those information’s are invaluable.

Import & Clean Data

Let us read the State Time Series data and do some basic analysis. The data is available from 2010 onwards, so let us remove the previous data from the data set.

In [94]:
import pandas as pd
df = pd.read_csv('zecon/State_time_series.csv')
df.Date = pd.to_datetime(df.Date)
df = df[df['Date'] >= '01-01-2010']
df.head()
Out[94]:
DateRegionNameDaysOnZillow_AllHomesInventorySeasonallyAdjusted_AllHomesInventoryRaw_AllHomesMedianListingPricePerSqft_1BedroomMedianListingPricePerSqft_2BedroomMedianListingPricePerSqft_3BedroomMedianListingPricePerSqft_4BedroomMedianListingPricePerSqft_5BedroomOrMore...ZHVI_BottomTierZHVI_CondoCoopZHVI_MiddleTierZHVI_SingleFamilyResidenceZHVI_TopTierZRI_AllHomesZRI_AllHomesPlusMultifamilyZriPerSqft_AllHomesZri_MultiFamilyResidenceRentalZri_SingleFamilyResidenceRental
81572010-01-31Alabama168.032618.029765.0NaN98.63325793.087748103.971150110.909091...67900.0137400.0123900.0123400.0232600.0NaNNaNNaNNaNNaN
81582010-01-31Alaska130.03477.02748.0182.03125163.834951154.224835144.439421125.418060...177200.0176100.0247200.0257100.0356200.0NaNNaNNaNNaNNaN
81592010-01-31Arizona112.050231.050482.0NaNNaNNaNNaNNaN...95800.0121900.0161500.0168800.0298500.0NaNNaNNaNNaNNaN
81602010-01-31Arkansas121.018473.017416.0NaN81.28390686.21495394.75000096.683503...59200.0127300.0105700.0105700.0185000.0NaNNaNNaNNaNNaN
81612010-01-31California108.0138662.0132187.0NaNNaNNaNNaNNaN...163600.0313600.0335100.0340000.0655700.0NaNNaNNaNNaNNaN

5 rows × 79 columns

Exploratory Data Analysis (EDA)

Let us explore statewise data. Whether there is any trend for a specific state or what are the top 5 states?

Top 7 States of Median Price Per Sqft

This is the top 7 states of Median Listring Price per Sqft All Homes

In [95]:
dfallhomes = df.groupby('RegionName', as_index=False)['MedianListingPricePerSqft_AllHomes'].\
    mean().dropna().sort_values('MedianListingPricePerSqft_AllHomes', ascending=False)
dfallhomes.head(7)
Out[95]:
RegionNameMedianListingPricePerSqft_AllHomes
8DistrictofColumbia419.093624
11Hawaii410.104882
4California267.298446
21Massachusetts205.758438
32NewYork172.085814
39RhodeIsland171.417274
5Colorado171.243077

Bottom 5 states of Median Listing Price per Sqft All Homes

List the bottom 5 states from the results

In [105]:
dfallhomes.tail(5)
Out[105]:
RegionNameMedianListingPricePerSqft_AllHomes
36Oklahoma87.865113
35Ohio87.469543
3Arkansas87.355016
24Mississippi84.641021
14Indiana82.010350

Trend in State Time Series

The most important EDA on time series data is to identify trend, seasonality & correlation. Let us check whether there is any trend in the data with the Top state data.

In [23]:
%matplotlib inline
import matplotlib.pyplot as plt
statelist = ['Hawaii', 'DistrictofColumbia', 'California', 'Massachusetts', 'NewYork', 'Colorado']
stateseries = pd.DataFrame(df[(df['RegionName'].\
    isin(statelist))][['Date','RegionName','MedianListingPricePerSqft_AllHomes']].\
    dropna().\
    groupby(['Date', 'RegionName'])['RegionName','MedianListingPricePerSqft_AllHomes'].mean().unstack())
stateseries.plot(figsize=(15,8), linewidth=3)
plt.show()

from the above graph, we can see that

  1. There are trends for Hawaii & District of Columbia, even though it is up & down but there is a strong upward trend on these two states. This also shows that there is a seasonality in the trends. But the investment is costly in these two states.
  2. There is no point in investing the homes in New York, it's almost stationary.
  3. The similar trend of top 2 states is reflecting in Colorado also. This seems to be the best investment for homes as it is steadily increasing over the years among the top 5.

Let us check whether there is any seasonality in the trends. This is important for the predictions.

In [100]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set()
season = df
season['Date'] = df.Date
season['Year'] = df['Date'].dt.year
season['Month'] = df['Date'].dt.month
spivot = pd.pivot_table(season, index='Month', columns = 'Year', values = 'MedianListingPricePerSqft_AllHomes', aggfunc=np.mean)
spivot.plot(figsize=(20,10), linewidth=3)
plt.show()

from above graph, we can see that

  1. The price start decreasing in year 2010 and continued till 2012. The price in year 2013 is almost equivalent to 2010.
  2. The best time to sell the house in a year is from June to October. The price is peak in these months consistently in all the years even during the down trend. Alternatively the best time to buy the house is in December & January.
  3. The price drops may be due to holiday season or some other reason is a problem for us to solve another day.

Correlation

I always have a doubt whether various bedroom types have any relation during the trends. Let's find out.

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt

brtypes = df.groupby('Date')['Date','MedianListingPricePerSqft_1Bedroom', 'MedianListingPricePerSqft_2Bedroom','MedianListingPricePerSqft_3Bedroom','MedianListingPricePerSqft_4Bedroom','MedianListingPricePerSqft_5BedroomOrMore'].\
    mean().dropna()
pd.plotting.autocorrelation_plot(brtypes);
plt.show()

The above graph shows there is a positive correlation for all the bedroom types, but still it is not clear how each bedroom types are correlated. Let us find out.

In [34]:
brtypes.corr()
Out[34]:
MedianListingPricePerSqft_1BedroomMedianListingPricePerSqft_2BedroomMedianListingPricePerSqft_3BedroomMedianListingPricePerSqft_4BedroomMedianListingPricePerSqft_5BedroomOrMore
MedianListingPricePerSqft_1Bedroom1.0000000.8697690.8341490.7764160.748781
MedianListingPricePerSqft_2Bedroom0.8697691.0000000.9950360.9803790.970255
MedianListingPricePerSqft_3Bedroom0.8341490.9950361.0000000.9934450.983750
MedianListingPricePerSqft_4Bedroom0.7764160.9803790.9934451.0000000.995176
MedianListingPricePerSqft_5BedroomOrMore0.7487810.9702550.9837500.9951761.000000

The above table confirms the correlation and to be more specific let us remove the seasonality from the data and see. This is called order of correlation.

In [35]:
brtypes.diff().corr()
Out[35]:
MedianListingPricePerSqft_1BedroomMedianListingPricePerSqft_2BedroomMedianListingPricePerSqft_3BedroomMedianListingPricePerSqft_4BedroomMedianListingPricePerSqft_5BedroomOrMore
MedianListingPricePerSqft_1Bedroom1.0000000.6481340.6885130.6341480.558673
MedianListingPricePerSqft_2Bedroom0.6481341.0000000.9365310.9142600.872121
MedianListingPricePerSqft_3Bedroom0.6885130.9365311.0000000.9511920.878442
MedianListingPricePerSqft_4Bedroom0.6341480.9142600.9511921.0000000.910900
MedianListingPricePerSqft_5BedroomOrMore0.5586730.8721210.8784420.9109001.000000

First order difference in correlation still has better correlation between bedroom types. You can see 1 Bedroom & 2 Bedroom are highly correlated than 1 bedroom & 5 bedroom.

Forecast

We will do the forecast with Median Listing price per sq ft All Homes. Let us see the trend first.

In [104]:
allhomes = df.groupby('Date')['Date','MedianListingPricePerSqft_AllHomes'].mean().dropna()
allhomes.plot(figsize=(10,8))
plt.show()

In the above graph, you can see the clear trend but also there are seasonality in the trend. The forecast for the time series should be stationary otherwise the predictions may not correct.

ARIMA forecast model

One of the common model used to forecast time series data is ARIMA. It stands for Autoregressive integrated moving average. One of the parameters are p, d & q. As you know the data has seasonality and let us use Seasonal ARIMA, SARIMAX to forecast the mode. There is a separate process to to identify the optmimum parameters, I did a grid search on GPU machine and it stopped after 700+ iterations.

In [ ]:
import warnings
from pandas import datetime
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
import itertools
warnings.filterwarnings("ignore")

#evaluate SARIMAX model for given order & seasonal order
def evaluate_sarimax_model(dataset, param, seasonal_param):
    #train test split
    train_size = int(len(dataset) * 0.60)
    train, test = dataset[0:train_size], dataset[train_size:]
    try:
        mod = sm.tsa.statespace.SARIMAX(dataset,
                                       order = param,
                                       seasonal_order = seasonal_param,
                                       enforce_stationarity = False,
                                       enforce_invertibility = False)
        results = mod.fit()
        pred = results.get_prediction(start = test.iloc(train_size)[0].name, dynamic = False)
        pred_ci = pred.conf_int()
        error = mean_squared_error(test, pred.predicted_mean)
        return error
        print('SARIMAX results for {} {} AIC {} MSE {}'.format(param, seasonal_param, results.aic, mse))
    except:
        return float("inf")

def evaluate_models(dataset, p_values, d_values, q_values, s_value):
    pdq = list(itertools.product(p, d, q))
    seasonal_pdq = [(x[0], x[1], x[2], s_value) for x in list(itertools.product(p, d, q))]
    best_score = float("inf")
    best_param = best_seasonal_param = None
    for param in pdq:
        for seasonal_param in seasonal_pdq:
            try:
                mse = evaluate_sarimax_model(dataset, param, seasonal_param)
                #print(best_score, mse)
                if mse < best_score:
                    best_score, best_param, best_seasonal_param = mse, param, seasonal_param
                print('SARIMAX results for {} {} MSE {}'.format(param, seasonal_param, mse))
            except:
                continue
    print('Best results are mse {} order {} seasonal order {}'.format(best_score, best_param, best_seasonal_param))

The output of the above code is suppressed as the above process may run for hours based on the capacity of the system. This is the code used to do the grid search for hyper parameter tuning.

In [103]:
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")
mod = sm.tsa.statespace.SARIMAX(allhomes,
                                    order = (2, 0, 4),
                                    seasonal_order = (3, 1, 2, 12),
                                    enforce_stationarity = False,
                                    enforce_invertibility = False)
results = mod.fit()
results.plot_diagnostics(figsize=(15,12))
plt.show()

From the above graph, we can see that

  1. The histogram has minor difference with KDE
  2. Linear regression can be improved
  3. There are still positive correlation, this can be optimized further.

Validate the Model

Let us validate the model by Train Test & Split.

In [102]:
train_size = int(len(allhomes) * 0.60)
train, test = allhomes[0:train_size], allhomes[train_size:]

pred = results.get_prediction(start = test.iloc(train_size)[0].name, dynamic = False)
pred_ci = pred.conf_int()

ax = allhomes.plot(label='actual', figsize=(10,8))
pred.predicted_mean.plot(ax=ax, label='Forecast', alpha=0.7, color='red')
plt.legend()
plt.show()

The predictions above is not perfect but it is still better compare to other parameters. Let us see the Mean Squared Error of the model.

In [98]:
from sklearn.metrics import mean_squared_error
error = mean_squared_error(test, pred.predicted_mean)
print('MSE {}'.format(error))
MSE 0.17096716457360572

The mean squared error is 0.171, so it means the forecast can be till improved. Now let us predict for the future.

In [101]:
pred_uc = results.get_forecast(steps=24)
pred_ci = pred_uc.conf_int()

ax = allhomes.plot(label = 'Actual', figsize=(15,8))
pred_uc.predicted_mean.plot(ax=ax, label='Forecasted')
ax.fill_between(pred_ci.index,
                   pred_ci.iloc[:,0],
                   pred_ci.iloc[:,1],
                   color='k', alpha=0.25)
ax.set_xlabel('Date')
ax.set_ylabel('MedianListingPricePerSqft_AllHomes')
plt.legend()
plt.show()

In the above graph you can see the forecast for next 24 months and the confidence interval is also better and not too much variance.

Leave a Comment