
PJME TIME SERIES
We take a look at the PJME hourly load data to predict the energy usage in the year to come for New Jersey. The predictions will be done with XGBoost.
Data Collection
Data was collected using PJME's Data Miner 2. Due to the terms of use, the data cannot be distributed on the GitHub page but should be available indefinitely through the Data Miner.
Data Cleaning
This data is very clean from PJME, however, the CSV's are downloaded in year intervals, so the CSV's are merged after download, then the data for New Jersey was isolated and saved.
Outlier Removal
​

​
Through a quick visual inspection of the data, we can see that there are a few times when the meter readings for the load are 0. These may be the result of either sensor failure or power outages. Whatever the case, there are outliers, and if we allow them to remain in the data, the model will attempt to learn them, which would be an error. To get rid of these we simply filter and include only non-0 values. This leaves us with:

Data Preparation
Because we intend to use XGBoost, we will have to make features for the regression algorithm. To do this, we will make features from the date like day of the week, quarter, month, whether it's a holiday, etc. Our final dataset with the added features has the following columns.

Like many other machine learning problems, we want a k-fold validation. To do this, we make 3 train/test splits using sklearn's TimeSeriesSplit. The three folds are displayed below.

Unlike other k-fold methods, for a time series, splits must be done so that no training data comes after the test data. If it did, it would taint the model with information about the test data, which would invalidate the validation.
Model Training/Tuning
After the data is prepared, we train our model and use the Root Mean Squared Error method to measure the effectiveness of the model. These measurements of the error are done before any hyper parameter tuning and will serve as a benchmark to see if our tuning has decreased the error.

We will then tune our model. To do this I utilize sklearn's GridSearchCV to find hyperparameters that optimize the performance of the model. It is important to note that this process is extremely computationally expensive and takes 60X as long to run as the entire rest of the project. It however yields the best hyperparameters shown below.

We use these hyperparameters to train a new model on the same folds that we trained our benchmark model to get the following performance.

The decrease in the error means that model tuning was a success! Using the hyperparameters given to us by the grid search, we were able to reduce the error by 18%. This is quite a large decrease in error. Because of the huge amount of investment that GridSearchCV requires, it is not always worth the time it takes to perform, but for us, it has paid off nicely!
Forecasting
We now approach the pinnacle of the project, the whole reason that we underwent this expedition at all. We wish to use the historical data to predict the hourly usage of energy in New Jersey. To do this, we create another data frame of dates a year into the future, then run it through our model to get the predictions of the data. The predictions were then graphed after our historical data.

With this data, a stakeholder could get an informed prediction of when the upcoming spikes and dips in power requirements are incoming and will be able to adjust the capacities of the grid accordingly. Alternatively, this is akin to a supply curve, which could be used to inform pricing. Ultimately, it is difficult to describe the accuracy of the prediction, as the future has not happened yet, but visually it does appear to fall in line with previous trends!
Feature Importance
Other than predictions for the future, our model also informs us of something called feature importance, or the weight that our model places on certain features. The feature importance is plotted below.

The feature importance gives us very unique insight into trends in energy usage. From our graph, we can see that the month heavily affects the observed energy usage. For example, summer months have enormous peaks, which makes sense because it's when most people turn on the AC in New Jersey. The feature importance also tells us that the time of day affects energy usage as well, as late at night when people are asleep, the grid is under less load. It also shows us which features do not appear to have as large an impact on the model, as holidays or not, people appear to be using about the same amount of energy.