Competition Link: WiDS Datathon 2022
Teammates:
In January of 2022, my friend Kristin and I got together to participate in the Women in Data Science datathon (WiDS) on Kaggle.com. Kristen was a PhD candidate with UC Davis at the time and we joined up with Jennifer Worrall, CTO of Community Energy Labs.
This was my first time taking a Kaggle competition seriously. Kristin and I wrote the Python code for the data analysis and model training, and Jennifer lead us as the industry professional. The competiton was about predicting the energy usage index (EUI) for specific buildings within the given test set.
The provided data came from the Lawrence Berkeley National Laboratory and consisted of 76,000 building ID’s as well as the corresponding weather data for a given time period.
What made this particular data set difficult was due to the low correlation between weather data and building energy usage. Jennifer pointed out that the biggest idicators for energy usage are the age of the equipment and the construction material, both of which were missing from our data.
Kristin and I initially started doing multivariate linear regressions with different combinations of the highest correlated variables using Sci-Kit Learn. We then switched our modeling framework to XGBoost and stuck with that for the rest of the competition.
After cleaning and encoding the data, we split the training set with an 80/20 train-test-split and passed the arrays to an XGBoost Regressor model. The scoring criteria for the competition was Root Mean Squared Error (RMSE).
We made a number of submissions and scored around the competition average with a RMSE of ~= 41. Overall, I had a great time learning more about the XGBoost library as well as the energy industry related to buildings. An interesting quote from the competition:
… the lifecycle of buildings from construction to demolition were responsible for 37% of global energy-related and process-related CO2 emissions in 2020.
Kaggle Page