ML · 2025
Road Accident Risk Predictor
Regression model on real crash data. Compared linear baseline, Random Forest, and Gradient Boosting; tuned with GridSearchCV on a held-out split. Random Forest won at 90.5% R². Wrapped in a Streamlit UI with a gauge so non-technical users can score a scenario.
The problem
Public road-safety data is messy: missing fields, inconsistent units, and a long tail of rare conditions (low-light, wet roads, holidays). The question I wanted to answer was practical — given a set of road characteristics, environmental conditions, and traffic context, what is the predicted risk of an accident?
The harder problem was making it usable. A trained model sitting in a notebook is useless; a non-technical user — a road planner, a learner driver, anyone — should be able to drive it and get a clear answer.
Approach
I started with feature engineering on real crash data: cleaning sentinels, normalising units, encoding categorical features (road type, weather, lighting). Stratified the split so the rare-event distribution survived intact, then trained three baselines: linear regression, Random Forest, and Gradient Boosting.
GridSearchCV tuned the trees on a held-out fold. Random Forest came out ahead — 90.5% R² on the test split — with explainable feature contributions via SHAP for the post-prediction breakdown.
For deployment I picked Streamlit. The whole UI is a single Python file: sliders for road conditions, dropdowns for weather and lighting, an animated risk gauge for the result, plus a SHAP contributions chart so users can see WHY the model scored their scenario the way it did.
Key decisions
Random Forest over Gradient Boosting
Gradient Boosting matched RF on raw R² but was more sensitive to the hyperparameter grid and slower to retrain. RF was within margin and far more forgiving — the right choice for a project where retraining is part of the iteration loop.
Streamlit over a custom React UI
A React + FastAPI stack would have been more impressive on paper, but Streamlit shipped in a fraction of the time and let me iterate on the model and the interface in the same Python file. The result was a usable demo in days, not weeks.
SHAP for explainability
Risk scores without justification are useless to users — and dangerous in a safety context. SHAP feature contributions make every prediction auditable.
Outcome
Live on Streamlit at the link above. Final test-set R² of 90.5% with Random Forest and tuned hyperparameters. The interactive UI handles arbitrary scenarios in real time and shows both the predicted probability and the top contributing features.
What I’d do differently
The gap between "model that works in a notebook" and "model anyone can use" is most of the work. Next iteration: a TensorFlow time-series variant for predicting risk over a route, not a single point.