ML · 2025
Diabetes Prediction Model
Classification on the Pima Indians dataset. Replaced the dataset’s sentinel zeros with KNN imputation (median was misleading the minority class), stratified the split, and tuned with GridSearchCV. ROC AUC = 0.81; Glucose was the dominant signal in feature importance, with BMI and Pregnancies trailing.
The problem
The Pima Indians Diabetes dataset is small, imbalanced, and full of sentinel zeros that look like real values but aren't (a glucose reading of 0 is impossible — it means missing). The class imbalance is the trap: a model that predicts "no diabetes" for everyone will hit ~65% accuracy and look fine. Accuracy alone is not a useful number here.
Approach
First step was distinguishing real zeros from missing zeros across Glucose, BloodPressure, SkinThickness, Insulin, and BMI. I went with KNN imputation rather than median — median was pulling the imputed values toward the majority class and degrading recall on the minority. KNN borrowed the local structure of nearby points and produced more honest imputations.
Stratified train/test split so the class ratio survived. GridSearchCV tuned a Logistic Regression with regularisation strength + class weights. Evaluation reported precision, recall, F1, and ROC-AUC together — accuracy alone hides the imbalance.
Key decisions
KNN imputation over median
Median imputation flattened the distinction between the diabetic and non-diabetic distributions. KNN preserved local structure and improved recall on the minority class, which is the class that matters in a screening context.
Logistic Regression as the final model
I trialled tree-based models too, but for a small dataset with 8 numerical features and a need for interpretable coefficients, regularised Logistic Regression was the right tool. ROC-AUC of 0.81 is competitive on this dataset and the coefficients tell a clean story.
Threshold tuning at 0.39, not 0.5
For screening, false negatives cost more than false positives. The optimal-threshold marker on the ROC curve sits at 0.39, trading a few false positives for materially better recall on actual diabetic cases.
Outcome
ROC AUC = 0.81. Confusion matrix shows 73% true-negative rate and 70.4% true-positive rate at the tuned threshold. Glucose dominates feature importance (~2x the next feature), with BMI and Pregnancies trailing — consistent with clinical priors.
What I’d do differently
Imbalanced screening problems live and die on the choice of metric and the choice of threshold. The model is fine; the framing is what makes it useful.