According to the American Diabetes Association, recent national surveys show that American adults…

3 min readMay 28, 2020

Pima women celebrating after winning a traditional game of field hockey (shinny).

In the U.S., Pima Indians have the highest known rates of diabetes and obesity. This concerned researchers enough to examine the rates of Type 2 diabetes and obesity which lead to one of the oldest and most widely used datasets in the medical field since 1965. The results have provided insight into the role of genetics and the environment in metabolic diseases such as Type 2 diabetes.

The women who had diabetes during pregnancy, were more obese and had higher glucose levels, than those women who developed diabetes after pregnancy. Although no new datasets exist, several publications came out later after they repeated analyses on expanded data sets .

From the onset, the dataset would indicate that out of 768 women tested, roughly 35%(268), would be diagnosed with Type 2 diabetes as indicated in the following graph.

Based off the evidence viewed and tested(link to colab provided at the bottom), it’s evident that women who were diagnosed as being diabetic had more pregnancies than the women who were not. There would be one problem, there was no clear distinction between the number of pregnancies and the occurence of Diabetes.

0 represents “non-diabetic” 1 represents “diabetic”

The above graph represents the overall goal. The occurrence and number of pregnancies obviously varies by age, but the risk of being diagnosed a diabetic increases with each additional pregnancy.

The purpose of the above graph is to show how well the function is capable of predicting between 2 different classes, in this case those 2 classes being diabetic vs. non-diabetic. The higher the AUC curve the better it is in predicting PIMA women as non-diabetics(0’s) and diabetic’s(1's). So by this analogy, the higher the AUC, the better the model is in predicting someone with or without a disease. After running this model my AUC accuracy score came in at 86%.

The PIMA Indian dataset was analyzed and explored in detail. The specific patterns identified using EDA methods were validated using specific modeling techniques employed in my colab. Classification models such as LR, K-Fold, RF, and K-Neighbors were built and executed with the purpose of identifying the best model to predict the occurence, based on pregnancies, of diabetes in PIMA Indian women. From the Cross-Validation performance measure, the LR model (78%) was the best performing model.

Please feel free to take a look at my colab by clicking the link below.

https://drive.google.com/file/d/1XEot8mbe0CisfO0pYeCm9g9MuYn2oPTe/view?usp=sharing

Written by Aaron Huizenga