Learn how to perform Exploratory Data Analysis, apply mean imputation, build a classification algorithm, and interpret the results.
Breast cancer is the second most common cancer and has the highest cancer death rate among women in the United States. Breast cancer occurs as a result of abnormal growth of cells in the breast tissue, commonly referred to as a tumor. A tumor does not mean cancer — can be benign (no breast cancer) or malignant (breast cancer). Tests such as an MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose breast cancer.
In this tutorial, we are going to create a model that will predict whether or not a patient has a positive breast cancer diagnosis based off of the tumor characteristics.
This dataset contains the following features:
- id (patientid)
- radius (the distance from the center to the circumference of the tumor)
- texture (standard deviation of gray-scale values)
- perimeter (circumference of the tumor, approx. 2*3.14 *radius)
- smoothness (local variation in radius lengths)
- concavity (severity of concave portions of the contour)
- diagnosis: 0 or 1 indicating whether patient has breast cancer or not
Click here to get the dataset and see my full code on GitHub.
Import Libraries and Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print(f'Libraries have been imported! :)')
Now that our libraries have been imported, let’s go ahead and import our data using pandas.
train = pd.read_csv('breastcancer.csv')
As a side note, F-strings are amazing! They allow you to print strings and expressions in a more concise manner. The \n part means to add a new line. I do this to create more white space.
Exploratory Data Analysis (EDA) answers the “What are we dealing with?” question. EDA is where we try to understand our data first. We want to gain insights before messing around with it.
Visualizations are a great way to do this.
Visualization #1: Heat Map
# simple heat map showing where we are missing dataheat_map = sns.heatmap(train.isnull(), yticklabels = False, cbar = True, cmap = "PuRd", vmin = 0, vmax = 1)plt.show()
- train.isnull() is checking for nulls in the train df
- yticklabels is not plotting train df column names to y-axis
- cbar is adding a color bar
- cmap is mapping data values to a color space
- vmin set 0 as the minimum for the color bar
- vmax set 1 as the maximum for the color bar
This heat map is interpreted as the following:
- 0 (white color) means we have a value
- 1 (dark red color) means we have a null
Looks like we only have nulls in the radius column! Not bad at all and easily fixable 🙂
Visualization #2: Count Plot
# a count plot shows the counts of observations in each categorical bin using bars
# think of it as a histogram across a categorical, instead of quantitative, variablesns.set_style("whitegrid")
sns.countplot(data = train, x = 'diagnosis', palette = 'husl')
- style is affecting the color of the axes, whether a grid is enabled by default, and other aesthetic elements
- data is the df, array, or list of arrays to plot
- x is the the name of the variable in the data parameter
- palette is the color you want to use (palette name, list, or dict)
- 0 indicates no breast cancer
- 1 indicates breast cancer
Note that 0 doesn’t always indicate an absence of something and that 1 means a presence of something. Make sure you are reading your data correctly.
Visualization #3: Histogram
# let's check out the spread of ages using a histogramtrain['age'].plot.hist(bins = 25, figsize = (10,6))
- we are looking at the age column within the train df
- bins are setting the number of class intervals
- figsize(width, height) sets a figure object with a width of 10 inches and height of 6 inches
Data is not skewed and doesn’t have a distinct shape — doesn’t tell us too much. Let’s move on to cleaning our data.
The missing data in the radius column needs to be filled in. We are going to do this by imputing the mean radius, not just dropping all null values. To impute a value simply means we are going to replace missing values with our newly calculated value. For our method specifically, it is referred to as mean imputation.
Let’s visualize the average radius of a tumor by diagnosis via a box plot.
plt.figure(figsize = (10,7))
sns.boxplot(x = "diagnosis", y = "radius", data = train)
Women who were diagnosed with breast cancer (diagnosis = 1) tend to have a higher tumor radius size, which is the distance from the center to the circumference of the tumor.
# calculate the average radius size by diagnosis (0 or 1)train.groupby('diagnosis')["radius"].mean()
This is interpreted as…
“Women who are not diagnosed with breast cancer have an average/mean tumor radius size of 12.34.”
“Women who are diagnosed with breast cancer have an average/mean tumor radius size of 17.89.”
Now that we have found our average tumor radius by diagnosis, let’s impute them into our missing (aka our null) values.
# create a function that imputes average radius into missing valuesdef impute_radius(cols):
radius = cols
diagnosis = cols
# if value in radius column is null
# if woman is diagnosed with breast cancer
if diagnosis == 1:
# if woman was not diagnosed with breast cancer
# when value in radius column is not null
# return that same value
After creating our function, we need to apply it like so:
train['radius'] = train[['radius', 'diagnosis']].apply(impute_radius, axis = 1)
In English, this means we are applying our function to both the radius column and diagnosis column.
We can visualize whether our function worked by checking our heat map again:
# check the heat map again after applying the above functionheat_map = sns.heatmap(train.isnull(), yticklabels = False, cbar = True, cmap = "PuRd", vmin = 0, vmax = 1)plt.show()
All rows that were missing data have now been imputed (aka substituted) with the average radius size, which was determined by whether the woman was diagnosed with breast cancer. No need to drop other columns or impute more missing values.
Let’s now look at a concise summary of our data:
See how the id and name columns are of object data type? That means they are categorical, and we need to drop those like so:
# dropping categorical variablestrain.drop(['id', 'name'], axis = 1, inplace = True)
Checking out what our dataframe looks like:
Step 1: Split data into X and y
X = train.drop('diagnosis', axis = 1)
y = train['diagnosis']
Check out what X and y look like:
Step 2: Split data into train set and test set
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
Step 3: Train and predict
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
predictions = logreg.predict(X_test)
A classification report checks our model’s precision, recall, and F1 score. The support is the number of samples of the true response that lies in that class.
- Precision and recall are not the same. Precision is the fraction of relevant results. Recall is the fraction of all relevant results that were correctly classified.
- F1 score is the harmonic mean between precision and recall that ranges between 0 (terrible) to 1 (perfection).
from sklearn.metrics import classification_reportclassification_report(y_test, predictions)
from sklearn.metrics import confusion_matrixconfusion_matrix(y_test, predictions)
We had 171 women in our test set. Out of the 105 women predicted to not have breast cancer, 7 women were classified as not having breast cancer when they actually did (Type I error). Out of the 66 women predicted to have breast cancer, 10 were classified as having breast cancer when they did not (Type II error). In a nut shell, our model was more or less 90% accurate.