Predicting Breast Cancer Using Logistic Regression The Startup – Medium

Learn how to perform Exploratory Data Analysis, apply mean imputation, build a classification algorithm, and interpret the results.

Source: DataCamp

Breast cancer is the second most common cancer and has the highest cancer death rate among women in the United States. Breast cancer occurs as a result of abnormal growth of cells in the breast tissue, commonly referred to as a tumor. A tumor does not mean cancer — can be benign (no breast cancer) or malignant (breast cancer). Tests such as an MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose breast cancer.

In this tutorial, we are going to create a model that will predict whether or not a patient has a positive breast cancer diagnosis based off of the tumor characteristics.

This dataset contains the following features:

  • id (patientid)

Click here to get the dataset and see my full code on GitHub.

Import Libraries and Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
print(f'Libraries have been imported! :)')

Now that our libraries have been imported, let’s go ahead and import our data using pandas.

train = pd.read_csv('breastcancer.csv')

As a side note, F-strings are amazing! They allow you to print strings and expressions in a more concise manner. The \n part means to add a new line. I do this to create more white space.

Exploratory Data Analysis (EDA) answers the “What are we dealing with?” question. EDA is where we try to understand our data first. We want to gain insights before messing around with it.

Visualizations are a great way to do this.

Visualization #1: Heat Map

# simple heat map showing where we are missing dataheat_map = sns.heatmap(train.isnull(), yticklabels = False, cbar = True, cmap = "PuRd", vmin = 0, vmax = 1)


  • train.isnull() is checking for nulls in the train df

This heat map is interpreted as the following:

  • 0 (white color) means we have a value

Looks like we only have nulls in the radius column! Not bad at all and easily fixable 🙂

Visualization #2: Count Plot

# a count plot shows the counts of observations in each categorical bin using bars
# think of it as a histogram across a categorical, instead of quantitative, variable
sns.countplot(data = train, x = 'diagnosis', palette = 'husl')


  • style is affecting the color of the axes, whether a grid is enabled by default, and other aesthetic elements


  • 0 indicates no breast cancer

Note that 0 doesn’t always indicate an absence of something and that 1 means a presence of something. Make sure you are reading your data correctly.

Visualization #3: Histogram

# let's check out the spread of ages using a histogramtrain['age'].plot.hist(bins = 25, figsize = (10,6))


  • we are looking at the age column within the train df

Data is not skewed and doesn’t have a distinct shape — doesn’t tell us too much. Let’s move on to cleaning our data.

The missing data in the radius column needs to be filled in. We are going to do this by imputing the mean radius, not just dropping all null values. To impute a value simply means we are going to replace missing values with our newly calculated value. For our method specifically, it is referred to as mean imputation.

Let’s visualize the average radius of a tumor by diagnosis via a box plot.

plt.figure(figsize = (10,7))
sns.boxplot(x = "diagnosis", y = "radius", data = train)

Women who were diagnosed with breast cancer (diagnosis = 1) tend to have a higher tumor radius size, which is the distance from the center to the circumference of the tumor.

# calculate the average radius size by diagnosis (0 or 1)train.groupby('diagnosis')["radius"].mean()

This is interpreted as…

“Women who are not diagnosed with breast cancer have an average/mean tumor radius size of 12.34.”

“Women who are diagnosed with breast cancer have an average/mean tumor radius size of 17.89.”

Now that we have found our average tumor radius by diagnosis, let’s impute them into our missing (aka our null) values.

# create a function that imputes average radius into missing valuesdef impute_radius(cols):
radius = cols[0]
diagnosis = cols[1]

# if value in radius column is null
if pd.isnull(radius):

# if woman is diagnosed with breast cancer
if diagnosis == 1:
return 17
# if woman was not diagnosed with breast cancer
return 12
# when value in radius column is not null
# return that same value
return radius

After creating our function, we need to apply it like so:

train['radius'] = train[['radius', 'diagnosis']].apply(impute_radius, axis = 1)

In English, this means we are applying our function to both the radius column and diagnosis column.

We can visualize whether our function worked by checking our heat map again:

# check the heat map again after applying the above functionheat_map = sns.heatmap(train.isnull(), yticklabels = False, cbar = True, cmap = "PuRd", vmin = 0, vmax = 1)

All rows that were missing data have now been imputed (aka substituted) with the average radius size, which was determined by whether the woman was diagnosed with breast cancer. No need to drop other columns or impute more missing values.

Let’s now look at a concise summary of our data:

See how the id and name columns are of object data type? That means they are categorical, and we need to drop those like so:

# dropping categorical variablestrain.drop(['id', 'name'], axis = 1, inplace = True)

Checking out what our dataframe looks like:


Step 1: Split data into X and y

X = train.drop('diagnosis', axis = 1)
y = train['diagnosis']

Check out what X and y look like:


Step 2: Split data into train set and test set

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

Step 3: Train and predict

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(), y_train)
predictions = logreg.predict(X_test)

A classification report checks our model’s precision, recall, and F1 score. The support is the number of samples of the true response that lies in that class.

  • Precision and recall are not the same. Precision is the fraction of relevant results. Recall is the fraction of all relevant results that were correctly classified.
from sklearn.metrics import classification_reportclassification_report(y_test, predictions)
from sklearn.metrics import confusion_matrixconfusion_matrix(y_test, predictions)

We had 171 women in our test set. Out of the 105 women predicted to not have breast cancer, 7 women were classified as not having breast cancer when they actually did (Type I error). Out of the 66 women predicted to have breast cancer, 10 were classified as having breast cancer when they did not (Type II error). In a nut shell, our model was more or less 90% accurate.







Understanding the Classification Report

Impute Missing Values with Means

Thanks for reading! Please feel free to follow me on Medium and LinkedIn. I’d love to continue the conversation and hear your thoughts/suggestions.



What do you think?


电子邮件地址不会被公开。 必填项已用*标注





Five Lessons I Learned from Missing a Writing Deadline The Startup – Medium

Race Conditions, Locks, Semaphores, and Deadlocks The Startup – Medium