Case Studies#

In this section, we will explore real-world applications of logistic regression in various fields, including healthcare, finance, and marketing. These case studies demonstrate the practical utility and versatility of logistic regression in solving real-world problems.

Real-World Examples#

Healthcare#

Logistic regression is widely used in healthcare for predicting patient outcomes, understanding disease progression, and optimizing treatment plans.

  1. Predicting Disease Presence: Logistic regression can be used to predict the presence or absence of a disease based on clinical parameters and patient history. For example, it can predict the likelihood of a patient having diabetes based on features such as age, BMI, blood pressure, and glucose levels.

  2. Understanding Disease Progression: Logistic regression can help understand how different factors contribute to the progression of diseases such as diabetes, heart disease, and cancer. For example, it can model the relationship between lifestyle factors and the progression of diabetes.

# Predicting Disease Presence

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Load the dataset (e.g., Pima Indians Diabetes dataset)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = [
    "Pregnancies",
    "Glucose",
    "BloodPressure",
    "SkinThickness",
    "Insulin",
    "BMI",
    "DiabetesPedigreeFunction",
    "Age",
    "Outcome",
]
data = pd.read_csv(url, names=column_names)

# Split the dataset into features and target variable
X = data.drop("Outcome", axis=1)
y = data["Outcome"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
Accuracy: 0.75
Precision: 0.64
Recall: 0.67
F1 Score: 0.65
ROC-AUC: 0.81

Finance#

In finance, logistic regression is used for risk management, portfolio optimization, and forecasting financial metrics.

  1. Credit Scoring: Logistic regression can be used to develop credit scoring models that assess the creditworthiness of individuals based on factors such as income, debt, and credit history. These models help financial institutions make informed lending decisions.

  2. Fraud Detection: Logistic regression can be used to detect fraudulent transactions by modeling the relationship between transaction features and the likelihood of fraud. This helps financial institutions identify and prevent fraudulent activities.

# Credit Scoring

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Generate a synthetic dataset for credit scoring
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
Accuracy: 0.83
Precision: 0.87
Recall: 0.82
F1 Score: 0.84
ROC-AUC: 0.91

Marketing#

In marketing, logistic regression is used to analyze consumer behavior, optimize marketing campaigns, and forecast sales.

  1. Customer Churn Prediction: Logistic regression can predict whether a customer will churn (i.e., stop using a service) based on factors such as usage patterns, customer service interactions, and demographic information. This helps businesses identify at-risk customers and take proactive measures to retain them.

  2. Email Spam Detection: Logistic regression can classify emails as spam or not spam based on email content features. This helps email service providers filter out spam messages and improve user experience.

# Customer Churn Prediction

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Load the dataset (e.g., Telco Customer Churn dataset)
url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
data = pd.read_csv(url)

# Preprocess the data
data["Churn"] = data["Churn"].map({"Yes": 1, "No": 0})
data = pd.get_dummies(data, drop_first=True)

# Split the dataset into features and target variable
X = data.drop("Churn", axis=1)
y = data["Churn"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[3], line 25
     23 # Create and train the logistic regression model
     24 model = LogisticRegression(max_iter=1000)
---> 25 model.fit(X_train, y_train)
     27 # Make predictions
     28 y_pred = model.predict(X_test)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:1350, in LogisticRegression.fit(self, X, y, sample_weight)
   1347 else:
   1348     n_threads = 1
-> 1350 fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
   1351     path_func(
   1352         X,
   1353         y,
   1354         pos_class=class_,
   1355         Cs=[C_],
   1356         l1_ratio=self.l1_ratio,
   1357         fit_intercept=self.fit_intercept,
   1358         tol=self.tol,
   1359         verbose=self.verbose,
   1360         solver=solver,
   1361         multi_class=multi_class,
   1362         max_iter=self.max_iter,
   1363         class_weight=self.class_weight,
   1364         check_input=False,
   1365         random_state=self.random_state,
   1366         coef=warm_start_coef_,
   1367         penalty=penalty,
   1368         max_squared_sum=max_squared_sum,
   1369         sample_weight=sample_weight,
   1370         n_threads=n_threads,
   1371     )
   1372     for class_, warm_start_coef_ in zip(classes_, warm_start_coef)
   1373 )
   1375 fold_coefs_, _, n_iter_ = zip(*fold_coefs_)
   1376 self.n_iter_ = np.asarray(n_iter_, dtype=np.int32)[:, 0]

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/utils/parallel.py:67, in Parallel.__call__(self, iterable)
     62 config = get_config()
     63 iterable_with_config = (
     64     (_with_config(delayed_func, config), args, kwargs)
     65     for delayed_func, args, kwargs in iterable
     66 )
---> 67 return super().__call__(iterable_with_config)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
   1845 self.n_dispatched_batches += 1
   1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
   1848 self.n_completed_tasks += 1
   1849 self.print_progress()

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/utils/parallel.py:129, in _FuncWrapper.__call__(self, *args, **kwargs)
    127     config = {}
    128 with config_context(**config):
--> 129     return self.function(*args, **kwargs)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:455, in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept, max_iter, tol, verbose, solver, coef, class_weight, dual, penalty, intercept_scaling, multi_class, random_state, check_input, max_squared_sum, sample_weight, l1_ratio, n_threads)
    451 l2_reg_strength = 1.0 / (C * sw_sum)
    452 iprint = [-1, 50, 1, 100, 101][
    453     np.searchsorted(np.array([0, 1, 2, 3]), verbose)
    454 ]
--> 455 opt_res = optimize.minimize(
    456     func,
    457     w0,
    458     method="L-BFGS-B",
    459     jac=True,
    460     args=(X, target, sample_weight, l2_reg_strength, n_threads),
    461     options={
    462         "maxiter": max_iter,
    463         "maxls": 50,  # default is 20
    464         "iprint": iprint,
    465         "gtol": tol,
    466         "ftol": 64 * np.finfo(float).eps,
    467     },
    468 )
    469 n_iter_i = _check_optimize_result(
    470     solver,
    471     opt_res,
    472     max_iter,
    473     extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
    474 )
    475 w0, loss = opt_res.x, opt_res.fun

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_minimize.py:713, in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
    710     res = _minimize_newtoncg(fun, x0, args, jac, hess, hessp, callback,
    711                              **options)
    712 elif meth == 'l-bfgs-b':
--> 713     res = _minimize_lbfgsb(fun, x0, args, jac, bounds,
    714                            callback=callback, **options)
    715 elif meth == 'tnc':
    716     res = _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,
    717                         **options)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_lbfgsb_py.py:407, in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, finite_diff_rel_step, **unknown_options)
    401 task_str = task.tobytes()
    402 if task_str.startswith(b'FG'):
    403     # The minimization routine wants f and g at the current x.
    404     # Note that interruptions due to maxfun are postponed
    405     # until the completion of the current minimization iteration.
    406     # Overwrite f and g:
--> 407     f, g = func_and_grad(x)
    408 elif task_str.startswith(b'NEW_X'):
    409     # new iteration
    410     n_iterations += 1

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:296, in ScalarFunction.fun_and_grad(self, x)
    294 if not np.array_equal(x, self.x):
    295     self._update_x_impl(x)
--> 296 self._update_fun()
    297 self._update_grad()
    298 return self.f, self.g

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:262, in ScalarFunction._update_fun(self)
    260 def _update_fun(self):
    261     if not self.f_updated:
--> 262         self._update_fun_impl()
    263         self.f_updated = True

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:163, in ScalarFunction.__init__.<locals>.update_fun()
    162 def update_fun():
--> 163     self.f = fun_wrapped(self.x)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:145, in ScalarFunction.__init__.<locals>.fun_wrapped(x)
    141 self.nfev += 1
    142 # Send a copy because the user may overwrite it.
    143 # Overwriting results in undefined behaviour because
    144 # fun(self.x) will change self.x, with the two no longer linked.
--> 145 fx = fun(np.copy(x), *args)
    146 # Make sure the function returns a true scalar
    147 if not np.isscalar(fx):

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_optimize.py:79, in MemoizeJac.__call__(self, x, *args)
     77 def __call__(self, x, *args):
     78     """ returns the function value """
---> 79     self._compute_if_needed(x, *args)
     80     return self._value

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_optimize.py:73, in MemoizeJac._compute_if_needed(self, x, *args)
     71 if not np.all(x == self.x) or self._value is None or self.jac is None:
     72     self.x = np.asarray(x).copy()
---> 73     fg = self.fun(x, *args)
     74     self.jac = fg[1]
     75     self._value = fg[0]

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_linear_loss.py:281, in LinearModelLoss.loss_gradient(self, coef, X, y, sample_weight, l2_reg_strength, n_threads, raw_prediction)
    278 else:
    279     weights, intercept = self.weight_intercept(coef)
--> 281 loss, grad_pointwise = self.base_loss.loss_gradient(
    282     y_true=y,
    283     raw_prediction=raw_prediction,
    284     sample_weight=sample_weight,
    285     n_threads=n_threads,
    286 )
    287 sw_sum = n_samples if sample_weight is None else np.sum(sample_weight)
    288 loss = loss.sum() / sw_sum

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/_loss/loss.py:202, in BaseLoss.loss_gradient(self, y_true, raw_prediction, sample_weight, loss_out, gradient_out, n_threads)
    193     self.closs.loss(
    194         y_true=y_true,
    195         raw_prediction=raw_prediction,
   (...)
    198         n_threads=n_threads,
    199     )
    200     return loss_out
--> 202 def loss_gradient(
    203     self,
    204     y_true,
    205     raw_prediction,
    206     sample_weight=None,
    207     loss_out=None,
    208     gradient_out=None,
    209     n_threads=1,
    210 ):
    211     """Compute loss and gradient w.r.t. raw_prediction for each input.
    212 
    213     Parameters
   (...)
    238         Element-wise gradients.
    239     """
    240     if loss_out is None:

KeyboardInterrupt: 
# Email Spam Detection

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Load the dataset (e.g., SMS Spam Collection Dataset)
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
data = pd.read_csv(url, sep="\t", header=None, names=["label", "message"])

# Convert labels to binary
data["label"] = data["label"].map({"ham": 0, "spam": 1})

# Split the dataset into features and target variable
X = data["message"]
y = data["label"]

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)
X_tfidf = vectorizer.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
Accuracy: 0.98
Precision: 1.00
Recall: 0.87
F1 Score: 0.93
ROC-AUC: 0.99

Summary#

These case studies highlight the versatility and practical utility of logistic regression in various fields. In healthcare, it can predict disease presence and understand disease progression. In finance, it can assess credit risk and detect fraud. In marketing, it can predict customer churn and classify emails as spam. By applying logistic regression to real-world problems, businesses and organizations can make data-driven decisions and optimize their operations.