Case Studies

Case Studies#

In this section, we will explore real-world applications of logistic regression in various fields, including healthcare, finance, and marketing. These case studies demonstrate the practical utility and versatility of logistic regression in solving real-world problems.

Real-World Examples#

Healthcare#

Logistic regression is widely used in healthcare for predicting patient outcomes, understanding disease progression, and optimizing treatment plans.

Predicting Disease Presence: Logistic regression can be used to predict the presence or absence of a disease based on clinical parameters and patient history. For example, it can predict the likelihood of a patient having diabetes based on features such as age, BMI, blood pressure, and glucose levels.
Understanding Disease Progression: Logistic regression can help understand how different factors contribute to the progression of diseases such as diabetes, heart disease, and cancer. For example, it can model the relationship between lifestyle factors and the progression of diabetes.

# Predicting Disease Presence

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Load the dataset (e.g., Pima Indians Diabetes dataset)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = [
    "Pregnancies",
    "Glucose",
    "BloodPressure",
    "SkinThickness",
    "Insulin",
    "BMI",
    "DiabetesPedigreeFunction",
    "Age",
    "Outcome",
]
data = pd.read_csv(url, names=column_names)

# Split the dataset into features and target variable
X = data.drop("Outcome", axis=1)
y = data["Outcome"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")

Accuracy: 0.75
Precision: 0.64
Recall: 0.67
F1 Score: 0.65
ROC-AUC: 0.81

Finance#

In finance, logistic regression is used for risk management, portfolio optimization, and forecasting financial metrics.

Credit Scoring: Logistic regression can be used to develop credit scoring models that assess the creditworthiness of individuals based on factors such as income, debt, and credit history. These models help financial institutions make informed lending decisions.
Fraud Detection: Logistic regression can be used to detect fraudulent transactions by modeling the relationship between transaction features and the likelihood of fraud. This helps financial institutions identify and prevent fraudulent activities.

# Credit Scoring

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Generate a synthetic dataset for credit scoring
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")

Accuracy: 0.83
Precision: 0.87
Recall: 0.82
F1 Score: 0.84
ROC-AUC: 0.91

Marketing#

In marketing, logistic regression is used to analyze consumer behavior, optimize marketing campaigns, and forecast sales.

Customer Churn Prediction: Logistic regression can predict whether a customer will churn (i.e., stop using a service) based on factors such as usage patterns, customer service interactions, and demographic information. This helps businesses identify at-risk customers and take proactive measures to retain them.
Email Spam Detection: Logistic regression can classify emails as spam or not spam based on email content features. This helps email service providers filter out spam messages and improve user experience.

# Customer Churn Prediction

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Load the dataset (e.g., Telco Customer Churn dataset)
url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
data = pd.read_csv(url)

# Preprocess the data
data["Churn"] = data["Churn"].map({"Yes": 1, "No": 0})
data = pd.get_dummies(data, drop_first=True)

# Split the dataset into features and target variable
X = data.drop("Churn", axis=1)
y = data["Churn"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[3], line 25
# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
---> 25 model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   estimator._validate_params()
with config_context(
   skip_parameter_validation=(
       prefer_skip_nested_validation or global_skip_validation
   )
):
-> 1473     return fit_method(estimator, *args, **kwargs)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:1350, in LogisticRegression.fit(self, X, y, sample_weight)
else:
   n_threads = 1
-> 1350 fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
   path_func(
       X,
       y,
       pos_class=class_,
       Cs=[C_],
       l1_ratio=self.l1_ratio,
       fit_intercept=self.fit_intercept,
       tol=self.tol,
       verbose=self.verbose,
       solver=solver,
       multi_class=multi_class,
       max_iter=self.max_iter,
       class_weight=self.class_weight,
       check_input=False,
       random_state=self.random_state,
       coef=warm_start_coef_,
       penalty=penalty,
       max_squared_sum=max_squared_sum,
       sample_weight=sample_weight,
       n_threads=n_threads,
   )
   for class_, warm_start_coef_ in zip(classes_, warm_start_coef)
)
fold_coefs_, _, n_iter_ = zip(*fold_coefs_)
self.n_iter_ = np.asarray(n_iter_, dtype=np.int32)[:, 0]

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/utils/parallel.py:67, in Parallel.__call__(self, iterable)
config = get_config()
iterable_with_config = (
   (_with_config(delayed_func, config), args, kwargs)
   for delayed_func, args, kwargs in iterable
)
---> 67 return super().__call__(iterable_with_config)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
   output = self._get_sequential_output(iterable)
   next(output)
-> 1918     return output if self.return_generator else list(output)
# Let's create an ID that uniquely identifies the current call. If the
# call is interrupted early and that the same instance is immediately
# re-used, this id will be used to prevent workers that were
# concurrently finalizing a task from the previous call to run the
# callback.
with self._lock:

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
self.n_dispatched_batches += 1
self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
self.n_completed_tasks += 1
self.print_progress()

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/utils/parallel.py:129, in _FuncWrapper.__call__(self, *args, **kwargs)
   config = {}
with config_context(**config):
--> 129     return self.function(*args, **kwargs)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:455, in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept, max_iter, tol, verbose, solver, coef, class_weight, dual, penalty, intercept_scaling, multi_class, random_state, check_input, max_squared_sum, sample_weight, l1_ratio, n_threads)
l2_reg_strength = 1.0 / (C * sw_sum)
iprint = [-1, 50, 1, 100, 101][
   np.searchsorted(np.array([0, 1, 2, 3]), verbose)
]
--> 455 opt_res = optimize.minimize(
   func,
   w0,
   method="L-BFGS-B",
   jac=True,
   args=(X, target, sample_weight, l2_reg_strength, n_threads),
   options={
       "maxiter": max_iter,
       "maxls": 50,  # default is 20
       "iprint": iprint,
       "gtol": tol,
       "ftol": 64 * np.finfo(float).eps,
   },
)
n_iter_i = _check_optimize_result(
   solver,
   opt_res,
   max_iter,
   extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
)
w0, loss = opt_res.x, opt_res.fun

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_minimize.py:713, in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
   res = _minimize_newtoncg(fun, x0, args, jac, hess, hessp, callback,
                            **options)
elif meth == 'l-bfgs-b':
--> 713     res = _minimize_lbfgsb(fun, x0, args, jac, bounds,
                          callback=callback, **options)
elif meth == 'tnc':
   res = _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,
                       **options)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_lbfgsb_py.py:407, in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, finite_diff_rel_step, **unknown_options)
task_str = task.tobytes()
if task_str.startswith(b'FG'):
   # The minimization routine wants f and g at the current x.
   # Note that interruptions due to maxfun are postponed
   # until the completion of the current minimization iteration.
   # Overwrite f and g:
--> 407     f, g = func_and_grad(x)
elif task_str.startswith(b'NEW_X'):
   # new iteration
   n_iterations += 1

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:296, in ScalarFunction.fun_and_grad(self, x)
if not np.array_equal(x, self.x):
   self._update_x_impl(x)
--> 296 self._update_fun()
self._update_grad()
return self.f, self.g

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:262, in ScalarFunction._update_fun(self)
def _update_fun(self):
   if not self.f_updated:
--> 262         self._update_fun_impl()
       self.f_updated = True

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:163, in ScalarFunction.__init__.<locals>.update_fun()
def update_fun():
--> 163     self.f = fun_wrapped(self.x)

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:145, in ScalarFunction.__init__.<locals>.fun_wrapped(x)
self.nfev += 1
# Send a copy because the user may overwrite it.
# Overwriting results in undefined behaviour because
# fun(self.x) will change self.x, with the two no longer linked.
--> 145 fx = fun(np.copy(x), *args)
# Make sure the function returns a true scalar
if not np.isscalar(fx):

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_optimize.py:79, in MemoizeJac.__call__(self, x, *args)
def __call__(self, x, *args):
   """ returns the function value """
---> 79     self._compute_if_needed(x, *args)
   return self._value

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_optimize.py:73, in MemoizeJac._compute_if_needed(self, x, *args)
if not np.all(x == self.x) or self._value is None or self.jac is None:
   self.x = np.asarray(x).copy()
---> 73     fg = self.fun(x, *args)
   self.jac = fg[1]
   self._value = fg[0]

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_linear_loss.py:281, in LinearModelLoss.loss_gradient(self, coef, X, y, sample_weight, l2_reg_strength, n_threads, raw_prediction)
else:
   weights, intercept = self.weight_intercept(coef)
--> 281 loss, grad_pointwise = self.base_loss.loss_gradient(
   y_true=y,
   raw_prediction=raw_prediction,
   sample_weight=sample_weight,
   n_threads=n_threads,
)
sw_sum = n_samples if sample_weight is None else np.sum(sample_weight)
loss = loss.sum() / sw_sum

File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/_loss/loss.py:202, in BaseLoss.loss_gradient(self, y_true, raw_prediction, sample_weight, loss_out, gradient_out, n_threads)
   self.closs.loss(
       y_true=y_true,
       raw_prediction=raw_prediction,
   (...)
       n_threads=n_threads,
   )
   return loss_out
--> 202 def loss_gradient(
   self,
   y_true,
   raw_prediction,
   sample_weight=None,
   loss_out=None,
   gradient_out=None,
   n_threads=1,
):
   """Compute loss and gradient w.r.t. raw_prediction for each input.

   Parameters
   (...)
       Element-wise gradients.
   """
   if loss_out is None:

KeyboardInterrupt: 

# Email Spam Detection

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Load the dataset (e.g., SMS Spam Collection Dataset)
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
data = pd.read_csv(url, sep="\t", header=None, names=["label", "message"])

# Convert labels to binary
data["label"] = data["label"].map({"ham": 0, "spam": 1})

# Split the dataset into features and target variable
X = data["message"]
y = data["label"]

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)
X_tfidf = vectorizer.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")

Accuracy: 0.98
Precision: 1.00
Recall: 0.87
F1 Score: 0.93
ROC-AUC: 0.99

Summary#

These case studies highlight the versatility and practical utility of logistic regression in various fields. In healthcare, it can predict disease presence and understand disease progression. In finance, it can assess credit risk and detect fraud. In marketing, it can predict customer churn and classify emails as spam. By applying logistic regression to real-world problems, businesses and organizations can make data-driven decisions and optimize their operations.