Case Studies#
In this section, we will explore real-world applications of logistic regression in various fields, including healthcare, finance, and marketing. These case studies demonstrate the practical utility and versatility of logistic regression in solving real-world problems.
Real-World Examples#
Healthcare#
Logistic regression is widely used in healthcare for predicting patient outcomes, understanding disease progression, and optimizing treatment plans.
Predicting Disease Presence: Logistic regression can be used to predict the presence or absence of a disease based on clinical parameters and patient history. For example, it can predict the likelihood of a patient having diabetes based on features such as age, BMI, blood pressure, and glucose levels.
Understanding Disease Progression: Logistic regression can help understand how different factors contribute to the progression of diseases such as diabetes, heart disease, and cancer. For example, it can model the relationship between lifestyle factors and the progression of diabetes.
# Predicting Disease Presence
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Load the dataset (e.g., Pima Indians Diabetes dataset)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = [
"Pregnancies",
"Glucose",
"BloodPressure",
"SkinThickness",
"Insulin",
"BMI",
"DiabetesPedigreeFunction",
"Age",
"Outcome",
]
data = pd.read_csv(url, names=column_names)
# Split the dataset into features and target variable
X = data.drop("Outcome", axis=1)
y = data["Outcome"]
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
Accuracy: 0.75
Precision: 0.64
Recall: 0.67
F1 Score: 0.65
ROC-AUC: 0.81
Finance#
In finance, logistic regression is used for risk management, portfolio optimization, and forecasting financial metrics.
Credit Scoring: Logistic regression can be used to develop credit scoring models that assess the creditworthiness of individuals based on factors such as income, debt, and credit history. These models help financial institutions make informed lending decisions.
Fraud Detection: Logistic regression can be used to detect fraudulent transactions by modeling the relationship between transaction features and the likelihood of fraud. This helps financial institutions identify and prevent fraudulent activities.
# Credit Scoring
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Generate a synthetic dataset for credit scoring
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
Accuracy: 0.83
Precision: 0.87
Recall: 0.82
F1 Score: 0.84
ROC-AUC: 0.91
Marketing#
In marketing, logistic regression is used to analyze consumer behavior, optimize marketing campaigns, and forecast sales.
Customer Churn Prediction: Logistic regression can predict whether a customer will churn (i.e., stop using a service) based on factors such as usage patterns, customer service interactions, and demographic information. This helps businesses identify at-risk customers and take proactive measures to retain them.
Email Spam Detection: Logistic regression can classify emails as spam or not spam based on email content features. This helps email service providers filter out spam messages and improve user experience.
# Customer Churn Prediction
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Load the dataset (e.g., Telco Customer Churn dataset)
url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
data = pd.read_csv(url)
# Preprocess the data
data["Churn"] = data["Churn"].map({"Yes": 1, "No": 0})
data = pd.get_dummies(data, drop_first=True)
# Split the dataset into features and target variable
X = data.drop("Churn", axis=1)
y = data["Churn"]
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[3], line 25
23 # Create and train the logistic regression model
24 model = LogisticRegression(max_iter=1000)
---> 25 model.fit(X_train, y_train)
27 # Make predictions
28 y_pred = model.predict(X_test)
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1466 estimator._validate_params()
1468 with config_context(
1469 skip_parameter_validation=(
1470 prefer_skip_nested_validation or global_skip_validation
1471 )
1472 ):
-> 1473 return fit_method(estimator, *args, **kwargs)
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:1350, in LogisticRegression.fit(self, X, y, sample_weight)
1347 else:
1348 n_threads = 1
-> 1350 fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
1351 path_func(
1352 X,
1353 y,
1354 pos_class=class_,
1355 Cs=[C_],
1356 l1_ratio=self.l1_ratio,
1357 fit_intercept=self.fit_intercept,
1358 tol=self.tol,
1359 verbose=self.verbose,
1360 solver=solver,
1361 multi_class=multi_class,
1362 max_iter=self.max_iter,
1363 class_weight=self.class_weight,
1364 check_input=False,
1365 random_state=self.random_state,
1366 coef=warm_start_coef_,
1367 penalty=penalty,
1368 max_squared_sum=max_squared_sum,
1369 sample_weight=sample_weight,
1370 n_threads=n_threads,
1371 )
1372 for class_, warm_start_coef_ in zip(classes_, warm_start_coef)
1373 )
1375 fold_coefs_, _, n_iter_ = zip(*fold_coefs_)
1376 self.n_iter_ = np.asarray(n_iter_, dtype=np.int32)[:, 0]
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/utils/parallel.py:67, in Parallel.__call__(self, iterable)
62 config = get_config()
63 iterable_with_config = (
64 (_with_config(delayed_func, config), args, kwargs)
65 for delayed_func, args, kwargs in iterable
66 )
---> 67 return super().__call__(iterable_with_config)
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
1916 output = self._get_sequential_output(iterable)
1917 next(output)
-> 1918 return output if self.return_generator else list(output)
1920 # Let's create an ID that uniquely identifies the current call. If the
1921 # call is interrupted early and that the same instance is immediately
1922 # re-used, this id will be used to prevent workers that were
1923 # concurrently finalizing a task from the previous call to run the
1924 # callback.
1925 with self._lock:
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
1845 self.n_dispatched_batches += 1
1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
1848 self.n_completed_tasks += 1
1849 self.print_progress()
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/utils/parallel.py:129, in _FuncWrapper.__call__(self, *args, **kwargs)
127 config = {}
128 with config_context(**config):
--> 129 return self.function(*args, **kwargs)
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:455, in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept, max_iter, tol, verbose, solver, coef, class_weight, dual, penalty, intercept_scaling, multi_class, random_state, check_input, max_squared_sum, sample_weight, l1_ratio, n_threads)
451 l2_reg_strength = 1.0 / (C * sw_sum)
452 iprint = [-1, 50, 1, 100, 101][
453 np.searchsorted(np.array([0, 1, 2, 3]), verbose)
454 ]
--> 455 opt_res = optimize.minimize(
456 func,
457 w0,
458 method="L-BFGS-B",
459 jac=True,
460 args=(X, target, sample_weight, l2_reg_strength, n_threads),
461 options={
462 "maxiter": max_iter,
463 "maxls": 50, # default is 20
464 "iprint": iprint,
465 "gtol": tol,
466 "ftol": 64 * np.finfo(float).eps,
467 },
468 )
469 n_iter_i = _check_optimize_result(
470 solver,
471 opt_res,
472 max_iter,
473 extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
474 )
475 w0, loss = opt_res.x, opt_res.fun
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_minimize.py:713, in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
710 res = _minimize_newtoncg(fun, x0, args, jac, hess, hessp, callback,
711 **options)
712 elif meth == 'l-bfgs-b':
--> 713 res = _minimize_lbfgsb(fun, x0, args, jac, bounds,
714 callback=callback, **options)
715 elif meth == 'tnc':
716 res = _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,
717 **options)
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_lbfgsb_py.py:407, in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, finite_diff_rel_step, **unknown_options)
401 task_str = task.tobytes()
402 if task_str.startswith(b'FG'):
403 # The minimization routine wants f and g at the current x.
404 # Note that interruptions due to maxfun are postponed
405 # until the completion of the current minimization iteration.
406 # Overwrite f and g:
--> 407 f, g = func_and_grad(x)
408 elif task_str.startswith(b'NEW_X'):
409 # new iteration
410 n_iterations += 1
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:296, in ScalarFunction.fun_and_grad(self, x)
294 if not np.array_equal(x, self.x):
295 self._update_x_impl(x)
--> 296 self._update_fun()
297 self._update_grad()
298 return self.f, self.g
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:262, in ScalarFunction._update_fun(self)
260 def _update_fun(self):
261 if not self.f_updated:
--> 262 self._update_fun_impl()
263 self.f_updated = True
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:163, in ScalarFunction.__init__.<locals>.update_fun()
162 def update_fun():
--> 163 self.f = fun_wrapped(self.x)
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_differentiable_functions.py:145, in ScalarFunction.__init__.<locals>.fun_wrapped(x)
141 self.nfev += 1
142 # Send a copy because the user may overwrite it.
143 # Overwriting results in undefined behaviour because
144 # fun(self.x) will change self.x, with the two no longer linked.
--> 145 fx = fun(np.copy(x), *args)
146 # Make sure the function returns a true scalar
147 if not np.isscalar(fx):
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_optimize.py:79, in MemoizeJac.__call__(self, x, *args)
77 def __call__(self, x, *args):
78 """ returns the function value """
---> 79 self._compute_if_needed(x, *args)
80 return self._value
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/scipy/optimize/_optimize.py:73, in MemoizeJac._compute_if_needed(self, x, *args)
71 if not np.all(x == self.x) or self._value is None or self.jac is None:
72 self.x = np.asarray(x).copy()
---> 73 fg = self.fun(x, *args)
74 self.jac = fg[1]
75 self._value = fg[0]
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/linear_model/_linear_loss.py:281, in LinearModelLoss.loss_gradient(self, coef, X, y, sample_weight, l2_reg_strength, n_threads, raw_prediction)
278 else:
279 weights, intercept = self.weight_intercept(coef)
--> 281 loss, grad_pointwise = self.base_loss.loss_gradient(
282 y_true=y,
283 raw_prediction=raw_prediction,
284 sample_weight=sample_weight,
285 n_threads=n_threads,
286 )
287 sw_sum = n_samples if sample_weight is None else np.sum(sample_weight)
288 loss = loss.sum() / sw_sum
File ~/work/machine_learning_101/machine_learning_101/.venv/lib/python3.12/site-packages/sklearn/_loss/loss.py:202, in BaseLoss.loss_gradient(self, y_true, raw_prediction, sample_weight, loss_out, gradient_out, n_threads)
193 self.closs.loss(
194 y_true=y_true,
195 raw_prediction=raw_prediction,
(...)
198 n_threads=n_threads,
199 )
200 return loss_out
--> 202 def loss_gradient(
203 self,
204 y_true,
205 raw_prediction,
206 sample_weight=None,
207 loss_out=None,
208 gradient_out=None,
209 n_threads=1,
210 ):
211 """Compute loss and gradient w.r.t. raw_prediction for each input.
212
213 Parameters
(...)
238 Element-wise gradients.
239 """
240 if loss_out is None:
KeyboardInterrupt:
# Email Spam Detection
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Load the dataset (e.g., SMS Spam Collection Dataset)
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
data = pd.read_csv(url, sep="\t", header=None, names=["label", "message"])
# Convert labels to binary
data["label"] = data["label"].map({"ham": 0, "spam": 1})
# Split the dataset into features and target variable
X = data["message"]
y = data["label"]
# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)
X_tfidf = vectorizer.fit_transform(X)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")
Accuracy: 0.98
Precision: 1.00
Recall: 0.87
F1 Score: 0.93
ROC-AUC: 0.99
Summary#
These case studies highlight the versatility and practical utility of logistic regression in various fields. In healthcare, it can predict disease presence and understand disease progression. In finance, it can assess credit risk and detect fraud. In marketing, it can predict customer churn and classify emails as spam. By applying logistic regression to real-world problems, businesses and organizations can make data-driven decisions and optimize their operations.