Binary X

Contents

Binary X #

Binary X → Continuous Y #

Binary X: HouseAge > 30
Continuous Y: Median Income (MedInc)

from sklearn.datasets import fetch_california_housing
from scipy.stats import pointbiserialr, ttest_ind, mannwhitneyu
from sklearn.feature_selection import mutual_info_regression
import pandas as pd

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)

df["X_bin"] = (df["HouseAge"] > 30).astype(int)
y = df["MedInc"]
X = df["X_bin"]

# Point-Biserial
pb, _ = pointbiserialr(X, y)

# t-test
t_stat, _ = ttest_ind(y[X==0], y[X==1])

# Mann-Whitney U (nonparametric)
u_stat, _ = mannwhitneyu(y[X==0], y[X==1])

# Mutual Information
mi = mutual_info_regression(X.values.reshape(-1,1), y)[0]

print("Point-Biserial:", pb)
print("t-test:", t_stat)
print("Mann–Whitney:", u_stat)
print("Mutual Information:", mi)

Point-Biserial: -0.08891425114393547
t-test: 12.824153614760043
Mann–Whitney: 59254034.5
Mutual Information: 0.009422289272053463

Metric	Value	What It Measures	Strength / Meaning	Interpretation
Point–Biserial	–0.0889	Linear association between a continuous variable and a binary variable	Very weak negative	The two groups have almost the same mean; X only very slightly lower for class 1.
t-test (t-statistic)	12.824	Difference in group means assuming normality	Statistically significant (given typical sample sizes)	Means differ statistically, but effect size is small (since r ≈ –0.09). Large t likely due to large sample size, not strong effect.
Mann–Whitney U	59,254,034.5	Difference in distributions (non-parametric)	Significance depends on group sizes	Indicates distributional difference, but does not quantify strength; with large samples even tiny effects give large U.
Mutual Information	0.00942	General dependency (linear + nonlinear)	Extremely weak	Variables share almost no information; relationship is effectively negligible.

Binary X → Binary Y #

Binary X: mean radius > median
Binary Y: cancer class (malignant / benign)

from sklearn.datasets import load_breast_cancer
from scipy.stats import chi2_contingency
from math import sqrt
from sklearn.feature_selection import mutual_info_classif
import pandas as pd

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target    # binary

df["X_bin"] = (df["mean radius"] > df["mean radius"].median()).astype(int)

X = df["X_bin"]
y = df["target"]

# Phi coefficient
table = pd.crosstab(X, y)
chi2, _, _, _ = chi2_contingency(table)
phi = sqrt(chi2 / len(df))

# Chi-square
chi2_value, p, _, _ = chi2_contingency(table)

# Mutual Information
mi = mutual_info_classif(X.values.reshape(-1,1), y)[0]

print("Phi Coefficient:", phi)
print("Chi-Square:", chi2_value)
print("Mutual Information:", mi)

Phi Coefficient: 0.644740755602814
Chi-Square: 236.52797526117857
Mutual Information: 0.22088305825323062

Metric	Value	What It Measures	Strength	Interpretation
Phi Coefficient	0.6447	Association between two binary variables	Strong positive	Strong relationship; when one binary variable is 1, the other is likely also 1 (or 0 with 0). Consistent directional association.
Chi-Square	236.528	Test of independence for categorical variables	Very large → statistically significant	Observed frequencies differ sharply from expected frequencies. Strong evidence of dependency between categories.
Mutual Information	0.2209	Information shared between the two variables (linear + nonlinear)	Moderate	The variables share a meaningful amount of information; knowing one reduces uncertainty about the other.

Binary X → Ordinal Y #

Binary X: Median Income > 3
Ordinal Y: target (price) converted to 3-level ordinal bins: Low, Medium, High

from sklearn.datasets import fetch_california_housing
from scipy.stats import spearmanr, kendalltau
from sklearn.feature_selection import mutual_info_classif
import pandas as pd

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

df["X_bin"] = (df["MedInc"] > 3).astype(int)

# Ordinal Y
df["Y_ord"] = pd.qcut(df["target"], q=3, labels=[1,2,3]).astype(int)

X = df["X_bin"]
y = df["Y_ord"]

# Spearman
spearman, _ = spearmanr(X, y)

# Kendall
kendall, _ = kendalltau(X, y)

# Mutual Information
mi = mutual_info_classif(X.values.reshape(-1,1), y)[0]

print("Spearman:", spearman)
print("Kendall:", kendall)
print("Mutual Information:", mi)

Spearman: 0.49811001629852997
Kendall: 0.46962263368309787
Mutual Information: 0.13921068369562795

Metric	Value	What It Measures	Strength	Interpretation
Spearman	0.4981	Monotonic (rank-based) correlation	Moderate–Strong	As X increases, Y tends to increase in a consistent ranked pattern; clear upward trend.
Kendall	0.4696	Pairwise concordance (rank agreement)	Strong	Most observation pairs move in the same direction; high directional agreement between X and Y.
Mutual Information	0.1392	Overall dependency (linear + nonlinear)	Moderate	X carries meaningful predictive information about Y; noticeable shared dependency.

Binary X → Categorical Nominal Y #

Binary X: petal length > 2.5
Nominal Y: species (3 classes)

from sklearn.datasets import load_iris
from scipy.stats import chi2_contingency
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from math import sqrt

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["species"] = data.target

df["X_bin"] = (df["petal length (cm)"] > 2.5).astype(int)

X = df["X_bin"]
y = df["species"]

# Cramér's V
table = pd.crosstab(X, y)
chi2, _, _, _ = chi2_contingency(table)
n = table.sum().sum()
k = min(table.shape)
cramers_v = sqrt(chi2 / (n * (k-1)))

# Chi-square
chi2_value, p, _, _ = chi2_contingency(table)

# Mutual Information
mi = mutual_info_classif(X.values.reshape(-1,1), y)[0]

print("Cramér’s V:", cramers_v)
print("Chi-Square:", chi2_value)
print("Mutual Information:", mi)

Cramér’s V: 1.0
Chi-Square: 150.0
Mutual Information: 0.7281284489676516

Metric	Value	What It Measures	Strength	Interpretation
Cramér’s V	1.0	Strength of association between two categorical variables	Perfect association	The categories align perfectly; one variable fully determines the other.
Chi-Square	150.0	Test of independence (categorical–categorical)	Very large → statistically significant	Observed frequencies differ sharply from expected; confirms strong dependence.
Mutual Information	0.7281	Amount of shared information (0 = none, 1 = high)	High	Variables share substantial information; knowing one reduces uncertainty about the other.

Binary X → Discrete Numeric Y #

Binary X: Bedrooms > median
Discrete numeric Y: Population (integer)

from sklearn.datasets import fetch_california_housing
from scipy.stats import spearmanr, pointbiserialr, ttest_ind
from sklearn.feature_selection import mutual_info_regression
import pandas as pd

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)

df["X_bin"] = (df["AveBedrms"] > df["AveBedrms"].median()).astype(int)
y = df["Population"]
X = df["X_bin"]

# Spearman
spearman, _ = spearmanr(X, y)

# Point-Biserial
pb, _ = pointbiserialr(X, y)

# t-test (due to binary X)
t_stat, _ = ttest_ind(y[X==0], y[X==1])

# Mutual Information
mi = mutual_info_regression(X.values.reshape(-1,1), y)[0]

print("Spearman:", spearman)
print("Point-Biserial:", pb)
print("t-test:", t_stat)
print("Mutual Information:", mi)

Spearman: 0.02537753340197663
Point-Biserial: 0.030026196631505037
t-test: -4.315488768612021
Mutual Information: 0.0012569075816197817

Metric	Value	What It Measures	Strength	Interpretation
Spearman	0.0254	Monotonic (rank-based) association	None / Extremely weak	No meaningful monotonic trend; ranks of X and Y are essentially unrelated.
Point–Biserial	0.0300	Linear association between continuous X and binary Y	None / Extremely weak	Group means are almost identical; X does not separate the two classes.
t-test (t-statistic)	–4.3155	Difference in means between two groups	Statistically significant (due to large n)	Means differ slightly, but statistical significance comes from large sample size—not from strong effect.
Mutual Information	0.00126	Any dependency (linear + nonlinear)	Essentially zero	X and Y share almost no information; practically independent.