Exploratory Data Analysis (EDA) for Liquefaction Assessment¶

Understanding geotechnical data for machine learning

Exercise: Solution:

Introduction¶

Exploratory Data Analysis (EDA) is a critical first step in any machine learning project. For geotechnical engineering applications, EDA helps us:

Understand the physical characteristics of our data
Identify data quality issues (missing values, outliers, errors)
Discover relationships between geotechnical parameters
Make informed decisions about preprocessing and feature engineering
Validate that data aligns with our engineering understanding

Liquefaction and Lateral Spreading¶

Soil liquefaction occurs in saturated loose sandy soils subjected to rapid loading conditions, such as earthquakes. The generation of excess pore water pressure can lead to a sudden reduction in soil strength and stiffness. In gently sloping ground, this may generate lateral displacements known as lateral spreading.

Dataset¶

We'll analyze the lateral spreading dataset from:

Durante, M. G., & Rathje, E. M. (2021). An exploration of the use of machine learning to predict lateral spreading. Earthquake Spectra, 37(4), 2288-2314.

Classification criteria: Sites with >0.3 m lateral displacement are classified as experiencing lateral spreading (Target=1).

Key features:

GWD (m): Ground Water Depth - distance from surface to water table
L (km): Distance to nearest free face (river, channel)
Slope (%): Ground slope angle
PGA (g): Peak Ground Acceleration from earthquake
Elevation: Site elevation (will be removed as Durante & Rathje found it less significant)
Target: Binary classification (0 = no lateral spreading, 1 = lateral spreading)

In [ ]:

Copied!

!pip3 install pandas numpy matplotlib seaborn scipy scikit-learn --quiet
!pip3 install pandas numpy matplotlib seaborn scipy scikit-learn --quiet

Load Data and Libraries¶

In [2]:

Copied!





import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)

In [3]:

Copied!





# Load data
df = pd.read_csv('https://raw.githubusercontent.com/kks32-courses/ai-geotech/refs/heads/main/docs/00-mlp-dtree/RF_YN_Model3.csv')
print(f"Dataset: {df.shape[0]} samples, {df.shape[1]} features\n")
df.head()
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/kks32-courses/ai-geotech/refs/heads/main/docs/00-mlp-dtree/RF_YN_Model3.csv')
print(f"Dataset: {df.shape[0]} samples, {df.shape[1]} features\n")
df.head()

Dataset: 7291 samples, 7 features

Out[3]:

	Test ID	GWD (m)	Elevation	L (km)	Slope (%)	PGA (g)	Target
0	182	0.370809	0.909116	0.319117	5.465739	0.546270	0
1	15635	1.300896	1.123009	0.211770	0.905948	0.532398	0
2	8292	1.300896	0.847858	0.195947	0.849104	0.532398	0
3	15629	1.788212	2.044325	0.115795	0.451034	0.542307	0
4	183	1.637517	2.003797	0.137265	0.941866	0.545784	1

In [4]:

Copied!

# Dataset info
df.info()
# Dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7291 entries, 0 to 7290
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Test ID    7291 non-null   int64  
 1   GWD (m)    7291 non-null   float64
 2   Elevation  7291 non-null   float64
 3   L (km)     7291 non-null   float64
 4   Slope (%)  7291 non-null   float64
 5   PGA (g)    7291 non-null   float64
 6   Target     7291 non-null   int64  
dtypes: float64(5), int64(2)
memory usage: 398.9 KB

Data Overview¶

In [5]:

Copied!

# Basic statistics
features = ['GWD (m)', 'L (km)', 'Slope (%)', 'PGA (g)']
df[features + ['Target']].describe()
# Basic statistics
features = ['GWD (m)', 'L (km)', 'Slope (%)', 'PGA (g)']
df[features + ['Target']].describe()

Out[5]:

	GWD (m)	L (km)	Slope (%)	PGA (g)	Target
count	7291.000000	7291.000000	7291.000000	7291.000000	7291.000000
mean	2.075960	1.018893	1.139434	0.439751	0.419010
std	0.641597	0.646159	1.133215	0.042140	0.493431
min	0.370809	0.000000	0.000000	0.328797	0.000000
25%	1.636752	0.487862	0.452506	0.407541	0.000000
50%	1.983148	0.948660	0.806480	0.439723	0.000000
75%	2.460715	1.478798	1.401399	0.470760	1.000000
max	6.047182	3.289537	10.922902	0.567631	1.000000

Geotechnical Interpretation of Summary Statistics¶

Let's interpret these statistics from a geotechnical perspective:

GWD (Ground Water Depth):
- Range: Check if values are physically plausible (typically 0-5m for liquefaction-prone sites)
- Mean vs. Median: Indicates distribution symmetry
PGA (Peak Ground Acceleration):
- Typical range: 0.1g - 1.0g for significant earthquakes
- Higher PGA → More severe shaking → Higher liquefaction risk
Slope:
- Range: Should be non-negative, typically 0-30%
- Steeper slopes → More lateral spreading potential
Distance (L):
- Sites closer to free faces show more lateral movement

In [6]:

Copied!





# Class balance
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
df['Target'].value_counts().plot(kind='bar', ax=ax[0], color=['steelblue', 'coral'])
ax[0].set_title('Class Distribution')
ax[0].set_xticklabels(['No Spreading', 'Spreading'], rotation=0)
ax[0].set_ylabel('Count')

df['Target'].value_counts(normalize=True).plot(kind='pie', ax=ax[1], autopct='%1.1f%%', 
                                                  colors=['steelblue', 'coral'])
ax[1].set_ylabel('')
plt.tight_layout()
plt.show()

print(f"\nClass balance: {df['Target'].value_counts(normalize=True).values}")
# Class balance
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
df['Target'].value_counts().plot(kind='bar', ax=ax[0], color=['steelblue', 'coral'])
ax[0].set_title('Class Distribution')
ax[0].set_xticklabels(['No Spreading', 'Spreading'], rotation=0)
ax[0].set_ylabel('Count')

df['Target'].value_counts(normalize=True).plot(kind='pie', ax=ax[1], autopct='%1.1f%%', 
                                                  colors=['steelblue', 'coral'])
ax[1].set_ylabel('')
plt.tight_layout()
plt.show()

print(f"\nClass balance: {df['Target'].value_counts(normalize=True).values}")

No description has been provided for this image

Class balance: [0.58099026 0.41900974]

Note on Class Imbalance: If one class significantly outnumbers the other, we may need to consider:

Stratified sampling during train-test split
Class weighting in models
Appropriate evaluation metrics (precision, recall, F1-score, AUC-ROC)

Missing data analysis¶

In [15]:

Copied!





# Missing values
print("Missing Values:")
missing = df[features].isnull().sum()
if missing.sum() == 0:
    print("✓ No missing values!\n")
else:
    print(missing[missing > 0])

# Physical plausibility
print("\nPhysical Plausibility Checks:")
print(f"  Negative GWD: {(df['GWD (m)'] < 0).sum()}")
print(f"  Negative Slope: {(df['Slope (%)'] < 0).sum()}")
print(f"  Negative Distance: {(df['L (km)'] < 0).sum()}")
print(f"  PGA > 1.0g: {(df['PGA (g)'] > 1.0).sum()}")

# Duplicates
dup = df.duplicated(subset=features).sum()
print(f"\nDuplicate rows: {dup} ({100*dup/len(df):.1f}%)")
# Missing values
print("Missing Values:")
missing = df[features].isnull().sum()
if missing.sum() == 0:
    print("✓ No missing values!\n")
else:
    print(missing[missing > 0])

# Physical plausibility
print("\nPhysical Plausibility Checks:")
print(f"  Negative GWD: {(df['GWD (m)'] < 0).sum()}")
print(f"  Negative Slope: {(df['Slope (%)'] < 0).sum()}")
print(f"  Negative Distance: {(df['L (km)'] < 0).sum()}")
print(f"  PGA > 1.0g: {(df['PGA (g)'] > 1.0).sum()}")

# Duplicates
dup = df.duplicated(subset=features).sum()
print(f"\nDuplicate rows: {dup} ({100*dup/len(df):.1f}%)")

Missing Values:
✓ No missing values!


Physical Plausibility Checks:
  Negative GWD: 0
  Negative Slope: 0
  Negative Distance: 0
  PGA > 1.0g: 0

Duplicate rows: 87 (1.2%)

Handling Missing Data - Geotechnical Considerations¶

Types of Missingness:

MCAR (Missing Completely At Random): No pattern to missing data
MAR (Missing At Random): Missingness related to observed variables
MNAR (Missing Not At Random): Missingness related to the missing value itself

Strategies for Geotechnical Data:

Deletion: Remove samples if <5% missing and MCAR
Imputation:
- Mean/Median: For continuous variables (GWD, PGA)
- Mode: For categorical variables (soil type)
- Domain knowledge: Use typical values from similar geological conditions
- Advanced: KNN or iterative imputation
Keep as feature: Sometimes missingness itself is informative (e.g., no GWD measurement might indicate deep water table)

Feature Distributions by Class¶

Key question: Can we visually distinguish spreading from non-spreading sites?

In [8]:

Copied!





# Histograms comparing spreading vs non-spreading
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()

for idx, feature in enumerate(features):
    # Plot both classes
    df[df['Target']==0][feature].hist(ax=axes[idx], bins=30, alpha=0.6, 
                                       label='No Spreading', color='steelblue', edgecolor='black')
    df[df['Target']==1][feature].hist(ax=axes[idx], bins=30, alpha=0.6, 
                                       label='Spreading', color='coral', edgecolor='black')
    
    axes[idx].set_xlabel(feature, fontweight='bold')
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.suptitle('Feature Distributions by Lateral Spreading', fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()
# Histograms comparing spreading vs non-spreading
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()

for idx, feature in enumerate(features):
    # Plot both classes
    df[df['Target']==0][feature].hist(ax=axes[idx], bins=30, alpha=0.6, 
                                       label='No Spreading', color='steelblue', edgecolor='black')
    df[df['Target']==1][feature].hist(ax=axes[idx], bins=30, alpha=0.6, 
                                       label='Spreading', color='coral', edgecolor='black')
    
    axes[idx].set_xlabel(feature, fontweight='bold')
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.suptitle('Feature Distributions by Lateral Spreading', fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

Understanding Skewness and Kurtosis¶

Skewness measures asymmetry of the distribution:

Skewness ≈ 0: Symmetric distribution (normal)
Skewness > 0: Right-skewed (tail extends to right) - e.g., extreme high PGA values
Skewness < 0: Left-skewed (tail extends to left)

Kurtosis measures the "tailedness" of the distribution:

Kurtosis ≈ 0: Normal distribution
Kurtosis > 0: Heavy tails (more outliers) - leptokurtic
Kurtosis < 0: Light tails (fewer outliers) - platykurtic

Geotechnical Implications:

Skewed data might need transformation (log, sqrt) for some models
High kurtosis indicates potential outliers or extreme events (major earthquakes)
Tree-based models handle skewness well; neural networks may benefit from normalization

Quartiles and Box Plots by Class¶

Quartiles: Robust statistics not affected by outliers

Q1, Q2 (median), Q3 divide data into 4 equal parts
IQR = Q3 - Q1 (middle 50% of data)
Outlier rule: Beyond Q1 - 1.5IQR or Q3 + 1.5IQR

In [9]:

Copied!





# Box plots by class
fig, axes = plt.subplots(1, 4, figsize=(14, 4))

for idx, feature in enumerate(features):
    df.boxplot(column=feature, by='Target', ax=axes[idx], patch_artist=True)
    axes[idx].set_title(f'{feature}')
    axes[idx].set_xlabel('')
    axes[idx].set_xticklabels(['No Spread', 'Spread'])
    axes[idx].get_figure().suptitle('')

plt.suptitle('Box Plots by Target Class', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Compute quartiles
print("\nFive-Number Summary:\n")
print(f"{'Feature':<12} {'Min':>8} {'Q1':>8} {'Median':>8} {'Q3':>8} {'Max':>8} {'IQR':>8}")
print("-" * 70)

for feature in features:
    q1 = df[feature].quantile(0.25)
    median = df[feature].median()
    q3 = df[feature].quantile(0.75)
    iqr = q3 - q1
    
    print(f"{feature:<12} {df[feature].min():>8.2f} {q1:>8.2f} {median:>8.2f} {q3:>8.2f} {df[feature].max():>8.2f} {iqr:>8.2f}")
# Box plots by class
fig, axes = plt.subplots(1, 4, figsize=(14, 4))

for idx, feature in enumerate(features):
    df.boxplot(column=feature, by='Target', ax=axes[idx], patch_artist=True)
    axes[idx].set_title(f'{feature}')
    axes[idx].set_xlabel('')
    axes[idx].set_xticklabels(['No Spread', 'Spread'])
    axes[idx].get_figure().suptitle('')

plt.suptitle('Box Plots by Target Class', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Compute quartiles
print("\nFive-Number Summary:\n")
print(f"{'Feature':<12} {'Min':>8} {'Q1':>8} {'Median':>8} {'Q3':>8} {'Max':>8} {'IQR':>8}")
print("-" * 70)

for feature in features:
    q1 = df[feature].quantile(0.25)
    median = df[feature].median()
    q3 = df[feature].quantile(0.75)
    iqr = q3 - q1
    
    print(f"{feature:<12} {df[feature].min():>8.2f} {q1:>8.2f} {median:>8.2f} {q3:>8.2f} {df[feature].max():>8.2f} {iqr:>8.2f}")

Five-Number Summary:

Feature           Min       Q1   Median       Q3      Max      IQR
----------------------------------------------------------------------
GWD (m)          0.37     1.64     1.98     2.46     6.05     0.82
L (km)           0.00     0.49     0.95     1.48     3.29     0.99
Slope (%)        0.00     0.45     0.81     1.40    10.92     0.95
PGA (g)          0.33     0.41     0.44     0.47     0.57     0.06

What Do These Numbers Mean?¶

Reading the Five-Number Summary:

GWD (Ground Water Depth):

Middle 50%: Between 1.64 m and 2.46 m (IQR = 0.82 m)
Median = 1.98 m: Typical water table depth
Interpretation: Most sites have water tables around 2 m deep, with relatively tight clustering (IQR = 0.82 m)

L (Distance to Free Face):

Middle 50%: Between 0.49 km and 1.48 km (IQR = 0.99 km)
Median = 0.95 km: Typical distance to river/channel
Interpretation: Most sites are roughly 0.5-1.5 km from water bodies

Slope:

Middle 50%: Between 0.45% and 1.40% (IQR = 0.95%)
Median = 0.81%: Very gentle slopes
Note: Max = 10.92% shows some steep sites exist (but rare)

PGA (Peak Ground Acceleration):

Middle 50%: Between 0.41g and 0.47g (IQR = 0.06g)
Median = 0.44g: Moderate earthquake shaking
Interpretation: Very tight clustering (IQR = 0.06g) - earthquakes in this dataset are similar strength

Key Insight: IQR tells you the SPREAD of the middle 50% of data

Small IQR (like PGA = 0.06g): Data is tightly clustered → consistent conditions
Large IQR (like L = 0.99 km): Data is more spread out → variable conditions

Quick Reference - p-value interpretation¶

p < 0.001 (***): Extremely strong evidence
p < 0.01 (**): Strong evidence
p < 0.05 (*): Moderate evidence
p ≥ 0.05: No significant difference

hThe Rule: Point is an outlier if:

Below: Q1 - 1.5 × IQR
Above: Q3 + 1.5 × IQR

Why 1.5? John Tukey's empirical rule - balances detection vs false alarms

What's a good outlier percentage?

1-5%: Typical for clean data
5-10%: Acceptable, natural variability
$>10\%$: Investigate - may indicate data quality issues or highly variable conditions

Statistical Significance Testing¶

The Question: Are features truly different between spreading and non-spreading sites?

Mann-Whitney U Test - How It Works¶

U counts the "wins" - how often one group beats the other.

The Algorithm:

Mix all values from both groups together
Rank them from smallest to largest (1, 2, 3, ...)
Sum ranks for each group
Calculate U: How often does a spreading value < non-spreading value?

The p-value answers: "If groups were actually the same, what's the chance we'd see this much separation by random luck?"

p < 0.001 → Less than 0.1% chance → Strong evidence of real difference!
p > 0.05 → More than 5% chance → Could just be randomness

Why it's awesome:

✓ No assumptions about distribution shape
✓ Robust to outliers (uses ranks, not values)
✓ Perfect for comparing two groups

In [10]:

Copied!





# Mann-Whitney U test (non-parametric)
from scipy.stats import mannwhitneyu

print("Mann-Whitney U Test Results:\n")
print(f"{'Feature':<12} {'p-value':>12} {'Significant?':>15}")
print("-" * 45)

for feature in features:
    no_spread = df[df['Target'] == 0][feature]
    spread = df[df['Target'] == 1][feature]
    
    u_stat, p_value = mannwhitneyu(no_spread, spread)
    sig = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else "No"
    
    print(f"{feature:<12} {p_value:>12.2e} {sig:>15}")

print("\n*** p<0.001  ** p<0.01  * p<0.05")
# Mann-Whitney U test (non-parametric)
from scipy.stats import mannwhitneyu

print("Mann-Whitney U Test Results:\n")
print(f"{'Feature':<12} {'p-value':>12} {'Significant?':>15}")
print("-" * 45)

for feature in features:
    no_spread = df[df['Target'] == 0][feature]
    spread = df[df['Target'] == 1][feature]
    
    u_stat, p_value = mannwhitneyu(no_spread, spread)
    sig = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else "No"
    
    print(f"{feature:<12} {p_value:>12.2e} {sig:>15}")

print("\n*** p<0.001  ** p<0.01  * p<0.05")

Mann-Whitney U Test Results:

Feature           p-value    Significant?
---------------------------------------------
GWD (m)          4.76e-15             ***
L (km)           4.90e-46             ***
Slope (%)        9.83e-02              No
PGA (g)          1.35e-16             ***

*** p<0.001  ** p<0.01  * p<0.05

Analysis of Results¶

Mann-Whitney U Results:

Feature	p-value	Significant?	Conclusion
GWD	4.76e-15	***	Strong predictor - water depth clearly differs between classes
L	4.90e-46	***	Strongest predictor - distance to free face is critical
Slope	9.83e-02	No	Weak predictor - distributions overlap too much
PGA	1.35e-16	***	Strong predictor - shaking intensity matters

Key Insight: Slope is NOT significantly different → May not help prediction much!

The other three features show extremely low p-values → These are our key predictors.

Z-Score Method¶

Z-Score: Measures how many standard deviations from mean

$$z = \frac{x - \mu}{\sigma}$$

Two uses:

Outlier detection: |z| > 3 (rare values)
Standardization for ML: Transform to mean=0, std=1 (needed for neural networks, SVM)

Rule: |z| > 3 means outlier (only 0.3% of normal data exceeds this)

In [11]:

Copied!





# Simple IQR outlier detection
print("IQR Outlier Detection:\n")
print(f"{'Feature':<12} {'Q1':>8} {'Q3':>8} {'IQR':>8} {'Outliers':>10} {'%':>8}")
print("-" * 60)

for feature in features:
    q1 = df[feature].quantile(0.25)
    q3 = df[feature].quantile(0.75)
    iqr = q3 - q1
    
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    
    outliers = df[(df[feature] < lower) | (df[feature] > upper)]
    n_outliers = len(outliers)
    pct = 100 * n_outliers / len(df)
    
    print(f"{feature:<12} {q1:>8.2f} {q3:>8.2f} {iqr:>8.2f} {n_outliers:>10} {pct:>7.1f}%")
# Simple IQR outlier detection
print("IQR Outlier Detection:\n")
print(f"{'Feature':<12} {'Q1':>8} {'Q3':>8} {'IQR':>8} {'Outliers':>10} {'%':>8}")
print("-" * 60)

for feature in features:
    q1 = df[feature].quantile(0.25)
    q3 = df[feature].quantile(0.75)
    iqr = q3 - q1
    
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    
    outliers = df[(df[feature] < lower) | (df[feature] > upper)]
    n_outliers = len(outliers)
    pct = 100 * n_outliers / len(df)
    
    print(f"{feature:<12} {q1:>8.2f} {q3:>8.2f} {iqr:>8.2f} {n_outliers:>10} {pct:>7.1f}%")

IQR Outlier Detection:

Feature            Q1       Q3      IQR   Outliers        %
------------------------------------------------------------
GWD (m)          1.64     2.46     0.82        133     1.8%
L (km)           0.49     1.48     0.99          8     0.1%
Slope (%)        0.45     1.40     0.95        542     7.4%
PGA (g)          0.41     0.47     0.06          3     0.0%

Evaluating IQR Results¶

Our Results:

Feature	IQR	Outliers	%	Assessment
GWD	0.82 m	133	1.8%	✓ Excellent
L	0.99 km	8	0.1%	✓ Excellent
Slope	0.95%	542	7.4%	✓ Good
PGA	0.06 g	3	0.0%	✓ Excellent

Analysis:

GWD (1.8% outliers): Valid extreme water table depths → Keep
L (0.1% outliers): Very few unusual distances → Keep
Slope (7.4% outliers): Natural field variability → Keep
PGA (0.0% outliers): Perfect - earthquake data well-controlled → Keep

Decision: All outliers < 10% and physically plausible → Keep all data

Why? Model needs to learn extreme events (major earthquakes, steep slopes)

Method 2: Z-Score Method (Standardization)¶

The Z-score method identifies outliers as points with |z-score| > 3.

What is a Z-score?

The z-score measures how many standard deviations a data point is from the mean:

$$z = \frac{x - \mu}{\sigma}$$

where:

$x$ = individual data point
$\mu$ = population mean (estimated by sample mean $\bar{x}$)
$\sigma$ = population standard deviation (estimated by sample std $s$)

Interpretation:

$z = 0$: Value is exactly at the mean
$z = 1$: Value is 1 standard deviation above the mean
$z = -2$: Value is 2 standard deviations below the mean
$|z| > 3$: Commonly used threshold for outlier detection (>99.7% of data falls within ±3σ for normal distribution)

Why use Z-scores?

Dimensionless: Allows comparison across different units (meters, g-forces, etc.)
Standardization: Transforms all features to same scale (mean=0, std=1)
Statistical foundation: Based on properties of normal distribution

Limitations:

Assumes approximately normal distribution
Sensitive to extreme outliers (they affect μ and σ)
May not work well for highly skewed distributions

Geotechnical Application: For example, if GWD has mean 2.0m and std 0.64m:

A value of 3.28m would have z = (3.28-2.0)/0.64 = 2.0 (within normal range)
A value of 6.0m would have z = (6.0-2.0)/0.64 = 6.25 (likely an outlier)

In [12]:

Copied!





# Z-score outliers
print("Z-Score Outlier Detection (|z| > 3):\n")
print(f"{'Feature':<12} {'Mean':>8} {'Std':>8} {'Outliers':>10} {'%':>8}")
print("-" * 50)

for feature in features:
    z_scores = np.abs(stats.zscore(df[feature]))
    outliers = df[z_scores > 3]
    n_outliers = len(outliers)
    pct = 100 * n_outliers / len(df)
    
    print(f"{feature:<12} {df[feature].mean():>8.2f} {df[feature].std():>8.2f} {n_outliers:>10} {pct:>7.1f}%")
# Z-score outliers
print("Z-Score Outlier Detection (|z| > 3):\n")
print(f"{'Feature':<12} {'Mean':>8} {'Std':>8} {'Outliers':>10} {'%':>8}")
print("-" * 50)

for feature in features:
    z_scores = np.abs(stats.zscore(df[feature]))
    outliers = df[z_scores > 3]
    n_outliers = len(outliers)
    pct = 100 * n_outliers / len(df)
    
    print(f"{feature:<12} {df[feature].mean():>8.2f} {df[feature].std():>8.2f} {n_outliers:>10} {pct:>7.1f}%")

Z-Score Outlier Detection (|z| > 3):

Feature          Mean      Std   Outliers        %
--------------------------------------------------
GWD (m)          2.08     0.64         66     0.9%
L (km)           1.02     0.65          8     0.1%
Slope (%)        1.14     1.13        153     2.1%
PGA (g)          0.44     0.04          2     0.0%

Decision Framework for Outliers¶

Keep outliers if:

Values are physically plausible (within engineering bounds)
Represent real extreme events (major earthquakes, unusual site conditions)
Only detected by one method (might be boundary cases)
Critical for model to learn rare but important scenarios

Remove outliers if:

Physically impossible values (negative GWD, PGA > 2g)
Obvious data entry or measurement errors
Detected by multiple methods consistently
Would severely bias the model

For this dataset: Based on the geotechnical evaluation above, most outliers appear to be valid extreme events. We'll keep them for model training but flag them for monitoring.

Correlation Analysis¶

In [13]:

Copied!





# Correlation matrix
corr = df[features + ['Target']].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nCorrelations with Target:")
print(corr['Target'].sort_values(ascending=False))
# Correlation matrix
corr = df[features + ['Target']].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nCorrelations with Target:")
print(corr['Target'].sort_values(ascending=False))

Correlations with Target:
Target       1.000000
PGA (g)      0.067607
Slope (%)   -0.012796
GWD (m)     -0.113653
L (km)      -0.167671
Name: Target, dtype: float64

Pair Plot - All Relationships at Once¶

In [14]:

Copied!





# Pair plot
g = sns.pairplot(df[features + ['Target']], hue='Target', 
                  palette=['steelblue', 'coral'],
                  plot_kws={'alpha': 0.5}, height=2)
g.fig.suptitle('Pairwise Feature Relationships', y=1.02, fontsize=14, fontweight='bold')
plt.show()
# Pair plot
g = sns.pairplot(df[features + ['Target']], hue='Target', 
                  palette=['steelblue', 'coral'],
                  plot_kws={'alpha': 0.5}, height=2)
g.fig.suptitle('Pairwise Feature Relationships', y=1.02, fontsize=14, fontweight='bold')
plt.show()

Summary: Key Insights¶

Most Important Features (based on correlation and statistical tests):

GWD: Shallow water table → more spreading
PGA: Higher shaking → more spreading
L: Closer to river → more spreading
Slope: Effect less clear

Data Quality: ✓ Good - no missing values, physically plausible

Next Steps: Ready for ML modeling!