Exploratory Data Analysis (EDA) for Liquefaction Assessment¶
Understanding geotechnical data for machine learning
Introduction¶
Exploratory Data Analysis (EDA) is a critical first step in any machine learning project. For geotechnical engineering applications, EDA helps us:
- Understand the physical characteristics of our data
- Identify data quality issues (missing values, outliers, errors)
- Discover relationships between geotechnical parameters
- Make informed decisions about preprocessing and feature engineering
- Validate that data aligns with our engineering understanding
Liquefaction and Lateral Spreading¶
Soil liquefaction occurs in saturated loose sandy soils subjected to rapid loading conditions, such as earthquakes. The generation of excess pore water pressure can lead to a sudden reduction in soil strength and stiffness. In gently sloping ground, this may generate lateral displacements known as lateral spreading.
Dataset¶
We'll analyze the lateral spreading dataset from:
Durante, M. G., & Rathje, E. M. (2021). An exploration of the use of machine learning to predict lateral spreading. Earthquake Spectra, 37(4), 2288-2314.
Classification criteria: Sites with >0.3 m lateral displacement are classified as experiencing lateral spreading (Target=1).
Key features:
- GWD (m): Ground Water Depth - distance from surface to water table
- L (km): Distance to nearest free face (river, channel)
- Slope (%): Ground slope angle
- PGA (g): Peak Ground Acceleration from earthquake
- Elevation: Site elevation (will be removed as Durante & Rathje found it less significant)
- Target: Binary classification (0 = no lateral spreading, 1 = lateral spreading)
!pip3 install pandas numpy matplotlib seaborn scipy scikit-learn --quiet
Load Data and Libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/kks32-courses/ai-geotech/refs/heads/main/docs/00-mlp-dtree/RF_YN_Model3.csv')
print(f"Dataset: {df.shape[0]} samples, {df.shape[1]} features\n")
df.head()
Dataset: 7291 samples, 7 features
| Test ID | GWD (m) | Elevation | L (km) | Slope (%) | PGA (g) | Target | |
|---|---|---|---|---|---|---|---|
| 0 | 182 | 0.370809 | 0.909116 | 0.319117 | 5.465739 | 0.546270 | 0 |
| 1 | 15635 | 1.300896 | 1.123009 | 0.211770 | 0.905948 | 0.532398 | 0 |
| 2 | 8292 | 1.300896 | 0.847858 | 0.195947 | 0.849104 | 0.532398 | 0 |
| 3 | 15629 | 1.788212 | 2.044325 | 0.115795 | 0.451034 | 0.542307 | 0 |
| 4 | 183 | 1.637517 | 2.003797 | 0.137265 | 0.941866 | 0.545784 | 1 |
# Dataset info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7291 entries, 0 to 7290 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Test ID 7291 non-null int64 1 GWD (m) 7291 non-null float64 2 Elevation 7291 non-null float64 3 L (km) 7291 non-null float64 4 Slope (%) 7291 non-null float64 5 PGA (g) 7291 non-null float64 6 Target 7291 non-null int64 dtypes: float64(5), int64(2) memory usage: 398.9 KB
Data Overview¶
# Basic statistics
features = ['GWD (m)', 'L (km)', 'Slope (%)', 'PGA (g)']
df[features + ['Target']].describe()
| GWD (m) | L (km) | Slope (%) | PGA (g) | Target | |
|---|---|---|---|---|---|
| count | 7291.000000 | 7291.000000 | 7291.000000 | 7291.000000 | 7291.000000 |
| mean | 2.075960 | 1.018893 | 1.139434 | 0.439751 | 0.419010 |
| std | 0.641597 | 0.646159 | 1.133215 | 0.042140 | 0.493431 |
| min | 0.370809 | 0.000000 | 0.000000 | 0.328797 | 0.000000 |
| 25% | 1.636752 | 0.487862 | 0.452506 | 0.407541 | 0.000000 |
| 50% | 1.983148 | 0.948660 | 0.806480 | 0.439723 | 0.000000 |
| 75% | 2.460715 | 1.478798 | 1.401399 | 0.470760 | 1.000000 |
| max | 6.047182 | 3.289537 | 10.922902 | 0.567631 | 1.000000 |
Geotechnical Interpretation of Summary Statistics¶
Let's interpret these statistics from a geotechnical perspective:
GWD (Ground Water Depth):
- Range: Check if values are physically plausible (typically 0-5m for liquefaction-prone sites)
- Mean vs. Median: Indicates distribution symmetry
PGA (Peak Ground Acceleration):
- Typical range: 0.1g - 1.0g for significant earthquakes
- Higher PGA → More severe shaking → Higher liquefaction risk
Slope:
- Range: Should be non-negative, typically 0-30%
- Steeper slopes → More lateral spreading potential
Distance (L):
- Sites closer to free faces show more lateral movement
# Class balance
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
df['Target'].value_counts().plot(kind='bar', ax=ax[0], color=['steelblue', 'coral'])
ax[0].set_title('Class Distribution')
ax[0].set_xticklabels(['No Spreading', 'Spreading'], rotation=0)
ax[0].set_ylabel('Count')
df['Target'].value_counts(normalize=True).plot(kind='pie', ax=ax[1], autopct='%1.1f%%',
colors=['steelblue', 'coral'])
ax[1].set_ylabel('')
plt.tight_layout()
plt.show()
print(f"\nClass balance: {df['Target'].value_counts(normalize=True).values}")
Class balance: [0.58099026 0.41900974]
Note on Class Imbalance: If one class significantly outnumbers the other, we may need to consider:
- Stratified sampling during train-test split
- Class weighting in models
- Appropriate evaluation metrics (precision, recall, F1-score, AUC-ROC)
Missing data analysis¶
# Missing values
print("Missing Values:")
missing = df[features].isnull().sum()
if missing.sum() == 0:
print("✓ No missing values!\n")
else:
print(missing[missing > 0])
# Physical plausibility
print("\nPhysical Plausibility Checks:")
print(f" Negative GWD: {(df['GWD (m)'] < 0).sum()}")
print(f" Negative Slope: {(df['Slope (%)'] < 0).sum()}")
print(f" Negative Distance: {(df['L (km)'] < 0).sum()}")
print(f" PGA > 1.0g: {(df['PGA (g)'] > 1.0).sum()}")
# Duplicates
dup = df.duplicated(subset=features).sum()
print(f"\nDuplicate rows: {dup} ({100*dup/len(df):.1f}%)")
Missing Values: ✓ No missing values! Physical Plausibility Checks: Negative GWD: 0 Negative Slope: 0 Negative Distance: 0 PGA > 1.0g: 0 Duplicate rows: 87 (1.2%)
Handling Missing Data - Geotechnical Considerations¶
Types of Missingness:
- MCAR (Missing Completely At Random): No pattern to missing data
- MAR (Missing At Random): Missingness related to observed variables
- MNAR (Missing Not At Random): Missingness related to the missing value itself
Strategies for Geotechnical Data:
- Deletion: Remove samples if <5% missing and MCAR
- Imputation:
- Mean/Median: For continuous variables (GWD, PGA)
- Mode: For categorical variables (soil type)
- Domain knowledge: Use typical values from similar geological conditions
- Advanced: KNN or iterative imputation
- Keep as feature: Sometimes missingness itself is informative (e.g., no GWD measurement might indicate deep water table)
Feature Distributions by Class¶
Key question: Can we visually distinguish spreading from non-spreading sites?
# Histograms comparing spreading vs non-spreading
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()
for idx, feature in enumerate(features):
# Plot both classes
df[df['Target']==0][feature].hist(ax=axes[idx], bins=30, alpha=0.6,
label='No Spreading', color='steelblue', edgecolor='black')
df[df['Target']==1][feature].hist(ax=axes[idx], bins=30, alpha=0.6,
label='Spreading', color='coral', edgecolor='black')
axes[idx].set_xlabel(feature, fontweight='bold')
axes[idx].set_ylabel('Frequency')
axes[idx].legend()
axes[idx].grid(alpha=0.3)
plt.suptitle('Feature Distributions by Lateral Spreading', fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()
Understanding Skewness and Kurtosis¶
Skewness measures asymmetry of the distribution:
- Skewness ≈ 0: Symmetric distribution (normal)
- Skewness > 0: Right-skewed (tail extends to right) - e.g., extreme high PGA values
- Skewness < 0: Left-skewed (tail extends to left)
Kurtosis measures the "tailedness" of the distribution:
- Kurtosis ≈ 0: Normal distribution
- Kurtosis > 0: Heavy tails (more outliers) - leptokurtic
- Kurtosis < 0: Light tails (fewer outliers) - platykurtic
Geotechnical Implications:
- Skewed data might need transformation (log, sqrt) for some models
- High kurtosis indicates potential outliers or extreme events (major earthquakes)
- Tree-based models handle skewness well; neural networks may benefit from normalization
Quartiles and Box Plots by Class¶
Quartiles: Robust statistics not affected by outliers
- Q1, Q2 (median), Q3 divide data into 4 equal parts
- IQR = Q3 - Q1 (middle 50% of data)
- Outlier rule: Beyond Q1 - 1.5IQR or Q3 + 1.5IQR
# Box plots by class
fig, axes = plt.subplots(1, 4, figsize=(14, 4))
for idx, feature in enumerate(features):
df.boxplot(column=feature, by='Target', ax=axes[idx], patch_artist=True)
axes[idx].set_title(f'{feature}')
axes[idx].set_xlabel('')
axes[idx].set_xticklabels(['No Spread', 'Spread'])
axes[idx].get_figure().suptitle('')
plt.suptitle('Box Plots by Target Class', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# Compute quartiles
print("\nFive-Number Summary:\n")
print(f"{'Feature':<12} {'Min':>8} {'Q1':>8} {'Median':>8} {'Q3':>8} {'Max':>8} {'IQR':>8}")
print("-" * 70)
for feature in features:
q1 = df[feature].quantile(0.25)
median = df[feature].median()
q3 = df[feature].quantile(0.75)
iqr = q3 - q1
print(f"{feature:<12} {df[feature].min():>8.2f} {q1:>8.2f} {median:>8.2f} {q3:>8.2f} {df[feature].max():>8.2f} {iqr:>8.2f}")
Five-Number Summary: Feature Min Q1 Median Q3 Max IQR ---------------------------------------------------------------------- GWD (m) 0.37 1.64 1.98 2.46 6.05 0.82 L (km) 0.00 0.49 0.95 1.48 3.29 0.99 Slope (%) 0.00 0.45 0.81 1.40 10.92 0.95 PGA (g) 0.33 0.41 0.44 0.47 0.57 0.06
What Do These Numbers Mean?¶
Reading the Five-Number Summary:
GWD (Ground Water Depth):
- Middle 50%: Between 1.64 m and 2.46 m (IQR = 0.82 m)
- Median = 1.98 m: Typical water table depth
- Interpretation: Most sites have water tables around 2 m deep, with relatively tight clustering (IQR = 0.82 m)
L (Distance to Free Face):
- Middle 50%: Between 0.49 km and 1.48 km (IQR = 0.99 km)
- Median = 0.95 km: Typical distance to river/channel
- Interpretation: Most sites are roughly 0.5-1.5 km from water bodies
Slope:
- Middle 50%: Between 0.45% and 1.40% (IQR = 0.95%)
- Median = 0.81%: Very gentle slopes
- Note: Max = 10.92% shows some steep sites exist (but rare)
PGA (Peak Ground Acceleration):
- Middle 50%: Between 0.41g and 0.47g (IQR = 0.06g)
- Median = 0.44g: Moderate earthquake shaking
- Interpretation: Very tight clustering (IQR = 0.06g) - earthquakes in this dataset are similar strength
Key Insight: IQR tells you the SPREAD of the middle 50% of data
- Small IQR (like PGA = 0.06g): Data is tightly clustered → consistent conditions
- Large IQR (like L = 0.99 km): Data is more spread out → variable conditions
Quick Reference - p-value interpretation¶
- p < 0.001 (***): Extremely strong evidence
- p < 0.01 (**): Strong evidence
- p < 0.05 (*): Moderate evidence
- p ≥ 0.05: No significant difference
hThe Rule: Point is an outlier if:
- Below: Q1 - 1.5 × IQR
- Above: Q3 + 1.5 × IQR
Why 1.5? John Tukey's empirical rule - balances detection vs false alarms
What's a good outlier percentage?
- 1-5%: Typical for clean data
- 5-10%: Acceptable, natural variability
- $>10\%$: Investigate - may indicate data quality issues or highly variable conditions
Statistical Significance Testing¶
The Question: Are features truly different between spreading and non-spreading sites?
Mann-Whitney U Test - How It Works¶
U counts the "wins" - how often one group beats the other.
The Algorithm:
- Mix all values from both groups together
- Rank them from smallest to largest (1, 2, 3, ...)
- Sum ranks for each group
- Calculate U: How often does a spreading value < non-spreading value?
The p-value answers: "If groups were actually the same, what's the chance we'd see this much separation by random luck?"
- p < 0.001 → Less than 0.1% chance → Strong evidence of real difference!
- p > 0.05 → More than 5% chance → Could just be randomness
Why it's awesome:
- ✓ No assumptions about distribution shape
- ✓ Robust to outliers (uses ranks, not values)
- ✓ Perfect for comparing two groups
# Mann-Whitney U test (non-parametric)
from scipy.stats import mannwhitneyu
print("Mann-Whitney U Test Results:\n")
print(f"{'Feature':<12} {'p-value':>12} {'Significant?':>15}")
print("-" * 45)
for feature in features:
no_spread = df[df['Target'] == 0][feature]
spread = df[df['Target'] == 1][feature]
u_stat, p_value = mannwhitneyu(no_spread, spread)
sig = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else "No"
print(f"{feature:<12} {p_value:>12.2e} {sig:>15}")
print("\n*** p<0.001 ** p<0.01 * p<0.05")
Mann-Whitney U Test Results: Feature p-value Significant? --------------------------------------------- GWD (m) 4.76e-15 *** L (km) 4.90e-46 *** Slope (%) 9.83e-02 No PGA (g) 1.35e-16 *** *** p<0.001 ** p<0.01 * p<0.05
Analysis of Results¶
Mann-Whitney U Results:
| Feature | p-value | Significant? | Conclusion |
|---|---|---|---|
| GWD | 4.76e-15 | *** | Strong predictor - water depth clearly differs between classes |
| L | 4.90e-46 | *** | Strongest predictor - distance to free face is critical |
| Slope | 9.83e-02 | No | Weak predictor - distributions overlap too much |
| PGA | 1.35e-16 | *** | Strong predictor - shaking intensity matters |
Key Insight: Slope is NOT significantly different → May not help prediction much!
The other three features show extremely low p-values → These are our key predictors.
Z-Score Method¶
Z-Score: Measures how many standard deviations from mean
$$z = \frac{x - \mu}{\sigma}$$
Two uses:
- Outlier detection: |z| > 3 (rare values)
- Standardization for ML: Transform to mean=0, std=1 (needed for neural networks, SVM)
Rule: |z| > 3 means outlier (only 0.3% of normal data exceeds this)
# Simple IQR outlier detection
print("IQR Outlier Detection:\n")
print(f"{'Feature':<12} {'Q1':>8} {'Q3':>8} {'IQR':>8} {'Outliers':>10} {'%':>8}")
print("-" * 60)
for feature in features:
q1 = df[feature].quantile(0.25)
q3 = df[feature].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = df[(df[feature] < lower) | (df[feature] > upper)]
n_outliers = len(outliers)
pct = 100 * n_outliers / len(df)
print(f"{feature:<12} {q1:>8.2f} {q3:>8.2f} {iqr:>8.2f} {n_outliers:>10} {pct:>7.1f}%")
IQR Outlier Detection: Feature Q1 Q3 IQR Outliers % ------------------------------------------------------------ GWD (m) 1.64 2.46 0.82 133 1.8% L (km) 0.49 1.48 0.99 8 0.1% Slope (%) 0.45 1.40 0.95 542 7.4% PGA (g) 0.41 0.47 0.06 3 0.0%
Evaluating IQR Results¶
Our Results:
| Feature | IQR | Outliers | % | Assessment |
|---|---|---|---|---|
| GWD | 0.82 m | 133 | 1.8% | ✓ Excellent |
| L | 0.99 km | 8 | 0.1% | ✓ Excellent |
| Slope | 0.95% | 542 | 7.4% | ✓ Good |
| PGA | 0.06 g | 3 | 0.0% | ✓ Excellent |
Analysis:
- GWD (1.8% outliers): Valid extreme water table depths → Keep
- L (0.1% outliers): Very few unusual distances → Keep
- Slope (7.4% outliers): Natural field variability → Keep
- PGA (0.0% outliers): Perfect - earthquake data well-controlled → Keep
Decision: All outliers < 10% and physically plausible → Keep all data
Why? Model needs to learn extreme events (major earthquakes, steep slopes)
Method 2: Z-Score Method (Standardization)¶
The Z-score method identifies outliers as points with |z-score| > 3.
What is a Z-score?
The z-score measures how many standard deviations a data point is from the mean:
$$z = \frac{x - \mu}{\sigma}$$
where:
- $x$ = individual data point
- $\mu$ = population mean (estimated by sample mean $\bar{x}$)
- $\sigma$ = population standard deviation (estimated by sample std $s$)
Interpretation:
- $z = 0$: Value is exactly at the mean
- $z = 1$: Value is 1 standard deviation above the mean
- $z = -2$: Value is 2 standard deviations below the mean
- $|z| > 3$: Commonly used threshold for outlier detection (>99.7% of data falls within ±3σ for normal distribution)
Why use Z-scores?
- Dimensionless: Allows comparison across different units (meters, g-forces, etc.)
- Standardization: Transforms all features to same scale (mean=0, std=1)
- Statistical foundation: Based on properties of normal distribution
Limitations:
- Assumes approximately normal distribution
- Sensitive to extreme outliers (they affect μ and σ)
- May not work well for highly skewed distributions
Geotechnical Application: For example, if GWD has mean 2.0m and std 0.64m:
- A value of 3.28m would have z = (3.28-2.0)/0.64 = 2.0 (within normal range)
- A value of 6.0m would have z = (6.0-2.0)/0.64 = 6.25 (likely an outlier)
# Z-score outliers
print("Z-Score Outlier Detection (|z| > 3):\n")
print(f"{'Feature':<12} {'Mean':>8} {'Std':>8} {'Outliers':>10} {'%':>8}")
print("-" * 50)
for feature in features:
z_scores = np.abs(stats.zscore(df[feature]))
outliers = df[z_scores > 3]
n_outliers = len(outliers)
pct = 100 * n_outliers / len(df)
print(f"{feature:<12} {df[feature].mean():>8.2f} {df[feature].std():>8.2f} {n_outliers:>10} {pct:>7.1f}%")
Z-Score Outlier Detection (|z| > 3): Feature Mean Std Outliers % -------------------------------------------------- GWD (m) 2.08 0.64 66 0.9% L (km) 1.02 0.65 8 0.1% Slope (%) 1.14 1.13 153 2.1% PGA (g) 0.44 0.04 2 0.0%
Decision Framework for Outliers¶
Keep outliers if:
- Values are physically plausible (within engineering bounds)
- Represent real extreme events (major earthquakes, unusual site conditions)
- Only detected by one method (might be boundary cases)
- Critical for model to learn rare but important scenarios
Remove outliers if:
- Physically impossible values (negative GWD, PGA > 2g)
- Obvious data entry or measurement errors
- Detected by multiple methods consistently
- Would severely bias the model
For this dataset: Based on the geotechnical evaluation above, most outliers appear to be valid extreme events. We'll keep them for model training but flag them for monitoring.
Correlation Analysis¶
# Correlation matrix
corr = df[features + ['Target']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.3f', cmap='coolwarm', center=0,
square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("\nCorrelations with Target:")
print(corr['Target'].sort_values(ascending=False))
Correlations with Target: Target 1.000000 PGA (g) 0.067607 Slope (%) -0.012796 GWD (m) -0.113653 L (km) -0.167671 Name: Target, dtype: float64
Pair Plot - All Relationships at Once¶
# Pair plot
g = sns.pairplot(df[features + ['Target']], hue='Target',
palette=['steelblue', 'coral'],
plot_kws={'alpha': 0.5}, height=2)
g.fig.suptitle('Pairwise Feature Relationships', y=1.02, fontsize=14, fontweight='bold')
plt.show()
Summary: Key Insights¶
Most Important Features (based on correlation and statistical tests):
- GWD: Shallow water table → more spreading
- PGA: Higher shaking → more spreading
- L: Closer to river → more spreading
- Slope: Effect less clear
Data Quality: ✓ Good - no missing values, physically plausible
Next Steps: Ready for ML modeling!