Reproducible PySpark starter: data reading → cleaning → feature engineering → group aggregations → artifact export.
Example dataset: Telco Customer Churn (CSV).
- Clear structure (
src/,configs/,notebooks/,artifacts/) - Config‑driven (YAML)
 - One‑command run: reproducible ETL / features / reports
 - Notebook is demo‑only (no business logic inside)
 
.
├─ src/
│  ├─ session.py         # SparkSession init
│  ├─ etl.py             # read / basic cleaning
│  ├─ features.py        # Telco-specific features & aggregates
│  └─ job.py             # entrypoint (config -> run)
├─ configs/
│  └─ config.yaml        # pipeline parameters
├─ data/
│  └─ WA_Fn-UseC_-Telco-Customer-Churn.csv
├─ artifacts/            # outputs (gitignored)
├─ notebooks/
│  └─ 01_demo.ipynb      # demonstration notebook
├─ requirements.txt
├─ .gitignore
└─ LICENSE
Requirements: Python 3.9+; Java (Temurin 11/17).
Windows: winutils.exe / hadoop.dll are optional in local[*] mode.
python -m pip install --upgrade pip
pip install -r requirements.txtIf Spark does not start, verify
JAVA_HOMEandPATH.
python -m src.job --config configs/config.yamlOutputs
artifacts/features/— engineered features (Parquet)artifacts/report/— grouped aggregates (Parquet)
- Trim string columns, drop duplicates
 - Safe cast for 
TotalCharges(handles empty strings) 
MonthlyCharges_log1p,TotalCharges_log1pTenureBucket:0–12,13–24,25–48,49+MCharges_per_Tenure=MonthlyCharges / tenure(zero‑safe)- Service flags: 
HasFiber,HasStreamingTV,HasOnlineSecurity,HasTechSupport, etc. 
- Default slice: 
Contract × InternetService × Churn - Metrics: 
n,mean / std / p50for charges 
Adjust paths and groupings in configs/config.yaml:
app_name: "SparkTelcoChurn"
shuffle_partitions: 8
input:
  path: "data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
  fmt: "csv"
  header: true
  infer_schema: true
report:
  group_by: ["Contract", "InternetService", "Churn"]
output:
  dir: "artifacts"
  fmt: "parquet"notebooks/01_demo.ipynb is a lightweight, interactive walkthrough (df.show(), basic checks).
All pipeline logic lives under src/.
- Java/Spark: ensure Java 11/17 is installed and 
JAVA_HOMEis set. - Windows: 
winutilsis optional for local mode; if used, place underD:\hadoop\binand add toPATH. - Performance: tune 
spark.sql.shuffle.partitionsvia config. 
- Spark ML step (feature pipeline + logistic regression on 
Churn) - CI (GitHub Actions): smoke 
pytest+ linters - Docker image for offline runs
 
MIT — see LICENSE.
Dataset: IBM Telco Customer Churn (educational sample).