Skip to content

ghfri-code/Data-Preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

FIFA 2019 Data Processing & Exploration

This project demonstrates a complete data preprocessing pipeline for the FIFA 2019 player dataset, including cleaning, feature engineering, encoding, outlier handling, and basic visualization.
The notebook prepares the dataset for downstream machine learning tasks such as player performance prediction or clustering.


🧩 Project Overview

The main objectives of this notebook are:

  • Clean and standardize raw FIFA player data
  • Handle missing values and inconsistent formats
  • Convert textual and categorical features into numerical form
  • Explore feature distributions, correlations, and outliers
  • Prepare a structured dataset ready for modeling

βš™οΈ Key Steps Performed

  1. Loading and Inspecting Data

    • Checked dataset shape, missing values, and data types.
    • Identified key player attributes for numeric and categorical processing.
  2. Data Cleaning & Conversion

    • Converted monetary (Value, Wage, Release Clause) and physical (Height, Weight) attributes into numeric formats.
    • Normalized inconsistent units and symbols (€, K, M, etc.).
  3. Handling Missing Values

    • Applied mean/median imputation for numeric features.
    • Filled or dropped missing categorical values when appropriate.
  4. Feature Engineering

    • Created grouped attributes (e.g., attacking, defending, passing).
    • Binned continuous variables (like height) for interpretability.
  5. Encoding Categorical Variables

    • Applied one-hot encoding to convert categories into numerical features.
  6. Outlier Detection & Treatment

    • Identified outliers using Z-score and IQR methods.
    • Visualized outliers with histograms and boxplots.
    • Marked or removed extreme values depending on their impact.
  7. Data Visualization

    • Used matplotlib and seaborn for correlation plots, distributions, and pairplots (with sampling to reduce output size).
  8. Exporting Clean Dataset

    • Saved the final processed dataset as:
      FIFA-2019-processed.csv

🧠 Tools & Libraries

  • Python 3.x
  • pandas, numpy β€” data cleaning and manipulation
  • matplotlib, seaborn β€” visualization
  • scikit-learn β€” encoding and scaling
  • scipy β€” robust statistical measures

πŸ“ˆ Next Steps

  • Feature selection (VIF, PCA)
  • Train/test split and model building
  • Model evaluation (Regression or Classification)
  • Feature importance analysis using SHAP or permutation importance

πŸ—‚οΈ Files

  • FIFA2019_Preprocessing.ipynb β€” main notebook containing all preprocessing steps and explanations
  • FIFA-2019-processed.csv β€” cleaned dataset (generated after running the notebook)

About

Data cleaning, preprocessing, and feature engineering on FIFA 2019 player dataset

Topics

Resources

Stars

Watchers

Forks