This Jupyter notebook conducts an Exploratory Data Analysis (EDA) on the "AI Job Dataset" to uncover insights, patterns, and relationships within the data. The primary goals are to:
- Understand the structure and characteristics of the dataset π.
- Identify key trends, distributions, and correlations in AI-related job postings π.
- Visualize relationships between variables, such as experience level and years of experience π.
- Prepare the data for potential future analysis or modeling by cleaning and transforming it π§Ή.
- Answer specific questions about the dataset, such as the average years of experience by experience level β.
The notebook serves as an educational tool for data analysts and scientists to practice EDA techniques and gain insights into the AI job market π.
- Source: The dataset is sourced from
'C:\Users\suryansh\Downloads\archive\ai_job_dataset.csv'. - Description: Contains 15,000 job postings related to AI roles, with details on job titles, salaries, experience levels, employment types, company locations, and more.
- Columns:
job_id: Unique identifier for each job posting π.job_title: Title of the job (e.g., AI Research Scientist, Data Analyst) πΌ.salary_usd: Salary in USD (or other currency as specified) π°.salary_currency: Currency of the salary πΈ.experience_level: Level of experience (e.g., EN for Entry, SE for Senior) π.employment_type: Type of employment (e.g., FT for Full-Time, PT for Part-Time) β°.company_location: Location of the company π.company_size: Size of the company (S, M, L) π’.employee_residence: Residence of the employee π .remote_ratio: Percentage of remote work allowed π.required_skills: List of required skills π οΈ.education_required: Required education level π.years_experience: Years of experience required β³.industry: Industry of the job π.posting_date: Date the job was posted π .application_deadline: Application deadline β³.job_description_length: Length of the job description π.benefits_score: Score representing job benefits π.company_name: Name of the company π¬.
The notebook is organized into the following steps:
- Imports and Reading Data π₯:
- Imports necessary Python libraries (
pandas,numpy,matplotlib,seaborn). - Loads the dataset into a pandas DataFrame.
- Imports necessary Python libraries (
- Data Understanding π:
- Examines the DataFrame's shape, head, tail, data types, and descriptive statistics.
- Data Preparation π οΈ:
- Renames columns for consistency (e.g.,
Years_Experiencetoyears_experience). - Drops irrelevant columns or duplicates.
- Handles missing values and creates new features if needed.
- Renames columns for consistency (e.g.,
- Feature Understanding π:
- Conducts univariate analysis using visualizations like value counts, KDE plots, and boxplots to explore individual feature distributions.
- Feature Relationships π:
- Performs bivariate analysis with scatter plots, pairplots, and correlation heatmaps to identify relationships between variables.
- Asking a Question β:
- Analyzes the average years of experience by experience level and visualizes it using a horizontal bar plot.
To run the notebook, ensure the following are installed:
- Python: Version 3.x π
- Libraries:
pandas: For data manipulation and analysis π.numpy: For numerical operations π’.matplotlib: For creating visualizations π.seaborn: For enhanced statistical visualizations π¨.
Install dependencies using:
pip install pandas numpy matplotlib seaborn