Advanced Text Analysis using NLP: Sentiment Analysis, Named Entity Recognition (NER) & Topic Modeling (LDA)
- This project analyzes a dataset of articles stored in
data/Text.csvto uncover insights through various text analysis techniques: Word Cloud, Sentiment Analysis, Named Entity Recognition (NER), and Latent Dirichlet Allocation (LDA) Topic Modeling. The analysis was performed using Python in a Jupyter Notebook environment with libraries likepandas,spacy,textblob,plotly,wordcloud, andscikit-learn.
The dataset (data/Text.csv) contains articles with titles and text content. A preview of the first few rows:
| Article (Excerpt) | Title |
|---|---|
| Data analysis is the process of inspecting and... | Best Books to Learn Data Analysis |
| The performance of a machine learning algorith... | Assumptions of Machine Learning Algorithms |
| You must have seen the news divided into categ... | News Classification with Machine Learning |
| When there are only two classes in a classific... | Multiclass Classification Algorithms in Machine... |
| The Multinomial Naive Bayes is one of the vari... | Multinomial Naive Bayes in Machine Learning |
- Encoding: Loaded with
latin-1encoding. - Size: Contains 34 articles (based on sentiment analysis count).
A word cloud was generated to visualize frequent terms in article titles.

- Key Words:
recorder,frameworks,stock,apis,multinomial,values,resources,apple,deep,computer,networks,assumptions,bayes,prediction,analysis,best,health,applications,news,machine,learning,clustering,insurance,introduction,learn,classification,algorithm,python,data,sentiment. - Insights:
- Dominant terms include
machine,learning,data,python, andalgorithm, reflecting a focus on machine learning and data science. - Specific terms like
multinomial,bayes, andclusteringsuggest technical content.
- Dominant terms include
Sentiment analysis was performed on the Article column using TextBlob, with polarity scores categorized as Positive (>0.1), Neutral (-0.1 to 0.1), or Negative (<-0.1).
-
Sentiment Summary: count 34.000000 mean 0.160138 (slightly positive) std 0.296732 min -0.800000 25% 0.000000 50% 0.134464 75% 0.332299 max 0.666667
-
Category Distribution: Positive 18 (52.9%) Neutral 11 (32.4%) Negative 5 (14.7%)
- Visualization Insights (Plotly Histogram with Rug Plot):
- Positive (Green): Peaks in the 0.2 to 0.6 range, indicating many articles have a positive tone.
- Neutral (Purple): Concentrated around 0, forming a significant portion.
- Negative (Red): Fewer articles, mostly between -0.1 and -0.2, with rare outliers near -1.
- Outliers: Rug plot shows a few strongly positive (near 1) and negative (near -1) articles.
- Conclusion: The articles are predominantly positive or neutral, with a mean sentiment of 0.16, suggesting an optimistic or informative tone.
NER was applied using spaCy to identify and count named entities in the articles.
- Top Individual Entities:
today: Most frequent, likely a temporal reference.Machine Learning: Second most frequent, highlighting its prominence.- Others:
one(twice),September,Instagram,Pfizer,Apple,the K-Means,Data Scientist. - Insights:
- Temporal/Numerical:
today,September, andonesuggest time and quantity references. - Companies/Products:
Instagram,Pfizer,Appleindicate mentions of specific entities. - Technical Terms:
Machine Learning,the K-Means,Data Scientistpoint to data science and ML focus. - Frequency Trend: Visualized with a 'Viridis' color scale (yellow for high frequency, purple for low), showing a steep drop-off after
today.
LDA was used to identify 5 topics from the preprocessed article text, with top words extracted for each.
- Topics and Top Words:
- Topic 1:
learn, python, insurance, application, machine, want, algorithm, assumption, tutorial, science- Focus: Learning machine learning with Python, possibly tutorials or assumptions.
- Topic 2:
algorithm, python, learning, machine, classification, api, base, clustering, bayes, datum- Focus: ML algorithms (classification, clustering, Bayes) and Python.
- Topic 3:
analysis, sentiment, datum, want, learn, people, python, data, analyze, today- Focus: Data and sentiment analysis with Python.
- Topic 4:
news, category, computer, learning, machine, website, good, know, popular, task- Focus: News categorization and ML applications.
- Topic 5:
learning, machine, clustering, algorithm, stock, learn, language, deep, python, price- Focus: Advanced ML (clustering, deep learning) and applications like stock prices.
- Insights:
- Overlap in terms like
machine,learning,python, andalgorithmacross topics, indicating a cohesive ML theme. - Diverse applications: insurance, news, sentiment, stock prices.
- Visualized with a 'Viridis' bar chart, showing top words per topic.
- Theme: The articles center on machine learning and data science, with frequent mentions of Python, algorithms, and applications like classification and clustering.
- Tone: Mostly positive (52.9%) or neutral (32.4%), with a mean sentiment of 0.16.
- Entities:
todayandMachine Learningdominate, alongside companies (e.g.,Apple) and technical terms (e.g.,Data Scientist). - Topics: Five distinct yet overlapping topics highlight learning, analysis, and practical ML applications.
- Tools: Python with
pandas,spacy,textblob,wordcloud,plotly,scikit-learn. - Preprocessing: Lemmatization, stopword/punctuation/digit removal for LDA; raw text for sentiment and NER.
- Visualizations: Word Cloud, Plotly histograms/bars with 'Viridis' color scale.
- Setup: Install dependencies (
pip install pandas spacy textblob wordcloud plotly scikit-learn) and spaCy model (python -m spacy download en_core_web_sm). - Run: Execute in Jupyter Notebook with
data/Text.csvin thedata/directory. - Explore: Interact with Plotly charts (zoom, hover) and review printed outputs.
- Expand Dataset: Analyze more articles for broader insights.
- Refine Topics: Adjust LDA
n_componentsor preprocessing for sharper topics. - Sentiment Granularity: Explore subjectivity alongside polarity.
- NER Enhancement: Filter entity types (e.g., only PERSON, ORG) for specific focus.
Author: [Pavan Yellathakota]
Date: FEB 2025
You can reach out to me through the following channels:
- Email: pavanyellathakota@gmail.com
- LinkedIn: Pavan Yellathakota
For more projects and resources, check out:
- GitHub: Pavan Yellathakota
- Portfolio: pye.pages.dev


