The problem

One of the best NLP architectures, transformer models, are typically applied to texts that revolve around one primary topic. How effective are they on sparse, heterogeneous text? This thesis compares a family of transformer models to legacy methods in application to user clustering. As a byproduct, it also presents a preprocessing method that improves information capture in this setting by 20%.

The outcome

Transfomer-powered clustering (BERTopic) is no better than legacy clustering for this problem; even worse, BERTopic's own clustering algorithm may fail to capture data relationships right. Filtering tweets based on cosine similarity can improve information capture by 20% as long as embeddings are created on a sentense level. However, all tested methods hit an accuracy ceiling of 62-64%.

Structure

Data

3 datasets in .zip "Data":

ExtractedTweets, housing data downloaded from the Kaggle source ([43] in the thesis).
Dataset_versions, housing abridged versions of the original dataset generated through CSF.

- Dataset_random_selection, housing a single abridged version of the dataset generated through random sampling.

Tweet processing

Reducing volume of data to <= 256 tokens (embedding model ’all-MiniLM-L6-v2’).

"CSF" "Contextual Similarity Filtering". Processes the input dataset "ExtractedTweets" into "Dataset_versions" by filtering out least similar tweets. Contains a csv with different volumes of data corresponding to different filtering thresholds.
"Random_selection" generates a random sample from "ExtractedTweets" Serves as a baseline to check for effectiveness of CSF. Generates a csv file "Dataset_random_selection".

Clustering

Clustering-related analysis (both embeddings- and BERTopic-based) is gathered in "Clustering". Makes use of the DBCV file, credited to [30]. Makes use of two datasets: "Dataset_random_selection" and "Dataset_versions".

Results

Legacy clustering on SBERT embeddings:

Cluster accuracy for unprocessed data

VS

Cluster accuracy for CSF-processed data (threshold_0.7 from "Dataset_versions").

Clustering with BERTopic:

Cluster accuracy for unprocessed data

Cluster accuracy for CSF-processed data (threshold_0.7 from "Dataset_versions").

References

[30] D. Moulavi, P. A. Jaskowiak, R. J. Campello, A. Zimek, and J. Sander, “Density-based clustering validation,” Proceedings of the 2014 SIAM International Conference on Data Mining, 2014. doi:10.1137/1.9781611973440.96

[43] K. Pastor, 'Democrat Vs. Republican Tweets', Kaggle, 2018. [Online]. Available: https://www.kaggle.com/datasets/kapastor/democratvsrepublicantweets/data

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
CSF.ipynb		CSF.ipynb
Clustering.ipynb		Clustering.ipynb
DBCV [30].py		DBCV [30].py
Data.zip		Data.zip
LICENSE		LICENSE
README.md		README.md
Random_selection.ipynb		Random_selection.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly