One of the best NLP architectures, transformer models, are typically applied to texts that revolve around one primary topic. How effective are they on sparse, heterogeneous text? This thesis compares a family of transformer models to legacy methods in application to user clustering. As a byproduct, it also presents a preprocessing method that improves information capture in this setting by 20%.
Transfomer-powered clustering (BERTopic) is no better than legacy clustering for this problem; even worse, BERTopic's own clustering algorithm may fail to capture data relationships right. Filtering tweets based on cosine similarity can improve information capture by 20% as long as embeddings are created on a sentense level. However, all tested methods hit an accuracy ceiling of 62-64%.
3 datasets in .zip "Data":
- ExtractedTweets, housing data downloaded from the Kaggle source ([43] in the thesis).
- Dataset_versions, housing abridged versions of the original dataset generated through CSF.
- Dataset_random_selection, housing a single abridged version of the dataset generated through random sampling.
Reducing volume of data to <= 256 tokens (embedding model ’all-MiniLM-L6-v2’).
-
"CSF" "Contextual Similarity Filtering". Processes the input dataset "ExtractedTweets" into "Dataset_versions" by filtering out least similar tweets. Contains a csv with different volumes of data corresponding to different filtering thresholds.
-
"Random_selection" generates a random sample from "ExtractedTweets" Serves as a baseline to check for effectiveness of CSF. Generates a csv file "Dataset_random_selection".
Clustering-related analysis (both embeddings- and BERTopic-based) is gathered in "Clustering". Makes use of the DBCV file, credited to [30]. Makes use of two datasets: "Dataset_random_selection" and "Dataset_versions".
Cluster accuracy for unprocessed data

VS
Cluster accuracy for CSF-processed data (threshold_0.7 from "Dataset_versions").

Cluster accuracy for unprocessed data

Cluster accuracy for CSF-processed data (threshold_0.7 from "Dataset_versions").

[30] D. Moulavi, P. A. Jaskowiak, R. J. Campello, A. Zimek, and J. Sander, “Density-based clustering validation,” Proceedings of the 2014 SIAM International Conference on Data Mining, 2014. doi:10.1137/1.9781611973440.96
[43] K. Pastor, 'Democrat Vs. Republican Tweets', Kaggle, 2018. [Online]. Available: https://www.kaggle.com/datasets/kapastor/democratvsrepublicantweets/data