Skip to content

nolnolon/User-Clustering-with-BERT-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The problem

One of the best NLP architectures, transformer models, are typically applied to texts that revolve around one primary topic. How effective are they on sparse, heterogeneous text? This thesis compares a family of transformer models to legacy methods in application to user clustering. As a byproduct, it also presents a preprocessing method that improves information capture in this setting by 20%.

The outcome

Transfomer-powered clustering (BERTopic) is no better than legacy clustering for this problem; even worse, BERTopic's own clustering algorithm may fail to capture data relationships right. Filtering tweets based on cosine similarity can improve information capture by 20% as long as embeddings are created on a sentense level. However, all tested methods hit an accuracy ceiling of 62-64%.

Structure

Data

3 datasets in .zip "Data":

  • ExtractedTweets, housing data downloaded from the Kaggle source ([43] in the thesis).
  • Dataset_versions, housing abridged versions of the original dataset generated through CSF.
Data - Dataset_random_selection, housing a single abridged version of the dataset generated through random sampling.

Tweet processing

Reducing volume of data to <= 256 tokens (embedding model ’all-MiniLM-L6-v2’).

  • "CSF" "Contextual Similarity Filtering". Processes the input dataset "ExtractedTweets" into "Dataset_versions" by filtering out least similar tweets. Contains a csv with different volumes of data corresponding to different filtering thresholds.

  • "Random_selection" generates a random sample from "ExtractedTweets" Serves as a baseline to check for effectiveness of CSF. Generates a csv file "Dataset_random_selection".

Clustering

Clustering-related analysis (both embeddings- and BERTopic-based) is gathered in "Clustering". Makes use of the DBCV file, credited to [30]. Makes use of two datasets: "Dataset_random_selection" and "Dataset_versions".

Results

Results table

Legacy clustering on SBERT embeddings:

Cluster accuracy for unprocessed data random_agglo_clusters

VS

Cluster accuracy for CSF-processed data (threshold_0.7 from "Dataset_versions"). selected_agglo_clusters

Clustering with BERTopic:

Cluster accuracy for unprocessed data random_single_BERTopic_NN_clusters

Cluster accuracy for CSF-processed data (threshold_0.7 from "Dataset_versions"). select_single_BERTopic_NN_clusters

References

[30] D. Moulavi, P. A. Jaskowiak, R. J. Campello, A. Zimek, and J. Sander, “Density-based clustering validation,” Proceedings of the 2014 SIAM International Conference on Data Mining, 2014. doi:10.1137/1.9781611973440.96

[43] K. Pastor, 'Democrat Vs. Republican Tweets', Kaggle, 2018. [Online]. Available: https://www.kaggle.com/datasets/kapastor/democratvsrepublicantweets/data

Releases

No releases published

Packages

No packages published