end-to-end pipeline from extracting, transforming and loading documents to embedding, searching and ranking based on query.
using nltk for documents processing and tf-idf to find the top-k relevant documents for query based on the cosine similarity in VSM.
built a simple query processing pipeline for text documents using python. it uses nltk for text preprocessing and tfidftransformer from scikit-learn to embed documents into vector space. cosine similarity is used to find the most relevant documents to a given query.
the dataset used is the movie scripts dataset which includes over one thousand english movie scripts. each document contains a movie name and its full script. this dataset provides diverse and rich natural language text suitable for retrieval tasks.
more about dataset: read
each document is converted to lowercase and cleaned by removing punctuation, html tags, non-alphabetical and special characters. nltk is used for tokenization, lemmatization and stopword removal. this step helps reduce noise and improve matching accuracy.
documents are transformed into numerical vectors using countvectorizer followed by tfidftransformer. this converts raw text into tfidf weighted representations based on word importance within the corpus.
the query is processed in the same way as the documents and embedded in the same tfidf space. cosine similarity is then computed between the query vector and all document vectors. the top k (default 5) documents with the highest similarity scores are selected as the most relevant results.
the application takes two inputs from the command line the path to the folder containing text files and the user query. it returns the top five files ranked by cosine similarity along with their similarity values.
input:
Choose:
[1] ETL
[2] Search
[exit] Quit
>>> 1output:
Start ETL Pipeline...
Creating json from Arrow format: 100%|██████████| 2/2 [00:02<00:00, 1.38s/ba]
Extracted 1172 Documents.
Transformed 100/1172 Documents.
Transformed 200/1172 Documents.
Transformed 300/1172 Documents.
Transformed 400/1172 Documents.
Transformed 500/1172 Documents.
Transformed 600/1172 Documents.
Transformed 700/1172 Documents.
Transformed 800/1172 Documents.
Transformed 900/1172 Documents.
Transformed 1000/1172 Documents.
Transformed 1100/1172 Documents.
Transformed 1172/1172 Documents.
Loaded 1172 Documents.input:
Choose:
[1] ETL
[2] Search
[exit] Quit
>>> 2
>>> movies
>>> don corleoneoutput:
Searching...
Top 5 results for: 'don corleone'
0.5320 - Godfather.txt
0.1768 - Godfather Part II.txt
0.0664 - The Godfather Part III.txt
0.0030 - Do The Right Thing.txt
0.0027 - Who's Your Daddy.txtmore about usage: read
we demonstrated how to build a simple yet effective information retrieval system using tfidf and cosine similarity.
by combining nltk preprocessing with sklearn vectorization tools the system can process text documents and return accurate search results for any given query.