Skip to content

Full pipeline from extracting, transforming and loading documents (ETL), to embedding, searching and ranking based on query

Notifications You must be signed in to change notification settings

IsmaelMousa/query-processing-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Query Processing Pipeline

end-to-end pipeline from extracting, transforming and loading documents to embedding, searching and ranking based on query.

Abstract

using nltk for documents processing and tf-idf to find the top-k relevant documents for query based on the cosine similarity in VSM.

Introduction

built a simple query processing pipeline for text documents using python. it uses nltk for text preprocessing and tfidftransformer from scikit-learn to embed documents into vector space. cosine similarity is used to find the most relevant documents to a given query.

Methodology

Dataset

the dataset used is the movie scripts dataset which includes over one thousand english movie scripts. each document contains a movie name and its full script. this dataset provides diverse and rich natural language text suitable for retrieval tasks.

more about dataset: read

Preprocessing

each document is converted to lowercase and cleaned by removing punctuation, html tags, non-alphabetical and special characters. nltk is used for tokenization, lemmatization and stopword removal. this step helps reduce noise and improve matching accuracy.

Embedding

documents are transformed into numerical vectors using countvectorizer followed by tfidftransformer. this converts raw text into tfidf weighted representations based on word importance within the corpus.

Search

the query is processed in the same way as the documents and embedded in the same tfidf space. cosine similarity is then computed between the query vector and all document vectors. the top k (default 5) documents with the highest similarity scores are selected as the most relevant results.

Usage

the application takes two inputs from the command line the path to the folder containing text files and the user query. it returns the top five files ranked by cosine similarity along with their similarity values.

input:

Choose:

[1] ETL

[2] Search

[exit] Quit

>>> 1

output:

Start ETL Pipeline...
Creating json from Arrow format: 100%|██████████| 2/2 [00:02<00:00,  1.38s/ba]
Extracted 1172 Documents.
Transformed 100/1172 Documents.
Transformed 200/1172 Documents.
Transformed 300/1172 Documents.
Transformed 400/1172 Documents.
Transformed 500/1172 Documents.
Transformed 600/1172 Documents.
Transformed 700/1172 Documents.
Transformed 800/1172 Documents.
Transformed 900/1172 Documents.
Transformed 1000/1172 Documents.
Transformed 1100/1172 Documents.
Transformed 1172/1172 Documents.
Loaded 1172 Documents.

input:

Choose:

[1] ETL

[2] Search

[exit] Quit

>>> 2
>>> movies
>>> don corleone

output:

Searching...

Top 5 results for: 'don corleone'

0.5320 - Godfather.txt
0.1768 - Godfather Part II.txt
0.0664 - The Godfather Part III.txt
0.0030 - Do The Right Thing.txt
0.0027 - Who's Your Daddy.txt

more about usage: read

Conclusion

we demonstrated how to build a simple yet effective information retrieval system using tfidf and cosine similarity. by combining nltk preprocessing with sklearn vectorization tools the system can process text documents and return accurate search results for any given query.

About

Full pipeline from extracting, transforming and loading documents (ETL), to embedding, searching and ranking based on query

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published