Query Processing Pipeline

end-to-end pipeline from extracting, transforming and loading documents to embedding, searching and ranking based on query.

Abstract

using nltk for documents processing and tf-idf to find the top-k relevant documents for query based on the cosine similarity in VSM.

Introduction

built a simple query processing pipeline for text documents using python. it uses nltk for text preprocessing and tfidftransformer from scikit-learn to embed documents into vector space. cosine similarity is used to find the most relevant documents to a given query.

Methodology

Dataset

the dataset used is the movie scripts dataset which includes over one thousand english movie scripts. each document contains a movie name and its full script. this dataset provides diverse and rich natural language text suitable for retrieval tasks.

more about dataset: read

Preprocessing

each document is converted to lowercase and cleaned by removing punctuation, html tags, non-alphabetical and special characters. nltk is used for tokenization, lemmatization and stopword removal. this step helps reduce noise and improve matching accuracy.

Embedding

documents are transformed into numerical vectors using countvectorizer followed by tfidftransformer. this converts raw text into tfidf weighted representations based on word importance within the corpus.

Search

the query is processed in the same way as the documents and embedded in the same tfidf space. cosine similarity is then computed between the query vector and all document vectors. the top k (default 5) documents with the highest similarity scores are selected as the most relevant results.

Usage

the application takes two inputs from the command line the path to the folder containing text files and the user query. it returns the top five files ranked by cosine similarity along with their similarity values.

input:

Choose:

[1] ETL

[2] Search

[exit] Quit

>>> 1

output:

Start ETL Pipeline...
Creating json from Arrow format: 100%|██████████| 2/2 [00:02<00:00,  1.38s/ba]
Extracted 1172 Documents.
Transformed 100/1172 Documents.
Transformed 200/1172 Documents.
Transformed 300/1172 Documents.
Transformed 400/1172 Documents.
Transformed 500/1172 Documents.
Transformed 600/1172 Documents.
Transformed 700/1172 Documents.
Transformed 800/1172 Documents.
Transformed 900/1172 Documents.
Transformed 1000/1172 Documents.
Transformed 1100/1172 Documents.
Transformed 1172/1172 Documents.
Loaded 1172 Documents.

input:

Choose:

[1] ETL

[2] Search

[exit] Quit

>>> 2
>>> movies
>>> don corleone

output:

Searching...

Top 5 results for: 'don corleone'

0.5320 - Godfather.txt
0.1768 - Godfather Part II.txt
0.0664 - The Godfather Part III.txt
0.0030 - Do The Right Thing.txt
0.0027 - Who's Your Daddy.txt

more about usage: read

Conclusion

we demonstrated how to build a simple yet effective information retrieval system using tfidf and cosine similarity. by combining nltk preprocessing with sklearn vectorization tools the system can process text documents and return accurate search results for any given query.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
DATASET.md		DATASET.md
README.md		README.md
main.ipynb		main.ipynb
metadata.yml		metadata.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Query Processing Pipeline

Abstract

Introduction

Methodology

Dataset

Preprocessing

Embedding

Search

Usage

Conclusion

About

Uh oh!

Releases

Packages

Languages

IsmaelMousa/query-processing-pipeline

Folders and files

Latest commit

History

Repository files navigation

Query Processing Pipeline

Abstract

Introduction

Methodology

Dataset

Preprocessing

Embedding

Search

Usage

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages