I made a search engine to rank Wikipedia articles using tf-idf and PageRank.
The search engine contains 3 parts:
To make search ranking efficient, I created an inverted index which tf-idf
I created a MapReduce server which follows the same structure as the Hadoop streaming interface. The contains a manager which receives MapReduce jobs. This manager distributes the job among many different workers, with fault-tolerance mechanisms and redundancy protocols. This was one of my first experiences with distributed systems and multi-threaded programming.
I created a Rest API using Python Flask to provide search results for search queries. The Rest API searches in the inverted index for tf-idf, calculates search result rankings using tf-idf and PageRank, and returns search results in JSON format.
The final component is the search server, which is the most user-facing component. It features a user interface rendered with server-side dynamic pages (Flask, Jinja, HTML, CSS). The user makes search queries, and the back end sends a Rest API request and renders an HTML file for the search results.
This was a part of a school project, so unfortunately I cannot show my code for this project.
Patrick Halim - 12/26/23
I built a search engine to rank Wikipedia articles using both tf-idf and PageRank. I implemented a custom MapReduce server and pipeline to create a tf-idf inverted index. In the process, I learned about service-based architecture, multi-threading and concurrency, redundancy protocols, and fault tolerance.
Technologies used: Python, Flask, SQLite, Javascript, HTML, CSS