Search Engine

Overview

I made a search engine to rank Wikipedia articles using tf-idf and PageRank.

The search engine contains 3 parts:

  • MapReduce server and pipeline for tf-idf inverted index
  • Python Flask server to serve the Rest API
  • Search server to deliver the user interface with server-side dynamic pages

MapReduce Server

To make search ranking efficient, I created an inverted index which tf-idf

I created a MapReduce server which follows the same structure as the Hadoop streaming interface. The contains a manager which receives MapReduce jobs. This manager distributes the job among many different workers, with fault-tolerance mechanisms and redundancy protocols. This was one of my first experiences with distributed systems and multi-threaded programming.

Rest API

I created a Rest API using Python Flask to provide search results for search queries. The Rest API searches in the inverted index for tf-idf, calculates search result rankings using tf-idf and PageRank, and returns search results in JSON format.

Search Server

The final component is the search server, which is the most user-facing component. It features a user interface rendered with server-side dynamic pages (Flask, Jinja, HTML, CSS). The user makes search queries, and the back end sends a Rest API request and renders an HTML file for the search results.

This was a part of a school project, so unfortunately I cannot show my code for this project.

Patrick Halim - 12/26/23

tl;dr

I built a search engine to rank Wikipedia articles using both tf-idf and PageRank. I implemented a custom MapReduce server and pipeline to create a tf-idf inverted index. In the process, I learned about service-based architecture, multi-threading and concurrency, redundancy protocols, and fault tolerance.

Technologies used: Python, Flask, SQLite, Javascript, HTML, CSS