Tiny Search Engine

A workable search engine written in C which crawls and caches webpages, indexes html attributes of each page, and queries a search string with page ranking

Search Engine Architecture

The simple search engine architecture is based off of a 2001 paper (Searching the Web) by Arasu et al. published by the Association for Computing Machinery

A schematic describing different components of the search engine design (from Arasu et al.)

Implementation of Search Engine

See my C implementation at https://github.com/srb-private-org/tiny-search-engine (email for access)

The implementation is broken up into three modules: Crawler, Indexer, and Querier.

Crawler

The Crawler module includes a standalone program crawler which crawls the web starting from a “seed” url, fetches links from pages continuing to a certain depth, and then caches these pages in a specified folder.

Command line usage
Crawler in action

Indexer

The Indexer module implements indexing functionality in that it reads the document files produced by the Crawler, builds an index, and writes this index to a specified index file.

Command line usage

Querier

The Querier module implements querying functionality in that it reads the index file produced by indexer, and the page files produced by crawler, and answers search queries inputted to stdin.

Command line usage
Querier in action