The Crux of a Search Engine
The job of a search engine is given a term to find all documents within a document collection which contain the term. To answer queries in a reasonable amount of time even in the presence of a vast document collection, an index of all search terms is prepared in advance. The index is in its simplest form nothing more than a sorted list of search terms; where each term has the list of the documents, wherein it is found, associated.
If the total size of the document collection is so small that the entire index fits into main memory it is straightforward to generate the index. The real challenge is to handle large document collections on small machines. Using a harddisk to store the index trades memory for time. Compressing the stored index will keep the space usage down. If the CPU is fast compared to disk throughput it is quite possible that compressing the index will achieve a time saving too.
The book Managing Gigabytes by Ian H. Witten, Alistair Moffat and Timothy C. Bell has the following table over predicted resource requirements to invert (i.e. build an index) a document collection with a total size of 5 Gbytes. The algorithms are more-or-less listed in order of increasing sophistication. The somewhat arbitrary limit of 40 Mbytes of main memory stems from the first edition of the book from 1993.
The first two entries illustrate the the space/time tradeoff really well.
The "Linked lists (memory)" algorithm builds an index in memory in the naïve way: all terms are stored as keys in a hash table, their associated value is the list of documents containing the term. It uses the almost as much memory as the entire collection.
The "Linked lists (disk)" uses the same approach, but stores the linked lists on a disk. Memory usage is fine, but due to the speed difference between main memory and disk the time usage explodes to the unreasonable.
The "Sort-based" approach builds the index in three tempi. First a temporary file consisting of (term-number, document-number) tuples are generated by looping over all terms in all documents. Second, the temporary file is sorted by an external sort; two tuples are compared by looking at the terms, the document number is used to break ties. Finally the sorted file is used to generate the index. This approach brings the execution time down in a reasonable range again. Due to tempoary file the space usage is almost twice the size of the total collection.
The "Sort-based compressed" algorithm refines the previous technique and reduces the disk space needed by using compression techniques.
"Sort-based multiway compressed" improves upon the sorting methods with much improved time usage as consequence.
For this project I'll implement the "Sort-based compressed" method.
If the total size of the document collection is so small that the entire index fits into main memory it is straightforward to generate the index. The real challenge is to handle large document collections on small machines. Using a harddisk to store the index trades memory for time. Compressing the stored index will keep the space usage down. If the CPU is fast compared to disk throughput it is quite possible that compressing the index will achieve a time saving too.
The book Managing Gigabytes by Ian H. Witten, Alistair Moffat and Timothy C. Bell has the following table over predicted resource requirements to invert (i.e. build an index) a document collection with a total size of 5 Gbytes. The algorithms are more-or-less listed in order of increasing sophistication. The somewhat arbitrary limit of 40 Mbytes of main memory stems from the first edition of the book from 1993.
Method | Memory (Mbytes) | Disk (Mbytes) | Time (hours) |
Linked lists (memory) | 4,000 | 0 | 6 |
Linked lists (disk) | 30 | 4,000 | 1,100 |
Sort-based | 40 | 8,000 | 20 |
Sort-based compressed | 40 | 680 | 26 |
Sort-based multiway compressed | 40 | 540 | 11 |
In-memory compressed | 420 | 1 | 12 |
Lexcion-based, no extra disk | 0 | 0 | 79 |
Lexcion-based, extra disk | 40 | 4,000 | 12 |
Text-based partition | 40 | 35 | 15 |
The first two entries illustrate the the space/time tradeoff really well.
The "Linked lists (memory)" algorithm builds an index in memory in the naïve way: all terms are stored as keys in a hash table, their associated value is the list of documents containing the term. It uses the almost as much memory as the entire collection.
The "Linked lists (disk)" uses the same approach, but stores the linked lists on a disk. Memory usage is fine, but due to the speed difference between main memory and disk the time usage explodes to the unreasonable.
The "Sort-based" approach builds the index in three tempi. First a temporary file consisting of (term-number, document-number) tuples are generated by looping over all terms in all documents. Second, the temporary file is sorted by an external sort; two tuples are compared by looking at the terms, the document number is used to break ties. Finally the sorted file is used to generate the index. This approach brings the execution time down in a reasonable range again. Due to tempoary file the space usage is almost twice the size of the total collection.
The "Sort-based compressed" algorithm refines the previous technique and reduces the disk space needed by using compression techniques.
"Sort-based multiway compressed" improves upon the sorting methods with much improved time usage as consequence.
For this project I'll implement the "Sort-based compressed" method.