Developing a large scale Search Engine Leonidas Akritidis Monday 6/10/2008 Department of Computer And Communication Engineering University of Thessaly, Volos, Hellas
Introduction Δεδομένα Web Search: Κυριαρχεί η Google, με innovative τεχνολογίες από διακεκριμένους επιστήμονες. Vertical Searches: Σποραδικές προσπάθειες με λίγες δημοσιεύσεις και αρκετό πεδίο έρευνας. Vertical Search Engines: Υπάρχουν αρκετές στο web για διάφορα θέματα. Products, News, Blogs, Tickets, Hotels, Scientific Articles. Κατά μεγάλο ποσοστό βασίζονται στους έτοιμους full- text indexes των RDBMS (mySQL κτλ).
Motivations Θέλουμε περισσότερη ευελιξία στο vertical search. Πχ οι full text indexes δεν υποστηρίζουν stemming, ranked queries (μόνο boolean). Επιπλέον δεν υποστηρίζονται όλες οι γλώσσες. Για μεγάλο όγκο δεδομένων απαιτούν data centers: Δεν υπάρχει text και index compression. Είναι τελικά αρκετά αργές (λόγω και του προηγούμενου). Δεν υπάρχει έλεγχος στο parallelization. Εφαρμόζεται η τεχνική horizontal partitioning των RDBMS. Δεν υπάρχει έλεγχος στο caching. Τα RDBMS αποθηκεύουν στην cache τα περισσότερο recent data και όχι τα πιο hot. Well controlled collections. Είναι ανεπτυγμένες σε scripting γλώσσες που τρέχουν σε layers πάνω από C/C++. (πχ PHP: δεν υπάρχει έλεγχος στη μνήμη που καταναλώνουμε, ούτε και μπορούμε να έχουμε data μόνιμα στη μνήμη).
A new search engine Αναπτύσσουμε μία νέα search engine ώστε: Να δίνει καλύτερα αποτελέσματα από τις υπάρχουσες. Να είναι γρηγορότερη Να δίνει δικαίωμα για searching σε θεματικές περιοχές που οι υπόλοιπες αγνοούν. Να λύσει τεχνολογικά/τεχνικά προβλήματα πάνω στο indexing and/or querying.
XML Μέσω της XML διακινείται σήμερα στο web σχεδόν όλη η semi- structured πληροφορία. RSS Feeds (News). Product Feeds. Blog Articles. Εισητήρια αεροπορικών εταιριών. Sitemaps (XML documents φτιαγμένα ειδικά για τους crawlers των search engines που ενημερώνουν για το ποιες σελίδες υπάρχουν στο site). Κρατήσεις Ξενοδοχείων.
XML Document Collections Not very well controlled. Δε γνωρίζουμε επακριβώς τι θα συναντήσουμε, αλλά έχουμε μια ιδέα για το structure του εγγράφου (ούτε TREC Document, ούτε web page). Τα web documents που δεικτοδοτεί μια search engine, είναι συνήθως περιορισμένα σε μέγεθος και σπάνια ξεπερνούν το 1 MB. Τα XML documents μπορούν να έχουν μέγεθος εκατοντάδων ΜΒ. Στις web search engines, ένα αρχείο ταυτίζεται με ένα document και λαμβάνει ένα document id. Τα XML Files περιέχουν πολλαπλά documents τα οποία πρέπει να αποκτήσουν ids. Additional parsing has to be applied to receive the attribute of an entry. A product index needs to be updated more often than the standard structures that store document collections. For example, a price of a product can be updated several times within a month or a week.
Architecture: Overview
The URL Server and Fetcher URL Server: Sends a list of XML Documents to be downloaded by Fetcher. Optional: Just send URLs to Fetcher (Crawling Mode). Fetcher: Contains a URL Client that reads the XML Documents to be fetched. For XML Documents: Simply download them. For URLs: Crawl the whole web site for XML Documents (Useful for RSS feeds. e.g. crawl whole and download all RSS feeds). Sends the XML Documents to the Store Server. Could work distributed: run multiple URL clients and a single URL Server. The Fetcher is developed in PHP, but I need to recode it in Python to follow the Client/Server model.
The Store Server The Store Server runs multiple modules: XML Parser: Parses the XML File and converts records to documents having uniform tagging. XML Parsers for Windows (tinyXML: mysteriously slows down after several operations, DOM parsing eats huge amounts of memory and cuteXML collapses when parsing large documents). XML Parsers for Linux (Still investigating: Expat is the fastest but very complex). Assigns a unique ID to the Document by computing its URL checksum (Google implements it this way, but duplicate checksums may occur!!). Compresses the Documents (zlib) and stores them in an incrementing repository.
The Repository The repository can reside on a PC, or a disk array. Technically, simple Binary Files with 64bit descriptors. Careful file system selection. Ext3 and XFS for Linux support such Big Files. Google implemented GFS, suitable for handling very large files, spanning across multiple file systems. File Handlers returned by fopen64();
The Indexer The most complex part of the engine, mainly developed in C-way C++ (no classes, use of low level malloc, fopen, fread etc). Debugged with sophisticated debuggers that stick to the process and monitor memory allocations and deallocations. Intensive optimization for speed, effectiveness and memory consumption. Follows the Single Pass In Memory Inversion Indexing approach (Justin Zobel, S. Heinz), but with some interesting variations.
Single Pass In Memory Inversion (1) 1.Access the Repository for reading. 2.Create a new empty Dictionary Structure. 3.For each Document: 1.Decompress the document. 2.Tokenize the Document and find all distinct terms. 3.For each distinct term, search the dictionary: 1.If term not in dictionary, add it. 2.Compress the Posting and append it to the Posting List. 3.Else increase the term’s frequency. Compress and append the Posting to the Posting List. 4.When no memory is available, sort the Dictionary. 5.Write the Dictionary AND the postings list to disk. 6.Deallocate all used memory and go to step 2. 7.If all documents have been processed, merge the Sorted Runs to the final Inverted File.
Single Pass In Memory Inversion (2) Can Index Collections of any size Does not require the Lexicon structure constantly in memory. Faster than Witten/Moffat/Bell Sort Based Inversion. Indexing is only limited by the available disk space. Works in limited amount of memory I ran it successfully setting memory limits to 100, 200, 350, 500 and 1000 MB.
Hit Lists For each distinct term of a document, one posting is created. The posting contains the document Id, the term frequency in document and the Hit List. The Posting could also contain one flag that indicates the type of document that the term appears in (DOC, PDF, HTML etc). The Hit List is an array that describes each occurrence of the term. 1.The Zone of the document appears in. 2.The term’s offset in the Zone. 3.Google introduced some other flags, font size and capitalization. These are of no use when indexing XML documents; no text markup exists).
Example: Tokenizing a Document The only non-positional bodies in the sky, are the sun and the moon. the : 45:4[ ] only : 45:1[ ] non : 45:1[ ] positional : 45:1[ ] non-positional : 45:1[ ] nonpositional : 45:1[ ] bodies : 45:1[ ] in : 45:1[ ] sky : 45:1[ ] are : 45:1[ ] sun : 45:1[ ] and : 45:1[ ] moon : 45:1[ ] Example Posting 45 : 4 [ ] docId f d,t HitList zone Term Offset
Temporary Data Structure for Document Parsing (1) The distinct terms of a single document are stored in a temporary data structure along with their Posting. One such data structure per Document. Current Implementations: Hash table with separate chaining (512 slots) and Binary Search Tree Why do we need this Intermediate structure? 1.Calculate the term document frequencies 2.Find Hits
Temporary Data Structure for Document Parsing (2) Remember: One Intermediate data structure for each parsed document is created and contains all distinct terms and their Postings in document. Documents are processed one by one. Before parsing the next document: 1.Traverse the temporary data structure and update the global Lexicon Data Structure (next slide). 2.Destroy the temporary data structure. 3.Recreate the temporary data structure (empty) document. No paper describes this temporary data structure. But indexing cannot be done without it! The papers refer to it as “the parsing stream” How do we assign offsets to the terms?
The Global Lexicon Data Structure The Global Lexicon data structure is implemented as a fast variant of a standard Hash Table with separate chaining. The variation is on insertion: Apply move- to-front for new entries and successful searches. 1,046,527 slots Hash Function by J. Zobel (other candidates not tested: FNV, Bob Jenkins, Paul Hsieh). The Global Lexicon is updated by the Intermediate Data Structure. Distinct terms enter the Hash table.
Index Compression For each distinct term, there is a pointer to a BitVector. The BitVector is a sequence of bytes used to store the compressed postings. When the Global lexicon is updated by the Temporary Data Structure, the Posting of a term is compressed on the fly in the BitVector. Implemented 4 compression schemes: 1.Unary 2.Elias Gamma 3.Elias Delta 4.Golomb Codes
Sorted Runs and Merging As Documents are being parsed, the Global Lexicon keeps storing the new terms and their compressed Inverted Lists occupying more and more memory. The available memory exhausts. Sort the Global Lexicon (if not a tree structure) and store it along with the compressed posting on a temporary file as a Sorted Run. When we parse all Documents, we merge and sort the temporary files and create the final Inverted File. The Inverted File is already compressed. The code for Merging the temporary files is very tricky!
Inverted Index Two Files (this implementation): The Dictionary File contains all the distinct terms, their collection frequencies and a long integer that represents the address of the Inverted List in the Inverted File. If the index occupies more than one files (shards), an extra short int pointing to the appropriate file is needed. Inverted File contains the compressed Inverted Lists. Future Work: Build the index in shards.
Document Collection and Indexing Self-crawled collection of 268,000 web documents (10 GB). 8 million distinct terms (with stemming and casefolding) and 95 million token pointers. Stemming: English Only (Martin Porter’s Algorithm) After Indexing: Dictionary File 199 MB (without front coding) Inverted File: 493 MB (Elias Delta Codes for doc-gaps, offset-gaps, term doc frequencies). Three test PCs Intel E6600 2,4 GHz Core 2 Duo, 2 GB RAM, 160GB disk, winXP Intel E4200 1,6 GHz Core 2 Duo, 1 GB RAM, 500GB disk, ubuntu Intel P4 1,8 GHz, 1,5 GB RAM, 80 GB disk, ubuntu
Query Server and Client To answer queries, the Lexicon of the collection must reside constantly in memory. The Query Server is developed for this reason. Developed using threads: 1.Support for concurrent query processing. 2.Shared memory environment (forking a new process requires copying the data to the new process). The Query Client writes the query to the Server’s listening socket and reads the returned results. The Query Client sends the results to the user’s browser. The Query Client could work as a Load Balancer across multiple Query Servers.
Query Processing Only Boolean Query has been implemented so far. Inverted lists ordered by doc Id. Linear time proportional to the size of the lists. Future Work: Order the Inverted lists by term doc frequency for efficient ranked query processing. Technical Issue: We must traverse the whole inverted list in order to deallocate the memory occupied by its nodes.
The USA Government funds the collapsing banks The USA Government funds the collapsing banks The American Government decided to support the problematic banks with 800 billion USD. The American Government decided to support the problematic banks with 800 billion USD. the : 1:4[ ] usa : 1:1[ ] government: 1:2[ ] funds : 1:1[ ] collapsing : 1:1[ ] banks : 1:2[ ] American : 1:1[ ] Decided : 1:1:[ ] To : 1:1[ ] support : 1:1[ ] problematic : 1:1[ ] with : 1:1[ ] 800 : 1:1[ ] billion : 1:1[ ] usd : 1:1[ ] The American banks collapse The American banks collapse Panic in the major global stock markets after the collapse of two commercial banks. Panic in the major global stock markets after the collapse of two commercial banks. The : 2:3[ ] American : 2:1[ ] banks: 2:2[ ] collapse : 2:2[ ] panic : 2:1[ ] in : 2:1[ ] major : 2:1:[ ] global : 2:1[ ] Stock : 2:1[ ] markets : 2:1[ ] after : 2:1[ ] of : 2:1[ ] two : 2:1[ ] Commercial : 2:1[ ] Jim Banks, a great American novel writer Jim Banks, a great American novel writer The great novel “collapse” now only 20 usd. The great novel “collapse” now only 20 usd. jim : 3:1[ ] banks : 3:1[ ] A : 3:1[ ] Great : 3:2[ ] American : 3:1[ ] novel : 3:1[ ] writer : 3:1[ ] the : 3:1[ ] collapse : 3:1:[ ] now : 3:1[ ] only : 3:1[ ] 20 : 3:1[ ] usd : 3:1[ ] Query: American Banks - which is the best result? Document 1: It has all query terms in the body, only banks in title. Document 2: It has all query terms in the title. Document 3: It has all the query terms in the title, but wrong order. Document 2 is the best result, but how do we support this in a ranked model?
References (1) Paul Hsieh’s Hash Function Paul Hsieh’s Hash Function Paul Hsieh’s Hash Function Bob Jenkins Hash Function Bob Jenkins Hash Function Bob Jenkins Hash Function Efficient Single Pass Index Construction for Text Databases (S. Heinz, J Zobel) Efficient Single Pass Index Construction for Text Databases (S. Heinz, J Zobel) Efficient Single Pass Index Construction for Text Databases (S. Heinz, J Zobel) The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin, Page) The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin, Page) The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin, Page) Introduction to Information Retrieval (Cambridge Preliminary Draft 2008) Managing Gigabytes (Witten, Moffat and Bell) Inverted Files for Text Search Engines (J. Zobel, A. Moffat) Inverted Files for Text Search Engines (J. Zobel, A. Moffat) Inverted Files for Text Search Engines (J. Zobel, A. Moffat) Performance in Practice of String Hashing Functions (Ramakrishna, Zobel) Performance in Practice of String Hashing Functions (Ramakrishna, Zobel) Performance in Practice of String Hashing Functions (Ramakrishna, Zobel) In-Memory Hash Tables for Accumulating Text Vocabularies (Zobel, Heinz, Williams) In-Memory Hash Tables for Accumulating Text Vocabularies (Zobel, Heinz, Williams) In-Memory Hash Tables for Accumulating Text Vocabularies (Zobel, Heinz, Williams) Cache Conscious Collision Resolution in String Hash Tables (Askitis, Zobel) Breaking the Hash Table Speed Limit (Cooper, Duncan, Gregg) Porter Stemmer Porter Stemmer Porter Stemmer