![]() ![]() Index the documents that each term occurs in. Ģ The term vocabulary and postingslists Recall the major steps in inverted index construction: 1. Terms are the indexed units (further discussed inTERM Section 2.2) they are usually words, and for the moment you can think of Antony Julius The Hamlet Othello Macbeth. The result is a binary term-document incidenceINCIDENCE MATRIX matrix, as in Figure 1.1. Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words). Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of the Boolean retrieval model. The way to avoid linearly scanning the texts for each query is to index theINDEX documents in advance. ![]() ![]() With modern computers, for simple querying of modest collections (the size of Shakespeare’s Collected Works is a bit under one million words of text in total), you really need nothing more. ![]() Grepping through text can be a very effective process, especially given the speed of modern computers, and often allows useful possibilities for wildcard pattern matching through the use of regular expres- sions. This process is commonly referred to as grepping through text, after the Unix command grep, whichGREP performs this process. The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. One way to do that is to start at the beginning and to read through all the text, noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it con- tains Calpurnia. Sup- pose you wanted to determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia. We will then examine the Boolean retrieval model and how Boolean queries are pro- cessed (Sections 1.3 and 1.4).ġ.1 An example information retrieval problem A fat book which many people own is Shakespeare’s Collected Works. In this chapter we begin with a very simple example of an information retrieval problem, and introduce the idea of a term-document matrix (Sec- tion 1.1) and the central inverted index data structure (Section 1.2). The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |