GetTextbooks.co.uk  
 Compare Prices & Save up to 90%
Search by ISBN, title, author, etc ...

Login | Sign up | My Wish List  


Information Retrieval: Data Structures and Algorithms

by William B. Frakes, Ricardo Baeza-Yates

ISBN-10: 0134638379
ISBN-10: 0-13-463837-9
ISBN-13: 9780134638379
ISBN-13: 978-0-13-463837-9
Paperback
1992-06-22
Prentice Hall PTR


Find Lowest Price

Editorials


Book Description
Information retrieval is a sub-field of computer science that deals with the automated storage and retrieval of documents. Providing the latest information retrieval techniques, this guide discusses Information Retrieval data structures and algorithms, including implementations in C. Aimed at software engineers building systems with book processing components, it provides a descriptive and evaluative explanation of storage and retrieval systems, file structures, term and query operations, document operations and hardware. Contains techniques for handling inverted files, signature files, and file organizations for optical disks. Discusses such operations as lexical analysis and stoplists, stemming algorithms, thesaurus construction, and relevance feedback and other query modification techniques. Provides information on Boolean operations, hashing algorithms, ranking algorithms and clustering algorithms. In addition to being of interest to software engineering professionals, this book will be useful to information science and library science professionals who are interested in text retrieval technology.

Reviews


Covers Basics with Varying Depth and Quality
Rather than a coherent textbook about information retrieval, this book contains 18 papers by individual authors which vary wildly in depth, quality and relevance today. The basic issues are covered each with their own chapters: inverted files, vector comparison techniques, stoplists, stemming, tehsauri, string searching, relevance feedback, boolean operations, ranking, clustering and hashing.

The introduction covers hashing and automata for string matching in detail, but doesn't mention vector-based techniques other than Hamming distance (!) and in one paragraph provides the only mention of edit distance (aka Levenstein distance) in the book. The chapter on PAT trees and the one on optical disks seem out of place due to their depth and obscurity. On the other hand, there's no mention of caching anywhere. The chapter on lexical analysis and stoplists by Fox has a nice introduction, but then devolves into page after page of C code. Ditto for Frakes' chapter on stemming -- good introduction, but we didn't need ten pages of code. Same for the thesaurus chapter -- a few pages of introduction, and then 40+ pages of code for some kind of hierarchical clustering. Baeza-Yates' chapter on string searching covers Knuth-Morris-Pratt and Boyer-Moore briefly and even contains some interesting empirical data, but again, we didn't really need the C code. Harman's chapter on relevance feedback (query modification) stands out as being entirely sensible, high level and informative, but is a decade behind the times. The chapter on boolean operations provides a few pages of info and then mysteriously spends 10 pages on bit vector code and then another handful on hashing. Then the following chapter on hashing has 40 pages of C code for perfect hashing! Harman's later chapter on ranking algorithms is a useful overview of scoring (though very high level). Rasmussen's chapter on clustering is also thoughtful, but rather non-standard -- you don't even get k-means, everyone's favorite clustering algorithm, and it also recaps the definitions of many of the other chapters.

Unfortunately, you don't get any higher-order graph analysis techniques that power web search engines like Google. You won't get any kind of help for load balancing servers or databases, which is critical. You also don't get any dimensionality reducing and smoothing techniques like latent semantic analysis or principal components analysis. There's also no analysis from a users' perspective on usability and the different kinds of tasks that peopel might be using information retrieval for. And of course, there's no discussion of natural language understanding techniques or crosslingual or multilingual retrieval techniques. Finally, it's all text based and you won't get any information on retrieving audio or images.

If you're serious about information retrieval, this book lacks the depth and recency to leave you feeling like an expert. The statistical language processing book by Manning and Schuetze contains an excellent introduction to information retrieval algorithms, as well as reams of background on statistical language processing you'll want to understand before getting into information retrieval. For more details on information retrieval itself, check out the collection of primary source papers edited by Karen Sparck-Jones: Readings in Information Retrieval.


Useful reference book
I bought this book while working on some informaitno retrieval related project, and it turned out as a useful reference for explaining terminology, suggesting efficient data structures, and offering good references for further reading.

However, the book turned out yet more useful to me as, during my M.A. studies (in CS) I had to write a work on "Suffix Trees" and "Suffix Arrays" and I found that Gonnet, Baeza-Yates and Snider describe equivalent ideas they call "PAT trees" and "PAT arrays".

I found this book useful too for working on computational linguistics related projects as well.

In short - I like keeping this book always in reach, as a reference, though, I found this book not so friendly as an introduction book to the subject ("Managing Gigabytes", might turn out to be a more welcomming).


Good coverage and treatment of algorithms for I.R.
I adopted this book as the primary textbook for my course on information retrieval. It covers a substantial part of core topics in IR: models of information retrieval system (boolean and best-match systems); implementations (inverted files, tries, signature files, hashing), indexing and retrieval algorithms (lexical analysis, stemming, ranking, relevance feedback, boolean operations) and somewhat more advanced topics like clustering and automatic thesaurus construction. These topics are dealt with varying level of detail: for some of them there are also C code examples that are rather useful to students; other topics are less well detailed (eg. relevance feedback and probabilistic models). These topics are dealt with sufficient clarity and reasonable conciseness. Some shortcomings are: (i) the weak treatment of the probabilistic models (I would have liked a deeper analysis of the underlying principles and how they lead to certain kinds of systems). Consequences of some techniques are discussed with insufficient depth. (ii) In my view too much attention is devoted to low-level string processing, like what is done in chapter 10, centered on string searching algorithms (not relevant to the main topic of the book). (iii) Other important topics have not been dealt at all, unfortunately. These include almost everything that goes under the topic of user-centered information retrieval and user interfaces. Another missing topic is "passage retrieval".

good tips on how to get the right bits of knowledge to user
Although this is more of a hardcore software book, the techniques and issues presented are transferable to the general problem of knowledge management. In particular, the various software tricks for how to handle very large databases provide useful ideas on how to get the right bits of knowledge to the user.


Home | Browse | Professors | Merchants | Webmasters | Contact Us

[ United States | Canada ]

Copyright © 2003-2008 GetTextbooks.co.uk