0% found this document useful (0 votes)
114 views

Chapter One ISR

The document provides an introduction to information retrieval systems and key concepts. It defines information retrieval as searching for relevant documents from large collections to satisfy user needs. It discusses how IR systems represent, store, organize and provide access to information through indexing and keywords. Examples of different IR systems are provided, including conventional library catalogs, text-based web search engines, and multimedia and question answering systems. The goals and challenges of IR are also summarized.

Uploaded by

Ebisa Chemeda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views

Chapter One ISR

The document provides an introduction to information retrieval systems and key concepts. It defines information retrieval as searching for relevant documents from large collections to satisfy user needs. It discusses how IR systems represent, store, organize and provide access to information through indexing and keywords. Examples of different IR systems are provided, including conventional library catalogs, text-based web search engines, and multimedia and question answering systems. The goals and challenges of IR are also summarized.

Uploaded by

Ebisa Chemeda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Information

Storage and Retrieval

Chapter One
Overview of Information Retrieval

Introduction to Information
10/16/2017 Retrieval 1
1
IR and IR Systems
 Information
retrieval (IR) is the process of searching for relevant
documents from unstructured large corpus that satisfy users
information need .
 According to Baeze-Yates & Riberio-Neto Information
retrieval deals with representation, storage, organization
of, and access to information items.

 The organization and access of information items should


provide the user with easy access to the information in
which he is interested
 The definition incorporates all important features of a
good information retrieval system
 Representation
 Storage
 Organization
 Access
10/16/2017 Introduction to Information Retrieval 2
Examples of IR systems
❖ Conventional (library catalog): Search by keyword,
title, author, etc.
❖ Text-based (Lexis-Nexis, Google, FAST): Search by
keywords. Limited search using queries in natural
language.
❖ Multimedia (QBIC, WebSeek, SaFe): Search by
visual appearance (shapes, colors,… ).
❖ Question answering systems (AskJeeves,
Answerbus): Search in (restricted) natural language
Web search systems
•Lycos, Excite, Yahoo,
Google, Live, Northern
Light, Teoma, HotBot,
Baidu, …
10/16/2017 Introduction to Information Retrieval 3
General Goal of Information Retrieval

❖ To help users find useful information based on


their information needs (with a minimum effort)
despite
 Increasing complexity of Information
 Changing needs of user

❖ Provide immediate random access to the


document collection.
➢ Retrieval systems, such as Google, Yahoo, are
developed with this aim.

10/16/2017 Introduction to Information Retrieval 4


Information Retrieval vs. Data Retrieval
 Emphasis of IR is on the retrieval of information, rather than
on the retrieval of data
Data retrieval
➢ Consists mainly of determining which documents contain a set
of keywords in the user query (which is not enough to satisfy
the user information need)
➢ Aims at retrieving all objects that satisfy well defined semantics
➢ a single erroneous object among a thousand retrieved objects
implies failure
Information retrieval
➢ Is concerned with retrieving information about a subject or topic
than retrieving data which satisfies a given query
➢ semantics is frequently loose: the retrieved objects might be
inaccurate
➢ small errors are tolerated
Example of data retrieval system is a relational database
10/16/2017 Introduction to Information Retrieval 5
Information Retrieval vs. Data Retrieval
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive
10/16/2017 Introduction to Information Retrieval 6
Why is IR so hard?
 Traditionnel Information retrieval (IR) System attempt to
find relevant documents to respond to a user’s request.
 Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
➢ The real problem boils down to matching the language
of the query to the language of the document.
➢ Simply matching on words is a very brittle (no
elasticity) approach. One word can have different
semantic meanings. Consider: Take
➢ “take a place at the table”

➢ “take money to the bank”

➢ “take a picture”

10/16/2017 Introduction to Information Retrieval 7


Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents
The User Task:
Two user task – retrieval and browsing

Retrieval

DB
Browsing

USER

10/16/2017 Introduction to Information Retrieval 8


User Task: Retrieval
❖ Retrieval is the process of retrieving information
whereby the main objective is clearly defined
from the onset of searching process.
❖ The user of a retrieval system has to translate his
information need into a query in the language
provided by the system.
❖ In this context (i.e. by specifying a set of words),
the user searches for useful information executing
a retrieval task
❖ English Language Statement :
I want a book by J. K Rowling titled The Chamber
of Secrets

10/17/2017 Introduction to Information Retrieval 9


User Task: Browsing
❖ Browsing is the process of retrieving information,
whereby the main objective is not clearly defined from
the beginning and whose purpose might change during
the interaction with the system.
❖ E.g. User might search for documents about ‘car racing’
. Meanwhile he might find interesting documents about
‘car manufacturers’. While reading about car
manufacturers in Addis, he might turn his attention to a
document providing ‘direction to Addis’, and from this
to documents which cover ‘Tourism in Ethiopia’.
❖ In this context, user is said to be browsing in the
collection and not searching, since a user may has an
interest glancing around

10/16/2017 Introduction to Information Retrieval 10


Logical View of Documents
Documents in a collection are frequently represented by a set of index
terms or keywords
Such keywords are mostly extracted directly from the text of the
document
These representative keywords provide a logical view of the document

Docs Tokenization stop words stemming Indexing

Full Index terms


text

Document representation viewed as a continuum, in which logical view


of documents might shift from full text to index terms
10/16/2017 Introduction to Information Retrieval 11
Logical view of documents
 If full text :
 Each word in the text is a keyword
 Most complex form
 Expensive
 If full text is too large, the set of representative keywords
can be reduced through transformation process called
text operation
 It reduce the complexity of the document
representation and allow moving the logical view
from that of a full text to a set of index terms

10/16/2017 Introduction to Information Retrieval 12


Structure of an IR System

 An Information Retrieval System serves as a bridge between the world of


authors and the world of readers/users,
 That is, writers present a set of ideas in a document using a set of concepts.
Then Users seek the IR system for relevant documents that satisfy their
information need.
User Documents
Black box

 The black box is the information retrieval system.


 To be effective in its attempt to satisfy information need of users, the IR
system must ‘interpret’ the contents of documents in a collection and rank
them according to their degree of relevance to the user query.
 Thus the notion of relevance is at the center of IR
 The primary goal of an IR system is to retrieve all the documents which are
relevant to a user query while retrieving as few non-relevant documents as
possible
10/16/2017 Introduction to Information Retrieval 13
Typical IR Task

 Given:
 A corpus of textual natural-language documents.
 A user query in the form of a textual string.
 Find:
 A ranked set of documents that are relevant to the
query.

10/16/2017 Introduction to Information Retrieval 14


Typical IR System Architecture

Document
corpus

Quer IR
y System
Strin
1. Doc1
g 2. Doc2
Ranked 3. Doc3
.
Documents .

10/16/2017 Introduction to Information Retrieval 15


Overview of the Retrieval process

10/16/2017 Introduction to Information Retrieval 16


The Retrieval Process
 It is necessary to define the text database before any of
the retrieval processes are initiated
 This is usually done by the manager of the database and
includes specifying the following
➢ The documents to be used

➢ The operations to be performed on the text

➢ The text model to be used (the text structure and what


elements can be retrieved)
 The text operations transform the original documents and
the information needs and generate a logical view of them
 Once the logical view of the documents is defined, the
database manager(using the DB Manager Module) builds
an index of the text.
 An index is a critical data structure because it allows fast
searching over large volumes of data
.
10/17/2017 Introduction to Information Retrieval 17
Retrieval Process ….
 Different index structures might be used , but the most
popular one is the inverted file
 The resources (time and storage space) spent on defining
the text database and building the index are amortized by
querying the retrieval system many times.
 Given that the document database is indexed, the retrieval
process can be initiated.
 The user first specifies a user need which is then parsed
and transformed by the same text operations applied to
the text.
 Then, query operations might be applied before the actual
query, which provides a system representation for the user
need, is generated.

10/17/2017 Introduction to Information Retrieval 18


The Retrieval Process …
❖ The query is then processed to obtain the retrieved documents
✓ Before the retrieved documents are sent to the user, the
retrieved documents are ranked according to the likelihood
of relevance
❖ The user then examines the set of ranked documents in the search
for useful information. Two choices for the user:
✓ Reformulate query, run on entire collection or
✓ Reformulate query, run on result set
❖ At this point, the user might pinpoint a subset of the documents
seen as definitely of interest and initiate a user feedback cycle
 In such a cycle, the system uses the documents selected by the

user to change the query formulation.


 Hopefully, this modified query is a better representation of the

real user need


10/17/2017 Introduction to Information Retrieval 19
Detail view of the Retrieval Process

User Interface
Text
User
Text Operations
Need
Logical View
User Query DB Manager
Feedback Operations Indexing
Module
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
Issues that arise in IR
 Text representation
 what makes a “good” representation?
 how is a representation generated from text?
 what are retrievable objects and how are they organized?
 Information needs representation
 what is an appropriate query language?
 how can interactive query formulation and refinement be
supported?
 Comparing representations (to identify relevant
documents)
 What weighting scheme and similarity measure to be used?
 what is a “good” model of retrieval?
 Evaluating effectiveness of retrieval
 what are good metrics?
 what constitutes a good experimental test bed?

10/16/2017 Introduction to Information Retrieval 21


Focus in IR System Design
Our focus during IR system design is:
 In improving performance effectiveness of the
system
 Effectiveness of the system is measured in terms of
precision, recall, …
 Stemming, stop words, weighting schemes, matching
algorithms
 In improving performance efficiency
 The concern here is storage space usage, access time,
searching time, data transfer time …
 Concern regarding space – time tradeoffs !!

 Use Compression techniques, data/file structures, etc.

10/16/2017 Introduction to Information Retrieval 22


Subsystems of an IR system
 The two subsystems of an IR system:
 Searching: is an online process of finding relevant

documents in the index list as per users query


 Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
 Indexing and searching: are unavoidably connected
 you cannot search what was not first indexed in some
manner or other
 indexing of documents or objects is done in order to be
searchable
 to index one needs an indexing language

 Knowing searching is knowing indexing


10/16/2017 Introduction to Information Retrieval 23
Indexing Subsystem

documents
Documents Assign document identifier

text document
Tokenize
IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index

10/16/2017 Introduction to Information Retrieval 24


Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms

Index terms
Index

10/16/2017 Introduction to Information Retrieval 25

You might also like