|
Plenary Lecture
Advanced Methods for Text Retrieval
Professor Ioana Moisil
Co-author: Lucian Blaga
Department of Computer Science and Automatic Control
Hermann Oberth Faculty of Engineering
Lucian Blaga University of Sibiu
Blvd. Victoriei 10, 550024 Sibiu
ROMANIA
E-mails: ioana.moisil@ulbsibiu.ro
Abstract: Information retrieval (IR) is one of the most challenging
fields of study. Today we are defining information retrieval as the
interdisciplinary science of searching for documents, for information within
documents and for documents’ metadata in databases and on the World Wide Web.
For many decades IR was concerned with finding the needed information from
large collections of text documents. The explosive growth of the use of
digital multimedia, from image and graphics to audio and video files over the
Internet and wireless communications, or stored on DVDs and CD-ROMs, has
determined the development of a specific sub-field of IR, i.e. multimedia
information retrieval. In this lecture I will focus only on text retrieval
methods. The retrieving unit is the document and documents from a collection
are extracted to form the text data base. The documents are Web pages, if we
are searching the Web. The retrieving process is simple but not at all
trivial: a user is issuing a query; the retrieving system is then finding a
set of documents relevant to the user’s query; the selected documents are
ranked by relevance scores. The process can be tuned manually or
automatically. The most wide used format for the query is a list of keywords,
but other formats can be used: Boolean queries, phrase queries, proximity
queries, full document, and natural language questions. There are several
aspects that turn the retrieving process from an apparent simple process into
a very complex and challenging one. First, the tremendous success of the Web
has transformed it in the most important information source, shadowing
traditional and digital libraries. That means we have to retrieve information
from a countable but almost infinite collection of documents. For example, the
number of pages in Google's index is growing at an amazing rate - while it
started with 26 million pages in 1998, it had last year a trillion of pages.
So the first challenge is to use retrieval methods that lead to a rapid
response. The second challenge is linked to relevance, and the concern to
reduce information overload. We will refer these aspects all along the
presentation.
In the first part of this lecture I will critically discuss the most used
information retrieval models, from the ones based on set theory (standard and
extended Boolean model, fuzzy retrieval model) to more recent ones: algebraic
models (the vector space model – VSM and extensions: TVSM, latent semantic
analysis, term discrimination, DSIR model). Some probabilistic models will be
also presented.
The second part of the lecture will discuss relevance feedback and performance
measures. In the third part the need of text and Web pages pre-processing will
be emphasised.
Instead of conclusions we will discuss the impact on text retrieval of two
innovative technologies: semantic Web and Web services and of the Web 2.0
paradigm.
Brief Biography of the Speaker:
Ioana Moisil received the M.Sc. in Mathematics at the University of Bucharest,
in 1971, the scientific grade in Statistical, Epidemiological and Operation
Research Methods Applied in Public Health and Medicine at the Universite Libre
de Bruxelles, in Belgium, in 1991 and the Ph.D. in Mathematics at the Romanian
Academy in 1997. Work places: the National Institute for Research &
Development in Informatics - I.C.I (1971-1986), Carol Davila Faculty of
Medicine Bucharest – department of Biophysics, CCSSDM Center of the Ministry
of Health. At present she is a full-time Professor and a Senior Researcher at
the Department of Computer Science and Automatic Control – Faculty of
Engineering at the “Lucian Blaga” University of Sibiu. She is the
author/co-author of fourteen books and over 150 scientific papers. Her
scientific interests include intelligent systems, healthcare telematics, web
technologies, data-mining, e-learning, modelling and simulation, uncertainty
management, human-computer interaction. Professor Moisil participated in
several EU funded projects as project manager for the national partner (Telenurse
ID ENTITY, MGT, PROPRACTITION, PRO-ACCESS), in Tempus projects and in national
funded projects as research manager and software development coordinator (INFOSOC
– eUNIV, AMTRANS – eCASTOR, INFOSOC - e-Scribe, INFOSOC – DANTE,
e-EDU-Quality, eTransMobility, CNCSIS 2007-code 33, Studies on multivariate
interpolation, polinomial classifiers and applications, CNCSIS 2007 – cod
1502, Aspects concerning the psycho-cognitive abilities of artificial
intelligent agents and applications in ITC based education). Current research
is oriented on information retrieval, meta-heuristics, advanced classification
methods. Ioana Moisil is a member of EARLI (European Association for Research
in Learning and Instruction), she is Romanian representative in the IMIA SIG
and EFMI WG5 Nursing Informatics, honorary member of the Bohemian Medical
Association J.E.Purkyne of Bio-engineering and Medical Informatics, member of
the ISCB – International Society for Clinical Biostatistics – Romanian
National Group, of the Romanian Association of Engineers, member of the IITM-
International Institute of Tele-Medicine and of the Romanian Society of
Mathematics Sciences. She is vice-president of the Romanian Medical
Informatics Society; vice-president of the HIT Foundation for Health
Informatics and Telematics and a member of RoCHI-ACM. Professor Moisil is
taking part in several international peer-review committees and conferences
scientific boards.
| |