 |
Rice University
Department of Computer Science
presents
Eugene Agichtein
Columbia University
Extracting Relations from Large Text Collections
Abstract
A wealth of information is "buried" within unstructured text.
This information can be better exploited in structured or relational
form, which is more suited for sophisticated query processing, for
integration with relational database management systems, and for
data mining. My research on information extraction from text
collections has focused on two fundamental challenges: reducing the
effort needed to adapt an extraction system for new tasks, and
improving the efficiency of information extraction. As part of my
thesis, I have developed the Snowball information extraction system,
which can be adapted to new domains with minimal effort.
Snowball has been applied for diverse tasks such as extracting information
about companies from news stories and mining biological literature
for gene and protein synonyms. As another contribution of my
thesis, I have developed the QXtract system, which learns search
engine queries to retrieve documents that are relevant for a given
information extraction task. By processing only the relevant
documents, and ignoring the rest, QXtract can dramatically improve
the efficiency of the information extraction process. Together,
Snowball and QXtract provide crucial building blocks for portable
and scalable information extraction from large text collections and
the web at large.
Eugene Agichtein is a faculty candidate.
Monday, March 29 at 3:00 p.m. in DH 1070
Reception precedes the talk at 2:30 p.m. in DH 1049
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- |
|
| |