Rice Computer Science: <title>Rice Computer Science-Colloquia
[RiceCS]
DEPARTMENT
RESEARCHACADEMICS
PEOPLENEWS
[Rice]
Rice Computer Science
  SEARCH:
  
Rice University
Department of Computer Science
presents

Eugene Agichtein
Columbia University

Extracting Relations from Large Text Collections

Abstract

A wealth of information is "buried" within unstructured text. This information can be better exploited in structured or relational form, which is more suited for sophisticated query processing, for integration with relational database management systems, and for data mining. My research on information extraction from text collections has focused on two fundamental challenges: reducing the effort needed to adapt an extraction system for new tasks, and improving the efficiency of information extraction. As part of my thesis, I have developed the Snowball information extraction system, which can be adapted to new domains with minimal effort.

Snowball has been applied for diverse tasks such as extracting information about companies from news stories and mining biological literature for gene and protein synonyms. As another contribution of my thesis, I have developed the QXtract system, which learns search engine queries to retrieve documents that are relevant for a given information extraction task. By processing only the relevant documents, and ignoring the rest, QXtract can dramatically improve the efficiency of the information extraction process. Together, Snowball and QXtract provide crucial building blocks for portable and scalable information extraction from large text collections and the web at large.

Eugene Agichtein is a faculty candidate.

Monday, March 29 at 3:00 p.m. in DH 1070
Reception precedes the talk at 2:30 p.m. in DH 1049

--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---