The continued growth of online text data presents exciting
opportunities for automated knowledge discovery. In this talk, I
will present two lines of research developing machine learning
algorithms to convert large text collections into actionable
knowledge. First, I will discuss information extraction (IE),
which infers a relational database from unstructured text. After
giving an overview of several IE tasks, including entity
extraction, coreference resolution, and relation extraction, I
will describe a new learning algorithm, SampleRank, that
efficiently models the complex statistical dependencies inherent
in IE, and present state-of-the-art results extracting information
from news stories. Second, I will turn to the analysis of informal
texts, specifically Twitter data. What can we infer about society
from this data? I will outline the fundamental challenges in this
line of research and present our work monitoring flu activity,
alcohol consumption, and anxiety towards Hurricane Irene, as well
as recent research inferring the geographical origin of Twitter
messages.