Contextual Representation Learning for Text Data
Nowadays, text data is being generated at an increasing rate of speed. Text data is prevalent in various domains, such as social media, newspapers, clinical notes and online reviews. Text data contains rich information and understanding text data is important for Artificial Intelligence (AI) tasks, especially for Natural Language Processing (NLP) tasks. The key to understanding text data lies in the representation of the data, as the success of NLP algorithms heavily depends on the quality of the text representations. For that reason, many conventional NLP systems attempt to design preprocessing pipelines and data transformations that can provide good representations of text data. Such feature engineering is useful but requires careful design and prior knowledge. Therefore, it is desirable to learn representations of text data automatically and make the systems dependent on as little feature engineering as possible. In this way, the downstream NLP applications can be constructed faster and achieve better performances.
Context information of text data, including spatial context, temporal context and domain context, is naturally a good source to learn representations of text data. Because the context information not only contains syntactic and semantic information which are essential for learning representations of text data, but also is easy and convenient to collect. I will first introduce how to extract semantic and syntactic features from text data based on its spatial context information, such as word-word co-occurrences and document-word co-occurrences, and also how to coordinate both kinds of spatial context information. Next, I will demonstrate how to learn time-aware representations based on the temporal context information of text data, for example, temporal representations that can capture the semantic evolutions of words. Then I will show how to learn domain-specific representations of text data based on its domain context information, for example, extracting domain-related features from documents given the task domain. Extensive evaluations are also conducted and presented to demonstrate the effectiveness of the proposed contextual representation learning algorithms.
- Hongning Wang, Committee Chair (Department of Computer Science)
- Aidong Zhang, Advisor (Department of Computer Science)
- Yangfeng Ji (Department of Computer Science)
- Jundong Li (Department of Electrical and Computer Engineering)
- Stefan Bekiranov (Department of Biochemistry and Molecular Genetics)
- Heng Huang (Department of Electrical and Computer Engineering, University of Pittsburgh)