On SA Corpora
Movie Review Data includes:
- The sentiment polarity datasets - 1000 positive and 1000 negative reviews, lightly tokenized (contracted forms are preserved, but punctuation is spaced) and down-cased, one sentence per line. Reviews were classified according to the star-rating.
- The sentiment scale datasets - four sets of approximately 1300 “subjective snippets” each from four reviewers. Each set is paired (with a per-line correspondance) with three files containing labels for each snippet: one with three classes, one with four and a more fine-grained one ([0-1] with stepsize 0.1 or smaller).
- The subjectivity datasets - 5000 subjective (reviews) and 5000 objective (movie plots) snippets, one per line.
The MPQA Opinion Corpus is a database comprising five subsets of extensively (manually) annotated data. The annotations include:
- Agent - marks phrases that refer to sources of private states and speech events, or phrases that refer to agents who are targets of an attitude.
- Expressive-subjectivity - Marks expressive-subjective elements, words and phrases that indirectly express a private state. For example, ’fraud’ and ‘daylight robbery’ in the following sentence are expressive-subjective elements.
- Direct-subjective - Marks direct mentions of private states and speech events (spoken or written) expressing private states.
- Objective-speech-event annotation - Marks speech events that do not express private states.
- Attitude - Marks the attitudes that compose the expressed private states (attitude is discussed in greater detail in the excerpt “Representing attitude and targets”).
- Target - Marks the targets of the attitudes, i.e., what the attitudes are about or what the attitudes are directed toward.
- Inside - The term ‘inside’ refers to the words inside the scope of a direct private state or speech event phrase.
Customer Reviews Datasets includes 2 sets of lightly tokenized reviews for five and nine products respectively. Product features immediately precede a positive/negative rating tag (e.g. [+3]). Additional metadata informs on feature absence, possible need for pronoun resolution and whether or not the opinionated sentence is a comparison or a suggestion.