3404 occurrences of four synonymous Finnish ‘think’ verbs (‘ajatella’: 1492; ‘mietti\"a’: 812; ‘pohtia’: 713; ‘harkita’: 387) in newspaper and Internet newsgroup discussion texts
data(think)
A data frame with 3404 observations on the following 27 variables:
Lexeme
A factor specifying one of the four ‘think’ verb synonyms
Polarity
A factor specifying whether the ‘think’ verb has negative polarity (Negation
) or not (Other
)
Voice
A factor specifying whether the ‘think’ verb is in the Passive
voice or not (Other
)
Mood
A factor specifying whether the ‘think’ verb is in the Indicative
or Conditional
mood or not (Other)
Person
A factor specifying whether the ‘think’ verb is in the First
, Second
, Third
person or not (None
)
Number
A factor specifying whether the ‘think’ verb is in the Plural
number or not (Other
)
Covert
A factor specifying whether the agent/subject of the ‘think’ verb is explicitly expressed as a syntactic argument (Overt
), or only as a morphological feature of the ‘think’ verb (Covert
)
ClauseEquivalent
A factor specifying whether the ‘think’ verb is used as a non-finite clause equivalent (ClauseEquivalent
) or as a finite verb (FiniteVerbChain
)
Agent
A factor specifying the occurrence of Agent/Subject of the ‘think’ verb as either a Human Individual
, Human Group
, or as absent (None
)
Patient
A factor specifying the occurrence of the Patient/Object argument among the semantic or structural subclasses as either an Human Individual or Group (IndividualGroup
), Abstraction
, Activity
, Communication
, Event
, an ‘etta’ (‘that’) clause (etta_CLAUSE
), DirectQuote
, IndirectQuestion
, Infinitive
, Participle
, or as absent (None
)
Manner
A factor specifying the occurrrence of the Manner argument as any of its subclasses Generic
, Negative
(sufficiency), Positive
(sufficiency), Frame
, Agreement
(Concur or Disagree), Joint
(Alone or Together), or as absent (None
)
Time
A factor specifying the occurrence of Time argument (as a moment) as either of its subclasses Definite
, Indefinite
, or as absent (None
)
Modality1
A factor specifying the main semantic subclasses of the entire Verb chain as either indicating Possibility
, Necessity
, or their absense (None
)
Modality2
A factor specifying minor semantic subclasses of the entire Verb chain as indicating either a Temporal
element (begin, end, continuation, etc.), External
(cause), Volition
, Accidental
nature of the thinking process, or their absense (None
)
Source
A factor specifying the occurrence of a Source
argument or its absense (None
)
Goal
A factor specifying the occurrence of a Goal
argument or its absence (None
)
Quantity
A factor specifying the occurrence of a Quantity
argument, or its absence (None
)
Location
A factor specifying the occurrence of a Location
argument, or its absence (None
)
Duration
A factor specifying the occurrence of a Duration
argument, or its absence (None
)
Frequency
A factor specifying the occurrence of a Frequency
arument, or its absence (None
)
MetaComment
A factor specifying the occurrence of a MetaComment
, or its absence (None
)
ReasonPurpose
A factor specifying the occurrence of a Reason or Purpose argument (ReasonPurpose
), or their absence (None
)
Condition
A factor specifying the occurrence of a Condition
argument, or its absence (None
)
CoordinatedVerb
A factor specifying the occurrence of a Coordinated Verb (in relation to the ‘think’ verb: CoordinatedVerb
), or its absence (None
)
Register
A factor specifying whether the ‘think’ verb occurs in the newspaper subcorpus (hs95
) or the Internet newsgroup discussion corpus (sfnet
)
Section
A factor specifying the subsection in which the ‘think’ verb occurs in either of the two subcorpora
Author
A factor specifying the author of the text in which the ‘think’ verb occurs, if that author is identifiable -- authors in the Internet newgroup discussion subcorpus are anonymized; unidentifiable/unknown author designated as (None
)
The four most frequent synonyms meaning ‘think, reflect, ponder,
consider’, i.e. ‘ajatella, miettia, pohtia, harkita’, were extracted
from two months of newspaper text from the 1990s (Helsingin Sanomat
1995) and six months of Internet newsgroup discussion from the early
2000s (SFNET 2002-2003), namely regarding (personal) relationships
(sfnet.keskustelu.ihmissuhteet) and politics
(sfnet.keskustelu.politiikka). The newspaper corpus consisted of
3,304,512 words of body text (i.e. excluding headers and captions as
well as punctuation tokens), and included 1,750 examples of the
studied ‘think’ verbs. The Internet corpus comprised 1,174,693 words of
body text, yielding 1,654 instances of the selected ‘think’
verbs. In terms of distinct identifiable authors, the newspaper
sub-corpus was the product of just over 500 journalists and other
contributors, while the Internet sub-corpus involved well over 1000
discussants. The think
dataset contains a selection of 26
contextual features judged as most informative.
For extensive details of the data and its linguistic and statistical
analysis, see Arppe (2008). For the full selection of contextual
features, see the amph
(2008) microcorpus.
Arppe, A. 2008. Univariate, bivariate and multivariate methods in corpus-based lexicography -- a study of synonymy. Publications of the Department of General Linguistics, University of Helsinki, No. 44. URN: http://urn.fi/URN:ISBN:978-952-10-5175-3.
Arppe, A. 2009. Linguistic choices vs. probabilities -- how much and what can linguistic theory explain? In: Featherston, Sam & Winkler, Susanne (eds.) The Fruits of Empirical Linguistics. Volume 1: Process. Berlin: de Gruyter, pp. 1-24.
# NOT RUN {
data(think)
think.ndl = ndlClassify(Lexeme ~ Person + Number + Agent + Patient + Register,
data=think)
summary(think.ndl)
plot(think.ndl)
# }
Run the code above in your browser using DataLab