WALS: The World Atlas of Language Structures (WALS)

Description

The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.

The first version of WALS was published as a book with CD-ROM in 2005 by Oxford University Press. The first online version was published in April 2008. The second online version was published in April 2011. The current dataset is WALS 2013, published on 14 November 2013.

The included dataset wals takes a somewhat sensible selection from the complete WALS data. It excludes attributes ("features" in WALS-parlance) that are definitially duplicates of others (3, 25, 95, 96, 97), those attributes that only list languages that are incompatible with other attributes (132, 133, 134, 135, 139, 140, 141, 142), and the `additional' attributes that are marked as `B' through `Z'. Further, it removes those languages that do not have any data left after removing those attributes. The result is a dataset with 2566 languages and 131 attributes.

Usage

data(wals)

Arguments

Format

A list with two dataframes:

data: the actual WALS data. The object wals$data contains a dataframe with data from 2566 languages on 131 different attributes. The column names identify the WALS features. For details about these features, see http://wals.info/chapter
meta: some metadata for the languages. The object wals$meta contains a dataframe with some limited meta-information about these 2566 languages.

The three-letter WALS-codes are used as rownames in both dataframes. Further, the object wals$meta contains the following variables.

name: a character vector giving a name for each language
genus: a factor with 522 levels with the genera according to M. Dryer
family: a factor with 215 levels with the families according to M. Dryer
longitude: a numeric vector with geo coordinates for all languages
latitude: a numeric vector with geo coordinates for all languages

Details

All details about the meaning of the variables and much more meta-information is available at http://wals.info.

References

Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info, Accessed on 2013-11-14.)

Examples

Run this code

# NOT RUN {
data(wals)

# plot all locations of the WALS languages, looks like a world map
plot(wals$meta[,4:5])

# turn the large and mostly empty dataframe into sparse matrices
# recoding is nicely optimized and quick for this reasonably large dataset
# this works perfect as long as things stay within available RAM of the computer
system.time(
  W <- splitTable(wals$data)
)

# as an aside: note that the recoding takes only about 30% of the space
as.numeric( object.size(W) / object.size(wals$data) )

# compute similarities (Chuprov's T, similar to Cramer's V) 
# between all pairs of variables using sparse Matrix methods
system.time(sim <- sim.att(wals$data, method = "chuprov"))

# some structure visible
rownames(sim) <- colnames(wals$data)
plot(hclust(as.dist(1-sim), method = "ward"), cex = 0.5)

# }

Run the code above in your browser using DataLab