Learn R Programming

⚠️There's a newer version (1.5.0) of this package.Take me there.

stevedata: Steve’s Toy Data for Teaching About a Variety of Methodological, Social, and Political Topics

{stevedata} is an R package full of toy data sets that you may find useful for various purposes. Namely, I’ve created probably over a hundred toy data sets along the way, either to riff on some topic on my blog, show my students something in one of my many classes, or just to entertain myself for spamming Twitter with my assorted thoughts. I had stuffed a lot of these into {stevemisc}, but I want to keep that package mostly about the functions (and whatever data are useful for showing off the functions). {stevedata} will have all my toy data going forward.

I anticipate two sets of R users may find these data useful. First, instructors may find these data useful for classes on a variety of topics, but prominently quantitative methods and international relations. Many of the toy data sets included in this R package are data I’ve acquired or assembled to teach about topics in quantitative methods or international relations in a reproducible way. Users should see my Github repositories for my classes on introduction to international relations, quantitative methods in political science, and foundations of social science research for public policy to see how I’ve used these data (or development versions of them). Topics here are diverse, including (but not limited to) carbon dioxide emissions over 800,000 years (as an illustration of climate change), coffee prices (as an illustration of the worsening terms of trade), the justifiability of bribe-taking (as an illustration of information-poor and discrete variables that a researcher may be tempted to treat as drawn from a normal distribution), the canonical case of illiteracy rates in the 1930 U.S. Census (as an illustration of an ecological fallacy), and many, many more topics.

Second, my students in these classes (but especially my methods classes) should find this R package useful. I will also be having my methods students (undergraduate and graduate) download this package to work through problem sets in the R programming language. It’d be a benefit to them (and less hassle/headache for myself) to have my students download this package from CRAN rather than work through potential curl issues by installing through Github.

In almost all instances, each data set has an underlying code/script that generates them. These are in a data-raw directory that is not included in the Github repository or the R package. However, I invite users to reach out with questions about the data if they have them.

Installation

This package is now on CRAN. You can download it as you would any other R package.

install.packages("stevedata")

You can also install the development version of {stevedata} from Github via the {devtools} package. I suppose using the {remotes} package would work as well.

devtools::install_github("svmiller/stevedata")

Usage

The data set already has a lot to offer those who might be curious about its contents. You can do this to see what is in it.

data(package = "stevedata")

The ensuing output will look like this.

Object NameTitle/Description
af_crime93Statewide Crime Data (1993)
aluminum_premiumsLME Aluminum Premiums Data
anes_partythermsMajor Party (Democrat, Republican) Thermometer Index Data (1978-2012)
anes_prochoiceAbortion Attitudes (ANES, 2012)
anes_vote84Simple Data for a Simple Model of Individual Voter Turnout (ANES, 1984)
ArcaNYSE Arca Steel Index data, 2017–present
arcticseaiceArctic Sea Ice Extent Data, 1901-2015
arg_tariffSimple Mean Tariff Rate for Argentina
asn_statsAviation Safety Network Statistics, 1942-2019
CFT15Randomization Inference in the Regression Discontinuity Design: An Application to Party Advantages in the U.S. Senate
clemson_tempsDaily Clemson Temperature Data
co2emissionsCarbon Dioxide Emissions Data
coffee_importsCoffee Imports for Select Importing Countries
coffee_priceThe Primary Commodity Price for Coffee (Arabica, Robustas)
CP77Education Expenditure Data (Chatterjee and Price, 1977)
DatasaurusThe Datasaurus Dozen
Dee04Are There Civics Returns to Education?
DJIADow Jones Industrial Average, 1885-Present
DSTCasualties/Fatalities in the U.S. for Drunk-Driving, Suicide, and Terrorism
eight_schoolsThe Effect of Special Preparation on SAT-V Scores in Eight Randomized Experiments
election_turnoutState-Level Education and Voter Turnout in 2016
eq_passengercarsExport Quality Data for Passenger Cars, 1963-2014
ESS9GBBritish Attitudes Toward Immigration (2018-19)
ESSBE5Trust in the Police in Belgium (European Social Survey, Round 5)
eustatesEU Member States (Current as of 2019)
fakeAPIHypothetical (Fake) Data on Academic Performance
fakeLogitFake Data for a Logistic Regression
fakeTSCSFake Data for a Time-Series Cross-Section
fakeTSDFake Data for a Time-Series
ghp100kGun Homicide Rate per 100,000 People, by Country
gss_abortionAbortion Opinions in the General Social Survey
gss_spendingAttitudes Toward National Spending in the General Social Survey (2018)
gss_wagesThe Gender Pay Gap in the General Social Survey
Guber99School Expenditures and Test Scores for 50 States, 1994-95
illiteracy30Illiteracy in the Population 10 Years Old and Over, 1930
LOTILand-Ocean Temperature Index, 1880-2020
LTPTLong-Term Price Trends for Computers, TVs, and Related Items
LTWT“Let Them Watch TV”
min_wageHistory of Federal Minimum Wage Rates Under the Fair Labor Standards Act, 1938-2009
mm_mldaMinimum Legal Drinking Age Fatalities Data
mm_nhisData from the 2009 National Health Interview Survey (NHIS)
mm_randhieData from the RAND Health Insurance Experiment (HIE)
mvprodMotor Vehicle Production by Country, 1950-2019
nesarc_drinkspdThe Usual Daily Drinking Habits of Americans (NESARC, 2001-2)
Newhouse77Medical-Care Expenditure: A Cross-National Survey (Newhouse, 1977)
ODGIOzone Depleting Gas Index Data, 1992-2019
PresidentsU.S. Presidents and Their Terms in Office
pwt_samplePenn World Table (9.1) Macroeconomic Data for Select Countries, 1950-2017
quartetsAnscombe’s (1973) Quartets
recessionsUnited States Recessions, 1855-present
SBCDSystemic Banking Crises Database II
SCP16South Carolina County GOP/Democratic Primary Data, 2016
sealevelsGlobal Average Absolute Sea Level Change, 1880–2015
so2concentrationsSulfur Dioxide Emissions, 1980-2020
steves_clothesSteve’s (Professional) Clothes, as of March 20, 2022
sugar_priceIMF Primary Commodity Price Data for Sugar
thatcher_approvalMargaret Thatcher Satisfaction Ratings, 1980-1990
thermsThermometer Ratings for Donald Trump and Barack Obama
turnipsTurnip prices in Animal Crossing (New Horizons)
TV16The Individual Correlates of the Trump Vote in 2016
ukg_eeriUnited Kingdom Effective Exchange Rate Index Data, 1990-2019
uniondensityCross-National Rates of Trade Union Density
usa_chn_gdp_forecastsUnited States-China GDP and GDP Forecasts, 1960-2050
usa_computersPercentage of U.S. Households with Computer Access, by Year
usa_migrationU.S. Inbound/Outbound Migration Data, 1990-2017
usa_statesState Abbreviations, Names, and Regions/Divisions
usa_tradegdpU.S. Trade and GDP, 1790-2018
voteincomeSample Turnout and Demographic Data from the 2000 Current Population Survey
wvs_ccodesSyncing Word Values Survey Country Codes with CoW Codes
wvs_immigAttitudes about Immigration in the World Values Survey
wvs_justifbribeAttitudes about the Justifiability of Bribe-Taking in the World Values Survey
wvs_usa_abortionAttitudes on the Justifiability of Abortion in the United States (World Values Survey, 1982-2011)
wvs_usa_educatEducation Categories for the United States in the World Values Survey
wvs_usa_regionsRegion Categories for the United States in the World Values Survey
yugo_salesYugo Sales in the United States, 1985-1992

Here is a simple scraping job to provide more information (by way of the description field in the associated R Documentation file). I include these descriptions as a vignette as well.

Object NameDescription
af_crime93These data are in Table 9.1 of the 3rd edition of Agresti and Finlay’s Statistical Methods for the Social Sciences. The data are from Statistical Abstract of the United States and most variables were measured in 1993.
aluminum_premiumsA near daily data set on the price of aluminum premiums (USD/MT) for LME in the U.S., Western Europe, East Asia, and Southeast Asia. I like these data as illustrative of some of the shortsightedness of the aluminum tariffs that Donald Trump announced in March 2018. The tariffs had no discernible effect on manufacturing employment or earnings, but they created a supply shock that made aluminum more expensive.
anes_partythermsA data set on thermometer ratings for the Democratic party, Republican party, “both major parties”, and a major party thermometer index from the American National Election Studies (1978-2012).
anes_prochoiceA simple data set for in-class illustration about how to estimate and interpret interactive relationships. The data here are deliberately minimal for that end.
anes_vote84This is a simple data set for estimating a simple model on voter turnout from the 1984 American National Election Studies (ANES) 1984 time-series.
ArcaDaily data on the NYSE Arca Steel Index. These data are useful for me in teaching how Trump’s 2018 steel tariffs didn’t do much good for the steel industry.
arcticseaiceThis data set from Connelly et al. (2017) measures the Arctic sea ice extent in 10^6 square kilometers. It includes lower bounds and upper bounds on annual averages.
arg_tariffSimple mean tariff rate for Argentina, starting in 1980. The goal is to keep these data current.
asn_statsThese are yearly counts on air accidents and fatalities, including measures for corporate jet accidents and hijackings. The hijackings are of particular interest to me, at least from a historical terrorism perspective.
CFT15This is the replication data for “Randomization Inference in the Regression Discontinuity Design: An Application to Party Advantages in the U.S. Senate”, published in 2015 in Journal of Causal Inference. I use these data to teach about regression discontinuity designs.
clemson_tempsThis data set contains daily temperatures (highs and lows) for Clemson, South Carolina from Jan. 1, 1930 to the end of the most recent calendar year. The goal is to update this periodically with new data for as long as I live in this town.
co2emissionsThis is a sample data set, cobbled from various sources, about carbon dioxide emissions in the history of the planet from 800,000 BCE to the most recently concluded calendar year. I use this for a data visualization example for a lecture on climate change and international politics. Data communicate yearly averages/estimates.
coffee_importsA simple time series on coffee imports for select importing countries (i.e. European Union + Japan + Russia + Tunisia + United States).
coffee_priceThis is primary commodity price data for coffee (Arabica, Robustas) from 1980 to the present. I manually update these data since FRED’s coverage since 2017 has been spotty.
CP77This is a simple data set provided by Chatterjee and Price (1977, p. 108) that serves as a known example of heteroscedasticity.
DatasaurusAn illustrative exercise in never trusting the summary statistics without also visualizing them.
Dee04This should be a data set for a (partial?) replication of Dee’s (2004) article on the purported civics returns to education. I use these data for in-class illustration about instrumental variable analyses.
DJIAThis data set contains the value of the Dow Jones Industrial Average on daily close for all available dates (to the best of my knowledge) from 1885 to the most recently concluded calendar year. Extensions shouldn’t be too difficult with existing packages.
DSTThese are fatalities (and, in the case of terrorism, casualties as well) for drunk-driving, suicide, and acts of terrorism in the U.S. spanning 1970 to 2018. Only one of these is sufficiently important to command public attention despite being the least severe public bad. Do you want to guess which one?
eight_schoolsYou’ve all seen these before. These are the “eight schools” that everyone gets when being introduced to Bayesian programming. Here are the full data for your consideration, which you can use instead of awkwardly searching where the data are and copy-pasting them as a list. Every damn time, Steve.
election_turnoutA simple data set on education and state-level (+ DC) turnout in the 2016 presidential election. This is inspired by what Pollock (2012) does in his book.
eq_passengercarsData from the International Monetary Fund for the export quality and unit/trade value of passenger cars for all available countries and years from 1963 to 2014.
ESS9GBThis is a replication data originally set to accompany a blog post and presentation to students at the University of Nottingham in March 2020. However, COVID-19 led to the cancellation of the talk.
ESSBE5This is a sample data set cobbled from the fifth round of European Social Survey data for Belgium. It offers a means to do a basic replication of some of Chapter 5 of The SAGE Handbook of Regression Analysis and Causal Inference.
eustatesEuropean Union membership by accession date
fakeAPIThis is a hypothetical universe of schools in a given territorial unit, patterned off the apipop data available in the survey package.
fakeLogitThis is a simple fake data set to illustrate a logistic regression.
fakeTSCSThis is a toy (i.e. “fake”) data set created by the fabricatr package. There are 100 observations for 25 hypothetical countries. The outcome y is a linear function of a baseline for each hypothetical country, plus a yearly growth trend as well as varying growth errors for each country. x1 is supposed to have a linear effect of .5 on y, all things considered. x2 is supposed to have a linear effect of 1 on y for each unit change in x2, all things considered.
fakeTSDThis is a toy (i.e. “fake”) data set created by the fabricatr package. There are 100 observations. The outcome y is a linear function of 20 + (.25 * year) + .(25 * x1) + (1 * x2) + e. This clearly implies some autocorrelation in the data. I.e. it’s a time-series.
ghp100kThis is the yearly rate of gun homicides per 100,000 people in the population, selecting on “Western” countries of interest.
gss_abortionThis is a toy data set derived from the General Social Survey that I intend to use for several purposes. First, the battery of abortion items can serve as toy data to illustrate mixed effects modeling as equivalent to a one-parameter (Rasch) model. Second, I include some covariates to also do some basic regressions. I think abortion opinions are useful learning tools for statistical inference for college students. Third, there’s a time-series component as well for understanding how abortion attitudes have changed over time.
gss_spendingThis is a toy data set that collects attitudes on toward national spending for various things in the General Social Survey for 2018. I use these data for in-class illustration about ordinal variables and ordinal models.
gss_wagesWage data from the General Social Survey (1974-2018) to illustrate wage discrepancies by gender (while also considering respondent occupation, age, and education).
Guber99A data set for a canonical case of a Simpson’s paradox, useful for in-class instruction on the topic.
illiteracy30This is perhaps the canonical data set for illustrating the ecological fallacy.
LOTIThese data contain monthly mean temperature anomalies expressed as deviations from the corresponding 1951-1980 means. They are useful for showing how we can measure climate change.
LTPTThese data are a monthly time-series of changes in the consumer price index relative to a Dec. 1997 starting date for televisions, computers, and related items. I use this as in-class illustration that globalization has made consumer electronics cheaper across the board for Americans.
LTWT“Let Them Watch TV”: These data contain price indices for various items for the general urban consumer. Categories include medical services, college tuition, college textbooks, child care, housing, food and beverages, all items (i.e. general CPI), new vehicles, apparel, and televisions. The base period in value was originally the 1982-4 average, but I converted the base period to January 2000. I use these data for in-class discussion about how liberalized trade has made consumer electronics (like TVs) fractions of their past prices. Yet, young adults face mounting costs for college, child-raising, and health care that government policy has failed to address.
min_wageA data set on the various federal minimum wage rates.
mm_mldaThese are data you can use to replicate the regression discontinuity design analyses throughout Chapter 4 of Mastering ’Metrics. Original analyses come from Carpenter and Dobkin (2009, 2011).
mm_nhisThese are data from the 2009 NHIS survey. People who have read Mastering ‘Metrics should recognize these data. They’re featured prominently in that book and the authors’ discussion of random assignment and experiments.
mm_randhieThese are data from the RAND Health Insurance Experiment (HIE).People who have read Mastering ‘Metrics should recognize these data. They’re featured prominently in that book and the authors’ discussion of random assignment and experiments.
mvprodData, largely from Organisation Internationale des Constructeurs d’Automobiles (OICA), on motor vehicle production in various countries (and the world totals) from 1950 to 2019 at various intervals. Tallies include production of passenger cars, light commercial vehicles, minibuses, trucks, buses and coaches.
nesarc_drinkspdThis toy data set is loosely modified from Wave I of the NESARC data set. Here, my main interest is the number of drinks consumed on a usual day drinking alcohol in the past 12 months, according to respondents in the nationally representative survey of 43,093 Americans.
Newhouse77These are the data in Newhouse’s (1977) simple OLS model from 1977. In his case, he’s trying to explain medical care expenditures as a function of GDP per capita for these countries. It’s probably the easiest OLS model I can find in print because Newhouse helpfully provides all the data in one simple table.
ODGIThe NOAA Earth System Research Laboratory has an “ozone depleting gas index” (ODGI) data set from 1992 to 2018. This dataset summarizes Table 1 and Table 2 from its website. The primary interest here (for my purposes) is the ODGI indices (including the new 2012 measure). The data set includes constituent greenhouse gases/chlorines as well in parts per trillion. The primary use here is for in-class illustration.
PresidentsThis should be self-evident. Here are all U.S. presidents who have completed their terms in office (i.e. excluding the current one).
pwt_sampleThese are some macroeconomic data for 21 select (rich) countries. I’ve used these data before to discuss issues of grouping and skew in cross-sectional data.
quartetsThese are four x-y data sets, combined into a long format, which have the same traditional statistical properties (mean, variance, correlation, regression line, etc.). However, they look quite different.
recessionsData on U.S. recessions, past to present. Data include information on contraction, expansion, and cycle.
SBCDA data set on banking, currency, debt, and debt-restructuring crises from1970 to 2017.
SCP16County-level data on vote share and various background/demographic information for the 2016 South Carolina GOP/Democratic primaries.
sealevelsThese data describe how sea level has changed over time, in both relative and absolute terms. Absolute sea level change refers to the height of the ocean surface regardless of whether nearby land is rising or falling.
so2concentrationsThis data set contains yearly observations by the Environmental Protection Agency on the concentration of sulfur dioxide in parts per billion, based on 32 sites. I use this for in-class illustration. Note that the national standard is 75 parts per billion. Data are the national trend.
steves_clothesI cobbled together this data set of the professional clothes (polos, long-sleeve dress shirts, pants) in my closet, largely for illustration on the origins of apparel in the U.S. for an intro lecture on trade.
sugar_priceThis is primary commodity price data

Copy Link

Version

Install

install.packages('stevedata')

Monthly Downloads

566

Version

0.7.0

License

GPL-2

Maintainer

Steven Miller

Last Published

April 2nd, 2022

Functions in stevedata (0.7.0)

ODGI

Ozone Depleting Gas Index Data, 1992-2019
asn_stats

Aviation Safety Network Statistics, 1942-2019
clemson_temps

Daily Clemson Temperature Data
SBCD

Systemic Banking Crises Database II
SCP16

South Carolina County GOP/Democratic Primary Data, 2016
LOTI

Land-Ocean Temperature Index, 1880-2020
aluminum_premiums

LME Aluminum Premiums Data
coffee_price

The Primary Commodity Price for Coffee (Arabica, Robustas)
LTWT

"Let Them Watch TV"
anes_partytherms

Major Party (Democrat, Republican) Thermometer Index Data (1978-2012)
LTPT

Long-Term Price Trends for Computers, TVs, and Related Items
DJIA

Dow Jones Industrial Average, 1885-Present
anes_prochoice

Abortion Attitudes (ANES, 2012)
Presidents

U.S. Presidents and Their Terms in Office
anes_vote84

Simple Data for a Simple Model of Individual Voter Turnout (ANES, 1984)
sealevels

Global Average Absolute Sea Level Change, 1880–2015
Newhouse77

Medical-Care Expenditure: A Cross-National Survey (Newhouse, 1977)
fakeAPI

Hypothetical (Fake) Data on Academic Performance
recessions

United States Recessions, 1855-present
illiteracy30

Illiteracy in the Population 10 Years Old and Over, 1930
eustates

EU Member States (Current as of 2019)
gss_wages

The Gender Pay Gap in the General Social Survey
TV16

The Individual Correlates of the Trump Vote in 2016
eight_schools

The Effect of Special Preparation on SAT-V Scores in Eight Randomized Experiments
af_crime93

Statewide Crime Data (1993)
election_turnout

State-Level Education and Voter Turnout in 2016
fakeLogit

Fake Data for a Logistic Regression
mm_nhis

Data from the 2009 National Health Interview Survey (NHIS)
mm_randhie

Data from the RAND Health Insurance Experiment (HIE)
wvs_ccodes

Syncing Word Values Survey Country Codes with CoW Codes
fakeTSCS

Fake Data for a Time-Series Cross-Section
uniondensity

Cross-National Rates of Trade Union Density
wvs_immig

Attitudes about Immigration in the World Values Survey
arcticseaice

Arctic Sea Ice Extent Data, 1901-2015
so2concentrations

Sulfur Dioxide Emissions, 1980-2020
steves_clothes

Steve's (Professional) Clothes, as of March 20, 2022
arg_tariff

Simple Mean Tariff Rate for Argentina
mvprod

Motor Vehicle Production by Country, 1950-2019
coffee_imports

Coffee Imports for Select Importing Countries
ukg_eeri

United Kingdom Effective Exchange Rate Index Data, 1990-2019
co2emissions

Carbon Dioxide Emissions Data
wvs_justifbribe

Attitudes about the Justifiability of Bribe-Taking in the World Values Survey
wvs_usa_abortion

Attitudes on the Justifiability of Abortion in the United States (World Values Survey, 1982-2011)
therms

Thermometer Ratings for Donald Trump and Barack Obama
voteincome

Sample Turnout and Demographic Data from the 2000 Current Population Survey
turnips

Turnip prices in Animal Crossing (New Horizons)
gss_abortion

Abortion Opinions in the General Social Survey
yugo_sales

Yugo Sales in the United States, 1985-1992
usa_tradegdp

U.S. Trade and GDP, 1790-2018
fakeTSD

Fake Data for a Time-Series
eq_passengercars

Export Quality Data for Passenger Cars, 1963-2014
ghp100k

Gun Homicide Rate per 100,000 People, by Country
gss_spending

Attitudes Toward National Spending in the General Social Survey (2018)
min_wage

History of Federal Minimum Wage Rates Under the Fair Labor Standards Act, 1938-2009
nesarc_drinkspd

The Usual Daily Drinking Habits of Americans (NESARC, 2001-2)
mm_mlda

Minimum Legal Drinking Age Fatalities Data
quartets

Anscombe's (1973) Quartets
sugar_price

IMF Primary Commodity Price Data for Sugar
usa_migration

U.S. Inbound/Outbound Migration Data, 1990-2017
pwt_sample

Penn World Table (9.1) Macroeconomic Data for Select Countries, 1950-2017
usa_states

State Abbreviations, Names, and Regions/Divisions
wvs_usa_educat

Education Categories for the United States in the World Values Survey
wvs_usa_regions

Region Categories for the United States in the World Values Survey
thatcher_approval

Margaret Thatcher Satisfaction Ratings, 1980-1990
usa_chn_gdp_forecasts

United States-China GDP and GDP Forecasts, 1960-2050
usa_computers

Percentage of U.S. Households with Computer Access, by Year
CFT15

Randomization Inference in the Regression Discontinuity Design: An Application to Party Advantages in the U.S. Senate
DST

Casualties/Fatalities in the U.S. for Drunk-Driving, Suicide, and Terrorism
Guber99

School Expenditures and Test Scores for 50 States, 1994-95
Datasaurus

The Datasaurus Dozen
Dee04

Are There Civics Returns to Education?
Arca

NYSE Arca Steel Index data, 2017–present
ESSBE5

Trust in the Police in Belgium (European Social Survey, Round 5)
CP77

Education Expenditure Data (Chatterjee and Price, 1977)
ESS9GB

British Attitudes Toward Immigration (2018-19)