pop.predict: Probabilistic Population Projection

Description

The function generates trajectories of probabilistic population projection for all countries for which input data is available, or any subset of them.

Usage

pop.predict(end.year = 2100, start.year = 1950, present.year = 2020, 
    wpp.year = 2019, countries = NULL, 
    output.dir = file.path(getwd(), "bayesPop.output"),
    annual = FALSE,
    inputs = list(popM=NULL, popF=NULL, mxM=NULL, mxF=NULL, srb=NULL,
        pasfr=NULL, patterns=NULL, 
        migM=NULL, migF=NULL, migMt=NULL, migFt=NULL, mig=NULL,
        mig.fdm = NULL, e0F.file=NULL, e0M.file=NULL, tfr.file=NULL,
        e0F.sim.dir=NULL, e0M.sim.dir=NULL, tfr.sim.dir=NULL,
        migMtraj = NULL, migFtraj = NULL, migtraj = NULL,
        migFDMtraj = NULL, GQpopM = NULL, GQpopF = NULL, 
        average.annual = NULL), 
    nr.traj = 1000, keep.vital.events = FALSE, 
    fixed.mx = FALSE, fixed.pasfr = FALSE,
    lc.for.hiv = TRUE, lc.for.all = TRUE, mig.is.rate = FALSE,
    mig.age.method  = c("auto", "fdmp", "fdmnop", "rc"), mig.rc.fam = NULL,
    my.locations.file = NULL, replace.output = FALSE, verbose = TRUE, ...)

Value

Object of class bayesPop.prediction with the following elements:

base.directory: Full path to the base directory output.dir.
output.directory: Sub-directory relative to base.directory with the projections.
nr.traj: The actual number of trajectories of the projections.
quantiles: Three-dimensional array of projection quantiles (countries x number of quantiles x projection periods). The second dimension corresponds to the following quantiles: $0.025,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.975$.
traj.mean.sd: Three-dimensional array of projection mean and standard deviation (countries x 2 x projection periods). First and second matrix of the second dimension, respectively, is the mean and standard deviation, respectively.
quantilesM, quantilesF: Quantiles of male and female projection, respectively. Same structure as quantiles.
traj.mean.sdM, traj.mean.sdF: Same as traj.mean.sd corresponding to male and female projection, respectively.
quantilesMage, quantilesFage: Four-dimensional array of age-specific quantiles of male and female projection, respectively (countries x age groups x number of quantiles x projection periods). The same quantiles are used as in quantiles.
quantilesPropMage, quantilesPropFage: Array of age-specific quantiles of male and female projection, respectively, divided by the total population. The dimensions are the same as in quantilesMage.
estim.years: Vector of time for which historical data was used in the projections.
proj.years: Vector of projection time periods starting with the present period.
wpp.year: The wpp year used.
inputs: List of input data used for the projection.
function.inputs: Content of the inputs argument passed to the function.
countries: Matrix of countries for which projection exists. It contains two columns: code, name.
ages: Vector of age groups.
annual: If TRUE, this object corresponds to a 1x1 prediction, otherwise 5x5.
cache: This component is added by get.pop.prediction and modified and used by pop.map and write.pop.projection.summary. It is an environment for caching and re-using results of expressions.
write.to.cache: Logical determining if cache should be modified.
is.aggregation: Logical determining if this object is a result of pop.predict or pop.aggregate.

Arguments

end.year

End year of the projection.

start.year

First year of the historical data.

present.year

Year for which initial population data is to be used.

wpp.year

Year for which WPP data is used. The functions loads a package called wpp$x$ where $x$ is the wpp.year and uses the various datasets as default if the corresponding inputs element is missing (see below).

countries

Array of country codes or country names for which a projection is generated. If it is NULL, all available countries are used. If it is NA and there is an existing projection in output.dir and replace.output=FALSE, then a projection is performed for all countries that are not included in the existing projection. Names of countries are matched to those in the UNlocations dataset (or in the dataset loaded from my.locations.file if used).

output.dir

Output directory of the projection. If there is an existing projection in output.dir and replace.output=TRUE, everything in the directory will be deleted.

annual

Logical. If TRUE it is assumed that this is 1x1 simulation, i.e. one year age groups and one year time periods. Note that this is still an experimental feature!

inputs

A list of file names where input data is stored. It contains the following elements (Unless otherwise noted, these are tab delimited ASCII files; Names of default datasets from the corresponding wpp package which are used if the corresponding element is NULL are shown in brackets):

popM, popF: Initial male/female age-specific population (at time present.year) [popM, popF].

mxM, mxF

Historical data and (optionally) projections of male/female age-specific death rates [mxM, mxF] (see also argument fixed.mx).

srb

Projection of sex ratio at birth. [sexRatio]

pasfr

Historical data and (optionally) projections of percentage age-specific fertility rate [percentASFR] (see also argument fixed.pasfr).

patterns, mig.type

Migration type and base year of the migration. In addition, this dataset gives information on country's specifics regarding mortality and fertility age patterns as defined in [vwBaseYear]. patterns and mig.type have the same meaning and can be used interchangeably.

migM, migF, migMt, migFt, mig

Projection and (optionally) historical data of net migration on the same scale as the initital population. There are three ways of defining this quantity, here in order of priority: 1. via migM and migF which should give male and female age-specific migration [migrationM, migrationF]; 2. via migMt and migFt which should give male and female total net migration; 3. via mig which should give the total net migration. For 2. and 3., the totals are disagregated into age-specific migration by applying a schedule defined by the mig.age.method argument. If all of these input items are missing, for wpp.year = 2024 or 2012, the UN age schedules are used. For other WPP revisions, the migration schedules are reconstructed from total migration counts derived from migration using either the age.specific.migration or the migration.totals2age function.

mig.fdm

If mig.age.method is “fdmp” or “fdmnop”, this file is used to disaggregate total in- and out-migration into ages, giving proportions of the migration in-flow and out-flow for each age. It should have columns “country_code”, “age”, “in” and “out”, where the latter two should each sum to 1 for each location. By default the function uses the rc1FDM (annual) or rc5FDM (5-year) datasets. For locations where the unique identifier does not match the country code in these default datasets, Rogers-Castro curves are used, obtained via the function rcastro.schedule.

e0F.file

Comma-delimited CSV file with results of female life expectancy (generated using bayesLife, function convert.e0.trajectories, file “ascii_trajectories.csv”). Required columns are “LocID”, “Year”, “Trajectory”, and “e0”. If this element is not NULL, the argument e0F.sim.dir is ignored. If both e0F.file and e0F.sim.dir are NULL, data from the corresponding wpp package is taken, namely the median projections as one trajectory and the low and high variants (if available) as second and third trajectory. For 5-year simulations, column “Year” should be the middle year of the time period, e.g. 2023, 2028 etc.

e0M.file

Comma-delimited CSV file containing results of male life expectancy (generated using bayesLife, function convert.e0.trajectories, file “ascii_trajectories.csv”). Required columns are “LocID”, “Year”, “Trajectory”, and “e0”. If this element is not NULL, the argument e0M.sim.dir is ignored. As in the female case, if both e0M.file and e0M.sim.dir are NULL, data from the corresponding wpp package is taken.

tfr.file

Comma-delimited CSV file with results of total fertility rate (generated using bayesTFR, function convert.tfr.trajectories, file “ascii_trajectories.csv”). Required columns are “LocID”, “Year”, “Trajectory”, and “TF”. If this element is not NULL, the argument tfr.sim.dir is ignored. If both tfr.file and tfr.sim.dir are NULL, data from the corresponding wpp package is taken (median and the low and high variants as three trajectories). Alternatively, this argument can be the keyword “median_” in which case only the wpp median is taken.

e0F.sim.dir

Simulation directory with results of female life expectancy (generated using bayesLife). It is only used if e0F.file is NULL.

e0M.sim.dir

Simulation directory with results of male life expectancy (generated using bayesLife). Alternatively, it can be the string “joint_”, in which case it is assumed that the male life expectancy was projected jointly from the female life expectancy (see joint.male.predict) and thus contained in the e0F.sim.dir directory. The argument is only used if e0M.file is NULL.

tfr.sim.dir

Simulation directory with results of total fertility rate (generated using bayesTFR). It is only used if tfr.file is NULL.

migMtraj, migFtraj, migtraj

Comma-delimited CSV file with male/female age-specific migration trajectories, or total migration trajectories (migtraj). If present, it replaces deterministic projections given by the mig* items. It has a similar format as e.g. e0M.file with columns “LocID”, “Year”, “Trajectory”, “Age” (except for migtraj) and “Migration”. For a five-year simulation, the “Age” column must have values “0-4”, “5-9”, “10-14”, ..., “95-99”, “100+”, and the “Year” column should be the middle year of the time period, e.g. 2023, 2028 etc. In an annual simulation, age is given by a single number between 0 and 100, and “Year” contains all projected years.

migFDMtraj

Comma-delimited CSV file with trajectories of in- and out-migration schedules used for the FDM migration method, i.e. if mig.age.method is “fdmp” or “fdmnop”. The values have te same meaning as in the mig.fdm input item, except that here multiple trajectories of such schedules can be provided. It should have columns “LocID”, “Age”, “Trajectory”, “Value”, and “Parameter”. For “Age”, the same rules apply as for migMtraj above. The “Parameter” column should have values “in” for in-migration, “out” for out-migration and “v” for values of the variance denominator $v$ used in Equation 22 of Sevcikova et al (2024). For the $v$ parameter, the “Age” column should be left empty.

GQpopM, GQpopF

Age-specific population counts (male and female) that should be excluded from application of the cohort component method (CCM). It can be used for defining group quarters. These counts are removed from population before the CCM projection and added back afterwards. It is not used when computing vital events on observed data. The datasets should have columns “country_code”, “age” and “gq”. In such a case the “gq” amount is applied to all years. If it is desired to destinguish the amount that is added back for individual years, the “gq” column should be replaced by columns indicating the individual years, i.e. single years for an annual simulation and time periods (e.g. “2020-2025”, “2025-2030”) for a 5-year simulation. For a five-year simulation, the “age” column should include values “0-4”, “5-9”, “10-14”, ..., “95-99”, “100+”. However, rows with zeros do not need to be included. In an annual simulation, age is given by a single number between 0 and 100.

average.annual

Character string with values “TFR”, “e0M”, “e0F”. If this is a 5-year simulation, but the inputs of TFR or/and e0 comes from an annual simulation, including the corresponding string here will cause that the TFR or/and e0 trajectories are converted into 5-year averages.

nr.traj

Number of trajectories to be generated. If this number is smaller than the number of available trajectories of the probabilistic components (TFR, life expectancy and migration), the trajectories are equidistantly thinned. If all of those components contain less trajectories than nr.traj, the value is adjusted to the maximum of available trajectories of the components. For those that have less trajectories than the adjusted number, the available trajectories are re-sampled, so that all components have the same number of trajectories.

keep.vital.events

Logical. If TRUE age- and sex-specific vital events of births and deaths as well as other objects are stored in the prediction object, see Details.

fixed.mx

Logical. If TRUE, it is assumed the dataset of death rates (mxM and mxF) include data for projection years and they are then used instead of the life expectancy.

fixed.pasfr

Logical. If TRUE, it is assumed the dataset on percent age-specific fertility rate (percentASFR) include data for projection years and they are then used instead of computing it on the fly.

lc.for.hiv

Logical controlling if the modified Lee-Carter method should be used for projection of mortality rates for countries with HIV epidemics. If FALSE, the function hiv.mortmod from the HIV.LifeTables package is used.

lc.for.all

Logical controlling if the modified Lee-Carter method should be used for projection of mortality rates for all countries. If FALSE, the corresponding method is determined by the columns “AgeMortProjMethod1” and “AgeMortProjMethod2” of the vwBaseYear dataset.

mig.is.rate

Logical determining if migration data are to be interpreted as net migration rates (TRUE) or counts (FALSE, default). It can also be a vector of two logicals, where the first element refers to observed data and the second element refers to predictions. A value of c(FALSE, TRUE) could for example be used if observed data in inputs$mig are counts, and migration trajectories in inputs$migtraj are rates.

mig.age.method

If migration is given as totals, this argument determines a method to disaggregate into age-specific migration.

The “rc” method uses a simple Rogers-Castro disaggregation, via the function rcastro.schedule. An alternative schedule can be passed via the mig.rc.fam argument.

Values “fdmp” and “fdmnop” trigger the Flow Difference Method (Sevcikova et al, 2024), where “fdmp” weights the flows by population, while “fdmnop” is an unweighted version. They both split the total net migration into total in- and out-migration and then disaggregate these flows separately. These two FDM methods use additional inputs in the inputs$rc.fdm and/or inputs$migFDMtraj components.

The “auto” method (default) uses “rc” if sex-specific migration totals are given, i.e. in inputs$migFt and inputs$migMt. If annual is FALSE and wpp.year is 2015, 2017 or 2019, then the residual method using the function age.specific.migration is used. Otherwise the “fdmp” method is applied.

mig.rc.fam

Data frame providing a single family of Rogers-Castro parameters to be used if mig.age.method is set to “rc”. Mandatory columns are “age” and “prop”. Optionally, it can have a column “mig_sign” with values “Inmigration” and “Emigration” (distinguishing schedules to be applied for positive and negative migration, respectively) and a column “sex” with values “Female” and “Male”. The format corresponds to the dataset DemoTools::mig_un_families, subset to a single family. If this argument is NULL and mig.age.method = "rc", the function rcastro.schedule with equal sex ratio is used to distribute total migration into ages.

my.locations.file

Name of a tab-delimited ascii file with a set of all locations for which a projection is generated. Use this argument if you are projecting for a country/region that is not included in the standard UNlocations dataset. It must have the same structure.

replace.output

Logical. If TRUE, everything in the directory output.dir is deleted prior to the prediction.

verbose

Logical controlling the amount of output messages.

...

Additional arguments passed to the underlying function. These can be parallel and nr.nodes for parallel processing and the number of nodes, respectively, as well as further arguments passed for creating a parallel cluster.

Author

Hana Sevcikova, Thomas Buettner, based on code of Nan Li and helpful comments from Patrick Gerland

Details

The population projection is computed using the cohort component method and is based on an algorithm used by the United Nation Population Division (see also Sevcikova et al (2016b) in the References below). For each country, one projection is calculated for each trajectory of male and female life expectancy, TFR and possibly migration. This results in a set of trajectories of population projection which forms its posterior distribution. The trajectories of life expectancy and TFR can be given either in its binary form generated by the packages bayesLife and bayesTFR, respectively (as directories e0M.sim.dir, e0F.sim.dir, tfr.sim.dir of the inputs argument), or they can be given as ASCII tables in csv format, see above. The number of trajectories for male and female life expectancy must match, as does for male and female migration.

The projection is generated sequentially location by location. Results are stored in a sub-directory of output.dir called prediction. There is one binary file per location, called totpop_country$x$.rda, where $x$ is the country code. It contains six objects: totp, totpf, totpm (trajectories of total population, age-specific female and age-specific male, respectively), totp.hch, totpf.hch, totpm.hch (the UN half-child variant for total population, age-specific female and age-specific male, respectively). Optionally, if keep.vital.events is TRUE, there is an additional file per country, called vital_events_country$x$.rda, containing the following objects: btm, btf (trajectories for births by age of mothers for male and female child, respectively), deathsm, deathsf (trajectories for age-specific male and female deaths, respectively), asfert (trajectories of age-specific fertility), mxm, mxf (trajectories of male and female age-specific mortality rates), migm, migf (if used, these are trajectories of male and female age-specific migration), btm.hch, btf.hch, deathsm.hch, deathsf.hch, asfert.hch, mxm.hch, mxf.hch (the UN half-child variant for age- and sex-specific births, deaths, fertility rates and mortality rates). An object of class bayesPop.prediction is stored in the same directory in a file prediction.rda. It is updated every time a country projection is finished.

See pop.trajectories for extracting trajectories.

To access a previously stored prediction object, use get.pop.prediction.

References

H. Sevcikova, A. E. Raftery (2016a). bayesPop: Probabilistic Population Projections. Journal of Statistical Software, 75(5), 1-29. doi:10.18637/jss.v075.i05

A. E. Raftery, N. Li, H. Sevcikova , P. Gerland, G. K. Heilig (2012). Bayesian probabilistic population projections for all countries. Proceedings of the National Academy of Sciences 109:13915-13921.

P. Gerland, A. E. Raftery, H. Sevcikova, N. Li, D. Gu, T. Spoorenberg, L. Alkema, B. K. Fosdick, J. L. Chunn, N. Lalic, G. Bay, T. Buettner, G. K. Heilig, J. Wilmoth (2014). World Population Stabilization Unlikely This Century. Science 346:234-237.

H. Sevcikova, N. Li, V. Kantorova, P. Gerland and A. E. Raftery (2016b). Age-Specific Mortality and Fertility Rates for Probabilistic Population Projections. In: Dynamic Demographic Analysis, ed. Schoen R. (Springer), pp. 285-310. Earlier version in arXiv:1503.05215.

H. Sevcikova, J. Raymer J., A. E. Raftery (2024). Forecasting Net Migration By Age: The Flow-Difference Approach. arXiv:2411.09878.

Examples

Run this code

if (FALSE) {
sim.dir <- tempfile()
# Countries can be given as a combination of numerical codes and names
pred <- pop.predict(countries=c("Netherlands", 218, "Madagascar"), nr.traj=3, 
           output.dir=sim.dir)
pop.trajectories.plot(pred, "Ecuador", sum.over.ages=TRUE)
unlink(sim.dir, recursive=TRUE)
}

Run the code above in your browser using DataLab