Learn R Programming

PAFit (version 1.2.10)

joint_estimate: Joint inference of attachment function and node fitnesses

Description

This function jointly estimates the attachment function \(A_k\) and node fitnesses \(\eta_i\). It first performs a cross-validation to select the optimal parameters \(r\) and \(s\), then estimates \(A_k\) and \(eta_i\) using that optimal pair with the full data (Ref. 2).

Usage

joint_estimate(net_object                               , 
              net_stat      = get_statistics(net_object), 
              p             = 0.75                      ,
              stop_cond     = 10^-8                     ,
              mode_reg_A    = 0                         , 
              ...)

Value

Outputs a Full_PAFit_result object, which is a list containing the following fields:

  • cv_data: a CV_Data object which contains the cross-validation data. This is the testing data.

  • cv_result: a CV_Result object which contains the cross-validation result. Normally the user does not need to pay attention to this data.

  • estimate_result: this is a PAFit_result object which contains the estimated attachment function \(A_k\), the estimated fitnesses \(\eta_i\) and their confidence intervals. In particular, the important fields are:

    • ratio: this is the selected value for the hyper-parameter \(r\).

    • shape: this is the selected value for the hyper-parameter \(s\).

    • k and A: a degree vector and the estimated PA function.

    • var_A: the estimated variance of \(A\).

    • var_logA: the estimated variance of \(log A\).

    • upper_A: the upper value of the interval of two standard deviations around \(A\).

    • lower_A: the lower value of the interval of two standard deviations around \(A\).

    • center_k and theta: when we perform binning, these are the centers of the bins and the estimated PA values for those bins. theta is similar to A but with duplicated values removed.

    • var_bin: the variance of theta. Same as var_A but with duplicated values removed.

    • upper_bin: the upper value of the interval of two standard deviations around theta. Same as upper_A but with duplicated values removed.

    • lower_bin: the lower value of the interval of two standard deviations around theta. Same as lower_A but with duplicated values removed.

    • g: the number of bins used.

    • alpha and ci: alpha is the estimated attachment exponent \(\alpha\) (when assume \(A_k = k^\alpha\)), while ci is the confidence interval.

    • loglinear_fit: this is the fitting result when we estimate \(\alpha\).

    • f: the estimated node fitnesses.

    • var_f: the estimated variance of \(\eta_i\).

    • upper_f: the estimated upper value of the interval of two standard deviations around \(\eta_i\).

    • lower_f: the estimated lower value of the interval of two standard deviations around \(\eta_i\).

    • objective_value: values of the objective function over iterations in the final run with the full data.

    • diverge_zero: logical value indicates whether the algorithm diverged in the final run with the full data.

  • contribution: a list containing an estimate of the contributions of preferential attachment and fitness mechanisms in the growth process of the network. The calculation adapts a quantification method proposed in Section 3 of Ref. 4, which is for preferential attachment and transitivity, to preferential attachment and fitness.

    • PA_contribution: an array containing the contributions of preferential attachment at each time-step

    • fit_contribution: an array containing the contributions of the fitness mechanism at each time-step

    • mean_PA_contrib: the average contribution of preferential attachment through the whole growth process

    • mean_fit_contrib: the average contribution of the fitness mechanism through the whole growth process

Arguments

net_object

an object of class PAFit_net that contains the network.

net_stat

An object of class PAFit_data which contains summarized statistics needed in estimation. This object is created by the function get_statistics. The default value is get_statistics(net_object).

p

Numeric. This is the ratio of the number of new edges in the learning data to that of the full data. The data is then divided into two parts: learning data and testing data based on p. The learning data is used to learn the node fitnesses and the testing data is then used in cross-validation. Default value is 0.75.

stop_cond

Numeric. The iterative algorithm stops when \(abs(h(ii) - h(ii + 1)) / (abs(h(ii)) + 1) < stop.cond\) where \(h(ii)\) is the value of the objective function at iteration \(ii\). We recommend to choose stop.cond at most equal to \(10^(- number of digits of h - 2)\), in order to ensure that when the algorithm stops, the increase in posterior probability is less than 1% of the current posterior probability. Default is 10^-8. This threshold is good enough for most applications.

mode_reg_A

Binary. Indicates which regularization term is used for \(A_k\):

  • 0: This is the regularization term used in Ref. 1 and 2. Please refer to Eq. (4) in the tutorial for the definition of the term. It approximately enforces the power-law form \(A_k = k^\alpha\). This is the default value.

  • 1: Unlike the default, this regularization term exactly enforces the functional form \(A_k = k^\alpha\). Please refer to Eq. (6) in the tutorial for the definition of the term. Its main drawback is it is significantly slower to converge, while its gain over the default one is marginal in most cases.

...

Other arguments to pass to the underlying algorithm.

Author

Thong Pham thongphamthe@gmail.com

References

1. Pham, T., Sheridan, P. & Shimodaira, H. (2015). PAFit: A Statistical Method for Measuring Preferential Attachment in Temporal Complex Networks. PLoS ONE 10(9): e0137796. (tools:::Rd_expr_doi("10.1371/journal.pone.0137796")).

2. Pham, T., Sheridan, P. & Shimodaira, H. (2016). Joint Estimation of Preferential Attachment and Node Fitness in Growing Complex Networks. Scientific Reports 6, Article number: 32558. (tools:::Rd_expr_doi("10.1038/srep32558")).

3. Pham, T., Sheridan, P. & Shimodaira, H. (2020). PAFit: An R Package for the Non-Parametric Estimation of Preferential Attachment and Node Fitness in Temporal Complex Networks. Journal of Statistical Software 92 (3). (tools:::Rd_expr_doi("10.18637/jss.v092.i03")).

4. Inoue, M., Pham, T. & Shimodaira, H. (2020). Joint Estimation of Non-parametric Transitivity and Preferential Attachment Functions in Scientific Co-authorship Networks. Journal of Informetrics 14(3). (tools:::Rd_expr_doi("10.1016/j.joi.2020.101042")).

See Also

See get_statistics for how to create summarized statistics needed in this function.

See Jeong, Newman and only_A_estimate for functions to estimate the attachment function in isolation.

See only_F_estimate for a function to estimate node fitnesses in isolation.

Examples

Run this code
if (FALSE) {
  
  library("PAFit")
  #### Example 1: a linear preferential attachment kernel, i.e., A_k = k ############
  set.seed(1)
  # size of initial network = 100
  # number of new nodes at each time-step = 100
  # Ak = k; inverse variance of the distribution of node fitnesse = 5
  net        <- generate_BB(N        = 1000 , m             = 50 , 
                            num_seed = 100  , multiple_node = 100,
                            s        = 5)
  net_stats  <- get_statistics(net)
  
  # Joint estimation of attachment function Ak and node fitness
  result     <- joint_estimate(net, net_stats)
  
  summary(result)
  
  # plot the estimated attachment function
  true_A     <- pmax(result$estimate_result$center_k,1) # true function
  plot(result , net_stats, max_A = max(true_A,result$estimate_result$theta))
  lines(result$estimate_result$center_k, true_A, col = "red") # true line
  legend("topleft" , legend = "True function" , col = "red" , lty = 1 , bty = "n")
  
  # plot the estimated node fitnesses and true node fitnesses
  plot(result, net_stats, true = net$fitness, plot = "true_f")
  
  #############################################################################
  #### Example 2: a non-log-linear preferential attachment kernel ############
  set.seed(1)
  # size of initial network = 100
  # number of new nodes at each time-step = 100
  # A_k = alpha* log (max(k,1))^beta + 1, with alpha = 2, and beta = 2
  # inverse variance of the distribution of node fitnesse = 10
  net        <- generate_net(N       = 1000 , m             = 50 , 
                            num_seed = 100  , multiple_node = 100,
                            s        = 10   , mode = 3, alpha = 2, beta = 2)
  net_stats  <- get_statistics(net)
  
  # Joint estimation of attachment function Ak and node fitness
  result     <- joint_estimate(net, net_stats)
  
  summary(result)
  
  # plot the estimated attachment function
  true_A     <- 2 * log(pmax(result$estimate_result$center_k,1))^2 + 1 # true function
  plot(result , net_stats, max_A = max(true_A,result$estimate_result$theta))
  lines(result$estimate_result$center_k, true_A, col = "red") # true line
  legend("topleft" , legend = "True function" , col = "red" , lty = 1 , bty = "n")
  
  # plot the estimated node fitnesses and true node fitnesses
  plot(result, net_stats, true = net$fitness, plot = "true_f")
  #############################################################################
  #### Example 3: another non-log-linear preferential attachment kernel ############
  set.seed(1)
  # size of initial network = 100
  # number of new nodes at each time-step = 100
  # A_k = min(max(k,1),sat_at)^alpha, with alpha = 1, and sat_at = 100
  # inverse variance of the distribution of node fitnesse = 10
  net        <- generate_net(N       = 1000 , m             = 50 , 
                            num_seed = 100  , multiple_node = 100,
                            s        = 10   , mode = 2, alpha = 1, sat_at = 100)
  net_stats  <- get_statistics(net)
  
  # Joint estimation of attachment function Ak and node fitness
  result     <- joint_estimate(net, net_stats)
  
  summary(result)
  
  # plot the estimated attachment function
  true_A     <- pmin(pmax(result$estimate_result$center_k,1),100)^1 # true function
  plot(result , net_stats, max_A = max(true_A,result$estimate_result$theta))
  lines(result$estimate_result$center_k, true_A, col = "red") # true line
  legend("topleft" , legend = "True function" , col = "red" , lty = 1 , bty = "n")
  
  # plot the estimated node fitnesses and true node fitnesses
  plot(result, net_stats, true = net$fitness, plot = "true_f")
  }

Run the code above in your browser using DataLab