Learn R Programming

iClusterVB (version 0.1.4)

sim_data: Simulated Dataset

Description

The dataset consists of \(N = 240\) individuals and \(R = 4\) data views with different data types. Two of the data views are continuous, one is count, and one is binary. The true number of clusters was set to \(K = 4\), and the cluster proportions were set at \(\pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25\), such that we have balanced cluster proportions. Each of the data views had \(p_r = 500\) features, \(r = 1, \dots, 4\), but only 50, or 10%, were relevant features that contributed to the clustering, and the rest were noise features that did not contribute to the clustering. In total, there were \(p = \sum_{r=1}^4 = 2000\) features.

For data view 1 (continuous), relevant features were generated from the following normal distributions: \(\text{N}(10, 1)\) for Cluster 1, \(\text{N}(5, 1)\) for Cluster 2, \(\text{N}(-5, 1)\) for Cluster 3, and \(\text{N}(-10, 1)\) for Cluster 4, while noise features were generated from \(\text{N}(0, 1)\). For data view 2 (continuous), relevant features were generated from the following normal distributions: \(\text{N}(-10, 1)\) for Cluster 1, \(\text{N}(-5, 1)\) for Cluster 2, \(\text{N}(5, 1)\) for Cluster 3, and \(\text{N}(10, 1)\) for Cluster 4, while noise features were generated from \(\text{N}(0, 1)\). For data view 3 (binary), relevant features were generated from the following Bernoulli distributions: \(\text{Bernoulli}(0.05)\) for Cluster 1, \(\text{Bernoulli}(0.2)\) for Cluster 2, \(\text{Bernoulli}(0.4)\) for Cluster 3, and \(\text{Bernoulli}(0.6)\) for Cluster 4, while noise features were generated from \(\text{Bernoulli}(0.1)\). For data view 4 (count), relevant features were generated from the following Poisson distributions: \(\text{Poisson}(50)\) for Cluster 1, \(\text{Poisson}(35)\) for Cluster 2, \(\text{Poisson}(20)\) for Cluster 3, and \(\text{Poisson}(10)\) for Cluster 4, while noise features were generated from \(\text{Poisson}(2)\).

Usage

data(sim_data)

Arguments

Format

A list containing four datasets, and other elements of interest.