Learn R Programming

Laurae

Advanced High Performance Data Science Toolbox for R by Laurae

me = wants download

devtools::install_github("Laurae2/Laurae")

Latest News (DD/MM/YYYY)

24/03/2017: Added Xgboard, an interactive dashboard for visualizing xgboost training, whether you are on computer, on your phone, on a tablet... by setting up a server accessible using a web browser (Google Chrome, Firefox...). Supports only Accuracy and Timing, more to come soon!

04/03/2017: Added Deep Forest implementation in R using xgboost, which may provide similar performance versus very simple Convolutional Neural Networks (CNNs), and slightly better results than boosted models. You can find the paper here. Supported: Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, Deep Forest. You can use Gradient Boosting to get a sort of "Deep Boosting" model.

Benchmark on MNIST 2,000 samples for training, 10,000 samples for testing, i7-4600U, 3-fold cross-validation (Cascade Forest with poor parameters for speed, Multi-Grained Scanning with poor parameters for speed):

ModelFeaturesAccuracyTraining TimeModel Size
Cascade Forest (xgboost)78489.91%6th iteration637.264s11th iterationForest: 274,951,008 bytes
Boosted Trees (xgboost)78490.53%250th iteration267.884s300 iterationsBoost: NA
"Deep Forest" (xgboost)=> Multi-Grained Scanning=>Cascade ForestScan: 28x28Forest: 240491.46%5 iterationsScan: 449.593sForest (8): 1135.937sScan: 256,419,396 bytesForest: 273,624,912 bytes
"Deep Boosting" (xgboost)=> Multi-Grained Scanning=>Boosted TreesScan: 28x28Boost: 240492.41%215 iterationsScan: 449.593sBoost (265): 852.360sScan: 256,419,396 bytesBoost: NA
LeNet (MXnet + R w/ Intel MKL)28x2894.74%50 epochs647.638s50 epochsCNN: NA

10/02/2017: Added Partial Dependence Analysis, currently a skeleton but I will build more on it. It is fully working for the analysis of single observations against an amount of features you specify. The multiple observation version is not yet working when it comes to analyzing statistically the results.

30/01/2017: Added "Lextravagenza", a machine learning model based on xgboost ignoring past gradient/hessian for optimization, but allowing dynamic trees to outperform small boosted trees.

09/01/2017: My LightGBM PR for easy installation in R has been merged in LightGBM official repository. When I will get time to work more on it (harvest metric, harvest feature importance, save/load models), I will update this package and get rid of the old LightGBM wrapper. This way, one will be able to use the latest versions of LightGBM, instead of being stuck with the (old) PR 33 of LightGBM.

08/01/2017: I'm starting to work on an automated machine learning model / stacker.

What is Data Science

What can I do with it?

Mostly... in a nutshell:

What?Can you do?
Supervised LearningDeep Forest implementation: Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, Deep Forest. Automated machine learning (feature selection + hyperparamter tuning) xgboost LightGBM (training from binary, feature importance, prediction) Rule-based model on outliers (univariate, bivariate) Feature engineering assistant Interactive xgboost feature importance Repeated cross-validation Symbolic loss function derivation Interactive split feature engineering assistant Laurae's Lextravagenza (dynamic boosted trees) Partial dependency analysis on single observations for finding insights
Unsupervised LearningAutomated t-SNE
Automated Reporting for Machine LearningLinear regression Unbiased xgboost regression/classification
Interactive AnalysisInteractive loss function symbolic derivation interactive "I'm Feeling Lucky" ggplot Interactive 3djs/Plotly Interactive Brewer's Paletttes, Xgboard
OptimizationCross-Entropy optimization combined with Elite optimization
data.table improvementsup to 3X memory efficiency without even a minor cost in CPU time
Plot massive amounts of data without being slowtableplots tableplots tableplots tableplots tableplots
SVMLight I/O (external package)C++ implementation of SVMLight reading/saving for dgCMatrix (sparse column-compressed format)

Supervised Learning:

  • Deep Forest Implementation: first implementation in R of Complete-Random Tree Forest, Cascade Forest, Multi-Grained Scanning, and Deep Forest. Read more on this paper.
  • (Soon Deprecated) Use LightGBM in R (first wrapper available in R for LightGBM) tuned for maximum I/O without using in-memory dataset moves (which is both a good and bad thing! - 10GB of data takes 4 mins of travel in a HDD) and use feature importance with smart and readable plots - I recommend using official LightGBM R Package which I contribute to
  • Automated Machine Learning from a set of features and hyperparameters (provide algorithm functions, features, hyperparamters, and a stochastic optimizer does the job for you with full logging if required)
  • Use a repeated cross-validated xgboost (Extreme Gradient Boosting)
  • Get pretty interactive feature importance tables of xgboost ready-to-use for markdown documents
  • Throw supervised rules using outliers anywhere you feel it appropriate (univariate, bivariate)
  • Create cross-validated and repeated cross-validated folds for supervised learning with more options for creating them (like batch creation - those ones can be fed into my LightGBM R wrapper for extensive analysis of feature behavior)
  • Feature Engineering Assistant (mostly non-linear version) using automated decision trees
  • Dictionary of loss functions and ready to input into xgboost (currently: Absolute Error, Squared Error, Cubic Error, Loglikelihood Error, Poisson Error, Kullback-Leibler Error)
  • Symbolic Derivaton for custom loss functions (finding gradient/hessian painlessly)
  • Lextravagenza model (dynamic boosted trees) which are good for small boosting iterations, bad for high boosting iterations (good for diversity)
  • Partial dependency analysis for single observation: the way to get insights on why a black box made a specific decision!

Unsupervised Learning:

  • Auto-tune t-SNE (t-Distributed Stochastic Neighbor Embedding), but it comes already with premade hyperparameters tuned for minimal reproduction loss!

Automated Reporting for Machine Learning:

  • Generate an in-depth automated report for linear regression with interactive elements.
  • Generate an in-depth automated report for xgboost regression/classification with interactive elements, with unbiased feature importance computations

Interactive Analysis:

  • Discover and optimize gradient and hessian functions interactively in real-time
  • Plot up to 1 dependent variable, 2 independent variables, 2 conditioning variables, and 1 weighting variable for Exploratory Data Analysis using ggplot, in real-time
  • Plot up to three variables for Exploratory Data Analysis using 3djs via NVD3, in real-time
  • Plot several variables for Exploratory Data Analysis using 3djs via Plotly/ggplot, in real-time
  • Discover rule-based (from decision trees) non-linear relationship between variables, with rules ready to be copied and pasted for data.tables
  • Visualize interactively Color Brewer palettes with unlimited colors (unlike the original palettes), with ready to copy&paste color codes as vectors
  • Monitor xgboost training in real time

Optimization:

  • Do feature selection & hyperparameter optimization using Cross-Entropy optimization & Elite optimization
  • Do the same optimization but with any variable (continuous, ordinal, discrete) for any function using fully personalized callbacks (which is both a great thing and a hassle for the user) and a personalized training backend (by default it uses xgboost as the predictor for next steps, you can modify it by another (un)supervised machine learning model!)
  • Symbolic Derivaton for custom loss functions (finding gradient/hessian painlessly)

Improvements & Extras:

  • Improve data.table memory efficiency by up to 3X while keeping a large part of its performance (best of both worlds? isn't that insane?)
  • Improve Cross-Entropy optimization by providing a more powerful frontend (at the expense of the user's necessary knowledge) in order to converge better on feature selection & but slower on hyperparameter optimization of black boxes
  • Load sparse data directly as dgCMatrix (sparse matrix)
  • Plot massive amount of data in an easily readable picture
  • Add unlimited colors to the Color Brewer palettes
  • Add the ability to add linear equation coefficient to ggplot facets
  • Add multiplot ggplot

Sparsity SVMLight converter benchmark:

  • Benchmark to convert a dgCMatrix with 2,500,000 rows and 8,500 columns (1.1GB in memory) => 5 minutes
  • I think it needs minimum hours if not days for the other existing converters for such size.
  • Currently not merged on this repository: see https://github.com/Laurae2/sparsity !

Nice pictures:

  • Partial Dependence for single observation analysis (5-variate example):

  • Partial Dependence for multiple observation analysis (univariate example):

  • LightGBM Feature Importance:

  • xgboost Interactive Feature Importance:

  • Automated Reporting with pretty tables:

  • Interactive Symbolic Derivation:

  • Interactive EDA using 3djs/Plotly/ggplot2:

  • Interactive Feature Engineering Assistant:

  • Deep Forest example:

Installing this package? (Unproper installation)

Proper version is at the end of this page.

If you already installed this package in the past, or you want to install this package super fast because you want the functions, run in R:

devtools::install_github("Laurae2/Laurae")

Running in a Virtual Machine and/or have no proxy redirection from R? Use the following alternative:

devtools::install_git("git://github.com/Laurae2/Laurae.git")

Need all R dependencies in one shot?:

devtools:::install_github("ramnathv/rCharts")
install.packages("https://cran.r-project.org/src/contrib/Archive/tabplot/tabplot_1.1.tar.gz", repos=NULL, type="source")
install.packages(c("data.table", "foreach", "doParallel", "rpart", "rpart.plot", "partykit", "tabplot", "partykit", "ggplot2", "ggthemes", "plotluck", "grid", "gridExtra", "RColorBrewer", "lattice", "car", "CEoptim", "DT", "formattable", "rmarkdown", "shiny", "shinydashboard", "miniUI", "Matrix", "matrixStats", "R.utils", "Rtsne", "recommenderlab", "Rcpp", "RcppArmadillo", "mgcv", "Deriv", "outliers", "MASS", "stringi"))
devtools:::install_github("Laurae2/sparsity")

Getting Failed with error: 'there is no package called 'sparsity'' ? Run install_github("Laurae2/sparsity") or install_git("git://github.com/Laurae2/sparsity.git") if you wish to hide this error or if you want to use the super fast column-compressed sparse matrix (dgCMatrix) -> SVMLight converter in R.

What you need?

If I am not missing stuff (please make a pull request if something is missing that must be added):

PackageRequires compilation?Which functions?
Microsoft/LightGBMYES (install separately, from PR 33*)lgbm.train, lgbm.predict, lgbm.cv, lgbm.cv.prep, lgbm.fi, lgbm.metric, lgbm.fi.plot, LauraeML_lgbreg
dmlc/xgboostYES (install separately, from PR 1855**)xgb.ncv, xgb.opt.depth, report.xgb, LauraeML_gblinear, LauraeML_gblinear_par, Lextravagenza, pred.Lextravagenza, predictor_xgb, CRTreeForest, CRTreeForest_pred, CascadeForest, CascadeForest_pred, MGScanning, MGScanning_pred
Laurae2/sparsityYES (***)lgbm.train, lgbm.predict, lgbm.cv, lgbm.cv.prep, xgboard functions
data.tableNoread_sparse_csv, lgbm.train, lgbm.predict, lgbm.cv, lgbm.cv.prep, lgbm.fi, lgbm.fi.plot, DTcbind, DTrbind, DTsubsample, DTcolsample, setDF, DTfillNA, DT2mat, report.lm, report.xgb, interactive.SymbolicLoss, interactive.eda_ggplot, interactive.eda_tree, interactive.eda_3djs, interactive.eda_plotly, interactive.eda_RColorBrewer, LauraeML, LauraeML_gblinear, LauraeML_gblinear_par, partial_dep.obs, partial_dep.obs_all, predictor_xgb, partial_dep.plot, partial_dep.feature, cbindlist, CRTreeForest, CRTreeForest_pred, CascadeForest, CascadeForest_pred, MGScanning, MGScanning_pred, xgboard functions
foreachNoLauraeML_gblinear_par
doParallelNoLauraeML_gblinear_par
rpartNoFeatureLookup, interactive.eda_tree
rpart.plotNoFeatureLookup, interactive.eda_tree
partykitNointeractive.eda_tree
tabplotNotableplot_jpg, interactive.eda_ggplot, partial_dep.plot
rChartsNointeractive.eda_3djs
plotlyNointeractive.eda_plotly, partial_dep.plot
ggplot2Nolgbm.fi.plot, report.lm, report.xgb, interactive.eda_ggplot, partial_dep.plot, stat_smooth_func, stat_smooth_func.plotly, grid_arrange_shared_legend
ggthemesNointeractive.eda_plotly
GGallyNopartial_dep.plot
plotluckNointeractive.eda_ggplot
gridNoreport.lm, report.xgb, interactive.eda_tree
gridExtraNoreport.lm, report.xgb
RColorBrewerNointeractive.eda_plotly, interactive.eda_RColorBrewer, brewer.pal_extended
latticeNoreport.lm, report.xgb, partial_dep.plot
carNo.ExtraOpt_plot, partial_dep.plot
CEoptimNoExtraOpt, LauraeML
DTNoxgb.importance.interactive, report.lm, report.xgb
formattableNoreport.lm, report.xgb
rmarkdownNoreport.lm, report.xgb, interactive.eda_tree
shinyNointeractive.SymbolicLoss, interactive.eda_ggplot, interactive.eda_tree, interactive.eda_3djs, interactive.eda_plotly, interactive.eda_RColorBrewer
shinydashboardNointeractive.SymbolicLoss, interactive.eda_ggplot, interactive.eda_tree, interactive.eda_3djs, interactive.eda_plotly, interactive.eda_RColorBrewer
miniUINoxgboard functions
MatrixNoread_sparse_csv, CRTreeForest, CRTreeForest_pred, CascadeForest, CascadeForest_pred, MGScanning, MGScanning_pred
matrixStatsNoreport.lm, report.xgb
R.utilsNorule_single, rule_double, report.lm, report.xgb, xgboard functions
RtsneNotsne_grid
recommenderlabNoread_sparse_csv (only when using NAs as sparse)
RcppNosparsity (package)
RcppArmadilloNoreport.lm
DerivNoSymbolicLoss, interactive.SymbolicLoss
outliersNorule_single, rule_double
MASSNointeractive.eda_plotly
stringiNolightgbm.cv
None so farNokfold, nkfold, lgbm.find

Manual installations:

Installing dependencies?

  • For LightGBM (use PR 33 please), please do NOT use: git clone --recursive https://github.com/Microsoft/LightGBM for the repository. Use my stable version which is aligned with Laurae package via git clone --recursive https://github.com/Laurae2/LightGBM. Then follow the installation steps (https://github.com/Microsoft/LightGBM/wiki/Installation-Guide).
  • For xgboost, refer to my documentation for installing in MinGW: https://github.com/dmlc/xgboost/tree/master/R-package - If you encounter strange issues in Windows (like permission denied, etc.), please read: https://medium.com/@Laurae2/compiling-xgboost-in-windows-for-r-d0cb826786a5. Make sure you are using MinGW.
  • sparsity: You must use Laurae's sparsity package (SVMLight I/O conversion) which can be found here: https://github.com/Laurae2/sparsity/blob/master/README.md - compilation simply requires writing devtools:::install_github("Laurae2/sparsity") (and having Rtools in Windows).
  • tabplot: please use: install.packages("https://cran.r-project.org/src/contrib/Archive/tabplot/tabplot_1.1.tar.gz", repos=NULL, type="source"). The 1.3 version is "junk" since they added standard deviation which makes unreadable tableplots when it is too high, even if standard deviation is disabled.

Strange errors on first run

Sometimes you will get strange errors (like a corrupted documentation database) on the first load ever on the package. Restart R to get rid of this issue. It does not show up anymore afterwards.

Printed text is missing after interrupting LightGBM / xgboost

Write in your R console sink() until you get an error.

A lot of functions that worked are giving errors.

Write in your R console sink() until you get an error.

What is inside?

UtilityFunction Name(s)
Supervised Learningxgboost: xgb.ncv, xgb.opt.depth LightGBM: lgbm.train, lgbm.predict, lgbm.cv, lgbm.metric, lgbm.fi, lgbm.fi.plot, lgbm.find Rules: rule_single, rule_double Base: kfold, nkfold Helpers: SymbolicLoss, FeatureLookup AutoML: ExtraOpt, LauraeML Laurae's Dynamic Trees: Lextravagenza, pred.Lextravagenza Partial Dependence: partial_dep.obs, partial_dep.obs_all, partial_dep.plot, partial_dep.feature Deep Forest: CRTreeForest, CRTreeForest_pred, CascadeForest, CascadeForest_pred, MGScanning, MGScanning_pred
Unsupervised Learningt-SNE: tsne_grid
Automated Reportingreport.lm, report.xgb
VisualizationsInteractive: interactive.SymbolicLoss, interactive.eda_ggplot, interactive.eda_tree, interactive.eda_3djs, interactive.eda_plotly, interactive.eda_RColorBrewer Helpers: tableplot_jpg, brewer.pal_extended, grid_arrange_shared_legend, stat_smooth_func, stat_smooth_func.plotly, xgb.importance.interactive
Extreme low-memory manipulationdata.table: setDF, DTcbind, DTrbind, DTsubsample, DTcolsample, DTfillNA, cbindtable CSV sparse: read_sparse_csv
Function NameTypeWhat is it for
Laurae_loadDependency loadAttempts to load all Laurae dependencies.
tsne_gridDimensionality Reduction + Grid SearchAllows to grid search a seed and a perplexity interval using t-SNE, while returning the best t-SNE model along with the best iteration found, all in a fully verbose fashion.
read_sparse_csvIterated numeric sparse matrix readingR always imports CSV as dense. This function allows to read very large CSVs in chunks by variables (or a specific subset of variables), outputting a sparse matrix with typically lower RAM usage than a dense matrix if sparsity is high enough, all in a fully verbose fashion. Sparsity can be defined as 0 or NA, while saving as RDS is available in the loading streak.
tableplot_jpgBatch tableplot output to JPEGAllows to create a tableplot which is immediately turned into JPEG in batch per variable, against a label. It allows to preview features in a more understandable fashion than eyeballing numeric values.
xgb.ncvRepeated xgboost Cross-ValidationAllows to run a repeated xgboost cross-validation with fully verbosity of aggregate summaries, computation time, and ETA of computation, with fixed seed and a sink to store xgboost verbose data, and also out-of-fold predictions and external data prediction.
rule_singleOutlying Univariate Continuous Association Rule FinderAllows to use an outlying univariate continuous association rule finder on data and predicts immediately. Intermediate outlying scores can be stored. High verbosity of outputs during computation.
rule_doubleOutlying Bivariate Linear Continuous Association Rule FinderAllows to use an outlying bivariate linear continuous association rule finder on data and predicts immediately. Intermediate outlying scores cannot be stored. If a bivariate combination is ill-conditioned (sum of correlation matrix = 4), that bivariate combination is skipped to avoid a solver matrix inversion crash/freeze/interruption when trying to compute Mahalanobis distance dimensionality reduction. High verbosity of outputs during computation. Potential TO-DO: give the user the possibility to use their own dimensionality reduction function (like a truncated PCA 1-axis).
xgb.opt.depthxgboost Depth OptimizerAllows to optimize xgboost's depth parameter using simple heuristics. The learner function is customizable to fit any other model requiring to work by integer steps. Hence, it is adaptable to work on continuous 1-D features, with a large safety net you define yourself by coercing the integer to your own range.
lgbm.trainLightGBM trainerTrains a LightGBM model. Full verbosity control, with logging to file possible. Allows to predict out of the box during the training on the validation set and a test set.
lgbm.predictLightGBM predictorPredicts from a LightGBM model. Use the model working directory if you lost the model variable (which is not needed to predict - you only need the correct model working directory and the model name).
lgbm.cvLightGBM CV trainerCross-Validates a LightGBM model, returns out of fold predictions, ensembled average test predictions (if provided a test set), and cross-validated feature importance. Full verbosity control, with logging to file possible, with predictions given back as return. Subsampling is optimized to maximum to lower memory usage peaks.
lgbm.cv.prepLightGBM CV preparation helperPrepares the data for using lgbm.cv. All required data files are output, so you can run lgbm.cv with files_exist = TRUE without the need of other data preparation (which can be long sometimes). Supports SVMLight format.
lgbm.fiLightGBM Feaure ImportanceComputes the feature importance (Gain, Frequence) of a LightGBM model with Sum / Relative Ratio / Absolute Ratio scales.
lgbm.fi.plotLightGBM Feaure Importance PlotPretty plots a LightGBM feature importance table from a trained model, or from a cross-validated model. Use the model for auto-plotting. Try to use different scales to see more appropriately differences in feature importance. You can also use the multipresence parameter to cross-validate features.
lgbm.metricLightGBM Training MetricsComputes the training metrics of a logged LightGBM model and finds the best iteration.
lgbm.findLightGBM Path HelperHelps you usign a GUI to find and write the correct path for input to LightGBM functions.
setDFLow memory DT coercion to DF(Already available in data.table) Coerces a data.table to data.frame using the least possible memory. Actually, it uses about 0 extra memory.
DTcbindLow memory DT cbindColumn bind two data.tables using the least possible memory. With extreme settings, it uses only one column extra of memory, and the peak is reached when hitting the largest RAM intensive column (which is not much when you have 1,000+ columns). Compared to cbind, this reduce peak memory usage by 3X, and sometimes by more.
DTrbindLow memory DT rbindRow bind two data.tables using the least possible memory. With extreme settings, it uses only one column extra of memory, and the peak is reached when hitting the largest RAM intensive column (which is not much when you have 1,000+ columns). Compared to rbind, this reduce peak memory usage by 3X, and sometimes by more.
DTsubsampleLow memory DT subsamplingSubsample a data.table using the least possible memory. It should not do lower memory usage than direct subsampling. Sometimes, you can get a slight efficiency of up to 5%.
DTcolsampleLow memory DT column samplingColumn sample a data.table using the least possible memory. Impact is major versus a FROM clause in data.table, but it is more a convenience function for NULLing and COPYing the data.table / modify in-memoory (versus a NULL loop, the performance and memory difference should be non existant).
DTfillNALow memory DT Missing Value fillingFills the missing values of a data.table using the least possible memory. Compared to direct usages (DT[is.na(DT)] <- value), this function consumes up to 3X less (and typically 2X less). You can even create a new data.table or overwrite the original one. Also, this function works on data.frame, and can even overwrite the original data.frame.
DT2matLow memory DT to MatrixConverts a data.table to a matrix using the least possible memory, and way faster than using as.matrix.
kfoldk-fold Cross-ValidationCreates folds for cross-validation.
nkfoldn-repeated k-fold Cross-ValidationCreates folds for repeated cross-validation.
ExtraOptCross-Entropy -based Hybrid OptimizationCombines Cross-Entropy optimization and Elite optimization in order to optimize mixed types of variable (continuous, ordinal, discrete). The frontend is fully featured and requires the usage of callbacks in order to be usable. Example callbacks are provided. A demo trainer, a demo estimator, a demo predictor, and a demo plotter are provided as reference callbacks to customize. The optimization backend is fully customizable, allowing you to switch the optimizer (default is xgboost) to any other (un)supervised machine learning model!
FeatureLookupNon-linear Feature Engineering AssistantAllows to run a cross-validated decision tree using your own specified depth, amount of surrogates, and best potential lookups in order to to create new features based on the resulting decision tree at your own will.
SymbolicLossSymbolic Derivation of Loss FunctionsAttemps to compute the exact 1st and 2nd derivatives of the loss function provided, along of a reference function if you provide one. The functions returned are ready to be used. Graphics are also added to help the user.
xgb.importance.interactiveInteractive xgboost Feature ImportanceAllows to print an interactive xgboost feature importance table, ready to be used in markdown documents and HTML documents to be shared.
report.lmAutomated HTML Reporting for Linear RegressionAutomatically creates a report for linear regression (C++ backend). Allows data normalization, NA cleaning, rank deficiency checking, pretty printed machine learning performance statistics (R, R^2, MAE, MSE, RMSE, MAPE), pretty printed feature multiplicative coefficients, plotting statistics, analysis of variance (ANOVA), adjusted R^2, degrees of freedom computation...
report.xgbAutomated HTML Reporting for Linear RegressionAutomatically creates a report for linear regression (C++ backend). Allows data normalization, NA cleaning, rank deficiency checking, pretty printed machine learning performance statistics (R, R^2, MAE, MSE, RMSE, MAPE, AUC, Logloss, optimistic Kappa, optimistic F1 Score, optimistic MCC, optimistic TPR, optimistic TNR, optimistic FPR, optimistic FNR), pretty printed feature (unbiased/biased) importance, plotting statistics, plotting of machine learning performance statistic evolution vs probability...
interactive.SymbolicLossInteractive Dashboard for Derivation of Loss FunctionsCreates an interactive dashboard which allows you to work on up to 4 loss functions with their gradient and hessian, which are typically used in numerical optimization tasks. Resists to errors (keeps running even when you input errors).
interactive.eda_ggplotInteractive Dashforboard for Exploratory Data Analysis using ggplot2Creates an interactive dashboard which allows to work on the data set you want (from the global environment) by plotting up to 3 variables simultaneously, using a smart detection of variables to choose the best appropriate plot via ggplot and plotluck. Resists to errors (keeps running even when you input errors).
interactive.eda_treeInteractive Dashboard for Non-linear Feature Engineering AssistantCreates an interactive dashboard which allows to run a cross-validated decision tree using the same settings as the Non-Linear Feature Engineering Assistant, but with an interactive interface and printable rules ready to copy and paste into data.tables.
interactive.eda_3djsInteractive Dashboard for Exploratory Data Analysis using d3jsCreates an interactive dashboard which allows to work on the data set you want (from the global environment) by plotting up to 3 variables using 3djs. Not recommended and it is better to use interactive.eda_plotly. Supposed to resist to errors (keeps running even when you input errors), but this is not always true (the window unexpectedly closes sometimes when you input a very very bad setup).
interactive.eda_plotlyInteractive Dashboard for Exploratory Data Analysis using d3js via PlotlyCreates an interactive dashboard which allows to work on the data set you want (from the global environment) by plotting several variables using 3djs via Plotly (can use ggplot2 via Plotly via d3js). This is the recommended way for interactive charts. Not all plots are available, but support for scatter, bar, pie, histogram, histogram2d, box, contour, heatmap, polar, scatter3d, and surface plots is provided. Supposed to resist to errors (keeps running even when you input errors), but this is not always true (the window unexpectedly closes sometimes when you input a very very bad setup). Performs also on-demand supervised/unsupervised clustering for continuous to discrete data.
brewer.pal_extendedColor Brewer Palette ExtendedExtends the original Color Brewer palettes by providing unlimited colors unlike the original palettes.
interactive.eda_RColorBrewerInteractive Dashboard for Finding the Perfect Color Brewer PaletteCreates an interactive dashboard which allows you to search visually for the best Color Brewer palette for your own taste. Not only everything is shown in real-time just by editing a field, but a copy&paste output is ready to be pasted into R for further usage. You are greeted with a pyramid.
LauraeMLAutomated Machine Learning(VERY EXPERIMENTAL) Provides a function for doing automated machine learning (optimize features, optimize hyperparameters) using a stochastic optimizer (Cross-Entropy optimization). It does not use a Bayesian optimizer, therefore sampling is random every each optimization iterations and is much slower (for the benefits of finding which features to keep). Full logging is provided which allows you find out the best features and their loss (ex: loss vs number of features used). Still a lot of TO-DO (best would be "throw all in a single function without more than 5 arguments, get results back"). Functions: LauraeML_gblinear, LauraeML_gblinear_par, LauraeML_lgbreg
LextravagenzaLaurae's Dynamic Boosted Trees(EXPERIMENTAL, working) Trains a dynamic boosted trees whose depth is defined by a range instead of a single value, without any past gradient/hessian memory. It outperforms xgboost for a small amount of boosting iterations, but xgboost is better for longer trainings. However, dynamism comes at a price: you need a validation set (for dynamism) and a testing set (for early stopping). You can use pred.Lextravagenza to predict from it.
grid_arrange_shared_legendMultiplot ggplotAllows to add multiple ggplot2 plots in one page, with a common legend.
stat_smooth_funcggplot equation formula(For non-Plotly routines only) Prints the formula used for linear regression in ggplot plots. Works with facetting.
stat_smooth_func.plotlyggplot equation formula(For Plotly routines only)Prints the formula used for linear regression in ggplot plots. Works with facetting, but you should hover the mouse to check for strange placements (hovering one statistic will reveal the others).
partial_dep.obsPartial Dependence, Single Observation analysisPerforms a single observation analysis using the provided data in order to check the evolution of the label to predict when the feature values are changed, keeping all other features invariant. This is great if you want to analyze why an observation got XYZ value according to some factors.
partial_dep.obs_allPartial Dependence, Multiple Observation analysisPerforms a univariate multiple observation analysis using the provided data in order to check the evolution of the label to predict when the feature values are changed, keeping all other features invariant.
partial_dep.plotPartial Dependence, PlottingAllows to plot the content of partial dependence analysis. You can use lattice, ggplot2, car, base, or tableplots. Use Plotly for interactive analysis.
partial_dep.featurePartial Dependence, Statistical checkingPerforms statistical tests to check for validity of impact of a feature against a specified variable.
cbindlistdata.table rbindlist for columnsAllows to perform rbindlist on list of vectors.
CRTreeForestComplete-Random Tree ForestTrains a Complete-Random Tree Forest model which is used in Cascade Forests from Deep Forests. You can use CRTreeForest_pred to predict from it.
CascadeForestCascade ForestTrains a Cascade Forest model which is the equivalent of a Multilayer Perceptron / Neural Network. Adding MGScanning before it makes it become a Deep Forest. Performance is very similar to LeNet (untested against other implementations yet), which is a convolutional neural network (CNN). You can use CascadeForest_pred to predict from it.
MGScanningMulti-Grained ScanningTrains a Multi-Grained Scanning model which is, when used as features for a Cascade Forest, turns it into a Deep Forest. You can use MGScannning_pred to predict from it.
xgboard.runXgboard Dashboard (run)Runs Xgboard Dashboard using the IP and port you specify and opens a window in a new browser (if asked to). By default, it uses 127.0.0.1:6700. You can use IP 0.0.0.0 for broadcasting in your Intranet.
xgboard.initXgboard Dashboard (init)Initialize an environment for xgboost.
xgboard.timeXgboard Dashboard (reset)Resets the time environment for xgboost.
xgboard.dumpXgboard Dashboard (dump)Performs dumping of metrics when passed in an evaluation metric.
xgboard.xgbXgboard Dashboard (eval_metric)(Easy) wrapper for the evaluation metric to pass to xgboost.
xgboard.eval.errorXgboard Dashboard (metric)Evaluates the best threshold for maximum binary accuracy and return both accuracy and threshold.
xgboard.eval.loglossXgboard Dashboard (metric)Evaluates the logartihmic loss for binary classification.

TO-DO:

  • Add a super fast matrix to data.table converter
  • Refactor LightGBM code
  • Better handling of LightGBM arguments
  • Better handling of LightGBM files
  • Fuse Laurae2/sparsity 's SVMLight converter/reader and Laurae2/Laurae
  • Add Differential Evolution algorithm for feature selection and hyperparameter simultaneous optimization (add another backend via another interface as it typically takes a lot of time for both) (cancelled)
  • (Attempt to) Add automated non-linear feature creation using decision trees (cancelled)
  • Provide more for LauraeML
  • Provide dynamic shrinkage for Lextravagenza for maximally overfitting validation data

To add:

  • xgboost grid search (LauraeML)
  • xgboost unbalanced large dataset learning (cancelled)
  • large sparse matrix loader for categorical data (cancelled)
  • Categorical to Numeric converter: h2o's autoencoder, mxnet's autoencoder, t-SNE, Generalized Low Rank Models, largeVis, FeatureHashing - along with testing performance using xgboost
  • Logloss brute force calibration
  • Prediction Analyzer (analyze any type of model predictions, currently only binary)
  • Automated Feature Creator (create automatically features using linear (^1), quadratic (^2), cubic (^3), quartic (^4), mean, and standard deviation of different random features as inputs) - all with a GUI to interrupt without losing data with full verbosity of search
  • Automated Feature Analyzer (analyze created features and test against randomness of improval) - all with verbosity
  • Leave-one-out encoding (encodes any categorical using a continuous variable such as the label or a feature)
  • AND MANY MORE...

Extra contributors:

  • @fakyras for the base R code for LightGBM.

Installing this package? (Proper installation)

If you need the modeling packages, you are going to need LightGBM and xgboost compiled. Also, xgboost requires to be installed afterwards as a R package. Using drat or CRAN version is not guaranteed to work with my package.

Linux users can skip xgboost (https://github.com/dmlc/xgboost/tree/master/R-package) and LightGBM (https://github.com/Microsoft/LightGBM/wiki/Installation-Guide) installation steps, as they are straightforward (compile source).

Windows users need MinGW (architecture x86_64) and Visual Studio 2015 Community (or any working version, starting from 2013). Prepare at least 10 GB.

xgboost (~1 GB in Windows)

This applies to Windows only. Linux users can just compile "out of the box" xgboost with the gcc tool chain and install easily the package in R.

Check first if you have RTools. If not, download a proper version here: https://cran.r-project.org/bin/windows/Rtools/

Check also whether you installed Git Bash or not. If not, install Git Bash (https://git-for-windows.github.io/).

Make sure you installed MinGW (mandatory) for x86_64 architecture.

Run in R: system('gcc -v')

  • If you don't see MinGW, then edit the PATH variable appropriately so MinGW is FIRST.
  • If you see MinGW, open Git Bash and run:
mkdir C:/xgboost
cd C:/xgboost
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
git submodule init
git submodule update
alias make='mingw32-make'
cd dmlc-core
make
cd ../rabit
make lib/librabit_empty.a
cd ..
cp make/mingw64_min.mk config.mk
make

This should compile xgboost perfectly out of the box on Windows. If you get an error at the last "make", it means you are not using MinGW or you messed up something in the steps.

Now, fire up an R session and run this:

setwd('C:/xgboost/xgboost/R-package')
library(devtools)
install()

If you get a "permission denied" error, go to C:\xgboost\xgboost\R-package, right-click on the “src” folder, select “Properties”:

  • Under the “Security” tab, click “Edit”
  • Click “Full control” to all group or user names (click on each group, click Full control for each)
  • Click OK twice
  • Right-click on the “src” folder, select “Properties”
  • Under the “Security” tab, click “Advanced”
  • Check “Replace all child object permission entries with inheritable permission entries from this object” (it is the last box at the bottom left of the opened tab).
  • Click OK twice
  • Run again install() in the R console

And you should have now xgboost compiled in Windows.

Check quickly that xgboost works:

library(xgboost)
set.seed(11111)
n=100
ncov=4
z=matrix(replicate(n,rnorm(ncov)),nrow=n)
alpha=c(-1,0.5,-0.25,-0.1)
za=z%*%alpha
p=exp(za)/(1+exp(za))
t=rbinom(n,1,p)
xgb.train(list(objective="binary:logitraw"), xgb.DMatrix(data=z,label=t), nrounds=10)

LightGBM installation (~10 GB in Windows)

  • Deprecated soon: I recommend to use the official LightGBM R package I contribute to, it is a one-liner install in R and you do not even need Visual Studio (but only Rtools). LightGBM in Laurae's package will be deprecated soon. *

This applies to Windows only. Linux users can just compile "out of the box" LightGBM with the gcc tool chain

LightGBM use Visual Studio (2013 or higher) to build in Windows. If you do not have Visual Studio, follow this: download Visual Studio 2015 Community. It is free. When installing Visual Studio Community, use the default installation method. Otherwise, you might have random errors on the UI if you try a minimal installation. Prepare at least 8GB of free drive space. Install it with the Visual C++ additions (custom install, select the first box which has 3 subboxes - it should say you will install the Windows SDK blablabla - ignore the update failure error at the end).

Once you are done installing Visual Studio 2015 Community, reboot your computer.

Now, or if you skipped the installation step, clone the latest (CLEARLY UNRECOMMENDED) LightGBM repository by doing in Git Bash:

cd C:/xgboost
git clone --recursive https://github.com/Microsoft/LightGBM

If you want the stable (RECOMMENDED) version aligned to Laurae package, use git clone --recursive https://github.com/Laurae2/LightGBM instead. You have 99%+ guarantee to have a non-working version if you use the fully bleeding edge devel version of LightGBM with this package (well, most of the things work but it is refusing to train on data most of the times, even via direct command line).

Now the steps:

  • Under C:/xgboost/LightGBM/windows, double click LightGBM.sln to open it in Visual Studio.
  • Accept any warning pop up about project versioning issues (Upgrade VC++ Compiler and Libraries --> OK).
  • Wait one minute for the loading.
  • On the Solution Explorer, click "Solution 'LightGBM' (1 project)"
  • On the bottom right tab (Properties), change the "Active config" to "Release|x64" (default is "Debug_mpi|x64")
  • Compile the solution by pressing Ctrl+Shift+B (or click Build > Build Solution).
  • Should everything be correct, you now have LightGBM compiled under C:\xgboost\LightGBM\windows\x64\Release

If you get an error while building (Windows SDK version blabla), then you will need the correct SDK for your OS. Start Visual Studio from scratch, click "New Project", select "Visual C++" and click "Install Visual C++ 2015 Tools for Windows Desktop". Then, attempt to build LightGBM.

If Visual Studio fails to load the "project", delete LightGBM folder and clone LightGBM repository again in Git Bash. If it still does not compile in Visual Studio, try adjusting the PATH to include the appropriate Windows SDK path. Restart Visual Studio and try compiling again. Another way: uninstall Visual Studio (using the installer), reboot, and reinstall using Custom install (and select all Visual C++ things, it must be the first box with 3 subboxes to check - which will tell you it will install the SDK etc.). Then, you should be able to compile it perfectly.

Once you compiled it (and after you installed everything else you need, like the Laurae package), create a folder named "test" in "C:/" (or any appropriate folder you have), and try to run the following in R (you will get two prompts: the first for the "temporary" directory you created, and the second for the LightGBM executable to select):

# Make sure you have data.table in case
setwd(choosedir(caption = "Select the temporary folder"))
library(Laurae)
library(stringi)

DT <- data.table(Split1 = c(rep(0, 50), rep(1, 50)), Split2 = rep(c(rep(0, 25), rep(0.5, 25)), 2))
DT$Split5 <- rep(c(rep(0, 5), rep(0.05, 5), rep(0, 10), rep(0.05, 5)), 4)
label <- as.numeric((DT$Split2 == 0) & (DT$Split1 == 0) & (DT$Split3 == 0) & (DT$Split4 == 0) | ((DT$Split2 == 0.5) & (DT$Split1 == 1) & (DT$Split3 == 0.25) & (DT$Split4 == 0.1) & (DT$Split5 == 0)) | ((DT$Split1 == 0) & (DT$Split2 == 0.5)))

trained <- lgbm.train(y_train = label,
                      x_train = DT,
                      bias_train = NA,
                      application = "binary",
                      num_iterations = 1,
                      early_stopping_rounds = 1,
                      learning_rate = 1,
                      num_leaves = 16,
                      min_data_in_leaf = 1,
                      min_sum_hessian_in_leaf = 1,
                      tree_learner = "serial",
                      num_threads = 1,
                      lgbm_path = lgbm.find(),
                      workingdir = getwd(),
                      validation = FALSE,
                      files_exist = FALSE,
                      verbose = TRUE,
                      is_training_metric = TRUE,
                      save_binary = TRUE,
                      metric = "binary_logloss")

tabplot

To have "more readable" tableplots for visualizations, you will need to install an old version of the tabplot package. You can do this by running in your R console:

install.packages("https://cran.r-project.org/src/contrib/Archive/tabplot/tabplot_1.1.tar.gz", repos=NULL, type="source")

Other packages

You can install the other packages by running in your R console:

install.packages(c("data.table", "foreach", "doParallel", "rpart", "rpart.plot", "partykit", "tabplot", "partykit", "ggplot2", "ggthemes", "plotluck", "grid", "gridExtra", "RColorBrewer", "lattice", "car", "CEoptim", "DT", "formattable", "rmarkdown", "shiny", "shinydashboard", "miniUI", "Matrix", "matrixStats", "R.utils", "Rtsne", "recommenderlab", "Rcpp", "RcppArmadillo", "mgcv", "Deriv", "outliers", "MASS", "stringi"))
devtools:::install_github("ramnathv/rCharts")
devtools:::install_github("Laurae2/sparsity")

Laurae

You can now install the Laurae package and use the fully fledged version of it.

devtools::install_github("Laurae2/Laurae")

Running in a Virtual Machine and/or have no proxy redirection from R? Use the following alternative:

devtools::install_git("git://github.com/Laurae2/Laurae.git")

Getting a package error while running install_github/install_git which is not "could not connect to server"? Make sure you have the package outlined in the error, which is required by devtools.

  • (Soon Deprecated) Use LightGBM in R (first wrapper available in R for LightGBM) tuned for maximum I/O without using in-memory dataset moves (which is both a good and bad thing! - 10GB of data takes 4 mins of travel in a HDD) and use feature importance with smart and readable plots - I recommend using official LightGBM R Package which I contribute to
  • Automated Machine Learning from a set of features and hyperparameters (provide algorithm functions, features, hyperparamters, and a stochastic optimizer does the job for you with full logging if required)
  • Use a repeated cross-validated xgboost (Extreme Gradient Boosting)
  • Get pretty interactive feature importance tables of xgboost ready-to-use for markdown documents
  • Throw supervised rules using outliers anywhere you feel it appropriate (univariate, bivariate)
  • Create cross-validated and repeated cross-validated folds for supervised learning with more options for creating them (like batch creation - those ones can be fed into my LightGBM R wrapper for extensive analysis of feature behavior)
  • Feature Engineering Assistant (mostly non-linear version) using automated decision trees
  • Dictionary of loss functions and ready to input into xgboost (currently: Absolute Error, Squared Error, Cubic Error, Loglikelihood Error, Poisson Error, Kullback-Leibler Error)
  • Symbolic Derivaton for custom loss functions (finding gradient/hessian painlessly)
  • Lextravagenza model (dynamic boosted trees) which are good for small boosting iterations, bad for high boosting iterations (good for diversity)
  • Partial dependency analysis for single observation: the way to get insights on why a black box made a specific decision!

Unsupervised Learning:

  • Auto-tune t-SNE (t-Distributed Stochastic Neighbor Embedding), but it comes already with premade hyperparameters tuned for minimal reproduction loss!

Automated Reporting for Machine Learning:

  • Generate an in-depth automated report for linear regression with interactive elements.
  • Generate an in-depth automated report for xgboost regression/classification with interactive elements, with unbiased feature importance computations

Interactive Analysis:

  • Discover and optimize gradient and hessian functions interactively in real-time
  • Plot up to 1 dependent variable, 2 independent variables, 2 conditioning variables, and 1 weighting variable for Exploratory Data Analysis using ggplot, in real-time
  • Plot up to three variables for Exploratory Data Analysis using 3djs via NVD3, in real-time
  • Plot several variables for Exploratory Data Analysis using 3djs via Plotly/ggplot, in real-time
  • Discover rule-based (from decision trees) non-linear relationship between variables, with rules ready to be copied and pasted for data.tables
  • Visualize interactively Color Brewer palettes with unlimited colors (unlike the original palettes), with ready to copy&paste color codes as vectors

Optimization:

  • Do feature selection & hyperparameter optimization using Cross-Entropy optimization & Elite optimization
  • Do the same optimization but with any variable (continuous, ordinal, discrete) for any function using fully personalized callbacks (which is both a great thing and a hassle for the user) and a personalized training backend (by default it uses xgboost as the predictor for next steps, you can modify it by another (un)supervised machine learning model!)
  • Symbolic Derivaton for custom loss functions (finding gradient/hessian painlessly)

Improvements & Extras:

  • Improve data.table memory efficiency by up to 3X while keeping a large part of its performance (best of both worlds? isn't that insane?)
  • Improve Cross-Entropy optimization by providing a more powerful frontend (at the expense of the user's necessary knowledge) in order to converge better on feature selection & but slower on hyperparameter optimization of black boxes
  • Load sparse data directly as dgCMatrix (sparse matrix)
  • Plot massive amount of data in an easily readable picture
  • Add unlimited colors to the Color Brewer palettes
  • Add the ability to add linear equation coefficient to ggplot facets
  • Add multiplot ggplot

Sparsity SVMLight converter benchmark:

  • Benchmark to convert a dgCMatrix with 2,500,000 rows and 8,500 columns (1.1GB in memory) => 5 minutes
  • I think it needs minimum hours if not days for the other existing converters for such size.
  • Currently not merged on this repository: see https://github.com/Laurae2/sparsity !

Nice pictures:

  • Partial Dependence for single observation analysis (5-variate example):

  • Partial Dependence for multiple observation analysis (univariate example):

  • LightGBM Feature Importance:

  • xgboost Interactive Feature Importance:

  • Automated Reporting with pretty tables:

  • Interactive Symbolic Derivation:

  • Interactive EDA using 3djs/Plotly/ggplot2:

  • Interactive Feature Engineering Assistant:

Installing this package? (Unproper installation)

Proper version is at the end of this page.

If you already installed this package in the past, or you want to install this package super fast because you want the functions, run in R:

devtools::install_github("Laurae2/Laurae")

Running in a Virtual Machine and/or have no proxy redirection from R? Use the following alternative:

devtools::install_git("git://github.com/Laurae2/Laurae.git")

Need all R dependencies in one shot?:

devtools:::install_github("ramnathv/rCharts")
install.packages("https://cran.r-project.org/src/contrib/Archive/tabplot/tabplot_1.1.tar.gz", repos=NULL, type="source")
install.packages(c("data.table", "foreach", "doParallel", "rpart", "rpart.plot", "partykit", "tabplot", "partykit", "ggplot2", "ggthemes", "plotluck", "grid", "gridExtra", "RColorBrewer", "lattice", "car", "CEoptim", "DT", "formattable", "rmarkdown", "shiny", "shinydashboard", "miniUI", "Matrix", "matrixStats", "R.utils", "Rtsne", "recommenderlab", "Rcpp", "RcppArmadillo", "mgcv", "Deriv", "outliers", "MASS", "stringi"))
devtools:::install_github("Laurae2/sparsity")

Getting Failed with error: 'there is no package called 'sparsity'' ? Run install_github("Laurae2/sparsity") or install_git("git://github.com/Laurae2/sparsity.git") if you wish to hide this error or if you want to use the super fast column-compressed sparse matrix (dgCMatrix) -> SVMLight converter in R.

What you need?

If I am not missing stuff (please make a pull request if something is missing that must be added):

PackageRequires compilation?Which functions?
Microsoft/LightGBMYES (install separately, from PR 33*)lgbm.train, lgbm.predict, lgbm.cv, lgbm.cv.prep, lgbm.fi, lgbm.metric, lgbm.fi.plot, LauraeML_lgbreg
dmlc/xgboostYES (install separately, from PR 1855**)xgb.ncv, xgb.opt.depth, report.xgb, LauraeML_gblinear, LauraeML_gblinear_par, Lextravagenza, pred.Lextravagenza, predictor_xgb, CRTreeForest, CRTreeForest_pred, CascadeForest, CascadeForest_pred, MGScanning, MGScanning_pred
Laurae2/sparsityYES (***)lgbm.train, lgbm.predict, lgbm.cv, lgbm.cv.prep
data.tableNoread_sparse_csv, lgbm.train, lgbm.predict, lgbm.cv, lgbm.cv.prep, lgbm.fi, lgbm.fi.plot, DTcbind, DTrbind, DTsubsample, setDF, DTfillNA, report.lm, report.xgb, interactive.SymbolicLoss, interactive.eda_ggplot, interactive.eda_tree, interactive.eda_3djs, interactive.eda_plotly, interactive.eda_RColorBrewer, LauraeML, LauraeML_gblinear, LauraeML_gblinear_par, partial_dep.obs, partial_dep.obs_all, predictor_xgb, partial_dep.plot, partial_dep.feature, cbindlist, CRTreeForest, CRTreeForest_pred, CascadeForest, CascadeForest_pred, MGScanning, MGScanning_pred
foreachNoLauraeML_gblinear_par
doParallelNoLauraeML_gblinear_par
rpartNoFeatureLookup, interactive.eda_tree
rpart.plotNoFeatureLookup, interactive.eda_tree
partykitNointeractive.eda_tree
tabplotNotableplot_jpg, interactive.eda_ggplot, partial_dep.plot
rChartsNointeractive.eda_3djs
plotlyNointeractive.eda_plotly, partial_dep.plot
ggplot2Nolgbm.fi.plot, report.lm, report.xgb, interactive.eda_ggplot, partial_dep.plot, stat_smooth_func, stat_smooth_func.plotly, grid_arrange_shared_legend
ggthemesNointeractive.eda_plotly
GGallyNopartial_dep.plot
plotluckNointeractive.eda_ggplot
gridNoreport.lm, report.xgb, interactive.eda_tree
gridExtraNoreport.lm, report.xgb
RColorBrewerNointeractive.eda_plotly, interactive.eda_RColorBrewer, brewer.pal_extended
latticeNoreport.lm, report.xgb, partial_dep.plot
carNo.ExtraOpt_plot, partial_dep.plot
CEoptimNoExtraOpt, LauraeML
DTNoxgb.importance.interactive, report.lm, report.xgb
formattableNoreport.lm, report.xgb
rmarkdownNoreport.lm, report.xgb, interactive.eda_tree
shinyNointeractive.SymbolicLoss, interactive.eda_ggplot, interactive.eda_tree, interactive.eda_3djs, interactive.eda_plotly, interactive.eda_RColorBrewer
shinydashboardNointeractive.SymbolicLoss, interactive.eda_ggplot, interactive.eda_tree, interactive.eda_3djs, interactive.eda_plotly, interactive.eda_RColorBrewer
MatrixNoread_sparse_csv, CRTreeForest, CRTreeForest_pred, CascadeForest, CascadeForest_pred, MGScanning, MGScanning_pred
matrixStatsNoreport.lm, report.xgb
R.utilsNorule_single, rule_double, report.lm, report.xgb
RtsneNotsne_grid
recommenderlabNoread_sparse_csv (only when using NAs as sparse)
RcppNosparsity (package)
RcppArmadilloNoreport.lm
DerivNoSymbolicLoss, interactive.SymbolicLoss
outliersNorule_single, rule_double
MASSNointeractive.eda_plotly
stringiNolightgbm.cv
None so farNokfold, nkfold, lgbm.find

Manual installations:

Installing dependencies?

  • For LightGBM (use PR 33 please), please do NOT use: git clone --recursive https://github.com/Microsoft/LightGBM for the repository. Use my stable version which is aligned with Laurae package via git clone --recursive https://github.com/Laurae2/LightGBM. Then follow the installation steps (https://github.com/Microsoft/LightGBM/wiki/Installation-Guide).
  • For xgboost, refer to my documentation for installing in MinGW: https://github.com/dmlc/xgboost/tree/master/R-package - If you encounter strange issues in Windows (like permission denied, etc.), please read: https://medium.com/@Laurae2/compiling-xgboost-in-windows-for-r-d0cb826786a5. Make sure you are using MinGW.
  • sparsity: You must use Laurae's sparsity package (SVMLight I/O conversion) which can be found here: https://github.com/Laurae2/sparsity/blob/master/README.md - compilation simply requires writing devtools:::install_github("Laurae2/sparsity") (and having Rtools in Windows).
  • tabplot: please use: install.packages("https://cran.r-project.org/src/contrib/Archive/tabplot/tabplot_1.1.tar.gz", repos=NULL, type="source"). The 1.3 version is "junk" since they added standard deviation which makes unreadable tableplots when it is too high, even if standard deviation is disabled.

Strange errors on first run

Sometimes you will get strange errors (like a corrupted documentation database) on the first load ever on the package. Restart R to get rid of this issue. It does not show up anymore afterwards.

Printed text is missing after interrupting LightGBM / xgboost

Write in your R console sink() until you get an error.

A lot of functions that worked are giving errors.

Write in your R console sink() until you get an error.

What is inside?

UtilityFunction Name(s)
Supervised Learningxgboost: xgb.ncv, xgb.opt.depth, xgb.importance.interactive LightGBM: lgbm.train, lgbm.predict, lgbm.cv, lgbm.metric, lgbm.fi, lgbm.fi.plot, lgbm.find Rules: rule_single, rule_double Base: kfold, nkfold Helpers: SymbolicLoss, FeatureLookup, ExtraOpt, LauraeML, Lextravagenza, pred.Lextravagenza
Unsupervised Learningt-SNE: tsne_grid
Automated Reportingreport.lm, report.xgb
Visualizationstableplot_jpg, interactive.SymbolicLoss, interactive.eda_ggplot, interactive.eda_tree, interactive.eda_3djs, interactive.eda_plotly, interactive.eda_RColorBrewer
Extreme low-memory manipulationdata.table: setDF, DTcbind, DTrbind, DTsubsample, DTfillNA CSV sparse: read_sparse_csv
Function NameTypeWhat is it for
Laurae_loadDependency loadAttempts to load all Laurae dependencies.
tsne_gridDimensionality Reduction + Grid SearchAllows to grid search a seed and a perplexity interval using t-SNE, while returning the best t-SNE model along with the best iteration found, all in a fully verbose fashion.
read_sparse_csvIterated numeric sparse matrix readingR always imports CSV as dense. This function allows to read very large CSVs in chunks by variables (or a specific subset of variables), outputting a sparse matrix with typically lower RAM usage than a dense matrix if sparsity is high enough, all in a fully verbose fashion. Sparsity can be defined as 0 or NA, while saving as RDS is available in the loading streak.
tableplot_jpgBatch tableplot output to JPEGAllows to create a tableplot which is immediately turned into JPEG in batch per variable, against a label. It allows to preview features in a more understandable fashion than eyeballing numeric values.
xgb.ncvRepeated xgboost Cross-ValidationAllows to run a repeated xgboost cross-validation with fully verbosity of aggregate summaries, computation time, and ETA of computation, with fixed seed and a sink to store xgboost verbose data, and also out-of-fold predictions and external data prediction.
rule_singleOutlying Univariate Continuous Association Rule FinderAllows to use an outlying univariate continuous association rule finder on data and predicts immediately. Intermediate outlying scores can be stored. High verbosity of outputs during computation.
rule_doubleOutlying Bivariate Linear Continuous Association Rule FinderAllows to use an outlying bivariate linear continuous association rule finder on data and predicts immediately. Intermediate outlying scores cannot be stored. If a bivariate combination is ill-conditioned (sum of correlation matrix = 4), that bivariate combination is skipped to avoid a solver matrix inversion crash/freeze/interruption when trying to compute Mahalanobis distance dimensionality reduction. High verbosity of outputs during computation. Potential TO-DO: give the user the possibility to use their own dimensionality reduction function (like a truncated PCA 1-axis).
xgb.opt.depthxgboost Depth OptimizerAllows to optimize xgboost's depth parameter using simple heuristics. The learner function is customizable to fit any other model requiring to work by integer steps. Hence, it is adaptable to work on continuous 1-D features, with a large safety net you define yourself by coercing the integer to your own range.
lgbm.trainLightGBM trainerTrains a LightGBM model. Full verbosity control, with logging to file possible. Allows to predict out of the box during the training on the validation set and a test set.
lgbm.predictLightGBM predictorPredicts from a LightGBM model. Use the model working directory if you lost the model variable (which is not needed to predict - you only need the correct model working directory and the model name).
lgbm.cvLightGBM CV trainerCross-Validates a LightGBM model, returns out of fold predictions, ensembled average test predictions (if provided a test set), and cross-validated feature importance. Full verbosity control, with logging to file possible, with predictions given back as return. Subsampling is optimized to maximum to lower memory usage peaks.
lgbm.cv.prepLightGBM CV preparation helperPrepares the data for using lgbm.cv. All required data files are output, so you can run lgbm.cv with files_exist = TRUE withou

Copy Link

Version

Version

0.0.0.9001

License

What license is it under? A license where I'm not liable for anything you do bad with.

Maintainer

First Last

Last Published

March 31st, 2017

Functions in Laurae (0.0.0.9001)

brewer.pal_extended

RColorBrewer's Extended Palettes without Warning
DTcbind

data.table column binding (nearly without) copy
DT2mat

data.table to matrix
ExtraOpt

Cross-Entropy -based Hybrid Optimization
bandwidth_rot

MASS' Rule of Thumb Bandwidth Estimation
FastROC

Fast AUROC (AUC, ROC) computation
DTfillNA

data.table NA fill (nearly without) copy (or data.frame)
DTsubsample

data.table subsampling (nearly without) copy
DTcolsample

data.table colsampling (nearly without) copy
DTrbind

data.table row binding (nearly without) copy
get.max_f1

Maximum F1 Score (Precision with Sensitivity harmonic mean)
get.max_acc

Maximum binary accuracy
FeatureLookup

The Non-Linear Feature Engineering Assistant
get.max_precision

Maximum Precision (Positive Predictive Value)
get.max_missrate

Minimum Miss-Rate (False Negative Rate)
get.max_specificity

Maximum Specificity (True Negative Rate)
get.max_mcc

Maximum Matthews Correlation Coefficient
get.max_sensitivity

Maximum Sensitivity (True Positive Rate)
get.max_kappa

Maximum Kappa statistic
get.max_fallout

Minimum Fall-Out (False Positive Rate)
interactive.SymbolicLoss

Interactive Dashboard for Symbolic Gradient/Hessian Loss Behavior Exploration
interactive.eda_plotly

Interactive Dashboard for Exploratory Data Analysis (Plotly)
Laurae_load

Laurae Package Loader
interactive.eda_ggplot

Interactive Dashboard for Exploratory Data Analysis (ggplot)
interactive.eda_RColorBrewer

Interactive Dashboard for Finding the Perfect Color Brewer Palette
GetPartyRules

partykit's Party Rules to data.table
kfold

(Un)Stratified k-fold for any type of label
interactive.eda_d3js

Interactive Dashboard for Exploratory Data Analysis (d3js)
kernel2d_est

MASS' Two-Dimensional Kernel Density Estimation
interactive.eda_tree

Interactive Dashboard for the Non-Linear Feature Engineering Assistant
LauraeML_gblinear

Laurae's Machine Learning (xgboost gblinear helper function)
Laurae-package

Laurae's package for (very) advanced Data Science for R
LauraeML_lgbreg

Laurae's Machine Learning (LightGBM regression helper function)
LauraeML_gblinear_par

Laurae's Machine Learning (xgboost gblinear helper parallel function)
LauraeML_utils.xgb_data

Laurae's Machine Learning Utility: create xgboost dataset
LauraeML_utils.badscore

Laurae's Machine Learning Utility: bad input score
LauraeML_utils.lgb_data

Laurae's Machine Learning Utility: create LightGBM dataset
LauraeML_utils.newlog

Laurae's Machine Learning Utility: new input logger
LauraeML_utils.feat_sel

Laurae's Machine Learning Utility: subset features to select during training
lgbm.cv

LightGBM Cross-Validated Model Training
LauraeML_utils.badlog

Laurae's Machine Learning Utility: bad input logger
lgbm.cv.prep

LightGBM Cross-Validated Model Preparation
Lextravagenza

Laurae's Extravagenza machine learning model
LauraeML

Laurae's Machine Learning (Automated modeling, Automated stacking)
loss_LL_grad

Loglikelihood Error (gradient function)
lgbm.train

LightGBM Model Training
lgbm.predict

LightGBM Prediction
loss_MCE_math

Mean Cubic Error (math function)
loss_LL_hess

Loglikelihood Error (hessian function)
loss_MCE_xgb

Mean Cubic Error (xgboost function)
loss_LL

Loglikelihood Error (computation function)
loss_MAE_grad

Mean Absolute Error (gradient function)
loss_MSE_math

Mean Squared Error (symbolic function)
loss_MSE_hess

Mean Squared Error (hessian function)
plotting.max_acc

Maximum binary accuracy plotting
print_fp

Print appropriately formatted fixed point
plotting.max_f1

Maximum F1 Score (Precision with Sensitivity harmonic mean) plotting
print_hyb

Print appropriately formatted integer or fixed point (hybrid)
loss_LKL_hess

Laurae's Kullback-Leibler Error (hessian function)
loss_LKL_math

Laurae's Kullback-Leibler Error (math function)
loss_MAE_hess

Mean Absolute Error (hessian function)
loss_MAE_math

Mean Absolute Error (symbolic function)
loss_MCE

Mean Cubic Error (computation function)
loss_MSE_grad

Mean Squared Error (gradient function)
loss_Poisson

Laurae's Poisson Error (computation function)
nkfold

(Un)Stratified Repeated k-fold for any type of label
prob.max_f1

Probability F1 Score (Precision with Sensitivity harmonic mean)
lgbm.fi

LightGBM Feature Importance
prob.max_acc

Probability binary accuracy
loss_LKL_xgb

Laurae's Kullback-Leibler Error (xgboost function)
lgbm.fi.plot

LightGBM Feature Importance Plotting
loss_LKL

Laurae's Kullback-Leibler Error (computation function)
loss_MAE_xgb

Mean Absolute Error (xgboost function)
loss_MAE

Mean Absolute Error (computation function)
loss_MSE_xgb

Mean Squared Error (xgboost function)
loss_MSE

Mean Squared Error (computation function)
plotting.max_mcc

Maximum Matthews Correlation Coefficient plotting
plotting.max_missrate

Minimum Miss-Rate (False Negative Rate) plotting
plotting.max_specificity

Maximum Specificity (True Negative Rate) plotting
pred.Lextravagenza

Laurae's Extravagenza machine learning model prediction
prob.max_mcc

Probability Matthews Correlation Coefficient
prob.max_missrate

Probability Miss-Rate (False Negative Rate)
xgb.max_f1

xgboost evaluation metric for maximum F1 Score (Precision with Sensitivity harmonic mean)
xgb.max_fallout

xgboost evaluation metric for minimum Fall-Out (False Positive Rate)
lgbm.find

Find LightGBM Path
loss_LL_math

Loglikelihood Error (math function)
lgbm.metric

LightGBM Metric Output
loss_LL_xgb

Loglikelihood Error (xgboost function)
loss_Poisson_hess

Laurae's Poisson Error (hessian function)
plotting.max_fallout

Minimum Fall-Out (False Positive Rate) plotting
loss_Poisson_grad

Laurae's Poisson Error (gradient function)
plotting.max_kappa

Maximum Kappa statistic plotting
prob.max_fallout

Probability Fall-Out (False Positive Rate)
prob.max_kappa

Probability Kappa statistic
rule_double

Outlying bivariate linear continuous association rule finder
timer

Get current Time in Milliseconds
report.xgb

Extreme Gradient Boosting HTML report
tsne_grid

t-SNE grid search function
xgb.max_sensitivity

xgboost evaluation metric for maximum Sensitivity (True Positive Rate)
xgb.max_specificity

xgboost evaluation metric for maximum Specificity (True Negative Rate)
read_sparse_csv

Read sparse (numeric) CSVs
xgb.ncv

xgboost repeated cross-validation (Repeated k-fold)
prob.max_specificity

Probability Specificity (True Negative Rate)
xgb.opt.depth

xgboost depth automated optimizer
loss_LKL_grad

Laurae's Kullback-Leibler Error (gradient function)
LogLoss

Fast Logarithmic Loss (logloss) computation
loss_MCE_grad

Mean Cubic Error (gradient function)
loss_Poisson_math

Laurae's Poisson Error (math function)
loss_MCE_hess

Mean Cubic Error (hessian function)
loss_Poisson_xgb

Laurae's Poisson Error (xgboost function)
print_int

Print appropriately formatted integer
plotting.max_precision

Maximum Precision (Positive Predictive Value) plotting
plotting.max_sensitivity

Maximum Sensitivity (True Positive Rate) plotting
print_multi

Print appropriately formatted hyperparameters and error
prob.max_sensitivity

Probability Sensitivity (True Positive Rate)
prob.max_precision

Probability Precision (Positive Predictive Value)
xgb.importance.interactive

xgboost feature importance interactive table
xgb.max_acc

xgboost evaluation metric for maximum binary accuracy
xgb.max_missrate

xgboost evaluation metric for minimum Miss-Rate (False Negative Rate)
setDF

Convert data.table to data.frame without copy
rule_single

Outlying univariate continuous association rule finder
xgb.max_precision

xgboost evaluation metric for maximum Precision (Positive Predictive Value)
xgb.max_kappa

xgboost evaluation metric for maximum Kappa statistic
xgb.max_mcc

xgboost evaluation metric for maximum Matthews Correlation Coefficient
report.lm

Linear Regression Modeling HTML report
report.xgb.helper

Extreme Gradient Boosting HTML report helper function
SymbolicLoss

Symbolic Gradient/Hessian Loss computation
tableplot_jpg

Batch tableplot generator to JPEG
grid_arrange_shared_legend

ggplot multiple plot per page
partial_dep.feature

Partial Dependency, output analyzer
partial_dep.plot

Partial Dependency, plotting function
predictor_xgb

Partial Dependency, xgboost predictor
stat_smooth_func

ggplot facet function with printed formula (non Plotly)
stat_smooth_func.plotly

ggplot facet function with printed formula (Plotly)
partial_dep.obs_all

Partial Dependency Observation, Contour (multiple observations)
partial_dep.obs

Partial Dependency Observation, Contour (single observation)
xgboard.time

Xgboard Metric Evaluation Time Reset (Environment)
df_mae

Mean Absolute Error (MAE) (computation function, any size)
df_mse

Mean Squared Error (MSE) (computation function, any size)
df_logloss

Logarithmic Loss (Logloss) (computation function, any size)
df_r

Pearson Coefficient of Correlation (R) (computation function, any size)
timer_func

Get Function Time in Milliseconds
timer_func_print

Get Function Time in Milliseconds (with printing)
df_acc_bin

Accuracy loss (Acc) (computation function, binary optimization)
df_mce

Mean Cubic Error (MCE) (computation function, any size)
df_mape

Mean Absolute Percentage Error (MAPE) (computation function, single vector)
xgboard.eval.error

Xgboard Metric Evaluation Error (Binary Accuracy)
xgboard.eval.logloss

Xgboard Metric Evaluation Logloss (Binary Logloss)
xgboard.init

Xgboard Metric Evaluation Initialization (Environment)
df_r2

Coefficient of Determination (R^2) (computation function, any size)
MGScanning

Multi-Grained Scanning implementation in R
MGScanning_pred

Multi-Grained Scanning Predictor implementation in R
df_rmse

Root Mean Squared Error (RMSE) (computation function, any size)
xgboard.run

Xgboard Server Launcher (Web Interface Creator)
CRTreeForest_pred

Complete-Random Tree Forest Predictor implementation in R
cbindlist

data.table column list binding
CRTreeForest

Complete-Random Tree Forest implementation in R
df_acc

Accuracy loss (Acc) (computation function, any size)
mean2

Fast mean computation
xgboard.dump

Xgboard Dumper
df_spearman

Spearman Coefficient of Correlation (R) (computation function, any size)
CascadeForest_pred

Cascade Forest Predictor implementation in R
df_auc

Area Under the Curve Loss (AUC) (computation function, any size)
df_medae

Median Absolute Error (MedAE) (computation function, single vector)
df_medpae

Median Absolute Percentage Error (MedPAE) (computation function, single vector)
xgboard.xgb

Xgboard Metric Evaluation Creator (Wrapper)
CRTree_Forest_pred_internals

Complete-Random Tree Forest Predictor (Deferred predictor) implementation in R
CascadeForest

Cascade Forest implementation in R