--- title: "MetaClean - Walkthrough" author: - name: "Kelsey Chetnik" email: "kelsey.chetnik@gmail.com" date: "`r BiocStyle::doc_date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Create Peak Integration Quality Classifier for Assessment of Untargeted Metabolomics Features} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) #library(MetaClean) ``` # Overview of MetaClean `MetaClean` is a package for building classifiers to identify low quality integrations in untargeted metabolomics data. It uses a combination of 11 peak quality metrics and 8 potential machine learning algorithms to build predictive models using user provided chromatographic data and associated labels. Once a predictive model has been built, it can be used to assign predictive labels and class probabilities to untargeted metabolomics datasets. The package is designed for use with the preprocessing package XCMS and can be easily integrated into existing untargeted metabolomics pipelines. # How to use MetaClean `MetaClean` has two main use cases: (1) Training a Classifier Using User-Provdied Data and (2) Using Existing Models to Make Predictions. This tutorial will walk the user through the steps for each. !!IMPORTANT!! While any version of XCMS's peak-picking, retention time correction, and grouping functions may be utilized, this package requires the user to provide two objects produced by the getEIC() and fillPeaks() functions. These functions require objects of the "xcmsSet" class which was replaced by the "XCMSnExp" class. If using the newest functions provided by XCMS, please convert the "XCMSnExp" object to an "xcmsSet" object using as(XCMSnExp_object, "xcmsSet"). ## Part 1: Training a Classifier Using User Provided Data This section explains how to build a classifier using data provided by the user. It is recommended that users create classifiers specific to the mode, matrix, and instrumentation of the dataset for those methodologies that are frequently utilized by the lab. For example, if a lab often runs plasma samples in Reverse Phase Negative Mode using the same column and the same instrument, a classifier can be built once using a dataset gerneated with this method and then saved and used again and again to make predictions for every additional run that uses this methodology. Before the classifier can be trained the user must invest some time in visually assesing and labeling at least two datasets: a development dataset to be used for training and at least one test dataset to be used for evaluating the performance of the classifiers and selecting the best model. It is recommended that the user perform the following steps to prepare the development and test datasets: - (Optionally) Use MetaClean::rsdFilter() to filter EICs with RSD greater than a user-provided threshold. - Randomly select 500 EICs (without replacement) from the dataset of interest to serve as the development dataset - Randomly select an additional 500 EICs (without replacement) from the dataset of interest, assuring no overlap between the two sets of EICs, to serve as the test dataset - For each dataset - Run fillPeaks() and getEIC() on the associated "group" object (convert to "xcmsSet" class first if necessary) - Print out the plots for the 500 EICs using the xcmsEIC object produced by getEIC() - Have an experty visually assess the quality of each integration and label as "PASS" or "FAIL" The following sections provide detailed explanations of the steps required in training a classifier using user provided data: - (Optional) RSD Filtering* - Set Up* - Get EvalObj* - Calculate Peak Quality Metrics* - Train Potential Classifiers - Calcualte Evaluation Measures - Compare Classifiers and Select Best Performing - Train Final Classifier - Save Model NOTE: The * denotes sections that require additional example data not included in the MetaClean package. Users can either provide their own fill and xcmsEIC objects or download and install the data package `MetaCleanData` using the following code: ```{r installData} ## UNCOMMENT THIS SECTION IF YOU WISH TO USE THE MetaCleanData DATA PACKAGE # # install devtools if not already installed # install.packages("devtools") # install the data package MetaCleanData from github # devtools::install_github("KelseyChetnik/MetaCleanData") # load MetaCleanData library # library(MetaCleanData) ``` `MetaCleanData` provides example data for development and test sets. ### (Optional) RSD Filtering !!!IMPORTANT!!! If the user uses rsdFilter() for the development dataset all testing datasets should also be filtered before evaluated on the trained model. Before generating training data, the user can optionally filter out EICs by RSD % using the rsdFilter() function. The rsdFilter() function takes the following arguments: - **peakTable**: A peak table (like that returned by xcms::peakTable()) containing columns for EIC, mz, rt, samples, and quality control samples. - **eicColumn**: Name of the EIC column. - **rsdColumns**: Vector of peakTable column names that correspond to quality control samples. - **rsdThreshold**: A value between 0-1 to serve as the RSD % threshold; default: 0.3. The function returns a peak table containing those EICs for which RSD is below the user-provided threshold. The following is an example of how to use the rsdFilter() function: ```{r rsd example} ## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED GITHUB PACKAGE MetaCleanData # # load the example input data # # example development data # data("group_development") # data("covar_development") # # # example test data # data("group_test") # data("covar_test") # # peak_table_development <- peakTable(group_development) # peak_table_development <- cbind(EICNo=1:nrow(peak_table_development), peak_table_development) # development_rsd_names <- as.character(covar_development[covar_development$SampleType=="LQC","FileNames"]) # filtered_peak_table_development <- rsdFilter(peakTable = peak_table_development, # eicColumn = "EICNo", # rsdColumns = development_rsd_names, # rsdThreshold = 0.3) # # peak_table_test <- peakTable(group_test) # peak_table_test <- cbind(EICNo=1:nrow(peak_table_test), peak_table_test) # test_rsd_names <- as.character(covar_test[covar_test$SampleType=="LQC","FileNames"]) # filtered_peak_table_test <- rsdFilter(peakTable = peak_table_test, # eicColumn = "EICNo", # rsdColumns = test_rsd_names, # rsdThreshold = 0.3) ``` After rsd filtering, the user can then proceed with randomly selecting and labeling EICs for training and then proceed with the steps outlined below. ### Set Up Note: All example data used below was made WITHOUT RSD filtering. To train a new classifier using `MetaClean`, the user must provide these three data files for each dataset: - **xcmsEIC object**: Returned by the XCMS function getEIC(). Contains data for each chromatographic peak of interest. - **fill object**: Returend by the XCMS function fillPeaks(). An xcmsSet object with filled in peak groups. -**eic labels**: A .csv file with the EICs of interest in the first column and the associated labels in the second column. NOTE: The same group object must be used to produce both the xcmsEIC and fill objects. The following are examples of the required xcmsEIC and fill objects: ```{r load data} ## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED GITHUB PACKAGE MetaCleanData # # load the example input data # # example development data # data("eic_labels_development") # data("fill_development") # data("xs_development") # # example test data # data("fill_test") # data("xs_test") ``` These examples will be utilized throughout the remainder of the tutorial. ### Get EvalObj The function `getEvalObj` is called to extract the relevant data from the three objects provided by ther user and store them in an object of class `evalObj`. This function takes the following arguments: - **xs**: An xcmsEIC object returned by the getEIC() function from the XCMS package - **fill**: An xcmsSet object with filled in peak groups ```{r evalObj} ## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData # # call getEvalObj on development data # eicEval_development <- getEvalObj(xs = xs_development, fill = fill_development) # # # call getEvalObj on test data # eicEval_test <- getEvalObj(xs = xs_test, fill = fill_test) ``` The `evalObj` has three slots: - **eicPts**: A list of 2D matrices containing the retention time and intensity values of each chromatographic peak - **eicPeakData**: A list of vectors for each sample in the group containing characteristic information about each chromatographic peak - **eicNos**: A numeric vector of the EIC numbers identifying each feature group ### Calculate Peak Quality Metrics The function `getPeakQualityMetrics` uses the `evalObj` objects to calculate each of the 11 peak quality metrics. These metrics are: Apex Max-Boundary Ratio, Elution Shift, FWHM2Base, Jaggedness, Modality, Retention-Time Consistency, Symmetry, Gaussian Similarity, Sharpness, Triangle Peak Area Similarity Ratio (TPASR), and Zig-Zag Index. See (our paper) for a description of each metric. This function takes the following arguments: - **eicEvalData**: An object of class evalObj containing the required chromatographic peak information - **eicLabel_df**: A dataframe with EICNos in the first column and Labels in the second column - **flatness.factor**: A numeric value between 0 and 1 that allows the user to adjust the sensitivity of the function to noise. This function calculates the difference between each adjacent pair of points; any value found to be less than flatness.factor * maximum intensity is set to 0. ```{r PeakQualityMetrics} ## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData # # calculate peak quality metrics for development dataset # # For 500 peaks and 89 samples, takes ~2.3 mins # pqm_development <- getPeakQualityMetrics(eicEvalData = eicEval_development, eicLabels_df = eic_labels_development) # # # calculate peak quality metrics for test dataset # # For 500 peaks and 100 samples, takes ~2.6 mins # pqm_test <- getPeakQualityMetrics(eicEvalData = eicEval_test) ``` The `getPeakQualityMetrics` function returns an Mx13 or Mx12 matrix where M is equal to the number of peaks. There are 13 or 12 columns in total (depending on whether there are labels provided), including one column for each of the eleven metrics, one column for EIC numbers, and (optionally) one column for the class label. This matrix serves as the input for the training the classifiers and making predictions. ### Train Potential Classifiers `MetaClean` provides 8 classification algorithms (implemented with the R package `caret`) for building a predictive model. These are: Decision Tree, Naive Bayes, Logistic Regression, RandomForest, SVM_Linear, AdaBoost, Neural Netowrk, and Model-Averaged Neural Networks. The `runCrossValidation` function is a wrapper function that uses cross-validation to train a user-selected subset of the 8 available algorithms. It takes the following arguments: - **trainData**: A dataframe where the rows correspond to peaks and columns should include peak quality metrics and class labels only. - **k**: Number of folds to be used in cross-validation - **repNum**: Number of cross-validation rounds to perform - **rand.seed**: State in which to set the random number generator - **models**: A character string or vector specifying the models to use. Specifies the classification algorithms to be trained from the nine available: DecisionTree, LogisiticRegression, NaiveBayes, RandomForest, SVM_Linear, AdaBoost, NeuralNetwork, and ModelAveragedNeuralNetwork. Use c() to select multiple models. "all" specifies the use of all models. Default is "all". - **metricSet**: The metric set(s) to be run with the selected model(s). Select from the following: M4, M7, and M11. Use c() to select multiple metrics. "all" specifics the use of all metrics. Default is "M11". ```{r pqmTables} ## IF YOU HAVE INSTALLED MetaCleanData YOU CAN COMMENT OUT THIS CODE AND PROCEED WITH THE PEAK QUALITY METRIC TABLES GENERATED IN THE PREVIOUS SECTIONS # data("pqm_development") # data("pqm_test") ``` ```{r trainClassifiers, echo=FALSE} # train classification algorithms # For 500 peaks and 89 samples takes ~17.5 mins for M11 # models <- runCrossValidation(trainData=pqm_development, # k=5, # repNum=10, # rand.seed = 512, # models="all", # metricSet = c("M4", "M7", "M11")) ``` `runCrossValidation` returns a list of lists. The outer list has one entry for every model trained. The inner list has two entries: the trained model and the name of the model trained. ### Calculate Evaluation Measures Once the potential models have been trained, the next step is to evaluate the performance of each to determine which is the best performing and should be selected as the classifier. To do this, we first generate the seven available evaluation measures: Pass_FScore, Pass_Precision, Pass_Recall, Fail_FScore, Fail_Precision, Fail_Recall, and Accuracy. We use the `getEvaluationMeasures` function to do this. This function takes the following arguments: - **models**: A list of trained models, like that returned by trainClassifiers() - **k**: Number of folds used in cross-validation - **repNum**: Number of cross-validation rounds ```{r getEvalMeasures} # calculate all seven evaluation measures for each model and each round of cross-validation #evalMeasuresDF <- getEvaluationMeasures(models=models, k=5, repNum=10) ``` `getEvaluationMeasures` returns a dataframe with the following columns: Model, RepNum, Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. The rows of the dataframe will correspond to the results of a particular model and a particular round of cross-validation. ### Compare Classifiers and Select Best Performing The evaluation measures data frame can be used to assess the performance of each of the algorithms. The most convenient way to compare the performance of each is with visualizations. `MetaClean` provides a simple wrapper function to easily generate bar plot visualizations of the evaluation measures for each model. This is `getBarPlots` which plots bar plots comparing each model across each of the evaluation measures. This function takes the following argument: - **evalMeasuresDF**: dataframe with the following columns: Model, RepNum, Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. The rows of the dataframe will correspond to the results of a particular model and a particular round of cross-validation. - **emNames**: list of names of the evaluation measures to visualize. Accepts the following: Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. Default is "All". NOTE: When "All" is selected for emNames, the bar plots are returned in the same order as the names listed in the description. ```{r getBarPlots} # generate bar plots for every #barPlots <- getBarPlots(evalMeasuresDF, emNames="All") # # plot(barPlots$M11$Pass_FScore) # Pass_FScore # plot(barPlots$M11$Fail_FScore) # Fail_FScore # plot(barPlots$M11$Accuracy) # Accuracy ``` `MetaClean` also provides a wrapper function to easily generate CD plots of the evaluation measures for each model. This is `getCDPlots` which generates CD plots and corresponding rank matrices for each metric set and each user-selected evaluation measures. This function also gives the user the option to compare the best performing model between each metric set. This function takes the following argument: - **evalMeasuresDF**: dataframe with the following columns: Model, RepNum, Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. The rows of the dataframe will correspond to the results of a particular model and a particular round of cross-validation. - **emNames**: list of names of the evaluation measures to visualize. Accepts the following: Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. Default is "All". - **compareBest**: boolean value. If TRUE, determines the best performing model from each metric set and returns InterMetric CD plots and rank matrices for each evaluation measure. Will not work if only a single metric set in data frame. Default: FALSE. - **use_abbr**: boolean value. If TRUE, use abbreviated names for models. Default: TRUE. NOTE: When "All" is selected for emNames, the bar plots are returned in the same order as the names listed in the description. ```{r getCDPlots} # generate CD plots and rank matrices for Pass FScore, Fail FScore, and Accuracy evaluation measures #cdPlots <- getCDPlots(evalMeasuresDF = evalMeasuresDF, emNames = c("Pass_FScore", "Fail_FScore", "Accuracy"), compareBest = T) #plot(cdPlots$M11$plots$Pass_FScore) #plot(cdPlots$M11$plots$Fail_FScore) #plot(cdPlots$M11$plots$Accuracy) #plot(cdPlots$InterMet$plots$Pass_FScore) #plot(cdPlots$InterMet$plots$Fail_FScore) #plot(cdPlots$InterMet$plots$Accuracy) ``` These plots can help the user select the best perfomring classifier for the data. Of course, the user can also employ their own statistical tests on the evalMeasuresDF itself to determine which class ### Train Final Classifier Once the best perfomring model has been selected, the user can train the algorithm using all of the available training data and the optimized hyperparameters for the algorithm determined by training, to create the final classifier. The hyperparamters can be found in the "pred" data frame returned with the model by `runCrossValidation`. The user must provide the hyperparamters as a data frame with the hyperparameter names as columns, as shown below. ```{r trainBest} # example of optimized hyperparameters for best performing model AdaBoost M11 # hyperparameters here are nIter = 150 and method = "Adaboost.M1" # View(models$AdaBoost_M11$pred) # best performing model for example development set, rand.seed = 453, k = 5, repNum = 10 is AdaBoost # hyperparameters <- models$AdaBoost_M11$pred[,c("nIter", "method")] # hyperparameters <- unique(hyperparameters) # # metaclean_model <- trainClassifier(trainData = pqm_development, # model = "AdaBoost", # metricSet = "M11", # hyperparameters = hyperparameters) ``` ### Save Model This classifier can then be saved to a directory specified by the user so it can used as many times as desired. ```{r saveModel} # uncomment the lines below and add path where you want to save trained model # model_file <- "" # saveRDS(metaclean_model, file=model_file) ``` ## Part 2: Using Existing Models to Make Predictions The user can load any previously trained model (including the trained model from our publication available with package) and use it to make predictions on new data. ### Load Saved Model To load a prediction model, simply provide the path and use the base function readRDS(). As an example, users who have downloaded the data package MetaCleanData can use the pre-trained model included with that package: ```{r loadModel} ## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData # # load model from MetaCleanData # data(example_model) ``` ### Make Predictions The user can then make predictions using this model using the predict() function from `caret` as seen below: ```{r ModelPrections} ## UNCOMMENT THIS CODE IF YOU HAVE INSTALLED MetaCleanData # mc_predictions <- getPredicitons(model = example_model, # testData = pqm_test, # eicColumn = "EICNo") ```