Title: | Detection of Low-Quality Peaks in Untargeted Metabolomics Data |
---|---|
Description: | Utilizes 11 peak quality metrics and 8 diverse machine learning algorithms to build a classifier for the automatic assessment of peak integration quality of peaks from untargeted metabolomics analyses. The 11 peak quality metrics were adapted from those defined in the following references: Zhang, W., & Zhao, P.X. (2014) <doi:10.1186/1471-2105-15-S11-S5> Toghi Eshghi, S., Auger, P., & Mathews, W.R. (2018) <doi:10.1186/s12014-018-9209-x>. |
Authors: | Kelsey Chetnik |
Maintainer: | Kelsey Chetnik <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-10-29 05:49:20 UTC |
Source: | https://github.com/KelseyChetnik/MetaClean |
Calculates the Apex-Max Boundary Ratio of the integrated region of a chromatographic peak. The Apex-Max Boundary Ratio is found by taking the ratio of the intensity of the peak apex over the intensity of the maximum of the two boundary intensities.
calculateApexMaxBoundaryRatio(peakData, pts)
calculateApexMaxBoundaryRatio(peakData, pts)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
This function repurposed from TargetedMSQC. Toghi Eshghi, S., Auger, P., & Mathews, W. R. (2018). Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clinical Proteomics, 15. https://doi.org/10.1186/s12014-018-9209-x
The apex-max boundary ratio (double)
# Calculate Apex Max-Boundary Ratio for a peak data(ex_pts) data(ex_peakData) apexMaxBoundary <- calculateApexMaxBoundaryRatio(peakData = ex_peakData, pts = ex_pts)
# Calculate Apex Max-Boundary Ratio for a peak data(ex_pts) data(ex_peakData) apexMaxBoundary <- calculateApexMaxBoundaryRatio(peakData = ex_peakData, pts = ex_pts)
Calculate the Elution Shift of each chromatographic peak in a group of samples. For each sample, the Elution Shift is found by calculating the difference between the peak apex (max intensity) of that chromatographic peak and the median peak apex of all samples and normalizing it by the peak base (which is equal to the average difference between the two peak boundaries). The Elution Shift of the Peak Group is equal to the mean of the Elution Shift of each chromatographic peak.
calculateElutionShift(peakDataList, ptsList)
calculateElutionShift(peakDataList, ptsList)
peakDataList |
A list of vectors containing characteristic information about a chromatographic peak - including the retention time range |
ptsList |
A list of 2D matrices containing the retention time and intensity values of a chromatographic peak |
This function repurposed from TargetedMSQC. Toghi Eshghi, S., Auger, P., & Mathews, W. R. (2018). Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clinical Proteomics, 15. https://doi.org/10.1186/s12014-018-9209-x
The Elution Shift of a Peak Group (double)
# Calculate Elution Shift for each peak data(ex_ptsList) data(ex_peakDataList) elutionShift <- calculateElutionShift(peakDataList = ex_peakDataList, ptsList = ex_ptsList)
# Calculate Elution Shift for each peak data(ex_ptsList) data(ex_peakDataList) elutionShift <- calculateElutionShift(peakDataList = ex_peakDataList, ptsList = ex_ptsList)
Calculate evaluation measures using the predictions generated during cross-validation.
calculateEvaluationMeasures(pred, true)
calculateEvaluationMeasures(pred, true)
pred |
factor. A vector of factors that represent predicted classes |
true |
factor. A vector of factors that represent the true classes |
A dataframe with the following columns: Model, CVNum, RepNum, Accuracy, PassFScore, PassRecall, PassPrecision, FailFScore, FailRecall, FailPrecision
# Calculate Evaluation Measures for test data test_evalMeasures <- calculateEvaluationMeasures(pred=test_predictions_class, pqMetrics_test$Class)
# Calculate Evaluation Measures for test data test_evalMeasures <- calculateEvaluationMeasures(pred=test_predictions_class, pqMetrics_test$Class)
Calculates the FWHM2Base of the integrated region of a chromatographic peak. The FWHM2Base is found by determining the peak width at half of the maximum intensity and normalizing this value by the width of the base of the peak.
calculateFWHM(peakData, pts)
calculateFWHM(peakData, pts)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
This function repurposed from TargetedMSQC. Toghi Eshghi, S., Auger, P., & Mathews, W. R. (2018). Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clinical Proteomics, 15. https://doi.org/10.1186/s12014-018-9209-x
The FWHM2Base value (double)
# Calculate FWHM2Base for a peak data(ex_pts) data(ex_peakData) fwhm <- calculateFWHM(peakData=ex_peakData, pts=ex_pts)
# Calculate FWHM2Base for a peak data(ex_pts) data(ex_peakData) fwhm <- calculateFWHM(peakData=ex_peakData, pts=ex_pts)
Calculates the Gaussian Similarity of the integrated region of a chromatographic peak. The Gaussian Similarity is found by calculating the dot product of the standard normalized intensity values of a chromatographic peak and the standard normalized intensity values of a Gaussian curve fitted to the intensities of the original curve.
calculateGaussianSimilarity(peakData, pts)
calculateGaussianSimilarity(peakData, pts)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
This function repurposed from Zhang et al. For details, see Zhang, W., & Zhao, P. X. (2014). Quality evaluation of extracted ion chromatograms and chromatographic peaks in liquid chromatography/mass spectrometry-based metabolomics data. BMC Bioinformatics, 15(Suppl 11), S5. https://doi.org/10.1186/1471-2105-15-S11-S5
The Gaussian Similarity value (double)
# Calculate Gaussian Similarity for a peak data(ex_pts) data(ex_peakData) gaussianSimilarity <- calculateGaussianSimilarity(peakData = ex_peakData, pts = ex_pts)
# Calculate Gaussian Similarity for a peak data(ex_pts) data(ex_peakData) gaussianSimilarity <- calculateGaussianSimilarity(peakData = ex_peakData, pts = ex_pts)
Calculates the Jaggedness of the integrated region of a chromatographic peak. The Jaggedness is found by determining the fraction of time points the intensity of the peak changes direction - excluding the peak apex and any intensity changes below a flatness factor.
calculateJaggedness(peakData, pts, flatness.factor = 0.05)
calculateJaggedness(peakData, pts, flatness.factor = 0.05)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
flatness.factor |
A numeric value between 0 and 1 that allows the user to adjust the sensitivity of the function to noise. This function calculates the difference between each adjacent pair of points; any value found to be less than flatness.factor * maximum intensity is set to 0. |
This function repurposed from TargetedMSQC. Toghi Eshghi, S., Auger, P., & Mathews, W. R. (2018). Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clinical Proteomics, 15. https://doi.org/10.1186/s12014-018-9209-x
The jaggedness of a chromatographic peak (double)
# Calculate Jaggedness for a peak data(ex_pts) data(ex_peakData) jaggedness <- calculateJaggedness(peakData = ex_peakData, pts = ex_pts)
# Calculate Jaggedness for a peak data(ex_pts) data(ex_peakData) jaggedness <- calculateJaggedness(peakData = ex_peakData, pts = ex_pts)
Calculates the Modality of the integrated region of a chromatographic peak. The Modaily is found by taking the ratio of the magnitude of the largest drop in intensity (exluding the apex) and the maximum intensity of the peak.
calculateModality(peakData, pts, flatness.factor = 0.05)
calculateModality(peakData, pts, flatness.factor = 0.05)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
flatness.factor |
A numeric value between 0 and 1 that allows the user to adjust the sensitivity of the function to noise. This function calculates the difference between each adjacent pair of points; any value found to be less than flatness.factor * maximum intensity is set to 0. |
This function repurposed from TargetedMSQC. Toghi Eshghi, S., Auger, P., & Mathews, W. R. (2018). Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clinical Proteomics, 15. https://doi.org/10.1186/s12014-018-9209-x
The modality of the peak (double)
# Calculate Modality for a peak data(ex_pts) data(ex_peakData) modality <- calculateModality(peakData = ex_peakData, pts = ex_pts)
# Calculate Modality for a peak data(ex_pts) data(ex_peakData) modality <- calculateModality(peakData = ex_peakData, pts = ex_pts)
Calculates the Retention Time Consistency of each chromatographic peak in a group of samples. For each sample, the Retention Time Consistency is found by calculating the difference between the time at the center of the sample peak and the mean time of all peak centers normalized by the mean time of all the peak centers.
calculateRetentionTimeConsistency(peakDataList, ptsList)
calculateRetentionTimeConsistency(peakDataList, ptsList)
peakDataList |
A list of vectors containing characteristic information about a chromatographic peak - including the retention time range |
ptsList |
A list of 2D matrices containing the retention time and intensity values of a chromatographic peak |
This function repurposed from TargetedMSQC. Toghi Eshghi, S., Auger, P., & Mathews, W. R. (2018). Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clinical Proteomics, 15. https://doi.org/10.1186/s12014-018-9209-x
The Retention Time Consistency of a Peak Group (double)
# Calculate Retention Time Consistency for each peak data(ex_ptsList) data(ex_peakDataList) rtc <- calculateRetentionTimeConsistency(peakDataList = ex_peakDataList, ptsList = ex_ptsList)
# Calculate Retention Time Consistency for each peak data(ex_ptsList) data(ex_peakDataList) rtc <- calculateRetentionTimeConsistency(peakDataList = ex_peakDataList, ptsList = ex_ptsList)
Calculate Sharpness of the integrated region of a chromatographic peak. The Sharpness is found by determining the sum of the difference between the intensities of each adjacent pair of points on the peak normalized by the intensity of the peak boundaries.
calculateSharpness(peakData, pts)
calculateSharpness(peakData, pts)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
This function repurposed from Zhang et al. For details, see Zhang, W., & Zhao, P. X. (2014). Quality evaluation of extracted ion chromatograms and chromatographic peaks in liquid chromatography/mass spectrometry-based metabolomics data. BMC Bioinformatics, 15(Suppl 11), S5. https://doi.org/10.1186/1471-2105-15-S11-S5
The Sharpness value (double)
# Calculate Sharpness for a peak data(ex_pts) data(ex_peakData) sharpness <- calculateSharpness(peakData = ex_peakData, pts = ex_pts)
# Calculate Sharpness for a peak data(ex_pts) data(ex_peakData) sharpness <- calculateSharpness(peakData = ex_peakData, pts = ex_pts)
Calculates the Symmetry of the integrated region of a chromatographic peak. The Symmetry is found by calcuating the correlation between the left and right halves of the peak.
calculateSymmetry(peakData, pts)
calculateSymmetry(peakData, pts)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
This function repurposed from TargetedMSQC. Toghi Eshghi, S., Auger, P., & Mathews, W. R. (2018). Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clinical Proteomics, 15. https://doi.org/10.1186/s12014-018-9209-x
The Symmetry of the peak (double)
# Calculate Symmetry for a peak data(ex_pts) data(ex_peakData) symmetry <- calculateSymmetry(peakData = ex_peakData, pts = ex_pts)
# Calculate Symmetry for a peak data(ex_pts) data(ex_peakData) symmetry <- calculateSymmetry(peakData = ex_peakData, pts = ex_pts)
Calculates the Triangle Peak Area Similarity Ratio (TPASR) of the integrated region of a chromatographic peak. The TPASR is found by calculating the ratio of the difference between the area of a triangle formed by the apex and the two peak boundaries and the integrated area of the peak over the area of the triangle.
calculateTPASR(peakData, pts)
calculateTPASR(peakData, pts)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
This function repurposed from Zhang et al. For details, see Zhang, W., & Zhao, P. X. (2014). Quality evaluation of extracted ion chromatograms and chromatographic peaks in liquid chromatography/mass spectrometry-based metabolomics data. BMC Bioinformatics, 15(Suppl 11), S5. https://doi.org/10.1186/1471-2105-15-S11-S5
The TPASR value (double)
# Calculate TPASR for a peak data(ex_pts) data(ex_peakData) tpasr <- calculateTPASR(peakData = ex_peakData, pts = ex_pts)
# Calculate TPASR for a peak data(ex_pts) data(ex_peakData) tpasr <- calculateTPASR(peakData = ex_peakData, pts = ex_pts)
Calculates the Zig-Zag Index of the integrated region of a chromatographic peak. The Zig-Zag Index is found by calculating the sum of the slope changes between neighboring points normalized by the average intensity of the peak boundaries.
calculateZigZagIndex(peakData, pts)
calculateZigZagIndex(peakData, pts)
peakData |
A vector containing characteristic information about a chromatographic peak - including the retention time range |
pts |
A 2D matrix containing the retention time and intensity values of a chromatographic peak |
This function repurposed from Zhang et al. For details, see Zhang, W., & Zhao, P. X. (2014). Quality evaluation of extracted ion chromatograms and chromatographic peaks in liquid chromatography/mass spectrometry-based metabolomics data. BMC Bioinformatics, 15(Suppl 11), S5. https://doi.org/10.1186/1471-2105-15-S11-S5
The Zig-Zag Index value (double)
# Calculate ZigZag Index for a peak data(ex_pts) data(ex_peakData) zigZagIndex <- calculateZigZagIndex(peakData = ex_peakData, pts = ex_pts)
# Calculate ZigZag Index for a peak data(ex_pts) data(ex_peakData) zigZagIndex <- calculateZigZagIndex(peakData = ex_peakData, pts = ex_pts)
A custom class for storing the chromatographic peak data required by the peak metric functions for each group of samples.
eicPts
A list of 2D matrices containing the retention time and intensity values of each chromatographic peak
eicPeakData
A list of vectors for each sample in the group containing characteristic information about each chromatographic peak
eicNos
A numeric vector of the EIC numbers identifying each feature group
An example of the input for the peakData argument for calculate... functions. It represents data from one sample for the peak of interest.
ex_peakData
ex_peakData
A list containing the following entries: mz
, mzmin
, mzmax
, rt
, rtmin
, rtmax
, into
,
intb
, maxo
, sn
, sample
, and is_filled
.
An example of the input for the peakDataList argument for calculteElutionShift and calculateRetentionTimeConsistency. Each entry in the list is represents data for a sample for the peak of interest.
ex_peakDataList
ex_peakDataList
A list of lists. Each nested list contains the following entries: mz
, mzmin
, mzmax
, rt
,
rtmin
, rtmax
, into
, intb
, maxo
, sn
, sample
, and is_filled
.
An example of the input for the pts argument for calcualte... functions. It represents rt and intensity data from one sample for peak of interest.
ex_pts
ex_pts
A two-column matrix where the first column represents rt
and the second column represents
intensity
.
An example of the input for the ptsList argument for calculteElutionShift and calculateRetentionTimeConsistency. Each entry in the list is a two-column matrix consisting of rt and intensity for a sample for the peak of interest.
ex_ptsList
ex_ptsList
A list of two-column matrices (one matrix per sample) where the first column represents rt
and the second column
represents intensity
.
Wrapper function for generating bar plots for each classifiers for each of the seven evaluation measures.
getBarPlots(evalMeasuresDF, emNames = "All")
getBarPlots(evalMeasuresDF, emNames = "All")
evalMeasuresDF |
A dataframe with the following columns: Model, RepNum, Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. The rows of the dataframe will correspond to the results of a particular model and a particular round of cross-validation. |
emNames |
A list of names of the evaluation measures to visualize. Accepts the following: Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. Default is "All". |
A list of up to seven bar plots (one for each evaluation measure).
# Create a list of bar plots for each evaluation measure makeBarPlots(evalMeasuresDF = test_evalMeasures)
# Create a list of bar plots for each evaluation measure makeBarPlots(evalMeasuresDF = test_evalMeasures)
Wrapper function for generating CD plots for each classifiers for each of the seven evaluation measures. Code for CD plots adapted from now archived scmamp R package.
getCDPlots(evalMeasuresDF, emNames = "All", compareBest = F, use_abbr = T)
getCDPlots(evalMeasuresDF, emNames = "All", compareBest = F, use_abbr = T)
evalMeasuresDF |
A dataframe with the following columns: Model, RepNum, Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. The rows of the dataframe will correspond to the results of a particular model and a particular round of cross-validation. |
emNames |
A list of names of the evaluation measures to visualize. Accepts the following: Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, and Accuracy. Default is "All". |
compareBest |
Boolean. If T, compare the best performing models from each of the metric sets. Else, compare the models within eachh metric set. Must have at least two metric sets. Default: F. |
use_abbr |
Boolean. If T, use abbreviations for model names in the CD plot (e.g. DecisionTree = DT). Default: T. |
A named list with the following structure: metric_type$plots | rankmatrix$eval_measures, where metric_type is one of the three metric sets (M4, M7, or M11) and eval_measures
# Create a list of bar plots for each evaluation measure getCDPlots(evalMeasuresDF = test_evalMeasures, emNames = c("Pass_FScore", "Fail_FScore"))
# Create a list of bar plots for each evaluation measure getCDPlots(evalMeasuresDF = test_evalMeasures, emNames = c("Pass_FScore", "Fail_FScore"))
This function extracts, formats, and combines the chromatographic peak data from the objects returned by the getEIC() and fillPeaks() functions from the XCMS package.
getEvalObj(xs, fill)
getEvalObj(xs, fill)
xs |
An xcmsEIC object returned by the getEIC() function from the XCMS package |
fill |
An xcmsSet object with filled in peak groups |
An object of class evalObj
# call getEvalObj on test data # \donttest{eicEval_test <- getEvalObj(xs = xs_test, fill = fill_test)}
# call getEvalObj on test data # \donttest{eicEval_test <- getEvalObj(xs = xs_test, fill = fill_test)}
Calculate evaluation measures using the predictions generated during cross-validation.
getEvaluationMeasures(models, k, repNum)
getEvaluationMeasures(models, k, repNum)
models |
list. A list of trained models, like that returned by trainClassifiers() |
k |
integer. Number of folds used in cross-validation |
repNum |
integer. Number of cross-validation rounds |
A dataframe with the following columns: Model, RepNum, Pass_FScore, Pass_Recall, Pass_Precision, Fail_FScore, Fail_Recall, Fail_Precision, Accuracy
# calculate all seven evaluation measures for each model and each round of cross-validation evalMeasuresDF <- getEvaluationMeasures(models=models, k=5, repNum=10)
# calculate all seven evaluation measures for each model and each round of cross-validation evalMeasuresDF <- getEvaluationMeasures(models=models, k=5, repNum=10)
Wrapper function for calculating the each of the 12 peak quality metrics for each feature.
getPeakQualityMetrics(eicEvalData, eicLabels_df, flatness.factor = 0.05)
getPeakQualityMetrics(eicEvalData, eicLabels_df, flatness.factor = 0.05)
eicEvalData |
An object of class evalObj containing the required chromatographic peak information |
eicLabels_df |
A dataframe with EICNos in the first column and Labels in the second column |
flatness.factor |
A numeric value between 0 and 1 that allows the user to adjust the sensitivity of the function to noise. This function calculates the difference between each adjacent pair of points; any value found to be less than flatness.factor * maximum intensity is set to 0. |
An Mx14 matrix where M is equal to the number of peaks. There are 14 columns in total, including one column for each of the twelve metrics, one column for EIC numbers, and one column for the class label.
# # calculate peak quality metrics for development dataset pqMetrics_development <- getPeakQualityMetrics(eicEvalData = eicEval_development, eicLabels_df = eicLabels_development)
# # calculate peak quality metrics for development dataset pqMetrics_development <- getPeakQualityMetrics(eicEvalData = eicEval_development, eicLabels_df = eicLabels_development)
Wrapper function for retrieving predictions from a trained MetaClean classifier and a test dataset. Returns a data frame with class predictions as well as the associated probabilities for each class prediciton.
getPredicitons(model, testData, eicColumn)
getPredicitons(model, testData, eicColumn)
model |
The train MetaClean model object. |
testData |
dataframe. Rows should correspond to peaks, columns should include peak quality metrics and EIC column only. |
eicColumn |
name of the EIC column |
a dataframe with four columns: EIC, Pred_Class, Pred_Prob_Pass, Pred_Prob_Fail
# train classification algorithms best_model <- getPredictions(model = mc_model, testData = pqm_test, eicColumn = "EICNo")
# train classification algorithms best_model <- getPredictions(model = mc_model, testData = pqm_test, eicColumn = "EICNo")
Data frame with peaks quality metrics and labels for all of the 500 EICs in the example development dataset.
pqm_development
pqm_development
A data frame with 13 variables (EIC Number, the 11 peak quality metrics, and Class Labels): EICNo
,
ApexBoundaryRatio_mean
, ElutionShift_mean
, FWHM2Base_mean
, Jaggedness_mean
, Modelaity_mean
,
RetentionTimeCorrelation_mean
, Symmetry_mean
, GaussianSimilarity_mean
, Sharpness_mean
,
TPASR_mean
, ZigZag_mean
, and Class
.
Data frame with peaks quality metrics and labels for all of the 500 EICs in the example test dataset.
pqm_test
pqm_test
A data frame with 13 variables (EIC Number, the 11 peak quality metrics, and Class Labels): EICNo
,
ApexBoundaryRatio_mean
, ElutionShift_mean
, FWHM2Base_mean
, Jaggedness_mean
, Modelaity_mean
,
RetentionTimeCorrelation_mean
, Symmetry_mean
, GaussianSimilarity_mean
, Sharpness_mean
,
TPASR_mean
, ZigZag_mean
, and Class
.
Filters out EICs with RSD
rsdFilter(peakTable, eicColumn, rsdColumns, rsdThreshold = 0.3)
rsdFilter(peakTable, eicColumn, rsdColumns, rsdThreshold = 0.3)
peakTable |
peak table generated by xcms group object |
eicColumn |
name of the EIC column |
rsdColumns |
names of the sample columns to be used to calcualte RSD |
rsdThreshold |
RSD percent threshold for filtering; default 0.3 |
peakTable with filtered EICs
rsd_filtered_table <- rsdFilter(peakTable = group_table, eicColumn = eicColumn, rsdColumns = rsdColumns)
rsd_filtered_table <- rsdFilter(peakTable = group_table, eicColumn = eicColumn, rsdColumns = rsdColumns)
Wrapper function for running cross-validation on up to 8 classification algorithms using one or more of the three available metrics sets.
runCrossValidation( trainData, k, repNum, rand.seed = NULL, models = "all", metricSet = "M11" )
runCrossValidation( trainData, k, repNum, rand.seed = NULL, models = "all", metricSet = "M11" )
trainData |
dataframe. Rows should correspond to peaks, columns should include peak quality metrics and class labels only. |
k |
integer. Number of folds to be used in cross-validation |
repNum |
integer. Number of cross-validation rounds to perform |
rand.seed |
integer. State in which to set the random number generator |
models |
character string or vector. Specifies the classification algorithms to be trained from the eight available: DecisionTree, LogisiticRegression, NaiveBayes, RandomForest, SVM_Linear, AdaBoost, NeuralNetwork, and ModelAveragedNeuralNetwork. "all" specifies the use of all models. Default is "all". |
metricSet |
The metric set(s) to be run with the selected model(s). Select from the following: M4, M7, and M11. Use c() to select multiple metrics. "all" specifics the use of all metrics. Default is "M11". |
a list of up to 8 trained models
# train classification algorithms models <- trainClassifiers(trainData=pqMetrics_development, k=5, repNum=10, rand.seed = 453, models="DecisionTree")
# train classification algorithms models <- trainClassifiers(trainData=pqMetrics_development, k=5, repNum=10, rand.seed = 453, models="DecisionTree")
For repeated cross-validation, find the mean and standard error of N rounds for each model.
summaryStats(i, evalMeasuresDF, emNames, modelNames)
summaryStats(i, evalMeasuresDF, emNames, modelNames)
i |
An integer representing 1:N where N is the total number of cross-validation rounds. |
evalMeasuresDF |
A dataframe with the following columns: Model, RepNum, PosClass.FScore, PosClass.Recall, PosClass.Precision, NegClass.FScore, NegClass.Recall, NegClass.Precision, and Accuracy. The rows of the dataframe will correspond to the results of a particular model and a particular round of cross-validation. |
emNames |
A list of names of the evaluation measures to visualize. Accepts the following: PosClass.FScore, PosClass.Recall, PosClass.Precision, NegClass.FScore, NegClass.Recall, NegClass.Precision, and Accuracy. Default is "All". |
modelNames |
A list of the models trained. |
A dataframe with the following columns: Model, evalMeasure, Mean, and SE (Standard Error).
summaryStatsList <- lapply(1:numModels, summaryStats, evalMeasuresDF=evalMeasuresDF, emNames=emNames, modelNames=modelNames)
summaryStatsList <- lapply(1:numModels, summaryStats, evalMeasuresDF=evalMeasuresDF, emNames=emNames, modelNames=modelNames)
Wrapper function for training one of the 8 classification algorithms using one of the three available metrics sets.
trainClassifier(trainData, model, metricSet, hyperparameters)
trainClassifier(trainData, model, metricSet, hyperparameters)
trainData |
dataframe. Rows should correspond to peaks, columns should include peak quality metrics and class labels only. |
model |
Name of the classification algorithm to be trained from the eight available: DecisionTree, LogisiticRegression, NaiveBayes, RandomForest, SVM_Linear, AdaBoost, NeuralNetwork, and ModelAveragedNeuralNetwork. |
metricSet |
The metric set to be run with the selected model. Select from the following: M4, M7, and M11. |
hyperparameters |
dataframe of the tuned hyperparameters returned by runCrossValidation() |
a trained MetaClean model
# train classification algorithms best_model <- trainClassifier(trainData=pqMetrics_development, model="AdaBoost", metricSet="M11", hyperparameters)
# train classification algorithms best_model <- trainClassifier(trainData=pqMetrics_development, model="AdaBoost", metricSet="M11", hyperparameters)