ML.Evaluation

class ddf_library.functions.ml.evaluation.BinaryClassificationMetrics(label_col, pred_col, ddf_var, true_label=1)

Bases: ddf_library.bases.ddf_base.DDFSketch

Evaluator for binary classification.

  • True Positive (TP) - label is positive and prediction is also positive
  • True Negative (TN) - label is negative and prediction is also negative
  • False Positive (FP) - label is negative but prediction is positive
  • False Negative (FN) - label is positive but prediction is negative
Metrics:
  • Accuracy
  • Precision (Positive Predictive Value) = tp / (tp + fp)
  • Recall (True Positive Rate) = tp / (tp + fn)
  • F-measure = F1 = 2 * (precision * recall) / (precision + recall)
  • Confusion matrix
Example:
>>> bin_metrics = BinaryClassificationMetrics(label_col='label',
>>>                                           pred_col='pred', data=ddf1)
>>> print(bin_metrics.get_metrics())
>>> # or using:
>>> print(bin_metrics.confusion_matrix)
>>> print(bin_metrics.accuracy)
>>> print(bin_metrics.recall)
>>> print(bin_metrics.precision)
>>> print(bin_metrics.f1)
Parameters:
  • label_col – Column name of true label values;
  • pred_col – Column name of predicted label values;
  • ddf_var – DDF;
  • true_label – Value of True label (default is 1).
get_metrics()
Returns:A pandas DataFrame with metrics.
class ddf_library.functions.ml.evaluation.MultilabelMetrics(label_col, pred_col, ddf_var)

Bases: ddf_library.bases.ddf_base.DDFSketch

Evaluator for multilabel classification.

  • True Positive (TP) - label is positive and prediction is also positive
  • True Negative (TN) - label is negative and prediction is also negative
  • False Positive (FP) - label is negative but prediction is positive
  • False Negative (FN) - label is positive but prediction is negative
Metrics:
  • Accuracy
  • Precision (Positive Predictive Value) = tp / (tp + fp)
  • Recall (True Positive Rate) = tp / (tp + fn)
  • F-measure = F1 = 2 * (precision * recall) / (precision + recall)
  • Confusion matrix
  • Precision_recall table
Example:
>>> metrics_multi = MultilabelMetrics(label_col='label',
>>>                                   pred_col='prediction', data=ddf1)
>>> print(metrics_multi.get_metrics())
>>> # or using:
>>> print(metrics_multi.confusion_matrix)
>>> print(metrics_multi.precision_recall)
>>> print(metrics_multi.accuracy)
>>> print(metrics_multi.recall)
>>> print(metrics_multi.precision)
>>> print(metrics_multi.f1)
Parameters:
  • label_col – Column name of true label values;
  • pred_col – Column name of predicted label values;
  • ddf_var – DDF.
get_metrics()
Returns:A pandas DataFrame with metrics.
class ddf_library.functions.ml.evaluation.RegressionMetrics(col_features, label_col, pred_col, data)

Bases: ddf_library.bases.ddf_base.DDFSketch

RegressionModelEvaluation’s methods.

  • Mean Squared Error (MSE): Is an estimator measures the average of the squares of the errors or deviations, that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss. In other words, MSE tells you how close a regression line is to a set of points.
  • Root Mean Squared Error (RMSE): Is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSE represents the sample standard deviation of the differences between predicted values and observed values.
  • Mean Absolute Error (MAE): Is a measure of difference between two continuous variables. Assume X and Y are variables of paired observations that express the same phenomenon. Is a quantity used to measure how close forecasts or predictions are to the eventual outcomes.
  • Coefficient of Determination (R2): Iis the proportion of the variance in the dependent variable that is predictable from the independent variable(s). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
  • Explained Variance: Measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set.
Example:
>>> reg_metrics = RegressionMetrics(col_features=['x1', 'x2'],
>>>                                 label_col='label', pred_col='pred',
>>>                                 data=data)
>>> print reg_metrics.get_metrics()
>>> # or using:
>>> print reg_metrics.r2
>>> print reg_metrics.mse
>>> print reg_metrics.rmse
>>> print reg_metrics.mae
>>> print reg_metrics.msr
Parameters:
  • col_features – Column name of features values;
  • label_col – Column name of true label values;
  • pred_col – Column name of predicted label values;
  • data – DDF.
get_metrics()
Returns:A pandas DataFrame with metrics.