ML.Classification¶
-
class
ddf_library.functions.ml.classification.
GaussianNB
¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonged to each class to make a prediction. It is a supervised learning approach that you would come up with if you wanted to model a predictive probabilistically modeling problem.
Naive bayes simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have the probability of a data instance belonging to that class.
To make a prediction we can calculate probabilities of the instance belonged to each class and select the class value with the highest probability.
Example: >>> cls = GaussianNB().fit(ddf1, feature_col=['col1', 'col2'], >>> label_col='label') >>> ddf2 = cls.transform(ddf1, pred_col='prediction')
-
check_fitted_model
()¶
-
fit
(data, feature_col, label_col)¶ Fit the model.
Parameters: - data – DDF
- feature_col – Feature column name;
- label_col – Label column name;
Returns: trained model
-
fit_transform
(data, feature_col, label_col, pred_col='prediction_GaussianNB')¶ Fit the model and transform.
Parameters: - data – DDF
- feature_col – Feature column name;
- label_col – Label column name;
- pred_col – Output prediction name (default, ‘prediction_GaussianNB’);
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, feature_col=None, pred_col='prediction_GaussianNB')¶ Parameters: - data – DDF
- feature_col – Feature column name;
- pred_col – Output prediction name (default, ‘prediction_GaussianNB’);
Returns: DDF
-
-
class
ddf_library.functions.ml.classification.
KNearestNeighbors
(k=3)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
K-Nearest Neighbor is a algorithm used that can be used for both classification and regression predictive problems. In a classification, the algorithm computes from a simple majority vote of the K nearest neighbors of each point present in the training set. The choice of the parameter K is very crucial in this algorithm, and depends on data set. However, values of one or tree is more common.
Example: >>> knn = KNearestNeighbors(k=1).fit(ddf1, feature_col=['col1', 'col2'], >>> label_col='label') >>> ddf2 = knn.transform(ddf1, pred_col='prediction')
Parameters: k – Number of nearest neighbors to majority vote; -
check_fitted_model
()¶
-
fit
(data, feature_col, label_col)¶ Fit the model.
Parameters: - data – DDF
- feature_col – Feature column name;
- label_col – Label column name;
Returns: trained model
-
fit_transform
(data, feature_col, label_col, pred_col='prediction_kNN')¶ Fit the model and transform.
Parameters: - data – DDF
- feature_col – Feature column name;
- label_col – Label column name;
- pred_col – Output prediction name (default, ‘prediction_kNN’);
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, feature_col=None, pred_col='prediction_kNN')¶ Parameters: - data – DDF
- feature_col – Feature column name
- pred_col – Output prediction name (default, ‘prediction_kNN’);
Returns: DDF
-
-
class
ddf_library.functions.ml.classification.
LogisticRegression
(alpha=0.1, regularization=0.1, max_iter=100, threshold=0.01)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Logistic regression is named for the function used at the core of the method, the logistic function. It is the go-to method for binary classification problems (problems with two class values).
The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.
This implementation uses a Gradient Ascent (a variant of the gradient descent). Gradient ascent is the same as gradient descent, except I’m maximizing instead of minimizing a function.
Example: >>> cls = LogisticRegression() >>> .fit(ddf1, feature_col=['col1', 'col2'], label_col='label') >>> ddf2 = cls.transform(ddf1, pred_col='prediction')
Parameters: - alpha – Learning rate parameter (default, 0.1);
- regularization – Regularization parameter (default, 0.1);
- max_iter – Maximum number of iterations (default, 100);
- threshold – Tolerance for stopping criterion (default, 0.01);
-
check_fitted_model
()¶
-
fit
(data, feature_col, label_col)¶ Fit the model.
Parameters: - data – DDF
- feature_col – Feature column name;
- label_col – Label column name;
Returns: trained model
-
fit_transform
(data, feature_col, label_col, pred_col='prediction_LogReg')¶ Fit the model and transform.
Parameters: - data – DDF
- feature_col – Feature column name;
- label_col – Label column name;
- pred_col – Output prediction name (default, ‘prediction_LogReg’);
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, feature_col=None, pred_col='prediction_LogReg')¶ Parameters: - data – DDF
- feature_col – Feature column name;
- pred_col – Output prediction name (default, ‘prediction_LogReg’);
Returns: DDF
-
class
ddf_library.functions.ml.classification.
SVM
(coef_lambda=0.1, coef_lr=0.01, threshold=0.001, max_iter=100, penalty='l2')¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Support vector machines (SVM) is a supervised learning model used for binary classification. Given a set of training examples, each marked as belonging to one or the other of two categories, a SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.
An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. This algorithm is effective in high dimensional spaces and it is still effective in cases where number of dimensions is greater than the number of samples.
The algorithm reads a data set composed by labels (-1 or 1) and features (numeric fields).
Example: >>> cls = SVM(max_iter=10).fit(ddf1, feature_col=['col1', 'col2'], >>> label_col='label') >>> ddf2 = cls.transform(ddf1, pred_col='prediction')
Parameters: - coef_lambda – Regularization parameter (default, 0.1);
- coef_lr – Learning rate parameter (default, 0.1);
- threshold – Tolerance for stopping criterion (default, 0.001);
- max_iter – Number max of iterations (default, 100);
- penalty – Apply ‘l2’ or ‘l1’ penalization (default, ‘l2’)
-
check_fitted_model
()¶
-
fit
(data, feature_col, label_col)¶ Fit the model.
Parameters: - data – DDF
- feature_col – Feature column name;
- label_col – Label column name;
Returns: trained model
-
fit_transform
(data, feature_col, label_col, pred_col='prediction_SVM')¶ Fit the model and transform.
Parameters: - data – DDF
- feature_col – Feature column name;
- label_col – Label column name;
- pred_col – Output prediction name (default, ‘prediction_SVM’);
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, feature_col=None, pred_col='prediction_SVM')¶ Parameters: - data – DDF
- feature_col – Feature column name;
- pred_col – Output prediction name (default, ‘prediction_SVM’);
Returns: DDF