ML.Feature¶
-
class
ddf_library.functions.ml.feature.
Binarizer
(threshold=0.0)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Binarize data (set feature values to 0 or 1) according to a threshold
Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.
Example: >>> ddf = Binarizer(threshold=5.0).transform(ddf, input_col=['feature'])
Parameters: threshold – Feature values below or equal to this are replaced by 0, above it by 1. Default = 0.0; -
check_fitted_model
()¶
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col, output_col=None, remove=False)¶ Parameters: - data – DDF
- input_col – List of columns;
- output_col – Output columns names.
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
-
class
ddf_library.functions.ml.feature.
CountVectorizer
(vocab_size=200, min_tf=1.0, min_df=1, binary=True)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Converts a collection of text documents to a matrix of token counts.
Example: >>> cv = CountVectorizer().fit(ddf1, input_col='col_1') >>> ddf2 = cv.transform(ddf1, output_col='col_2')
Parameters: - vocab_size – Maximum size of the vocabulary. (default, 200)
- min_tf – Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in; Default 1.0;
- min_df – Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document);
- binary – If True, all nonzero counts are set to 1.
-
check_fitted_model
()¶
-
fit
(data, input_col)¶ Fit the model.
Parameters: - data – DDF
- input_col – Input column name with the tokens;
Returns: a trained model
-
fit_transform
(data, input_col, output_col=None, remove=False)¶ Fit the model and transform.
Parameters: - data – DDF
- input_col – Input column name with the tokens;
- output_col – Output field (default, add suffix ‘_vectorized’);
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col=None, output_col=None, remove=False)¶ Parameters: - data – DDF
- input_col – Input column name with the tokens;
- output_col – Output field (default, add suffix ‘_vectorized’);
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
class
ddf_library.functions.ml.feature.
IndexToString
(model)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. :Example: >>> ddf2 = IndexToString(model=model) >>> .transform(ddf1, input_col=’category_indexed’)
Parameters: model – Model generated by StringIndexer; -
check_fitted_model
()¶
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col, output_col=None)¶ Parameters: - data –
- input_col – Input column name;
- output_col – Output column name.
Returns:
-
-
class
ddf_library.functions.ml.feature.
MaxAbsScaler
¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
MaxAbsScaler transforms a data set of features rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
Example: >>> scaler = MaxAbsScaler(input_col='features', >>> output_col='features_norm').fit(ddf1) >>> ddf2 = scaler.transform(ddf1)
-
check_fitted_model
()¶
-
fit
(data, input_col)¶ Fit the model.
Parameters: - data – DDF
- input_col – Column with the features;
Returns: trained model
-
fit_transform
(data, input_col, output_col=None, remove=False)¶ Fit the model and transform.
Parameters: - data – DDF
- input_col – Column with the features;
- output_col – Output column;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col=None, output_col=None, remove=False)¶ Parameters: - data – DDF
- input_col – Column with the features;
- output_col – Output column;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
-
class
ddf_library.functions.ml.feature.
MinMaxScaler
(feature_range=(0, 1))¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
MinMaxScaler transforms a data set of features rows, rescaling each feature to a specific range (often [0, 1])
Example: >>> scaler = MinMaxScaler() >>> ddf2 = scaler.fit_transform(ddf1, input_col=['col1', 'col2'])
Parameters: feature_range – A tuple with the range, default is (0,1); -
check_fitted_model
()¶
-
fit
(data, input_col)¶ Fit the model.
Parameters: - data – DDF
- input_col – Column with the features;
Returns: trained model
-
fit_transform
(data, input_col, output_col=None, remove=False)¶ Fit the model and transform.
Parameters: - data – DDF
- input_col – Column with the features;
- output_col – Output column;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col=None, output_col=None, remove=False)¶ Parameters: - data – DDF
- input_col – Column with the features;
- output_col – Output column;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
-
class
ddf_library.functions.ml.feature.
NGram
(n=2)¶ Bases:
ddf_library.bases.ddf_base.DDFSketch
A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
Example: >>> ddf = NGram(n=3).transform(ddf_input, 'col_in', 'col_out')
Parameters: n – Number integer. Default = 2; -
transform
(data, input_col, output_col=None)¶ Parameters: - data – DDF
- input_col – Input columns with the tokens;
- output_col – Output column. Default, add suffix ‘_ngram’;
Returns: DDF
-
-
class
ddf_library.functions.ml.feature.
OneHotEncoder
¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Encode categorical integer features as a one-hot numeric array.
Example: >>> enc = OneHotEncoder() >>> ddf2 = enc.fit_transform(ddf1, input_col='col_1', output_col='col_2')
-
check_fitted_model
()¶
-
fit
(data, input_col)¶ Fit the model.
Parameters: - data – DDF
- input_col – Input column name with the tokens;
Returns: a trained model
-
fit_transform
(data, input_col, output_col='_onehot', remove=False)¶ Fit the model and transform.
Parameters: - data – DDF
- input_col – Input column name with the tokens;
- output_col – Output suffix name. The pattern will be col + order + suffix, suffix default is ‘_onehot’;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col=None, output_col='_onehot', remove=False)¶ Parameters: - data – DDF
- input_col – Input column name with the tokens;
- output_col (str) – Output suffix name. The pattern will be col + order + suffix, suffix default is ‘_onehot’;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
-
class
ddf_library.functions.ml.feature.
PCA
(n_components)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. The columns of the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.
Example: >>> pca = PCA(n_components=2).fit(ddf1, input_col='features') >>> ddf2 = pca.transform(ddf1, output_col='features_pca')
Parameters: n_components – Number of output components; -
check_fitted_model
()¶
-
fit
(data, input_col)¶ Parameters: - data – DDF
- input_col – Input columns;
Returns: trained model
-
fit_transform
(data, input_col, output_col='_pca', remove=False)¶ Fit the model and transform.
Parameters: - data – DDF
- input_col – Input columns;
- output_col – A list of output feature column or a suffix name.
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col=None, output_col='_pca', remove=False)¶ Parameters: - data – DDF
- input_col – Input columns;
- output_col – A list of output feature column or a suffix name.
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
-
class
ddf_library.functions.ml.feature.
PolynomialExpansion
(degree=2, interaction_only=False)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Perform feature expansion in a polynomial space. In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition.
For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]
Parameters: - degree – The degree of the polynomial features. Default = 2.
- interaction_only –
If true, only interaction features are produced: features that are products of at most degree distinct input features. Default = False
Example: >>> dff = PolynomialExpansion(degree=2) >>> .transform(ddf, input_col=['x', 'y'],)
-
check_fitted_model
()¶
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col, output_col='_poly', remove=False)¶ Parameters: - data – DDF
- input_col – List of columns;
- output_col – Output suffix name following col plus order and suffix. Suffix default is ‘_poly’;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
class
ddf_library.functions.ml.feature.
RegexTokenizer
(pattern='\s+', min_token_length=2, to_lowercase=True)¶ Bases:
ddf_library.bases.ddf_base.DDFSketch
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text.
Example: >>> ddf2 = RegexTokenizer(input_col='col_0', pattern=r"(?u)\w\w+") ... .transform(ddf_input)
Parameters: - pattern – Regex pattern in Java dialect, default r”(?u)ww+”;
- min_token_length – Minimum tokens length (default is 2);
- to_lowercase – To convert words to lowercase (default is True).
-
transform
(data, input_col, output_col=None)¶ Parameters: - data – DDF
- input_col – Input column with sentences;
- output_col – Output column (‘input_col’_token if None);
Returns: DDF
-
class
ddf_library.functions.ml.feature.
RemoveStopWords
(case_sensitive=True, stops_words_list=None, language=None)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Remove stop-words is a operation to remove words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
Example: >>> remover = RemoveStopWords(input_col='col_0', >>> stops_words_list=['word1', 'word2']) >>> remover = remover.stopwords_from_ddf(stopwords_ddf, 'col') >>> ddf2 = remover.transform(ddf_input, output_col='col_1')
Parameters: - case_sensitive – To compare words using case sensitive (default);
- stops_words_list – Optional, a list of words to be removed.
-
check_fitted_model
()¶
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
stopwords_from_ddf
(data, input_col)¶ Is also possible inform stop-words form a DDF.
Parameters: - data – DDF with a column of stop-words;
- input_col – Stop-words column name;
-
transform
(data, input_col, output_col=None)¶ Parameters: - data – DDF
- input_col – Input columns with the tokens;
- output_col – Output column (‘input_col’_rm_stopwords if None);
Returns: DDF
-
class
ddf_library.functions.ml.feature.
StandardScaler
(with_mean=True, with_std=True)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
The standard score of a sample x is calculated as:
z = (x - u) / swhere u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.
Example: >>> scaler = StandardScaler(with_mean=True, with_std=True) >>> ddf2 = scaler.fit_transform(ddf1, input_col=['col1', 'col2'])
Parameters: - with_mean – True to use the mean (default is True);
- with_std – True to use standard deviation of the training samples (default is True);
-
check_fitted_model
()¶
-
fit
(data, input_col)¶ Fit the model.
Parameters: - data – DDF
- input_col – Column with the features;
Returns: trained model
-
fit_transform
(data, input_col, output_col=None, remove=False)¶ Fit the model and transform.
Parameters: - data – DDF
- input_col – Column with the features;
- output_col – Output column;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col=None, output_col=None, remove=False)¶ Parameters: - data – DDF
- input_col – Column with the features;
- output_col – Output column;
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
class
ddf_library.functions.ml.feature.
StringIndexer
¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
StringIndexer indexes a feature by encoding a string column as a column containing indexes. :Example: >>> model = StringIndexer().fit(ddf1, input_col=’category’) >>> ddf2 = model.transform(ddf1)
-
check_fitted_model
()¶
-
fit
(data, input_col)¶ Fit the model. :param data: DDF :param input_col: list of columns; :return: a trained model
-
fit_transform
(data, input_col, output_col=None)¶ Fit the model and transform. :param data: DDF :param input_col: list of columns; :param output_col: Output indexes column. :return: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col=None, output_col=None)¶ Parameters: - data – DDF
- input_col – list of columns;
- output_col – List of output indexes column.
Returns: DDF
-
-
class
ddf_library.functions.ml.feature.
TfidfVectorizer
(vocab_size=200, min_tf=1.0, min_df=1)¶ Bases:
ddf_library.bases.ddf_model.ModelDDF
Term frequency-inverse document frequency (TF-IDF) is a numerical statistic transformation that is intended to reflect how important a word is to a document in a collection or corpus.
Example: >>> tfidf = TfidfVectorizer(input_col='col_0', output_col='col_1').fit(ddf1) >>> ddf2 = tfidf.transform(ddf1)
Parameters: - vocab_size – Maximum size of the vocabulary. (default, 200)
- min_tf – Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in; Default 1.0;
- min_df – Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document);
-
check_fitted_model
()¶
-
fit
(data, input_col)¶ Fit the model.
Parameters: - data – DDF
- input_col – Input column name with the tokens;
Returns: trained model
-
fit_transform
(data, input_col, output_col=None, remove=False)¶ Fit the model and transform.
Parameters: - data – DDF
- input_col – Input column name with the tokens;
- output_col – Output field (default, add suffix ‘_vectorized’);
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
load_model
(filepath)¶ Load a machine learning model from a binary file in a storage.
Parameters: filepath – The absolute path name; Returns: self Example: >>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
-
save_model
(filepath, overwrite=True)¶ Save a machine learning model as a binary file in a storage.
Parameters: - filepath – The output absolute path name;
- overwrite – Overwrite if file already exists (default, True);
Returns: self
Example: >>> cls = KMeans().fit(dataset, input_col=['col1', 'col2']) >>> cls.save_model('hdfs://localhost:9000/trained_model')
-
transform
(data, input_col=None, output_col=None, remove=False)¶ Parameters: - data – DDF
- input_col – Input column name with the tokens;
- output_col – Output field (default, add suffix ‘_vectorized’);
- remove – Remove input columns after execution (default, False).
Returns: DDF
-
class
ddf_library.functions.ml.feature.
Tokenizer
(min_token_length=2, to_lowercase=True)¶ Bases:
ddf_library.bases.ddf_base.DDFSketch
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.
Example: >>> ddf2 = Tokenizer(input_col='features').transform(ddf_input)
Parameters: - min_token_length – Minimum tokens length (default is 2);
- to_lowercase – To convert words to lowercase (default is True).
-
transform
(data, input_col, output_col=None)¶ Parameters: - data – DDF
- input_col – Input column with sentences;
- output_col – Output column (‘input_col’_tokens if None);
Returns: DDF