ML.Feature

class ddf_library.functions.ml.feature.Binarizer(threshold=0.0)

Bases: ddf_library.bases.ddf_model.ModelDDF

Binarize data (set feature values to 0 or 1) according to a threshold

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

Example:
>>> ddf = Binarizer(threshold=5.0).transform(ddf, input_col=['feature'])
Parameters:threshold – Feature values below or equal to this are replaced by 0, above it by 1. Default = 0.0;
check_fitted_model()
load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col, output_col=None, remove=False)
Parameters:
  • data – DDF
  • input_col – List of columns;
  • output_col – Output columns names.
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.CountVectorizer(vocab_size=200, min_tf=1.0, min_df=1, binary=True)

Bases: ddf_library.bases.ddf_model.ModelDDF

Converts a collection of text documents to a matrix of token counts.

Example:
>>> cv = CountVectorizer().fit(ddf1, input_col='col_1')
>>> ddf2 = cv.transform(ddf1, output_col='col_2')
Parameters:
  • vocab_size – Maximum size of the vocabulary. (default, 200)
  • min_tf – Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in; Default 1.0;
  • min_df – Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document);
  • binary – If True, all nonzero counts are set to 1.
check_fitted_model()
fit(data, input_col)

Fit the model.

Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
Returns:

a trained model

fit_transform(data, input_col, output_col=None, remove=False)

Fit the model and transform.

Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
  • output_col – Output field (default, add suffix ‘_vectorized’);
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col=None, output_col=None, remove=False)
Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
  • output_col – Output field (default, add suffix ‘_vectorized’);
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.IndexToString(model)

Bases: ddf_library.bases.ddf_model.ModelDDF

Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. :Example: >>> ddf2 = IndexToString(model=model) >>> .transform(ddf1, input_col=’category_indexed’)

Parameters:model – Model generated by StringIndexer;
check_fitted_model()
load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col, output_col=None)
Parameters:
  • data
  • input_col – Input column name;
  • output_col – Output column name.
Returns:

class ddf_library.functions.ml.feature.MaxAbsScaler

Bases: ddf_library.bases.ddf_model.ModelDDF

MaxAbsScaler transforms a data set of features rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

Example:
>>> scaler = MaxAbsScaler(input_col='features',
>>>                       output_col='features_norm').fit(ddf1)
>>> ddf2 = scaler.transform(ddf1)
check_fitted_model()
fit(data, input_col)

Fit the model.

Parameters:
  • data – DDF
  • input_col – Column with the features;
Returns:

trained model

fit_transform(data, input_col, output_col=None, remove=False)

Fit the model and transform.

Parameters:
  • data – DDF
  • input_col – Column with the features;
  • output_col – Output column;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col=None, output_col=None, remove=False)
Parameters:
  • data – DDF
  • input_col – Column with the features;
  • output_col – Output column;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.MinMaxScaler(feature_range=(0, 1))

Bases: ddf_library.bases.ddf_model.ModelDDF

MinMaxScaler transforms a data set of features rows, rescaling each feature to a specific range (often [0, 1])

Example:
>>> scaler = MinMaxScaler()
>>> ddf2 = scaler.fit_transform(ddf1, input_col=['col1', 'col2'])
Parameters:feature_range – A tuple with the range, default is (0,1);
check_fitted_model()
fit(data, input_col)

Fit the model.

Parameters:
  • data – DDF
  • input_col – Column with the features;
Returns:

trained model

fit_transform(data, input_col, output_col=None, remove=False)

Fit the model and transform.

Parameters:
  • data – DDF
  • input_col – Column with the features;
  • output_col – Output column;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col=None, output_col=None, remove=False)
Parameters:
  • data – DDF
  • input_col – Column with the features;
  • output_col – Output column;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.NGram(n=2)

Bases: ddf_library.bases.ddf_base.DDFSketch

A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

Example:
>>> ddf = NGram(n=3).transform(ddf_input, 'col_in', 'col_out')
Parameters:n – Number integer. Default = 2;
transform(data, input_col, output_col=None)
Parameters:
  • data – DDF
  • input_col – Input columns with the tokens;
  • output_col – Output column. Default, add suffix ‘_ngram’;
Returns:

DDF

class ddf_library.functions.ml.feature.OneHotEncoder

Bases: ddf_library.bases.ddf_model.ModelDDF

Encode categorical integer features as a one-hot numeric array.

Example:
>>> enc = OneHotEncoder()
>>> ddf2 = enc.fit_transform(ddf1, input_col='col_1', output_col='col_2')
check_fitted_model()
fit(data, input_col)

Fit the model.

Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
Returns:

a trained model

fit_transform(data, input_col, output_col='_onehot', remove=False)

Fit the model and transform.

Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
  • output_col – Output suffix name. The pattern will be col + order + suffix, suffix default is ‘_onehot’;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col=None, output_col='_onehot', remove=False)
Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
  • output_col (str) – Output suffix name. The pattern will be col + order + suffix, suffix default is ‘_onehot’;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.PCA(n_components)

Bases: ddf_library.bases.ddf_model.ModelDDF

Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. The columns of the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.

Example:
>>> pca = PCA(n_components=2).fit(ddf1, input_col='features')
>>> ddf2 = pca.transform(ddf1, output_col='features_pca')
Parameters:n_components – Number of output components;
check_fitted_model()
fit(data, input_col)
Parameters:
  • data – DDF
  • input_col – Input columns;
Returns:

trained model

fit_transform(data, input_col, output_col='_pca', remove=False)

Fit the model and transform.

Parameters:
  • data – DDF
  • input_col – Input columns;
  • output_col – A list of output feature column or a suffix name.
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col=None, output_col='_pca', remove=False)
Parameters:
  • data – DDF
  • input_col – Input columns;
  • output_col – A list of output feature column or a suffix name.
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.PolynomialExpansion(degree=2, interaction_only=False)

Bases: ddf_library.bases.ddf_model.ModelDDF

Perform feature expansion in a polynomial space. In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition.

For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]

Parameters:
  • degree – The degree of the polynomial features. Default = 2.
  • interaction_only

    If true, only interaction features are produced: features that are products of at most degree distinct input features. Default = False

    Example:
    >>> dff = PolynomialExpansion(degree=2)        >>>         .transform(ddf, input_col=['x', 'y'],)
    
check_fitted_model()
load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col, output_col='_poly', remove=False)
Parameters:
  • data – DDF
  • input_col – List of columns;
  • output_col – Output suffix name following col plus order and suffix. Suffix default is ‘_poly’;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.RegexTokenizer(pattern='\s+', min_token_length=2, to_lowercase=True)

Bases: ddf_library.bases.ddf_base.DDFSketch

A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text.

Example:
>>> ddf2 = RegexTokenizer(input_col='col_0', pattern=r"(?u)\w\w+")    ...         .transform(ddf_input)
Parameters:
  • pattern – Regex pattern in Java dialect, default r”(?u)ww+”;
  • min_token_length – Minimum tokens length (default is 2);
  • to_lowercase – To convert words to lowercase (default is True).
transform(data, input_col, output_col=None)
Parameters:
  • data – DDF
  • input_col – Input column with sentences;
  • output_col – Output column (‘input_col’_token if None);
Returns:

DDF

class ddf_library.functions.ml.feature.RemoveStopWords(case_sensitive=True, stops_words_list=None, language=None)

Bases: ddf_library.bases.ddf_model.ModelDDF

Remove stop-words is a operation to remove words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.

Example:
>>> remover = RemoveStopWords(input_col='col_0',
>>>                           stops_words_list=['word1', 'word2'])
>>> remover = remover.stopwords_from_ddf(stopwords_ddf, 'col')
>>> ddf2 = remover.transform(ddf_input, output_col='col_1')
Parameters:
  • case_sensitive – To compare words using case sensitive (default);
  • stops_words_list – Optional, a list of words to be removed.
check_fitted_model()
load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
stopwords_from_ddf(data, input_col)

Is also possible inform stop-words form a DDF.

Parameters:
  • data – DDF with a column of stop-words;
  • input_col – Stop-words column name;
transform(data, input_col, output_col=None)
Parameters:
  • data – DDF
  • input_col – Input columns with the tokens;
  • output_col – Output column (‘input_col’_rm_stopwords if None);
Returns:

DDF

class ddf_library.functions.ml.feature.StandardScaler(with_mean=True, with_std=True)

Bases: ddf_library.bases.ddf_model.ModelDDF

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Example:
>>> scaler = StandardScaler(with_mean=True, with_std=True)
>>> ddf2 = scaler.fit_transform(ddf1, input_col=['col1', 'col2'])
Parameters:
  • with_mean – True to use the mean (default is True);
  • with_std – True to use standard deviation of the training samples (default is True);
check_fitted_model()
fit(data, input_col)

Fit the model.

Parameters:
  • data – DDF
  • input_col – Column with the features;
Returns:

trained model

fit_transform(data, input_col, output_col=None, remove=False)

Fit the model and transform.

Parameters:
  • data – DDF
  • input_col – Column with the features;
  • output_col – Output column;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col=None, output_col=None, remove=False)
Parameters:
  • data – DDF
  • input_col – Column with the features;
  • output_col – Output column;
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.StringIndexer

Bases: ddf_library.bases.ddf_model.ModelDDF

StringIndexer indexes a feature by encoding a string column as a column containing indexes. :Example: >>> model = StringIndexer().fit(ddf1, input_col=’category’) >>> ddf2 = model.transform(ddf1)

check_fitted_model()
fit(data, input_col)

Fit the model. :param data: DDF :param input_col: list of columns; :return: a trained model

fit_transform(data, input_col, output_col=None)

Fit the model and transform. :param data: DDF :param input_col: list of columns; :param output_col: Output indexes column. :return: DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col=None, output_col=None)
Parameters:
  • data – DDF
  • input_col – list of columns;
  • output_col – List of output indexes column.
Returns:

DDF

class ddf_library.functions.ml.feature.TfidfVectorizer(vocab_size=200, min_tf=1.0, min_df=1)

Bases: ddf_library.bases.ddf_model.ModelDDF

Term frequency-inverse document frequency (TF-IDF) is a numerical statistic transformation that is intended to reflect how important a word is to a document in a collection or corpus.

Example:
>>> tfidf = TfidfVectorizer(input_col='col_0', output_col='col_1').fit(ddf1)
>>> ddf2 = tfidf.transform(ddf1)
Parameters:
  • vocab_size – Maximum size of the vocabulary. (default, 200)
  • min_tf – Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer >= 1, this specifies the number of documents the term must appear in; Default 1.0;
  • min_df – Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document);
check_fitted_model()
fit(data, input_col)

Fit the model.

Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
Returns:

trained model

fit_transform(data, input_col, output_col=None, remove=False)

Fit the model and transform.

Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
  • output_col – Output field (default, add suffix ‘_vectorized’);
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, input_col=None, output_col=None, remove=False)
Parameters:
  • data – DDF
  • input_col – Input column name with the tokens;
  • output_col – Output field (default, add suffix ‘_vectorized’);
  • remove – Remove input columns after execution (default, False).
Returns:

DDF

class ddf_library.functions.ml.feature.Tokenizer(min_token_length=2, to_lowercase=True)

Bases: ddf_library.bases.ddf_base.DDFSketch

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.

Example:
>>> ddf2 = Tokenizer(input_col='features').transform(ddf_input)
Parameters:
  • min_token_length – Minimum tokens length (default is 2);
  • to_lowercase – To convert words to lowercase (default is True).
transform(data, input_col, output_col=None)
Parameters:
  • data – DDF
  • input_col – Input column with sentences;
  • output_col – Output column (‘input_col’_tokens if None);
Returns:

DDF