ML.Clustering

class ddf_library.functions.ml.clustering.Kmeans(n_clusters=3, max_iter=100, epsilon=0.01, init_mode='k-means||')

Bases: ddf_library.bases.ddf_model.ModelDDF

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

Two of the most well-known forms of initialization of the set of clusters are: “random” and “k-means||”:

  • random: Starting with a set of randomly chosen initial centers;
  • k-means|| (Bahmani et al., Scalable K-Means++, VLDB 2012): This is a variant of k-means++ that tries to find dissimilar cluster centers by starting with a random center and then doing passes where more centers are chosen with probability proportional to their squared distance to the current cluster set. It results in a provable approximation to an optimal clustering.
Example:
>>> clu = Kmeans(n_clusters=2, init_mode='random')
>>> ddf2 = clu.fit_transform(ddf1, feature_col=['col1', 'col2'])
Parameters:
  • n_clusters – Number of clusters;
  • max_iter – Number maximum of iterations;
  • epsilon – tolerance value (default, 0.01);
  • init_mode‘random’ or ‘k-means||’.
check_fitted_model()
compute_cost()

Compute the cost of this iteration;

Returns:float
fit(data, feature_col)
Parameters:
  • data – DDF
  • feature_col – Features column names;
Returns:

trained model

fit_transform(data, feature_col, pred_col='prediction_kmeans')

Fit the model and transform.

Parameters:
  • data – DDF
  • feature_col – Features column names;
  • pred_col – Output prediction column;
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, feature_col=None, pred_col='prediction_kmeans')
Parameters:
  • data – DDF
  • feature_col – Optional, features;
  • pred_col – Output prediction column;
Returns:

trained model