ML.Regression

class ddf_library.functions.ml.regression.OrdinaryLeastSquares

Bases: ddf_library.bases.ddf_model.ModelDDF

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables and the single output variable. More specifically, that y can be calculated from a linear combination of the input variables (x).

When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression.

b1 = (sum(x*y) + n*m_x*m_y) / (sum(x²) -n*(m_x²)) b0 = m_y - b1*m_x

Example:
>>> model = OrdinaryLeastSquares()    >>>         .fit(ddf1, feature=['col1', 'col2'], label='y')
>>> ddf2 = model.transform(ddf1)
check_fitted_model()
fit(data, feature_col, label_col)

Fit the model.

Parameters:
  • data – DDF
  • feature_col – Feature column name;
  • label_col – Label column name;
Returns:

trained model

fit_transform(data, feature_col, label_col, pred_col='pred_LinearReg')

Fit the model and transform.

Parameters:
  • data – DDF
  • feature_col – Feature column name;
  • label_col – Label column name;
  • pred_col – Output prediction column (default, ‘pred_LinearReg’);
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, feature_col=None, pred_col='pred_LinearReg')
Parameters:
  • data – DDF
  • feature_col – Feature column name;
  • pred_col – Output prediction column (default, ‘pred_LinearReg’);
Returns:

DDF

class ddf_library.functions.ml.regression.GDRegressor(max_iter=100, alpha=1, tol=0.001)

Bases: ddf_library.bases.ddf_model.ModelDDF

Linear model fitted by minimizing a regularized empirical loss with Gradient Descent.

Example:
>>> model = GDRegressor()    >>>         .fit(ddf1, feature_col=['col1', 'col2'], label_col='label')
>>> ddf2 = model.transform(ddf1)
Parameters:
  • max_iter – Maximum number of iterations (default, 100);
  • alpha – learning rate parameter (default, 1). This method sets the learning rate parameter used by Gradient Descent when updating the hypothesis after each iteration. Up to a point, higher values will cause the algorithm to converge on the optimal solution more quickly, however if the value is set too high then it will fail to converge at all, yielding successively larger errors on each iteration;
  • tol – Tolerance stop criteria (default, 1e-3).
check_fitted_model()
fit(data, feature_col, label_col)

Fit the model.

Parameters:
  • data – DDF
  • feature_col – Feature column name;
  • label_col – Label column name;
Returns:

trained model

fit_transform(data, feature_col, label_col, pred_col='pred_LinearReg')

Fit the model and transform.

Parameters:
  • data – DDF
  • feature_col – Feature column name;
  • label_col – Label column name;
  • pred_col – Output prediction column (default, ‘pred_LinearReg’);
Returns:

DDF

load_model(filepath)

Load a machine learning model from a binary file in a storage.

Parameters:filepath – The absolute path name;
Returns:self
Example:
>>> ml_model = Kmeans().load_model('hdfs://localhost:9000/model')
save_model(filepath, overwrite=True)

Save a machine learning model as a binary file in a storage.

Parameters:
  • filepath – The output absolute path name;
  • overwrite – Overwrite if file already exists (default, True);
Returns:

self

Example:
>>> cls = KMeans().fit(dataset, input_col=['col1', 'col2'])
>>> cls.save_model('hdfs://localhost:9000/trained_model')
transform(data, feature_col=None, pred_col='pred_LinearReg')
Parameters:
  • data – DDF
  • feature_col – Feature column name;
  • pred_col – Output prediction column (default, ‘pred_LinearReg’);
Returns:

DDF