API Reference¶

This page gives an overview of all public DDF objects, functions and methods. All classes and functions exposed in ddf namespace are public.

Contents¶

COMPSs Context
ETL
Statistics
ML

ML.Classification

ML.Clustering

ML.Feature

ML.Frequent Pattern Mining

ML.Evaluation

ML.Regression

Geographic Operations
Graph Algorithms

COMPSs Context¶

COMPSsContext.stop - Stop the DDF environment.

COMPSsContext.start_monitor - Start a web service monitor that informs the environment current status.

COMPSsContext.show_tasks - Show all tasks in the current code.

COMPSsContext.set_log - Set the log level.

COMPSsContext.context_status - Generates a DAG (in dot file) and some information on screen about the status process.

COMPSsContext.parallelize - Import data to DDF by distributing a DataFrame into DDF.

COMPSsContext.import_compss_data - Import a previous Pandas DataFrame list into DDF abstraction.

COMPSsContext.read.csv - Read a csv file.

COMPSsContext.read.json - Read a json file.

COMPSsContext.read.parquet - Read a parquet file.

COMPSsContext.read.shapefile - Reads a shapefile using the shp and dbf file.

ETL¶

DDF.add_column - Merges two dataFrames, column-wise.

DDF.cache - Forces the computation of all tasks in the current stack.

DDF.cast - Change the data’s type of some columns.

DDF.columns - Returns the columns name in the current DDF..

DDF.count_rows - Returns the number of rows in this DDF.

DDF.distinct - Returns a new DDF with non duplicated rows.

DDF.drop - Removes some columns from DDF.

DDF.drop_duplicates - Alias for distinct.

DDF.distinct - Returns a new DDF with non duplicated rows.

DDF.except_all - Returns a new set with containing rows in the first frame but not in the second one while preserving duplicates.

DDF.explode - Returns a new row for each element in the given array.

DDF.export_ddf - Export ddf data as a list of Pandas’s DataFrame.

DDF.fillna - Replace NaN elements by value or by median, mean or mode.

DDF.filter - Filters elements based on a condition.

DDF.group_by - Returns a GroupedDFF with a set of methods for aggregations on a DDF.

DDF.hash_partition - Hash partitioning is a partitioning technique where data is stored separately in different fragments by a hash function.

DDF.intersect - Returns a new DDF containing rows in both DDF.

DDF.intersect_all - Returns a new DDF containing rows in both DDF while preserving duplicates.

DDF.join - Joins two DDF using the given join expression.

DDF.map - Applies a function to each row of this data set.

DDF.num_of_partitions - Returns the number of data partitions (Task parallelism).

DDF.range_partition - Range partitioning is a partitioning technique where ranges of data is stored separately in different fragments.

DDF.repartition - Repartition a distributed data based in a fixed number of partitions or based on a distribution list.

DDF.replace - Replaces one or more values to new ones.

DDF.sample - Returns a sampled subset.

DDF.save.csv - Saves a csv file.

DDF.save.json - Saves a json file.

DDF.save.parquet - Saves a parquet file.

DDF.save.pickle - Saves a pickle file.

DDF.schema - Returns a schema table where each row contains the name columns and its data types of the current DDF.

DDF.select - Performs a projection of selected columns.

DDF.select_expression - Projects a set of SQL expressions and returns a new DDF.

DDF.show - Collect the current DDF into a single DataFrame.

DDF.sort - Returns a sorted DDF by the specified column(s).

DDF.split - Randomly splits a DDF into two DDF.

DDF.subtract - Returns a new set with containing rows in the first frame but not in the second one.

DDF.take - Returns the first num rows.

DDF.to_df - Returns the DDF contents as a pandas’s DataFrame.

DDF.union - Combines two DDF (concatenate) by column position.

DDF.union_by_name - Combines two DDF (concatenate) by column name.

DDF.rename - Returns a new DDF by renaming an existing column.

Statistics¶

DDF.count_rows - Returns a number of rows in this DDF.

DDF.correlation - Calculates the Pearson Correlation Coefficient.

DDF.covariance - Calculates the sample covariance for the given columns.

DDF.cross_tab - Computes a pair-wise frequency table of the given columns.

DDF.describe - Computes basic statistics for numeric and string columns.

DDF.freq_items - Finds frequent items for columns.

ML¶

Machine learning algorithms is divided in: classifiers, clusterings, feature extractor operations, frequent pattern mining algorithms, evaluators and regressors.

ML.Classification¶

The ml.classification module includes some supervised classifiers algorithms:

ml.classification.KNearestNeighbors - K-Nearest Neighbor is a algorithm used that can be used for both classification and regression predictive problems. However, it is more widely used in classification problems. To do a classification, the algorithm computes from a simple majority vote of the K nearest neighbors of each point present in the training set. The choice of the parameter K is very crucial in this algorithm, and depends on the dataset. However, values of one or tree is more commom.

ml.classification.GaussianNB - The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonged to each class to make a prediction. It is a supervised learning approach that you would come up with if you wanted to model a predictive probabilistically modeling problem.

ml.classification.LogisticRegression - Logistic regression is named for the function used at the core of the method, the logistic function. It is the go-to method for binary classification problems (problems with two class values).

ml.classification.SVM - Support vector machines (SVM) is a supervised learning model used for binary classification. Given a set of training examples, each marked as belonging to one or the other of two categories, a SVM training algorithm builds a model that assigns new points to one category or the other, making it a non-probabilistic binary linear classifier.

ML.Clustering¶

The ml.clustering module gathers popular unsupervised clustering algorithms:

ml.clustering.Kmeans - The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

ml.clustering.DBSCAN - A density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).

ML.Evaluation¶

The ml.evaluation module includes score functions and performance metrics:

ml.evaluation.BinaryClassificationMetrics - Evaluator for binary classification.

ml.evaluation.MultilabelMetrics - Evaluator for multilabel classification.

ml.evaluation.RegressionMetrics - Evaluator for regression.

ML.Feature¶

The ml.feature covers algorithms for working with features to extracting features from “raw” data, scaling, converting or modifying features: Selection: Selecting a subset from a larger set of features:

ml.feature.VectorAssembler - Vector Assembler is a transformer that combines a given list of columns into a single vector column.

ml.feature.VectorSlicer - Vector Slicer create a new feature vector with a subarray of an original features.

ml.feature.Binarizer - Binarize data (set feature values to 0 or 1) according to a threshold.

ml.feature.OneHotEncoder - Encode categorical integer features as a one-hot numeric array.

ml.feature.Tokenizer - Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.

ml.feature.RegexTokenizer - A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text.

ml.feature.RemoveStopWords - Remove stop-words is a operation to remove words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.

ml.feature.NGram - A feature transformer that converts the input array of strings into an array of n-grams.

ml.feature.CountVectorizer - Converts a collection of text documents to a matrix of token counts.

ml.feature.TfidfVectorizer - Term frequency-inverse document frequency (TF-IDF) is a numerical statistic transformation that is intended to reflect how important a word is to a document in a collection or corpus.

ml.feature.StringIndexer - StringIndexer indexes a feature by encoding a string column as a column containing indexes.

ml.feature.IndexToString - Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings.

ml.feature.MaxAbsScaler - MaxAbsScaler transforms a dataset of features rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.

ml.feature.MinMaxScaler - MinMaxScaler transforms a dataset of features rows, rescaling each feature to a specific range (often [0, 1]).

ml.feature.StandardScaler - StandardScaler transforms a dataset of features rows, reascaling each feature by the standard score.

ml.feature.PCA - Principal component analysis (PCA) is used widely in dimensionality reduction.

ml.feature.PolynomialExpansion - Perform feature expansion in a polynomial space..

ML.Frequent Pattern Mining¶

ml.fpm.Apriori - Apriori is a algorithm to find frequent item sets.

ml.feature.AssociationRules - AssociationRules implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent.

ML.Regression¶

ml.regression.LinearRegression - Linear Regression using method of least squares (works only for 2-D data) or using Stochastic Gradient Descent.

Geographic Operations¶

DDF.geo_within - Returns the sectors that the each point belongs.

Graph Algorithms¶

graph.PageRank - Perform PageRank.