API Reference

This page gives an overview of all public DDF objects, functions and methods. All classes and functions exposed in ddf namespace are public.

Contents

COMPSs Context

COMPSsContext.stop - Stop the DDF environment.

COMPSsContext.start_monitor - Start a web service monitor that informs the environment current status.

COMPSsContext.show_tasks - Show all tasks in the current code.

COMPSsContext.set_log - Set the log level.

COMPSsContext.context_status - Generates a DAG (in dot file) and some information on screen about the status process.

COMPSsContext.context_status - Generates a DAG (in dot file) and some information on screen about the status process.

COMPSsContext.parallelize - Import data to DDF by distributing a DataFrame into DDF.

COMPSsContext.import_compss_data - Import a previous Pandas DataFrame list into DDF abstraction.

COMPSsContext.read.csv - Read a csv file.

COMPSsContext.read.json - Read a json file.

COMPSsContext.read.parquet - Read a parquet file.

COMPSsContext.read.shapefile - Reads a shapefile using the shp and dbf file.

ETL

DDF.add_column - Merges two dataFrames, column-wise.

DDF.cache - Forces the computation of all tasks in the current stack.

DDF.cast - Change the data’s type of some columns.

DDF.columns - Returns the columns name in the current DDF..

DDF.count_rows - Returns the number of rows in this DDF.

DDF.distinct - Returns a new DDF with non duplicated rows.

DDF.drop - Removes some columns from DDF.

DDF.drop_duplicates - Alias for distinct.

DDF.distinct - Returns a new DDF with non duplicated rows.

DDF.except_all - Returns a new set with containing rows in the first frame but not in the second one while preserving duplicates.

DDF.explode - Returns a new row for each element in the given array.

DDF.export_ddf - Export ddf data as a list of Pandas’s DataFrame.

DDF.fillna - Replace NaN elements by value or by median, mean or mode.

DDF.filter - Filters elements based on a condition.

DDF.group_by - Returns a GroupedDFF with a set of methods for aggregations on a DDF.

DDF.hash_partition - Hash partitioning is a partitioning technique where data is stored separately in different fragments by a hash function.

DDF.intersect - Returns a new DDF containing rows in both DDF.

DDF.intersect_all - Returns a new DDF containing rows in both DDF while preserving duplicates.

DDF.join - Joins two DDF using the given join expression.

DDF.map - Applies a function to each row of this data set.

DDF.num_of_partitions - Returns the number of data partitions (Task parallelism).

DDF.range_partition - Range partitioning is a partitioning technique where ranges of data is stored separately in different fragments.

DDF.repartition - Repartition a distributed data based in a fixed number of partitions or based on a distribution list.

DDF.replace - Replaces one or more values to new ones.

DDF.sample - Returns a sampled subset.

DDF.save.csv - Saves a csv file.

DDF.save.json - Saves a json file.

DDF.save.parquet - Saves a parquet file.

DDF.save.pickle - Saves a pickle file.

DDF.schema - Returns a schema table where each row contains the name columns and its data types of the current DDF.

DDF.select - Performs a projection of selected columns.

DDF.select_expression - Projects a set of SQL expressions and returns a new DDF.

DDF.show - Collect the current DDF into a single DataFrame.

DDF.sort - Returns a sorted DDF by the specified column(s).

DDF.split - Randomly splits a DDF into two DDF.

DDF.subtract - Returns a new set with containing rows in the first frame but not in the second one.

DDF.take - Returns the first num rows.

DDF.to_df - Returns the DDF contents as a pandas’s DataFrame.

DDF.union - Combines two DDF (concatenate) by column position.

DDF.union_by_name - Combines two DDF (concatenate) by column name.

DDF.rename - Returns a new DDF by renaming an existing column.

Statistics

DDF.count_rows - Returns a number of rows in this DDF.

DDF.correlation - Calculates the Pearson Correlation Coefficient.

DDF.covariance - Calculates the sample covariance for the given columns.

DDF.cross_tab - Computes a pair-wise frequency table of the given columns.

DDF.describe - Computes basic statistics for numeric and string columns.

DDF.freq_items - Finds frequent items for columns.

ML

Machine learning algorithms is divided in: classifiers, clusterings, feature extractor operations, frequent pattern mining algorithms, evaluators and regressors.

ML.Classification

The ml.classification module includes some supervised classifiers algorithms:

ml.classification.KNearestNeighbors - K-Nearest Neighbor is a algorithm used that can be used for both classification and regression predictive problems. However, it is more widely used in classification problems. To do a classification, the algorithm computes from a simple majority vote of the K nearest neighbors of each point present in the training set. The choice of the parameter K is very crucial in this algorithm, and depends on the dataset. However, values of one or tree is more commom.

ml.classification.GaussianNB - The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonged to each class to make a prediction. It is a supervised learning approach that you would come up with if you wanted to model a predictive probabilistically modeling problem.

ml.classification.LogisticRegression - Logistic regression is named for the function used at the core of the method, the logistic function. It is the go-to method for binary classification problems (problems with two class values).

ml.classification.SVM - Support vector machines (SVM) is a supervised learning model used for binary classification. Given a set of training examples, each marked as belonging to one or the other of two categories, a SVM training algorithm builds a model that assigns new points to one category or the other, making it a non-probabilistic binary linear classifier.

ML.Clustering

The ml.clustering module gathers popular unsupervised clustering algorithms:

ml.clustering.Kmeans - The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

ml.clustering.DBSCAN - A density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).

ML.Evaluation

The ml.evaluation module includes score functions and performance metrics:

ml.evaluation.BinaryClassificationMetrics - Evaluator for binary classification.

ml.evaluation.MultilabelMetrics - Evaluator for multilabel classification.

ml.evaluation.RegressionMetrics - Evaluator for regression.

ML.Feature

The ml.feature covers algorithms for working with features to extracting features from “raw” data, scaling, converting or modifying features: Selection: Selecting a subset from a larger set of features:

ml.feature.VectorAssembler - Vector Assembler is a transformer that combines a given list of columns into a single vector column.

ml.feature.VectorSlicer - Vector Slicer create a new feature vector with a subarray of an original features.

ml.feature.Binarizer - Binarize data (set feature values to 0 or 1) according to a threshold.

ml.feature.OneHotEncoder - Encode categorical integer features as a one-hot numeric array.

ml.feature.Tokenizer - Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.

ml.feature.RegexTokenizer - A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text.

ml.feature.RemoveStopWords - Remove stop-words is a operation to remove words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.

ml.feature.NGram - A feature transformer that converts the input array of strings into an array of n-grams.

ml.feature.CountVectorizer - Converts a collection of text documents to a matrix of token counts.

ml.feature.TfidfVectorizer - Term frequency-inverse document frequency (TF-IDF) is a numerical statistic transformation that is intended to reflect how important a word is to a document in a collection or corpus.

ml.feature.StringIndexer - StringIndexer indexes a feature by encoding a string column as a column containing indexes.

ml.feature.IndexToString - Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings.

ml.feature.MaxAbsScaler - MaxAbsScaler transforms a dataset of features rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.

ml.feature.MinMaxScaler - MinMaxScaler transforms a dataset of features rows, rescaling each feature to a specific range (often [0, 1]).

ml.feature.StandardScaler - StandardScaler transforms a dataset of features rows, reascaling each feature by the standard score.

ml.feature.PCA - Principal component analysis (PCA) is used widely in dimensionality reduction.

ml.feature.PolynomialExpansion - Perform feature expansion in a polynomial space..

ML.Frequent Pattern Mining

ml.fpm.Apriori - Apriori is a algorithm to find frequent item sets.

ml.feature.AssociationRules - AssociationRules implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent.

ML.Regression

ml.regression.LinearRegression - Linear Regression using method of least squares (works only for 2-D data) or using Stochastic Gradient Descent.

Geographic Operations

DDF.geo_within - Returns the sectors that the each point belongs.

Graph Algorithms

graph.PageRank - Perform PageRank.