API Reference¶
This page gives an overview of all public DDF objects, functions and methods. All classes and functions exposed in ddf namespace are public.
Contents¶
COMPSs Context¶
COMPSsContext.stop - Stop the DDF environment.
COMPSsContext.start_monitor - Start a web service monitor that informs the environment current status.
COMPSsContext.show_tasks - Show all tasks in the current code.
COMPSsContext.set_log - Set the log level.
COMPSsContext.context_status - Generates a DAG (in dot file) and some information on screen about the status process.
COMPSsContext.context_status - Generates a DAG (in dot file) and some information on screen about the status process.
COMPSsContext.parallelize - Import data to DDF by distributing a DataFrame into DDF.
COMPSsContext.import_compss_data - Import a previous Pandas DataFrame list into DDF abstraction.
COMPSsContext.read.csv - Read a csv file.
COMPSsContext.read.json - Read a json file.
COMPSsContext.read.parquet - Read a parquet file.
COMPSsContext.read.shapefile - Reads a shapefile using the shp and dbf file.
ETL¶
DDF.add_column - Merges two dataFrames, column-wise.
DDF.cache - Forces the computation of all tasks in the current stack.
DDF.cast - Change the data’s type of some columns.
DDF.columns - Returns the columns name in the current DDF..
DDF.count_rows - Returns the number of rows in this DDF.
DDF.distinct - Returns a new DDF with non duplicated rows.
DDF.drop - Removes some columns from DDF.
DDF.drop_duplicates - Alias for distinct.
DDF.distinct - Returns a new DDF with non duplicated rows.
DDF.except_all - Returns a new set with containing rows in the first frame but not in the second one while preserving duplicates.
DDF.explode - Returns a new row for each element in the given array.
DDF.export_ddf - Export ddf data as a list of Pandas’s DataFrame.
DDF.fillna - Replace NaN elements by value or by median, mean or mode.
DDF.filter - Filters elements based on a condition.
DDF.group_by - Returns a GroupedDFF with a set of methods for aggregations on a DDF.
DDF.hash_partition - Hash partitioning is a partitioning technique where data is stored separately in different fragments by a hash function.
DDF.intersect - Returns a new DDF containing rows in both DDF.
DDF.intersect_all - Returns a new DDF containing rows in both DDF while preserving duplicates.
DDF.join - Joins two DDF using the given join expression.
DDF.map - Applies a function to each row of this data set.
DDF.num_of_partitions - Returns the number of data partitions (Task parallelism).
DDF.range_partition - Range partitioning is a partitioning technique where ranges of data is stored separately in different fragments.
DDF.repartition - Repartition a distributed data based in a fixed number of partitions or based on a distribution list.
DDF.replace - Replaces one or more values to new ones.
DDF.sample - Returns a sampled subset.
DDF.save.csv - Saves a csv file.
DDF.save.json - Saves a json file.
DDF.save.parquet - Saves a parquet file.
DDF.save.pickle - Saves a pickle file.
DDF.schema - Returns a schema table where each row contains the name columns and its data types of the current DDF.
DDF.select - Performs a projection of selected columns.
DDF.select_expression - Projects a set of SQL expressions and returns a new DDF.
DDF.show - Collect the current DDF into a single DataFrame.
DDF.sort - Returns a sorted DDF by the specified column(s).
DDF.split - Randomly splits a DDF into two DDF.
DDF.subtract - Returns a new set with containing rows in the first frame but not in the second one.
DDF.take - Returns the first num rows.
DDF.to_df - Returns the DDF contents as a pandas’s DataFrame.
DDF.union - Combines two DDF (concatenate) by column position.
DDF.union_by_name - Combines two DDF (concatenate) by column name.
DDF.rename - Returns a new DDF by renaming an existing column.
Statistics¶
DDF.count_rows - Returns a number of rows in this DDF.
DDF.correlation - Calculates the Pearson Correlation Coefficient.
DDF.covariance - Calculates the sample covariance for the given columns.
DDF.cross_tab - Computes a pair-wise frequency table of the given columns.
DDF.describe - Computes basic statistics for numeric and string columns.
DDF.freq_items - Finds frequent items for columns.
ML¶
Machine learning algorithms is divided in: classifiers, clusterings, feature extractor operations, frequent pattern mining algorithms, evaluators and regressors.
ML.Classification¶
The ml.classification module includes some supervised classifiers algorithms:
ml.classification.KNearestNeighbors - K-Nearest Neighbor is a algorithm used that can be used for both classification and regression predictive problems. However, it is more widely used in classification problems. To do a classification, the algorithm computes from a simple majority vote of the K nearest neighbors of each point present in the training set. The choice of the parameter K is very crucial in this algorithm, and depends on the dataset. However, values of one or tree is more commom.
ml.classification.GaussianNB - The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonged to each class to make a prediction. It is a supervised learning approach that you would come up with if you wanted to model a predictive probabilistically modeling problem.
ml.classification.LogisticRegression - Logistic regression is named for the function used at the core of the method, the logistic function. It is the go-to method for binary classification problems (problems with two class values).
ml.classification.SVM - Support vector machines (SVM) is a supervised learning model used for binary classification. Given a set of training examples, each marked as belonging to one or the other of two categories, a SVM training algorithm builds a model that assigns new points to one category or the other, making it a non-probabilistic binary linear classifier.
ML.Clustering¶
The ml.clustering module gathers popular unsupervised clustering algorithms:
ml.clustering.Kmeans - The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.
ml.clustering.DBSCAN - A density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).
ML.Evaluation¶
The ml.evaluation module includes score functions and performance metrics:
ml.evaluation.BinaryClassificationMetrics - Evaluator for binary classification.
ml.evaluation.MultilabelMetrics - Evaluator for multilabel classification.
ml.evaluation.RegressionMetrics - Evaluator for regression.
ML.Feature¶
The ml.feature covers algorithms for working with features to extracting features from “raw” data, scaling, converting or modifying features: Selection: Selecting a subset from a larger set of features:
ml.feature.VectorAssembler - Vector Assembler is a transformer that combines a given list of columns into a single vector column.
ml.feature.VectorSlicer - Vector Slicer create a new feature vector with a subarray of an original features.
ml.feature.Binarizer - Binarize data (set feature values to 0 or 1) according to a threshold.
ml.feature.OneHotEncoder - Encode categorical integer features as a one-hot numeric array.
ml.feature.Tokenizer - Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.
ml.feature.RegexTokenizer - A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text.
ml.feature.RemoveStopWords - Remove stop-words is a operation to remove words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
ml.feature.NGram - A feature transformer that converts the input array of strings into an array of n-grams.
ml.feature.CountVectorizer - Converts a collection of text documents to a matrix of token counts.
ml.feature.TfidfVectorizer - Term frequency-inverse document frequency (TF-IDF) is a numerical statistic transformation that is intended to reflect how important a word is to a document in a collection or corpus.
ml.feature.StringIndexer - StringIndexer indexes a feature by encoding a string column as a column containing indexes.
ml.feature.IndexToString - Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings.
ml.feature.MaxAbsScaler - MaxAbsScaler transforms a dataset of features rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
ml.feature.MinMaxScaler - MinMaxScaler transforms a dataset of features rows, rescaling each feature to a specific range (often [0, 1]).
ml.feature.StandardScaler - StandardScaler transforms a dataset of features rows, reascaling each feature by the standard score.
ml.feature.PCA - Principal component analysis (PCA) is used widely in dimensionality reduction.
ml.feature.PolynomialExpansion - Perform feature expansion in a polynomial space..
ML.Frequent Pattern Mining¶
ml.fpm.Apriori - Apriori is a algorithm to find frequent item sets.
ml.feature.AssociationRules - AssociationRules implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent.
ML.Regression¶
ml.regression.LinearRegression - Linear Regression using method of least squares (works only for 2-D data) or using Stochastic Gradient Descent.
Geographic Operations¶
DDF.geo_within - Returns the sectors that the each point belongs.
Graph Algorithms¶
graph.PageRank - Perform PageRank.