API Reference¶
This page gives an overview of all public DDF objects, functions and methods. All classes and functions exposed in ddf namespace are public.
Contents¶
COMPSs Context¶
COMPSsContext.stop
- Stop the DDF environment.
COMPSsContext.start_monitor
- Start a web service monitor that informs the environment current status.
COMPSsContext.show_tasks
- Show all tasks in the current code.
COMPSsContext.set_log
- Set the log level.
COMPSsContext.context_status
- Generates a DAG (in dot file) and some information on screen about the status process.
COMPSsContext.context_status
- Generates a DAG (in dot file) and some information on screen about the status process.
COMPSsContext.parallelize
- Import data to DDF by distributing a DataFrame into DDF.
COMPSsContext.import_compss_data
- Import a previous Pandas DataFrame list into DDF abstraction.
COMPSsContext.read.csv
- Read a csv file.
COMPSsContext.read.json
- Read a json file.
COMPSsContext.read.parquet
- Read a parquet file.
COMPSsContext.read.shapefile
- Reads a shapefile using the shp and dbf file.
ETL¶
DDF.add_column
- Merges two dataFrames, column-wise.
DDF.cache
- Forces the computation of all tasks in the current stack.
DDF.cast
- Change the data’s type of some columns.
DDF.columns
- Returns the columns name in the current DDF..
DDF.count_rows
- Returns the number of rows in this DDF.
DDF.distinct
- Returns a new DDF with non duplicated rows.
DDF.drop
- Removes some columns from DDF.
DDF.drop_duplicates
- Alias for distinct.
DDF.distinct
- Returns a new DDF with non duplicated rows.
DDF.except_all
- Returns a new set with containing rows in the first frame but not in the second one while preserving duplicates.
DDF.explode
- Returns a new row for each element in the given array.
DDF.export_ddf
- Export ddf data as a list of Pandas’s DataFrame.
DDF.fillna
- Replace NaN elements by value or by median, mean or mode.
DDF.filter
- Filters elements based on a condition.
DDF.group_by
- Returns a GroupedDFF with a set of methods for aggregations on a DDF.
DDF.hash_partition
- Hash partitioning is a partitioning technique where data is stored separately in different fragments by a hash function.
DDF.intersect
- Returns a new DDF containing rows in both DDF.
DDF.intersect_all
- Returns a new DDF containing rows in both DDF while preserving duplicates.
DDF.join
- Joins two DDF using the given join expression.
DDF.map
- Applies a function to each row of this data set.
DDF.num_of_partitions
- Returns the number of data partitions (Task parallelism).
DDF.range_partition
- Range partitioning is a partitioning technique where ranges of data is stored separately in different fragments.
DDF.repartition
- Repartition a distributed data based in a fixed number of partitions or based on a distribution list.
DDF.replace
- Replaces one or more values to new ones.
DDF.sample
- Returns a sampled subset.
DDF.save.csv
- Saves a csv file.
DDF.save.json
- Saves a json file.
DDF.save.parquet
- Saves a parquet file.
DDF.save.pickle
- Saves a pickle file.
DDF.schema
- Returns a schema table where each row contains the name columns and its data types of the current DDF.
DDF.select
- Performs a projection of selected columns.
DDF.select_expression
- Projects a set of SQL expressions and returns a new DDF.
DDF.show
- Collect the current DDF into a single DataFrame.
DDF.sort
- Returns a sorted DDF by the specified column(s).
DDF.split
- Randomly splits a DDF into two DDF.
DDF.subtract
- Returns a new set with containing rows in the first frame but not in the second one.
DDF.take
- Returns the first num rows.
DDF.to_df
- Returns the DDF contents as a pandas’s DataFrame.
DDF.union
- Combines two DDF (concatenate) by column position.
DDF.union_by_name
- Combines two DDF (concatenate) by column name.
DDF.rename
- Returns a new DDF by renaming an existing column.
Statistics¶
DDF.count_rows
- Returns a number of rows in this DDF.
DDF.correlation
- Calculates the Pearson Correlation Coefficient.
DDF.covariance
- Calculates the sample covariance for the given columns.
DDF.cross_tab
- Computes a pair-wise frequency table of the given columns.
DDF.describe
- Computes basic statistics for numeric and string columns.
DDF.freq_items
- Finds frequent items for columns.
ML¶
Machine learning algorithms is divided in: classifiers, clusterings, feature extractor operations, frequent pattern mining algorithms, evaluators and regressors.
ML.Classification¶
The ml.classification module includes some supervised classifiers algorithms:
ml.classification.KNearestNeighbors
- K-Nearest Neighbor is a algorithm used that can be used for both classification and regression predictive problems. However, it is more widely used in classification problems. To do a classification, the algorithm computes from a simple majority vote of the K nearest neighbors of each point present in the training set. The choice of the parameter K is very crucial in this algorithm, and depends on the dataset. However, values of one or tree is more commom.
ml.classification.GaussianNB
- The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonged to each class to make a prediction. It is a supervised learning approach that you would come up with if you wanted to model a predictive probabilistically modeling problem.
ml.classification.LogisticRegression
- Logistic regression is named for the function used at the core of the method, the logistic function. It is the go-to method for binary classification problems (problems with two class values).
ml.classification.SVM
- Support vector machines (SVM) is a supervised learning model used for binary classification. Given a set of training examples, each marked as belonging to one or the other of two categories, a SVM training algorithm builds a model that assigns new points to one category or the other, making it a non-probabilistic binary linear classifier.
ML.Clustering¶
The ml.clustering module gathers popular unsupervised clustering algorithms:
ml.clustering.Kmeans
- The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.
ml.clustering.DBSCAN
- A density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).
ML.Evaluation¶
The ml.evaluation module includes score functions and performance metrics:
ml.evaluation.BinaryClassificationMetrics
- Evaluator for binary classification.
ml.evaluation.MultilabelMetrics
- Evaluator for multilabel classification.
ml.evaluation.RegressionMetrics
- Evaluator for regression.
ML.Feature¶
The ml.feature covers algorithms for working with features to extracting features from “raw” data, scaling, converting or modifying features: Selection: Selecting a subset from a larger set of features:
ml.feature.VectorAssembler
- Vector Assembler is a transformer that combines a given list of columns into a single vector column.
ml.feature.VectorSlicer
- Vector Slicer create a new feature vector with a subarray of an original features.
ml.feature.Binarizer
- Binarize data (set feature values to 0 or 1) according to a threshold.
ml.feature.OneHotEncoder
- Encode categorical integer features as a one-hot numeric array.
ml.feature.Tokenizer
- Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.
ml.feature.RegexTokenizer
- A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text.
ml.feature.RemoveStopWords
- Remove stop-words is a operation to remove words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
ml.feature.NGram
- A feature transformer that converts the input array of strings into an array of n-grams.
ml.feature.CountVectorizer
- Converts a collection of text documents to a matrix of token counts.
ml.feature.TfidfVectorizer
- Term frequency-inverse document frequency (TF-IDF) is a numerical statistic transformation that is intended to reflect how important a word is to a document in a collection or corpus.
ml.feature.StringIndexer
- StringIndexer indexes a feature by encoding a string column as a column containing indexes.
ml.feature.IndexToString
- Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings.
ml.feature.MaxAbsScaler
- MaxAbsScaler transforms a dataset of features rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature.
ml.feature.MinMaxScaler
- MinMaxScaler transforms a dataset of features rows, rescaling each feature to a specific range (often [0, 1]).
ml.feature.StandardScaler
- StandardScaler transforms a dataset of features rows, reascaling each feature by the standard score.
ml.feature.PCA
- Principal component analysis (PCA) is used widely in dimensionality reduction.
ml.feature.PolynomialExpansion
- Perform feature expansion in a polynomial space..
ML.Frequent Pattern Mining¶
ml.fpm.Apriori
- Apriori is a algorithm to find frequent item sets.
ml.feature.AssociationRules
- AssociationRules implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent.
ML.Regression¶
ml.regression.LinearRegression
- Linear Regression using method of least squares (works only for 2-D data) or using Stochastic Gradient Descent.
Geographic Operations¶
DDF.geo_within
- Returns the sectors that the each point belongs.
Graph Algorithms¶
graph.PageRank
- Perform PageRank.