DDF vs PySpark DataFrame¶
Some DDF functions have interfaces similar to the PySpark DataFrame to help new users who want to migrate to the COMPSs.
The following tables show some of these correspondences.
ETL¶
PySpark DataFrame | DDF |
---|---|
parallelize | parallelize |
map | map |
cache | cache |
count | count |
describe | describe |
subtract | subtract |
drop | drop |
dropna | dropna |
dropDuplicates | drop_duplicates |
fillna | fillna |
filter | filter |
groupBy.agg | groupBy.agg |
intersect | intersect |
intersectAll | intersect_all |
join | join |
randomSplit | split |
replace | replace |
sample | sample |
select | select |
show | show |
sort | sort |
take | take |
toDF | toDF |
union | union |
unionByName | union_by_name |
withColumnRenamed | with_column_renamed |
read.text | load_text |
write | save |
Machine Learning¶
PySpark DataFrame | DDF |
---|---|
VectorAssembler | VectorAssembler |
VectorSlicer | VectorSlicer |
NGram | NGram |
TF-IDF | TF-IDF |
CountVectorizer | CountVectorizer |
Tokenizer | Tokenizer |
StopWordsRemover | RemoveStopWords |
PCA | PCA |
StringIndexer | StringIndexer |
IndexToString | IndexToString |
StandardScaler | StandardScaler |
MaxAbsScaler | MaxAbsScaler |
MinMaxScaler | MinMaxScaler |
SVMWithSGD | SVM |
LogisticRegressionWithSGD | LogisticRegression |
NaiveBayes | Gaussian Naive Bayes |
LinearRegressionWithSGD | LinearRegression |
K-means | K-Means |
AssociationRules | AssociationRules |