

PYSPARK ML
Silas Liu - Dec. 19, 2022
Python, Big Data, Spark, Classification
Spark is a framework for cluster computing (one of the topics I teach in my Data Science MBA/bootcamp courses). It manages data manipulation and computation over clusters with multiple nodes. Thus it is recommended for very large datasets like big data. Also Spark can be set to run entirely on RAM memory, making it proper for real-time analysis.
​
PySpark is the Python interface for Spark. It has the PySpark ML, which is the Machine Learning library, developed to properly run on parallel processing offered by Spark.
​
Here we compare an entire workflow of feature engineering, cross-validation, hyperparameter tuning and evaluation of a simple and an ensemble model, over a data with 1.4 million rows, run on Spark (via Databricks) and on local scikit-learn (via Google Colab).
​
Detail: It is advised to read first the Spark document (on the left) and following the scikit-learn document (on the right).