Stick around if you're for a complete guide to set up a pyspark environment for data science applications; pyspark functionality as well as the best platforms to be explored.

What is Pyspark?

Pyspark, a robust language that must be considered to learn if you're into the idea of creating more scalable pipelines and analyses. According to Chris Min, a data engineer, Pyspark basically enables writing Spark apps in Python and makes data processing efficient in a distributed fashion. Python is not just a great language, but an all-in-one ecosystem to perform exploratory data analysis, create ETLs for data platforms, and build ML pipelines. You might also say that PySpark is no less than a whole library that can be used for a great deal of large data processing on a single/cluster of machines, Moreover, it has you covered up with handling all those parallel processing without even threading or multiprocessing modules in Python.

Spark is the Real Deal For Data Engineering

According to the International Journal of Data Science and Analytics, the emergence of Spark as a general-purpose cluster computing framework having language-integrated API in Python, Scala, and Java is a real thing right now. Its impressively advanced in-memory programming model and libraries for structured data processing, scalable ML, and Graph analysis increase its functionality in the data science industry. And as matter of fact, it is undeniable that at a certain point of data processing, scaling with Pandas is hard. Being a data engineer involves a lot of large data processing which isn't a big deal if you get well-versed with Spark.

Why Should Data Scientists Learn spark?

https://unsplash.com/photos/AxAPuIRWHGk

Being a data scientist, learning Spark can be a game-changer. For large data processing, Spark is way better than Pandas while not so different in use, so switching to it is not a big deal, and that too when you get real deal benefits while your operations in data engineering. Spark has solutions to various issues and it's a complete collection of libraries to execute logic quite efficiently. Spark ensures you a very clean and efficient experience of operations, even better than Pandas somehow, especially while dealing with large data sets. Spark has you covered up by its efficiently high-performance analysis and user-friendly structure.

Exploring All The Options for Pyspark Setup

I also have a video version of this article, if you are interested feel free to watch this video on my youtube channel

https://youtu.be/Ql_jfk3UnHE

Following is a set of various options you can consider to set up the PySpark ecosystem. The list mentioned below addresses all the best platform that you can consider:

Setting Up Locally Spark and Python On Ubuntu

Install Java

sudo apt install openjdk-8-jdk

Download spark from https://spark.apache.org/downloads.html linux version
Set environment variables sudo nano /etc/environment

JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
#Save and exit

To test echo $JAVA_HOME and see path to confirm installation
Open bashrc sudo nano ~/.bashrc and at the end of the file add source /etc/environment
This should setup your Java environment on ubuntu
Install spark, after you downloaded spark in step 2 install with the following commands

cd Downloads
sudo tar -zxvf spark-3.1.2-bin-hadoop3.2.tgz

Configure environment for spark sudo nano ~/.bashrc and add the following

export SPARK_HOME=~/Downloads/spark-3.1.2-bin-hadoop3.2
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:~/anaconda3/bin
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$PATH:$JAVA_HOME/jre/bin

Save and exit
To test pyspark

Don't have ubuntu?Use VirtualBox

Setup ubuntu on your local using virtualbox. VirtualBox basically enables you to build a virtual computer, and that too, on your own physical computer. You can explore VirtualBox to set up Spark and Python: (20-30 mins approx)

Start with downloading the Virtualbox.

Untitled 1.png

Screenshot from Virtualbox download

Download ubuntu ISO Image
In virtual box click on new and setup ubuntu 64 bit environment
Pass in desired cpu cores,memory and storage
Point to the downloaded ubuntu image

Setting Up Locally Spark and Python On Mac

Make sure Homebrew is installed and updated, if not go to this link or type in terminal

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Open terminal and Install Java

brew install java
#to check if java installed?
brew info java

Install scala

brew install scala

Install Spark

brew install apache-spark

Install python

brew install python3

Setup environment bashrc Open file sudo nano .bashrc
Add following env variables

#java
export JAVA_HOME=/Library/java/JavaVirtualMachines/adoptopenjdk-8.jdk/contents/Home/
export JRE_HOME=/Library/java/JavaVirtualMachines/openjdk-13.jdk/contents/Home/jre/
#spark
export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.4/libexec
export PATH=/usr/local/Cellar/apache-spark/2.4.4/bin:$PATH
#pyspark
export PYSPARK_PYTHON=/usr/local/bin/python3 # or your path to python
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

This should configure the pyspark setup, to test type pyspark in terminal

Setting up locally with docker and jupyter notebook (My preferred Method)

What is docker?

Docker is an open platform for developing, shipping, and running applications. Want to learn more about docker, check out this link

Setting up Spark with docker and jupyter notebook is quite a simple task involving a few steps that help build up an optimal environment for PySpark to be run on Jupyter Notebook in no time. Follow the steps mentioned below:

Install Docker
Use a pre-existing docker image jupyter/pyspark-notebook by jupyter
Pull Image

docker pull jupyter/pyspark-notebook

Docker Run

docker run -d -p 8888:8888 jupyter/pyspark-notebook:latest

Go to localhost:8888 and create a new notebook, and run cell with import pyspark

Databricks Setup

Databricks, a unified analytics platform basically has Spark clusters in the cloud that are quite well managed. It is an easy-to-use environment that encourages the users to learn, collaborate and work in a fully integrated workspace. Any spark code can be easily scheduled without any hassle as databricks support pyspark natively

To start create a databricks account (This is usually done by databricks admins). and link it to your preferred cloud provider. For more information to get started check out this video

https://www.youtube.com/watch?v=3fqfWYBXj2A

You have to start with creating a Databricks cluster.
Create a databricks notebook and test by import pyspark

Spark and Python on AWS EC2

Amazon EC2, are virtual machines provided by AWS, these come with pre-installed os software AMIs but the rest of the dependencies would need to be installed separately.

Go to AWS Console and EC2
Select Ubuntu AMI
Follow the steps from Option 1

Avoid doing this and use other options

Pyspark on AWS Sagemaker Notebooks

Launched in 2017, Amazon SageMaker is a cloud-based machine-learning platform that is fully managed and decouples your environments across developing, training, and deploying, letting you scale them separately whilst helping you optimize your spending and time. It is really easy to spin Sagemaker notebooks with a click of a few buttons. Amazon SageMaker notebook instance is a machine learning (ML) compute instance running the Jupyter Notebook Environment. It comes with pre-configured Conda environments like python2, python3, PyTorch, TensorFlow etc

Log in to your aws console and go to Sagemaker
Click on Notebook, Notebook Instances on the left side
Click on Create Notebook Instances, give it a name and select desired configurations
Select instance type, maybe start small ml.t2.medium, and maybe you can spin up a powerful instance later
Click create and wait for a few minutes and then click on open jupyterlab to go to the notebook
Create a new notebook and write the following code snippet to run pyspark

import sagemaker_pyspark
from pyspark.sql import SparkSession, DataFrame

classpath = ":".join(sagemaker_pyspark.classpath_jars())
spark = SparkSession.builder.config(
        "spark.driver.extraClassPath", classpath
    ).getOrCreate()

If you are interested to know more about Sagemaker, do check out my previous video, Sagemaker in 11 Minutes

https://youtu.be/95332cm5ROo

AWS EMR Cluster Setup

Amazon EMR, probably one of the best places to run Spark, can help you create Spark clusters very easily as it is equipped with various features such as Amazon S3 connectivity which makes it all lightning-fast and super-convenient. Moreover, integrated operations with EC2 spot market and EMR Managed scaling.

To be precise, EMR is a well-managed big data service enabling data scientists to get assistance in their work with data science applications written in Python, Scala, and Pyspark. It ensures a convenient cluster setup for Spark for the data scientist to have a platform to develop and visualize.

Go to AWS console and search for EMR
Click on create a cluster
In general configuration give it a name, in Software configuration select Spark Application

EMR Cluster setup 1.png

In Hardware configuration select instance type, maybe start small from m1.medium and select number of instances in cluster
In Security Select EC2 Key pairs, usually created by administrator, if not you can follow the steps on the right to create programmatic access keys for the cluster to use

EMR Cluster setup 2.png

Keep the rest options to default and create the cluster
After that create a EMR notebook and select the newly created cluster to execute your jobs for scale

Conclusion:

Spark, a complete analytic engine, helps data scientists in their operations of lengthy data processings that are rather difficult when handled with Pandas. Thus, learning PySpark, the robust library, can help data engineers a lot in their course of work. Now that you know various platforms that enable you setup Spark clusters with well managed clouds, you can explore them yourself.

Anuj Syal's Blog