Spark 2 - Virtual Artifact

Spark 2

Spark 2

Getting Started

PySpark - Spark SQL Context

Spark aims to make it easy to work with data. One way they achieve this is by working with spark data as if you were working on a SQL database * Spark SQL enables querying of DataFrames as database tables * Temporary per-session and global tables * The Catalyst optimizer makes SQL queries

PySpark - Using DataFrames

Spark DataFrames Previously we looked at RDDs, and were the primary data set in Spark 1. In Spark 2 we rarely use RDDs only for low level transformations and control over the dataset. If the data is unstructured or streaming data we then have to rely on RDDs, for everything

PySpark - Using RDDs

I will keep my data in a folder named 'datasets' If PySpark is not already loaded up, go ahead and start PySpark and create a new Jupyter notebook View information about the SparkContext by inputing sc If we were running a cluster of nodes the output would be a bit

Getting Started

Spark 2 setup

Demo will cover: * Install standalone Spark on your local machine * Set up the PySpark REPL interface Req's * This demo will by done with Python 3 * Java v8 * jupyter notebooks Download Spark 2 from https://spark.apache.org/downloads.html 1. Choose the most recent 2.x build 2. Choose package: