Virtual Artifact

Coming soon

Froilan Miranda — Wed, 27 Jul 2022 17:32:57 GMT

This is Virtual Artifact, a brand new site by Froilan Miranda that's just getting started. Things will be up and running here shortly, but you can subscribe in the meantime if you'd like to stay up to date and receive emails when new content is published!

AWS API Gateway Introduction

Froilan Miranda — Tue, 26 Jan 2021 01:06:29 GMT

Today API's are the hot thing, and for good reason. The add simplicity and flexibility to complex data architectures. API Gateway offers the ability to create RESTful APIs and WebSocket APIs within the AWS Cloud Platform. API Gateway also supports serverless, containerized workloads.

Amazon API Gateway is an AWS service for creating, publishing, maintaining, monitoring, and securing REST, HTTP, and WebSocket APIs at any scale. Developers can create API access to any web service connected to the internet. This include AWS Cloud services and services not within the AWS Cloud.

API Types

RESTful APIs - Optimized for serverless workloads and HTTP backend
HTTP based
Enabled stateless client-server communication
Standard GET, POST, PUT, PATCH AND DELETE HTTP methods
WebSocket APIs - Real-time two way communication between applications
Adheres to the WebSocket protocol, which enables stateful, full-duplex communication between client and server.
Route incoming messages based on message content.

Benefits

Efficient API development - The ability to run multiple versions of an API that can be quickly tested and released
Performance at any scale - Using the AWS global structure keeps performance high and scaling easy
Cost savings at scale - Different tiering plans allow for cost flexibility as requests grow
Easy monitoring - Easy integration with CloudWatch to monitor API usage, latency, error rates and more.
Flexible security controls - Leverage IAM and Cognito to finely tune access, as well as, OAuth 2 and OIDC support.
RESTful API options - Easily create and customize your RESTful APIs

Lambdas

Using API Gateway with AWS Lambdas allows for the app-facing part of the AWS serverless infrastructure.

Stream line a web application by hosting it on AWS Lambda. Then expose lambda functions through API Gateway. Both services are highly available ,scalable
and can be monitored through CloudWatch. This greatly simplifies development and administration efforts.

Access API Gateway

AWS allows access to API Gateway through several means:

AWS Management Console
AWS SDKs
API Gateway V1 and V2 APIs
AWS CLI
AWS Tools for Windows PowerShell

Pricing

HTTP and REST API

API Gateway charges for what you use. The charges are for number of API calls and the amount of data transferred out. There is no upfront fees are setup charges.

There are options for Private APIs and data caching that can also affect charges.

WebSocket API

WebSocket APIs only incur a charge when messages are sent and received and connection minutes

Free Tier

If you are still on the Free Tier, you have 1,000,000 API calls and 1,000,000 messages and 750,000 connection minutes available.

After Free Tier Expires

Charges after the free tier vary. Consult the prices page to learn more about API Gateway pricing. Learn More

Conclusion

AWS API Gateway allows for a easy, secure, and scalable way to expose a web application through a RESTful service. Used with Lambda and other AWS services, you can quickly develop, test and deploy web interfaces.

In Addition, it is a service that offers Free Tier access. This makes it is easy to test and try with out buying.

AWS CodeCommit Introduction - Part 2

Froilan Miranda — Mon, 18 Jan 2021 23:33:20 GMT

In the previous article we looked at a general overview of the AWS CodeCommit service.

In this article we will look at setup and accessing a repository in CodeCommit.

There are several ways to work with CodeCommit:

AWS Management Console
Use Git credentials with HTTPs
Federated Access
Temporary credentials
Web Identity Provider

This article will use Git credentials and HTTPS

Prerequisites

You will need the following setup in order to follow along with this walkthrough

Git version control on you local machine More Info
An AWS account with access to IAM credentials More Info

Part 1 - Setting Up Permissions

First we are going to give an existing AWS user the proper policies to access CodeCommit

!! images here

Loging to AWS Management Console
Type 'iam' in the top search bar and select the IAM service from the drop down.
Select 'Users' from the left menu
Select the user you wish to add access to CodeCommit
Make sure the Permissions tab is selected and click add permissions
Select 'Attach existing policies dirctly in the Grant permissions section and type CodeCommit in the section below it.
Click the checkbox next to 'AWSCodeCommitFullAccess'
Click 'Next: Review' in the lower right corner

This will take you back to the Summary page for the user.

Part 2 - Create Git Credentials

Loging to AWS Management Console
Type 'iam' in the top search bar and select the IAM service from the drop down.
Select 'Users' from the left menu
Select the user you wish to add access to CodeCommit
Make sure the 'Security credentials' tab is selected. Scroll down to the 'HTTPS Git credentials for AWS CodeCommit' section and click Generate credentials.
Download the credentials csv some where safe. This will be needed later to connect to the repository

Part 3 - Create a Repository

Loging to AWS Management Console
Type 'codecommit' in the top search bar and select the CodeCommit service from the drop down.
Click 'Create repository'
Give your repository a name and click create in the bottom right corner

This will bring you to a 'Connections steps' page with a green Success bar at the top.

Scroll down to 'Step 3: Clone the repository' and copy the repository location

Part 4 - Connect to the repository

Open terminal and move to the directory that you wish to clone the repository to.
Use the repository address from 'Part 3' to clone the repo. git clone
You will be prompted to enter your git credentials from 'Part 2'
Now, that the repository is cloned to your local machine you can interact with it as you would any other git repository
Add some files, push and confirm in the AWS Management Console

AWS CodeCommit Introduction - Part 1

Froilan Miranda — Sun, 10 Jan 2021 23:43:44 GMT

In todays development environment, Git-based services are a norm. With GitHub being the industry standard and others like Bitbucket closly following. AWS has also added there solution for Git-based version control called CodeCommit.

We will look at some of the features of CodeCommit and what makes it special compared to the rest of the version control players

AWS CodeCommit

As stated earlier, CodeCommit is an AWS hosted cloud service that allows for Git-based version control.

What does all this mean?

AWS Cloud

AWS offers an ever expanding catalog of cloud services. These services are focused on creating secure, scalable and dynamically priceds solutions for application infrastucture. CodeCommit is source control service with all of these in mind. By using AWS to host private Git repositories you can shift the weight of some responsiblity to a proven cloud services provider. Not having to manage and scale your source control solution yourself allows your business to focus on what is important...devloping and delivering code. It supports standard Git operations and works with existing Git-based tools to fit right into your development pipeline.

Pricing

The pricing for CodeCommit is pretty straight forward. It is free for the first 5 users and then $1 for every user above 5. This is great for anyone who want to get in and start trying out this service.

Fully Managed

Because AWS fully manages the platform, there is no need for provisioning servers, updating software, configurations. Also, AWS's large cloud network means that service availbility and durability are high. This leaves you with no hardware or software concerns, lowering adminitrative overhead.

Security

AWS is very much about security and it doesn't stop with CodeCommit. Data stored in CodeCommit is encrypted at rest and in transit.

Collaboration is a Must

One of the best benifits of the Git is the collaborative abilities when used in conjunction with online repositories. CodeCommit supports all the capabilities we know and love. Pull request, notifications, comments and more. The development process is the same as the other online repositories we have grown accustom to.

Size Matters

CodeCommit can scale to meet any development needs. It can handle large amounts of files, branches and revision histories. There is no limit on file size, repository size or file types.

It's better on the Inside

If you are already working with AWS and it's vast list of services, you will be able to benifit from how easy it is to intergrate them with CodeCommit. From within the AWS family of services CodeCommit can streamline with deployment, monitoring, serverless services and more.

Making the Move

CodeCommit will be compatible with most Git-based repository making migration easy. CodeCommit uses all the standard Git commands so there is nothing new for you to learn there. If you are familiar with AWS CLI and API's, there is support for CodeCommit with these interfaces as well.

Conclusion

CodeCommit is a profesional and well equiped Git repository hosting service backed by the AWS global infrastructure. It is not as well know as some of the industry staples like GitHub and Bitbucket. But we have seen that it offers the same benifits and a little extra if you are already using AWS managed services. Because of its free tier offers, I recommend taking it for a spin.

And Beyond

In the next few articles we will look at seting up CodeComment and intergrating it with some of AWS Cloud Services. Hope to see you there.

PySpark - Spark SQL Context

Froilan Miranda — Mon, 20 Apr 2020 15:10:02 GMT

Spark aims to make it easy to work with data. One way they achieve this is by working with spark data as if you were working on a SQL database

Spark SQL enables querying of DataFrames as database tables
Temporary per-session and global tables
The Catalyst optimizer makes SQL queries fast
Tables schemas can be inferred or explicitly specified

Basic Operations

from pyspark.sql import SparkSession

Import SparkSession

spark = SparkSession.builder\
                    .appName('Analyzing Students')\
                    .getOrCreate()

Create a new Session

from pyspark.sql.types import Row
from datetime import datetime

Import the some libraries

record = sc.parallelize([Row(id = 1,
                             name = 'Jill',
                            active = True,
                            clubs = ['chess', 'hockey'],
                            subjects = {'math':80, 'english': 56},
                            enrolled = datetime(2014, 8, 1, 14, 1, 5)),
                        Row(id = 2,
                            name = 'George',
                            active = False,
                            clubs = ['chess','soccer'],
                           subjects = {'math': 60, 'english':96},
                           enrolled = datetime(2015, 3, 21, 8, 2, 5))
                        ])

Use paralize(...) to create and RDD made of Row objects that contains a mixture of data types

record_df = record.toDF()
record_df.show()

Create a DataFrame from the RDD

record_df.createOrReplaceTempView('records')

Run this data as SQL we first need to regester the DataFram as a table. The name of the SQL table is records and only exist within this session. Once the session exits the table is also destroyed

all_records_df = sqlContext.sql('SELECT * FROM records')

all_records_df.show()

Using the sqlContext.sql(...) to pass SQL statements to query the table and return a DataFrame

sqlContext.sql('SELECT id, clubs[1], subjects["english"] FROM records').show()

This is using a complex query, returning sebsets of collection data from queried rows

sqlContext.sql('SELECT ID, NOT active FROM records').show()

Logical operators(AND, OR, NOT)

sqlContext.sql('SELECT* FROM records WHERE subjects["english"] > 90').show()

comparison operators are also available (<,>,<=,>=)

record_df.createGlobalTempView('global_records')

In order to make a table accessible to all sessions on the cluster we must register the table as Global

sqlContext.sql('SELECT * FROM global_temp.global_records').show()

In order to access global table view and namespace must be provided with the table name

Analyzing Data with Spark SQL

from pyspark.sql import SparkSession

Import SparkSession

spark = SparkSession.builder\
                    .appName("Analyzing airline data")\
                    .getOrCreate()

Create Session

from pyspark.sql.types import Row
from datetime import datetime

Import some libraries to be used later

airlinesPath = '/Users/froilanmiranda/python-envs/sparktest/datasets/airlines.csv'
flightsPath = '/Users/froilanmiranda/python-envs/sparktest/datasets/flights.csv'
airportsPath = '/Users/froilanmiranda/python-envs/sparktest/datasets/airports.csv'

Create some variable to represent the paths to the data sets

airlines = spark.read\
                .format('csv')\
                .option('header', 'true')\
                .load(airlinesPath)

Create a DataFrame from the csv file

airlines.createOrReplaceTempView('airlines')

airlines = spark.sql('SELECT * FROM airlines')
airlines.columns

Explore the data by displaying the columns

airlines.show(5)

Continue exploring the data by displaying the first few rows

flights = spark.read\
                .format('csv')\
                .option('header','true')\
                .load(flightsPath)

Read in the next csv as a DataFrame

flights.createOrReplaceTempView('flights')

flights.columns

flights.show(5)

Explore this data by printing the first few records to the screen

flights.count(), airlines.count()

Get a total record count for each set of data

flights_count = spark.sql('SELECT COUNT(*) FROM flights')
airlines_count = spark.sql('SELECT COUNT(*) FROM airlines')

We can also use sql to get the same data

flights_count, airlines_count

Display the result of the SQL query. Notice the result is a DataFrame

flights_count.collect()[0][0], airlines_count.collect()[0][0]

We can matrix notation to extract particular values from the resulting DataFrame

total_distance_df = spark.sql('SELECT distance FROM flights')\ # (1)
                        .agg({'distance':'sum'})\ # (2)
                        .withColumnRenamed('sum(distance)',  'total_distance') # (3)

Mixing of DataFrame and SQL operations are valid since the sqlContext will return a DataFrame

Return the 'distance' columm as a DataFrame
Apply Aggregatin on the DataFrame
Creat a new column in the DataFrame and assign the aggregate value to the new column

total_distance_df.show()

Display DataFrame values

all_delays_2012 = spark.sql(
    'SELECT date, airlines, flight_number, departure_delay ' +
    'FROM flights WHERE departure_delay > 0 and year(date) = 2012')

Results in an empty DataFrame, no records match the WHERE criteria

all_delays_2012.show(5)

Displays empty DataFrame

all_delays_2014 = spark.sql(
    'SELECT date, airlines, flight_number, departure_delay ' +
    'FROM flights WHERE departure_delay > 0 and year(date) = 2014')

all_delays_2014.show(5)

Change the criteria to capture data that exists in the table view

all_delays_2014.createOrReplaceTempView('all_delays')

all_delays_2014.orderBy(all_delays_2014.departure_delay.desc()).show(5)

Sort all the records by the delay time. Notice the values for delay don't make sense, earlier we saw that other delay times were greater in value. Why is this? Because the delay column is being treated as a string value. This is why taking your time to observe and explore your data is crucial. We will not use this data so we can leave it as is

delay_count = spark.sql('SELECT COUNT(departure_delay) FROM all_delays')

Collect the total count of flights delayed

delay_count.show()

Display this result

delay_count.collect()[0][0]

Extract the single piece of data

delay_percent = delay_count.collect()[0][0] / flights_count.collect()[0][0] * 100
delay_percent

Using all this data we can calculate the percentage of flights that were delayed

delay_per_airline = spark.sql('SELECT airlines, departure_delay  FROM flights')\
                        .groupBy('airlines')\
                        .agg({'departure_delay':'avg'})\
                        .withColumnRenamed('avg(departure_delay)', 'departure_delay')

Now lets get the average delay by airline

delay_per_airline.orderBy(delay_per_airline.departure_delay.desc()).show(5)

Ordering by departure delay in descending order gives us the airlines with the longest delays

delay_per_airline.createOrReplaceTempView('delay_per_airline')

delay_per_airline = spark.sql('SELECT * FROM delay_per_airline ORDER BY departure_delay DESC')

This will assign ordered data from the SQL table into a DataFrame.

delay_per_airline.show(5)

Displaying this data we can see it matches the previous operation. This is to show that SQL and DataFrame operations will result in the same outcome

delay_per_airline = spark.sql('SELECT * FROM delay_per_airline ' +
                              'JOIN airlines ON airlines.code = delay_per_airline.airlines ' +
                              'ORDER BY departure_delay DESC')

Using a SQL join, we are able to combine two registerd SQL tables and return a DataFrame

delay_per_airline.show(5)

Display the first few columns of the resulting DataFrame

Inferred and Explicit Schemas

Spark with infer data types with creating DataFrames. But sometimes we will need to explictly set the schema of the DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder\
                    .appName('Inferred and explicit schemas')\
                    .getOrCreate()

from pyspark.sql.types import Row

Import the needed libraries and create the needed entities as per usual

lines = sc.textFile('/Users/froilanmiranda/python-envs/sparktest/datasets/students.txt')

Use the SparkContext that is directly available to read the text file into an RDD

lines.collect()

This is a comma seperated list about a few students. Every line is a string and every string has values seperated by commas

parts = lines.map(lambda l: l.split(',')) # (1)

parts.collect() # (2)

Use the map function with a lambda to create a list from the the string value of each row
Display the result to screen

students = parts.map(lambda p: Row(name=p[0], math=int(p[1]), english=int(p[2]), science=int(p[3])))

Again, use the map function with a lambda to create Row objects from the list

students.collect()

Display the result

schemaStudents = spark.createDataFrame(students)

schemaStudents.createOrReplaceTempView('students')

Create a DataFrame from the RDD and then register it as a SQL table

schemaStudents.columns

Show column info of the DataFrame

schemaStudents.schema

We did not declare a schema for the DataFrame but it was able to use reflection to infer the schema. Notice the data type StructType and StructField

spark.sql('SELECT * FROM students').show()

It is because of the infered typing that Spark can create a schema for the table view when it is registered

parts.collect()

Now lets use the parts RDD to create a DataFrame and explicitly define the schema. We can see is an RDD of List elements

parts_typed = parts.map(lambda p: Row(name=p[0], math=int(p[1]), english=int(p[2]), science=int(p[3])))

As we can see above the values for the grades are strings and we will need them to be numbers. Using the map and lambda together to accomplish this

schemaString = 'name math english science'

This is just to map out what columns we want to configure the schema to for visual reference only

from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType

fields = [StructField('name', StringType(), True),
         StructField('math', IntegerType(), True),
         StructField('english', IntegerType(), True),
         StructField('science', IntegerType(), True)]

Specify the fields for every record. Each column is represented by a StructField and takes values for column name, data type, is nullable.

schema = StructType(fields)

Create a StructType and pass the StructFields as a parameter to create a schema

schemaStudents = spark.createDataFrame(parts_typed, schema)

Create a DataFrame using the RDD of List and the configured schema

schemaStudents.columns

Confirm columns have been properly named

schemaStudents.schema

Confirm schema is configured correctly

schemaStudents.createOrReplaceTempView('students_explicit')

spark.sql('SELECT * FROM students_explicit').show()

And now with the schema explicitly in place we can query the data with SQL

PySpark - Using DataFrames

Froilan Miranda — Wed, 15 Apr 2020 17:08:17 GMT

Spark DataFrames

Previously we looked at RDDs, and were the primary data set in Spark 1. In Spark 2 we rarely use RDDs only for low level transformations and control over the dataset. If the data is unstructured or streaming data we then have to rely on RDDs, for everything else we will use DataFrames

SparkSession vs. SparkContext

Up until now we have been using the SparkContext as the entry point to Spark. Moving forward, the SparkSession will be the entry point we will utilize

SparkSession offers:

Ease of Use
- SparkSession - simplified entry point
- No confusion about which context to use
- Encapsulates SQLContext and Hive Context

To create a SparkSession use SparkSession.builder()

Exploring Data with DataFrames

SparkSession and DataFrames

from pyspark.sql import SparkSession

Import the neccessary libraries

spark = SparkSession.builder\
                    .appName("Analyzing London Crime Data")\
                    .getOrCreate

Build a new SparkSession and assign it a name if a session with the same name does not exist. Otherwise, return the existing session with this app name. This will be the entry point to the Spark engine

data = spark.read\	# (1)
            .format("csv")\ # (2)
            .option("header", "true")\ # (3)
            .load("../datasets/london_crime_by_lsoa.csv") # (4)

.read() returns a DataFrameReader that can be used to read non-streaming data in as a DataFrame.
.format(...) sets the file format to be read
.options(...) adds input options for the underlying data source
.load(...) loads input in as a DataFrame from a data source

data.printSchema()

Remember DataFrames are always structured data. Using .printSchema will print the schema of the tabular data

data.count()

We can see the number of rows in this DataFrame

data.limit(5).show()

Examine data by looking at the fist 5 rows with the .show() function

Drop and Select Columns

data.dropna()

Drop rows which have values that are not available (N/A). As we know is a major part of data cleaning

data = data.drop('lsoa_code')

data.show(5)

To drop a column we can use .drop(...) and pass a column name as a string value.

total_boroughs = data.select('borough')\ # (1)
                    .distinct() # (2)
total_boroughs.show()

Select a column from the DataFrame
Select only distinct values

total_boroughs.count()

The number of distinct values for this column from the DataFrame

Filter Records

hackney_data = data.filter(data['borough'] == 'Hackney')
hackney_data.show(5)

Using .filter(...) we can filter records using columns their data

# (1)
data_2015_2016 = data.filter(data['year'].isin(["2015", "2016"]))

# (2)
data_2015_2016.sample(fraction=0.1).show()

Notice the use of .isin(...) withing the filter parameters. This will select the records with column values in the range of the parameters passed into it
.sample(...) returns a sampled subset of this DataFrame. fraction determines the fraction size of the full DataFrame to return

data_2014_onwards = data.filter(data['year'] >=2014)

data_2014_onwards.sample(fraction=0.1).show()

Another example of .filter(...) using >= comparator to select records with a coloumn value greater or equal to the value passed

Aggregations and grouping

borough_crime_count = data.groupBy('borough')\
                            .count()

borough_crime_count.show(5)

DataFrames support grouping of data, with the .groupBy(...) function. .groupBy(...) can be used on any column

borough_crime_count = data.groupBy('borough')\
                            .agg({"value":"sum"})

borough_crime_count.show(5)

.agg(...) is a function that will compute aggregates and return the result as a DataFrame.

Built-in aggregation functions:

avg
max
min
sum
count

borough_convictin_sum = data.groupBy('borough')\
                            .agg({"value":"sum"})\
                            .withColumnRenamed('sum(value)','convictions')

borough_convictin_sum.show(5)

Using .withColumnRenamed(, ) will result in the column name being replaced by a new name

total_borough_convictions = borough_conviction_sum.agg({'convictions':'sum'})

total_borough_convictions.show()

By removing the grouping function, the aggregate will act on the whole DataFrame

total_convictions = total_borough_convictions.collect()[0][0]

Using matrix notation we can grab the value out of the collectiona and assign it to variable

import pyspark.sql.functions as fun

Imports some extra functionality from the PySpark library

borough_percentage_contribution = borough_conviction_sum.withColumn(
    '% contribution',
    func.round(borough_conviction_sum.convictions / total_convictions * 100, 2))

borough_percentage_contribution.printSchema()

Here we create a new column and use the previous variable to calculate the new column value

borough_percentage_contribution.orderBy(borough_percentage_contribution[2].desc())\
                                .show(10)

we can use .orderBy(...) and a column index to transform the DataFrame by ascending and descending order

conviction_monthly = data.filter(data['year'] == 2014)\
                            .groupBy('month')\
                            .agg({'value':'sum'})\
                            .withColumnRenamed('sum(value)', 'convictions')

Here we use a combination of group by, aggregate and column renaming to extract the data

total_conviction_monthly = conviction_monthly.agg({'convictions':'sum'})\
                                            .collect()[0][0]

total_conviction_monthly = conviction_monthly.withColumn(
                'percent',
                func.round(conviction_monthly.convictions/total_conviction_monthly * 100, 2))
total_conviction_monthly.columns

Now we use more transformations to alter the data more and print the resulting DataFrame columns

total_conviction_monthly.orderBy(total_conviction_monthly.percent.desc()).show()

Finally, we order the resulting DataFrame and display

Aggregations and Visualizations

crimes_category = data.groupBy('major_category')\
                        .agg({'value':'sum'})\
                        .withColumnRenamed('sum(value)','convictions')

use group by and aggregates to create a DataFrame

crimes_category.orderBy(crimes_category.convictions.desc()).show()

Order and display the new DataFrame

year_df = data.select('year')

Create a new DataFrame from one column

year_df.agg({'year':'min'}).show()

Use the min aggregate to return the minimum value

year_df.agg({'year':'max'}).show()

Use the max aggregate to reutrn the maximum value

year_df.describe().show()

.describe() will return:

count
mean
standard deviation
min
max

data.crosstab('borough', 'major_category')\
    .select('borough_major_category', 'Burglary', 'Drugs', 'Fraud or Forgery', 'Robbery')\
    .show()

.crosstab(...) computes a pair-wise frequency table of the given columns. Also known as a contingency table.

get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Matplotlib graphs displayed inline on this notebook

def describe_year(year):
    yearly_details = data.filter(data.year == year)\
                        .groupBy('borough')\
                        .agg({'value':'sum'})\
                        .withColumnRenamed('sum(value)', 'convictions')
    
    borough_list = [x[0] for x in yearly_details.toLocalIterator()]
    convictions_list = [x[1] for x in yearly_details.toLocalIterator()]
    
    plt.figure(figsize=(33,10))
    plt.bar(borough_list, convictions_list)
    
    plt.title('Crime for the year: ' + year, fontsize=30)
    plt.xlabel('Boroughs',fontsize=30)
    plt.ylabel('Convictions', fontsize=30)
    
    plt.xticks(rotation=90, fontsize=30)
    plt.yticks(fontsize=30)
    plt.autoscale()
    plt.show()

This is a helper function to contain all the necessary steps to create the DataFrame based off year and create the chart to visualize it

Extracting Data and User Defined Functions

In this section we will explore using DataFrames to explore and clean data. We will use User Defined Functions to assist us in the process

from pyspark.sql import SparkSession

First import SparkSession

spark = SparkSession.builder\
                .appName('Analyzing soccer players')\
                .getOrCreate()

Create SparkSession instance

players = spark.read\
                .format('csv')\
                .option('header', 'true')\
                .load('../datasets/player.csv')

Read in the data source into a DataFrame

players.printSchema()

Look at the schema

players.show(5)

Checkout the first few records

player_attributes = spark.read\
                        .format('csv')\
                        .option('header', 'true')\
                        .load('../datasets/Player_Attributes.csv')

Read in a second CSV data source in a DataFrame

player_attributes.printSchema()

Again, look at the schema

players.count(), player_attributes.count()

Lets view the total record count for each DataFrame

player_attributes.select('player_api_id')\
                .distinct()\
                .count()

Notice that entities from one DataFrame have a many to one relationship with the records of the other data set

players = players.drop('id', 'player_fifa_api_id')
players.columns

Get rid of unwanted data columns

player_attributes = player_attributes.drop(
    'id',
    'player_fifa_api_id',
    'preferred_foot',
    'attacking_work_rate',
    'defensive_work_rate',
    'crossing',
    'jumping',
    'sprint_speed',
    'balance',
    'aggression',
    'short_passing',
    'potential'
)
player_attributes.columns

Get rid of unwanted data colums

player_attributes = player_attributes.dropna()
players = players.dropna()

Remove records with non available data

players.count(), player_attributes.count()

Look at the new data count

User defined functions

from pyspark.sql.functions import udf

Import the User defined functions library

year_extract_udf = udf(lambda date: date.split('-')[0]) # (1)

player_attributes = player_attributes.withColumn( # (2)
    'year',
    year_extract_udf(player_attributes.date)
)

Create an UDF with a lambda function operate on a date value and return the year only
Create a new column for the year and extract the values from the data column using the UDF

player_attributes = player_attributes.drop('date')

Now we can drop the data column, as the year data has been copied to another column

player_attributes.columns

view the new schema

Joining DataFrames

Spark DataFrames can be joined much like SQL tables can be joined. In this section we will join data to create a new DataFrame.

pa_2016 = player_attributes.filter(player_attributes.year == 2016)

Create a new DataFrame from a subset of another DataFrame

pa_2016.count()

View the count

pa_2016.select(pa_2016.player_api_id)\
    .distinct()\
    .count()

Select only destinct value to make sure the unique ids match with the DataFrame we want to join

pa_striker_2016 = pa_2016.groupBy('player_api_id')\
                        .agg({
                            'finishing':'avg',
                            'shot_power':'avg',
                            'acceleration':'avg'
                        })

Since one data set has many records associated with an entity we will group the records by entity id first, then average the values of the columns we are interested in to create a one to one relatioinship

pa_striker_2016.count()

Check that the two DataFrame counts match

pa_striker_2016.show(5)

Take a quick look at the new aggregated data

pa_striker_2016 = pa_striker_2016.withColumnRenamed('avg(finishing)', 'finishing')\
                                 .withColumnRenamed('avg(shot_power)', 'shot_power')\
                                 .withColumnRenamed('avg(acceleration)', 'acceleration')

Rename the columns for readablity

weight_finishing = 1
weight_shot_power = 2
weight_acceleration = 1

total_weight = weight_finishing + weight_shot_power + weight_acceleration

Lets create a weighted grading system to apply more value to some attributes

strikers = pa_striker_2016.withColumn('striker_grade',
                                     (pa_striker_2016.finishing * weight_finishing + \
                                      pa_striker_2016.shot_power * weight_shot_power + \
                                      pa_striker_2016.acceleration * weight_acceleration) / total_weight)

Create a new column and apply the grading syestem to caluculate the each row value

strikers = strikers.drop('finishing',
                         'acceleration',
                         'shot_power')

Remove uneeded fields

strikers = strikers.filter(strikers.striker_grade > 70)\
                    .sort(strikers.striker_grade.desc())

strikers.show(10)

Drop lower grades from the dataset

strikers.count(), players.count()

See how many entities we have left

striker_details = players.join(strikers, players.player_api_id == strikers.player_api_id)

Now we can join the two DataFrames

striker_details.columns

View the columns, and take note the double both join fields

striker_details.count()

Check that count is inline with before

striker_details = players.join(strikers, ['player_api_id'])

Alternate way to join

striker_details.show(5)

view the data

striker_details.columns

View the columns, and take note the single join column

Saving DataFrames to CSV and JSON

Saving to file is pretty straight forward

CSV

striker_details.select("player_name", "striker_grade")\  # (1)
                .coalesce(1)\  # (2)
                .write\  # (3)
                .option('header', 'true')\  # (4)
                .csv('striker_grade.csv')  # (5)

Select the columns to export
how many files to break the data into
Begin the write command
Any options to apply
File format and file name

JSON

striker_details.select("player_name", "striker_grade")\
                .write\
                .json('striker_grade.json')

Going Further with Joins

Here we will cover other ways to join DataFrames

valuesA = [('John', 100000), ('James', 150000), ('Emily', 65000), ('Nina', 200000)]
tableA = spark.createDataFrame(valuesA, ['name', 'salary'])

Create a DataFrame from a list of tuples

tableA.show()

View DataFrame

valuesB = [('James', 2), ('Emily',3), ('Darth Vader', 5), ('Princess Leia', 6)]

tableB = spark.createDataFrame(valuesB, ['name', 'employee_id'])

Create a second DataFrame

tableB.show()

View DataFrame

inner_join = tableA.join(tableB, tableA.name == tableB.name)
inner_join.show()

This is the behavior that we have seen previously

left_join = tableA.join(tableB, tableA.name == tableB.name, how='left')
left_join.show()

using the how parameter to explicitly declare the type of join

right_join = tableA.join(tableB, tableA.name == tableB.name, how='right')
right_join.show()

Outer join right

full_outer_join = tableA.join(tableB, tableA.name == tableB.name, how='full')
full_outer_join.show()

Full outer join

PySpark - Using RDDs

Froilan Miranda — Tue, 14 Apr 2020 23:11:40 GMT

I will keep my data in a folder named 'datasets'

If PySpark is not already loaded up, go ahead and start PySpark and create a new Jupyter notebook

View information about the SparkContext by inputing sc

If we were running a cluster of nodes the output would be a bit more interesting. As we are running in standalone mode there is little output

Lets import a few things

from pyspark.sql.types import Row # (1)
from datatime import datetime # (2)

Row is a spark object that represents a single row of a dataframe, we will see this shortly
datetime is standard from Python

simple_data = sc.parallelize([1, "Alice", 50])
simple_data

sc.parallelize(...) converts data into an RDD

simple_data.count()

Returns a number representing the number of entities in the RDD

simple_data.first()

Access the first element in the RDD

.count() and .first() are what is considered an Action

simple_data.take(2)

.take(...) will return a subset of the RDD as a list

simple_data.collect()

.collect() will return all values in the RDD as a list

These have been some examples of Actions. Remember that when calling an action it will trigger all the transformations that occurred before it to execute. This can be a costly operation on large datasets so be care when and where you use them

Up until now we have been using RDDs, and this is fine to do in Spark. However Spark 2 offers the new DataFrame and we will use dataframes a lot more than RDDs. The take away is that Spark 2 still has access to the underlying RDD construct. Lets quickly try to create a dataframe from an RDD

df = simple_data.toDF()

toDF() will try to convert an RDD to a DataFrame

Here we get an error. The problem here is that the data in the rows of the RDD are not structured. The data types are mixed and dataframes

This RDD has no schema, contains elements of different types - it cannot be converted to a DataFrame

Convert RDDs to DataFrames

records = sc.parallelize([[1, "Alice", 50], [2,"Bob", 100]])
records

Here we create an RDD with structured data, two rows with matching data schemas

records.collect()

Again, collect() returns all rows in the RDD

records.count()

Again, returns the row count of the RDD

records.first()

Again, returns the first record

records.take(2)

records.collect()

Because of the size of this RDD, the previous two methods have the same return values

df = records.toDf()

This will return a Spark DataFrame for the RDDs values. Because the RDDs rows have the same number of columns and those columns have the same data type, Spark can create a DataFrame from this RDDs values

df

We can see Spark infers the datatypes

df.show()

.show() allows for a quick view of the dataframe. Here is show the first 20 rows as a default

Take notice that the Column names have been automatically generated and assigned

If we want to specify the column names we must make use of the Row object imported earlier. Using a Row object to create an RDD will pass column name data

data = sc.parallelize([Row(id=1,
                           name="Alice",
                           score=50)])
data

Now if we inspect the column names withing the row object

data.collect()

data.count()

Lets create dataframe from this RDD

df = data.toDF()
df.show()

We can now see the column names applied to the output.

Lets add some more data

data = sc.parallelize([Row(id=1,
                           name="Alice",
                           score=50),
                      Row(id=2,
                           name="Bob",
                           score=100),
                      Row(id=3,
                           name="Charlee",
                           score=150)])
data

Now, convert to a dataframe and show

df = data.toDF()
df.show()

And as before, since the data is structured Spark will infer the datatypes and effortlessly convert to a dataframe

Working with complex data

complex_data = sc.parallelize([Row(
                                col_float=1.44,
                                col_integer=10,
                                col_sring="John")])

We create a RDD with one Row object. This row consist of a float, integer and string value types

complex_data_df = complex_data.toDF()
complex_data_df.show()

Convert the complex data to a dataframe

complex_data = sc.parallelize([Row(
                                col_float=1.44,
                                col_integer=10,
                                col_sring="John",
                                col_boolean=True,
                                col_list=[1,2,3])])

Now we see a good mixture of datatypes, take note of the list in the last column.

complex_data_df = complex_data.toDF()
complex_data_df.show()

After converting to a dataframe, we can see fromt the table displayed fromthe show() method that the list type has been preserved

complex_data = sc.parallelize([Row(
                                col_list=[1,2,3],
                                col_dic={"k1": 0},
                                col_row=Row(a=10,b=20,c=30),
                                col_time=datetime(2014, 8, 1, 14, 1 ,5)
                              ),
                              Row(
                                col_list=[1,2,3,4,5],
                                col_dic={"k1": 0, "k2":1},
                                col_row=Row(a=40,b=50,c=60),
                                col_time=datetime(2014, 8, 1, 14, 1 ,6)
                              ),
                              Row(
                                col_list=[1,2,3,4,5,6,7],
                                col_dic={"k1": 0,"k2": 0,"k3": 0},
                                col_row=Row(a=70,b=80,c=90),
                                col_time=datetime(2014, 8, 1, 14, 1 ,7)
                              )])

Here we can see all the complex structures supported by dataframes in spark

complex_data_df = complex_data.toDF()
complex_data_df.show()

SQL Context

We can ust the sqlContext to run SQL queries on the Spark data

sqlContext = SQLContext(sc)

sqlContext

This wraps around the SparkContext to add SQL functionality

df = sqlContext.range(5)
df

.range(5) on the sqlContext obect will return a dataframe with five one column rows with the integer values 1-5

df.count()

data = [('Alice',50),
        ('Bob',80),
        ('Charlee', 75)]

Create a list of touples and assign to data variable

sqlContext.createDataFrame(data).show()

Creates a dataframe from the list and displays the data. Note the column names have been automatically generated

sqlContext.createDataFrame(data, ['Name', 'Score']).show()

The same operation but with specifing the column name

complex_data = [
                (1.0,
                10,
                "Alice",
                True,
                [1,2,3],
                {"k1":0},
                Row(a=1,b=2,c=3),
                datetime(2014, 8,1,14,1,5)),
    
                (2.0,
                20,
                "Bob",
                True,
                [1,2,3,4,5],
                {"k1":0,"k2":1},
                Row(a=1,b=2,c=3),
                datetime(2014, 8,1,14,1,5)),

                (3.0,
                30,
                "Charlee",
                False,
                [1,2,3,4,5,6,7],
                {"k1":0,"k2":1,"k3":2},
                Row(a=1,b=2,c=3),
                datetime(2014, 8,1,14,1,5))    
               ]

List of complex data

sqlContext.createDataFrame(complex_data).show()

Convert to dataframe and display

complex_data_df = sqlContext.createDataFrame(complex_data,[
        'col_integer',
        'col_float',
        'col_string',
        'col_boolean',
        'col_list',
        'col_dictionary',
        'col_row',
        'col_date_time']
)
complex_data_df.show()

Convert to dataframe with column name and display

data = sc.parallelize([
    Row(1,'Alice',50),
    Row(2,'Bob',100),
    Row(3,'Charlee',150)
])

Create an RDD with some Row objects, but with no column name specification for the Row object

column_names = Row('id','name','score')
students = data.map(lambda r: column_names(*r))

We can apply column name to an RDD after it has been created by using the .map(...) function.

students

This returns a new RDD

Note: The map() operation performs a transformation on every element in the RDD

students.collect()

We see that the column names have been assigned to all the records

students_df = sqlContext.createDataFrame(students)
students_df

Use the SQLContext to create a dataframe from the students RDD

student_df.show()

Notice the dataframe has recognized all the column names properly

Accessing RDDs from DataFrames

Looking back to the complex data we created earlier

complex_data_df.first()

This data consist is primative data types as well as complex data types

complex_data_df.take(2)

Dataframes are in tabular format and can be accessed using matrix notation

cell_string = complex_data_df.collect()[0][2]
cell_string

another example

cell_list = complex_data_df.collect()[0][4]
cell_list

Modify the list

cell_list.append(100)
cell_list

complex_data_df.show()

Take note that the original data is unaltered. This is because accessing the data is then returned as seperate value

complex_data_df.rdd\
                .map(lambda x: (x.col_string, x.col_dictionary))\
                .collect()

Extract specific columns by converting the DataFrame to an RDD

complex_data_df.select(
    'col_string',
    'col_list',
    'col_date_time'
).show()

.select(...) will return only the specified column names

complex_data_df.rdd\
                .map(lambda x: (x.col_string + " Boo"))\
                .collect()

A map() operation which appends "Boo" to every string in the column

Dataframes do not support the .map(...) function

complex_data_df.select(
    'col_integer',
    'col_float'
    )\
    .withColumn(
    'col_sum',
    complex_data_df.col_integer + complex_data_df.col_float
    )\
    .show()

To perform a calculation with column values we need to use the .withColumn(...)

Select the column to use for calculation
Create a new column with the reulting values

complex_data_df.select('col_boolean')\
                .withColumn(
                    'col_opposite',
                    complex_data_df.col_boolean == False)\
                .show()

Here is another example of .withColumn(...) that inverts the value of booleans in a column

complex_data_df.withColumnRenamed('col_dictionary','col_map').show()

This example renames the column

complex_data_df.select(complex_data_df.col_string.alias('Name')).show()

This will select and rename a column

Spark DataFrames and Pandas

Pandas and Spark DataFrames are interoperable

import pandas

Import the pandas library, do not forget to pip install if needed

df_pandas = complex_data_df.toPandas()
df_pandas

Converting a Spark DataFrame to a Pandas DataFrame is done using toPandas()

Remember that Spark DataFrames are built on top of RDDs and stored across multiple nodes. Conversely, Pandas dataframes are stored in memory of the machine it is running on.

df_spark = sqlContext.createDataFrame(df_pandas).show()
df_spark

On the flip side the .createDataFrame(...) will convert a Pandas DataFrame to a Spark DataFrame

Spark 2 setup

Froilan Miranda — Mon, 13 Apr 2020 14:15:15 GMT

Demo will cover:

Install standalone Spark on your local machine
Set up the PySpark REPL interface

Req's

This demo will by done with Python 3
Java v8
jupyter notebooks

Download Spark 2 from https://spark.apache.org/downloads.html

Choose the most recent 2.x build
Choose package: Pre-built for Apache Hadoop 2.7 and later. (Standalone installation does not require you to have Hadoop installed)
Download generated link
Move the download file to a suitable location on you harddrive
run the this command to unpack download, sudo tar -xvzf
ls to confirm files have unpacked

Standalone Spark requires some environment variables to be set.

Open bash profile, nano ~/.bash_profile and set the following:

export SPARK_HOME="/path/to/spark2/folder"
export PATH="$SPARK_HOME/bin:$PATH"

Note: if you don't have java home set, this will need to be done as well.

export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

Exit nano and reload the bash profile

source ~/.bash_profile

If not installed, install PySpark:

pyspark --version

# if not installed
pip install pyspark

With Pyspark installed we can create a spark shell

pyspark

Once the shell starts up we can get access to the Spark Context by typing sc and exiting the shell with exit()

>>> sc

>>> exit()

Instead of interacting with the Pyspark shell directly, we can setup jupyter notebooks to launch when we start up Spark 2

We will need to declare more environment variables

Open bash profile again, nano ~/.bash_profile

add the following

# original version
export PYSPARK_SUBMIT_ARGS="PYSPARK-SHELL"
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark

Save and close the file. Then, reload the bash profile
source ~/.bash_profile

Now, when you run pyspark a jupyter notebook server will start

AWS - Creating a simple VPC and EC2 instance

Froilan Miranda — Thu, 09 Apr 2020 14:50:57 GMT

Introduction

This article with step through the process of quickly creating a VPC and attching an EC2 instance

Creating a VPC

Rescource by region will automaticlly have a default vpc, subnets, route tables and more created for you. One default for every Region. We can use the default VPC and reconfigure it to meet our needs. However, we will make one from scratch because that is more fun

Log into AWS and select the VPC service
Choose your region
Launch VPC wizard
Step1: Select a VPC Configuration - VPC with a Single Public Subnet
Step 2: VPC with a Single Public Subnet
1. IPv4 CIDR block: 10.0.0.0/16
2. IPv6 CIDER block: No IPv6 CIDR block
3. VPC name: name of your VPC
4. Public subnet's IPv4 CIDR: 10.0.0.0/24
5. Availability Zone: Select first availability zone
6. Subnet name: public-subnet-a
7. The rest of the options can stay as they are
8. Click create VPC

This will take you to a successfully create page.

Creating an EC2 Instance

Now it's time to create an EC2 Instance, we will choose to create an Amazon Linux instace

Select EC2 from the AWS web console
Click 'Launch Instance'
Step 1: Select Amazon Linux 2
Step 2: Select 't2.micro' instance type, click continue
Step 3:
- Num. of instances : 1
- Network: your new vpc
- Subnet: your new subnet
- Auto-assign Public IP: enable
- The rest of the settings can stay at default, click next

Step 4: The wizard automaticly configures EBS for us and we can move on, click next
Step 5: click 'add another tag' with key/value Name/demo-app. This will help to identify it amongst serveral instances.
Step 6:
- Assign a security group: Create a new sercurity group
- Security group name: demo-ec2-sg
- Description: Security group for awesome demo instances

At this point we one rule for SSH already created, this is good but we will need to create another rule.

Click 'Add Rule'
Type: Custom TCP
Port Range: 8000
Source: Anywhere
Click Review and Launch
Click Launch

Very Important

This last step is critical. In the pop up window you will have the option to select an existing key pair or create a new key pair. Select 'Create a new key pair' give it a name demo-app-keys. Then click 'Download Key Pair' and save this somewhere you will not loose or delete it. Finally, click 'Launch Instance'

Connecting and Deploying to an EC2 Instance

To complete this section you will want Python 3 installed on your development machine

To test out the EC2 instance we are going to create a very simple HTTP server that will deliver a simplet HTML page

Creating HTML and Python scripts

Create the follow html page

index.html




    
    Super Awsome Web Page 5000


    Hello World!

Create the following python script:

http_runner.py

import http.server
import socketserver

PORT = 8000
Handler = http.server.SimpleHTTPRequestHandler

with socketserver.TCPServer(("", PORT), Handler) as httpd:
    print("serving at port", PORT)
    httpd.serve_forever()

Start the server python3 http_runner.py

confirm by opening a browser and navigate to localhost:8000

Configure EC2

Goto EC2 Dashboard, click 'running instances'
Select the new instance
In the 'Description' tab take note of:
- Public IP/ Private IP
- Key pair
- Availability zone
Modify key pair file chmod 400 ~/path/to/your.pem
ssh -i ec2-user@
type yes when promted about authenticity of host
you are now logging into the new machine

Now that we are logged in to the new instance we should always update the system first

sudo yum update

Type Y to initiate the operation

Install Python

yum list installed | grep -i python3

sudo yum install python3 -y

mkdir simple_http_app

python3 -m venv simple_http_app/

source ~/simple_http_app/bin/activate

pip install pip --upgrade

File Transfer

Time to move our local files to the ec2 instance

Exit the ec2 console by typing exit, returning you to your local system prompt
scp -r -i ec2-user@:/home/ec2-user/simple_http_app
Log back into the instance
Run python script and verify by using the browser to view the html page

Very Important: This server is not for production use. This is just a simple http service for development and testing

Pandas and MySql with a hit of AWS RDS

Froilan Miranda — Wed, 01 Apr 2020 19:41:50 GMT

Introduction

This article will look at connecting Python and MySQL database. With the help of some sql connection tools, transfering data between Python and MySQL will be simplified. Finally, we will migrate and run the database on a cloud platform

prerequisites

database with some data to work on
aws developer account
python (demo is python3)

Create MySQL Db on AWS RDS

Log in to AWS (https://aws.amazon.com/)
In the Resources section, click DB Instances
In the top right you will find a Create database button, click this
The "Create database" page will take you through setting up the database. Here are the settngs we will use
- Choose a databese ceation method - Standard
- Engine options
  - MySQL
  - Version - Use the version closest to your local version
- Template - Free Tier
- Settings
  - DB instance identifier - give the database a name
  - Master username - create name
  - Master password - create password
- DB instance size - leave default settings
- Storage - leave default settings
- Availability & durability - Do not create a standby instance
- Connectivity - change to publicly available
- Database authentication - Passord authentication
- Additional Configurations - leave defaults

It will take a few minutes to get the database up and running

meanwhile...

Set up virtual environment

Create a directory for the files mkdir sample-mysql-rds and then cd into the newly created directory.
Initialize the virtual environment python3 -m venv ./
Start up the virtual environment source ./bin/activate

import modules

With the virtual environment setup and running we can turn our attention to the modules needed to this demo

modules
- pandas
- sqlalchemy
- PyMySQL
- boto3(optional)

Pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

pandas can be installed via pip from PyPI.

pip3 install pandas

SqlAlchemy

SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. SQLAlchemy provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.

pip3 install SQLAlchemy

PyMySQL

This package contains a pure-Python MySQL client library

pip3 install PyMySQL

Boto3 (Optional)

Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.

pip3 install boto3

Examine Database

Now that we are finished with the python dependencies. Let's take a moment to review the data we plan to import into the python script. Here, I will be using data about the English Premier League.

You can either use a DBMS IDE like DataGrip or MySQL Workbench, or just interface with your database with command line terminal. Make sure the data exist and is available. We will connect to the local database first to make sure everthings is functioning properly and then we will deploy the database to the AWS RDS

Python and MySQl

Now that we have the data store in place and the right modules available to us, it's time to put rubber to the road!

import data from mysql

Create a new Python script

app.py

# (1)
import pandas as pd 
import sqlalchemy
from sqlalchemy import Table, Column, Integer, String, MetaData

# (2)
engine = sqlalchemy.create_engine('mysql+pymysql://username:password@localhost/demo_epl_1819')

First import the pandas and sqlalchemy libraries
Using create_engine() creates an Engine object that can be used to bridge python and a relational database. Let's look at the string parameter that is passed

'mysql+pymysql://username:password@localhost/demo_epl_1819'

mysql = database type
pymysql = sql interpreter for the engine to use
username:password = username and password
@localhost/demo_epl_1819 = database url

table

Here we will look at accessing a MySQL table and reading the data into a dataframe.

app.py

df = pd.read_sql_table('match_results', engine)
print(df.head())
print(type(df))

The read_sql_table() method returns all the records of a MySQL table by passsing the table name and the Engine object created earlier as arguments. Pandas will read all sql table data into a dataframe

query

Queries can also be created to for customization

# create dataframe from sql query result
query_1 = 'SELECT  HomeTeam, AwayTeam, FTR FROM match_results;'
df_query = pd.read_sql_query(query_1, engine)
print(df_query.head())
print(type(df))

read_sql_query() takes a sql statement as a string and an Engine object. This will execute a query on the database and return any values as a pandas dataframe

Write to csv and save to s3 (Optional)

This section will lean on another article we did, where we created an storage service to write and read objects from S3 buckets.

Create a directory named services
Within services create a Python file name storage_service.py

storage_service.py

*NOTE: you will need to have your AWS credentials available for the boto3 service to work. If you need help reference this articles

import boto3


class StorageService:

    def __init__(self, storage_location):
        self.client = boto3.client('s3')
        self.bucket_name = storage_location

    def upload_file(self, file_name, object_name=None):
        if object_name is None:
            object_name = file_name

        response = self.client.upload_file(file_name, self.bucket_name, object_name)

        return response

    def download_object(self, object_name, file_name=None):
        if file_name is None:
            file_name = object_name

        print(file_name + " is the file name")

        response = self.client.download_file(self.bucket_name, object_name, file_name)

        return response

    def list_all_objects(self):
        objects = self.client.list_objects(Bucket=self.bucket_name)

        if "Contents" in objects:
            response = objects["Contents"]
        else:
            response = {}

        return response

    def delete_object(self, object_name):
        response = self.client.delete_object(Bucket=self.bucket_name, Key=object_name)

        return response

We will not go into detail about this code. If you are curious checkout the other demo for more on this service.

Pass data to S3

app.py

from services.storage_service import StorageService # (1)

storage_service = StorageService("your.first.boto.s3.bucket") # (2)
df_query.to_csv("output.csv", index=False) # (3)
storage_service.upload_file("output.csv") # (4)

Import the storage service into the app
Instantiate a StorageService object
Create a csv file from the dataset
Use storage service to upload csv to S3 bucket

write to another table

Here is an example of using sqlalchemy to create a table. Then, use pandas with sqlalchemy to write dataframe contents to the new SQL table

meta = MetaData() 				# (1)

results_table = Table( 			# (2)
   'simple_result', meta,
   Column('id', Integer, primary_key=True, autoincrement=True),
   Column('HomeTeam', String(25)),
   Column('AwayTeam', String(25)),
   Column('FTR', String(1))
)
meta.create_all(engine)			# (3)

# (4)
df_query.to_sql(name='simple_result', con=engine, index=False, if_exists='append')

MetaData() is a container object that keeps together many different features of a database (or multiple databases) being described.
Create a Table object that represents the table to be created
.create_all() will cause the MetaData() instance to create any tables associated with it
Pandas dataframes have the method to_sql() writes records stored in a DataFrame to a SQL database
- name - SQL table name
- con - alchemysql Engine connection
- index - write dataframe index's as column in table
- if_exists - behaviour with table exist

Migrate to Cloud

Export data from MySQL to file format of your choice
Get back to AWS RDS web console and click the MySQL database you created earlier. This will take you to a details page
In the section titled 'Connectivity & security' make note of two things here
1. Endpoint value
2. Port number
Check that scurity group is open to all traffic
Use terminal or command line to log into the new database. mysql --port=3306 --host=<> --user=<> --password
create a new database
import the sql exported earlier. Log out of mysql and up load the .sql file with the following mysql -h <> -u <> -p --port=3306 <> < <>
Check the database loaded correctly
Back in app.py change the database url to point to the RDS DB instance endpoint mysql+pymysql://<>:<>@<>:<>/<>
Run python script

Tear Down

Remember to spin down any services you do not wish to incur any charges onpi

Conclusion

Using library SQLAlchemy with Pandas allowed easy access to our local and remote databases in the form of dataframs. We then looked at converting the dataframse and saving this data as CSV format and then saving that data in S3, leveraging a storage service used in a previous article. Finally we migrated the MySQL database to the AWS RDS and updated our application to connect to the remote database

Creating a AWS S3 service with Python

Froilan Miranda — Wed, 25 Mar 2020 15:33:44 GMT

Overview

In a previous module we use the boto3 library to connect our python script with an AWS service. Here, we are going continue down this path by taking a look at some operations we can perform using AWS S3 and python. We will create a storage service with python that interfaces with S3 and give us a chance to use some common operations.

Initiate the Environment

We will use a virtual environment to allow us to develop our code in an isolated environment. Developing in an virual environment allows us to manage our projects independently, making for a more portability and less coupled to our development machine.

To create a new environment named sample-env execute: $ python3 -m venv ~/python-envs/sample-boto-s3
To activate the environment execute: $ source ~/python-envs/sample-boto-s3/bin/activate
Install BOT03 package using: $ pip3 install boto3

Setting up the project

By setting up access to S3 as a service, it will encapsulate the code need to interact with S3. This will in turn make the service more reusable and portable for other projects.

Open up your favorite Python IDE and lets get to the good stuff...code!
Create a new python script file called storage-service.py and save it to the sample-boto-s3 folder created for the virtual environment previously.

Create the Serivice

Open storage_service.py and enter the following:

import boto3 // (1)

class StoreageService

def __init__(self, storage_location): // (2)
        self.client = boto3.client('s3') // (3)
        self.bucket_name = storage_location  // (4)

In order to leverage the boto3 library we must import it first
Passing the storage location at the time of constructions allows for the decoupling of the storage url from class code. When a StorageService object is created the bucket location will be passed into the constructor, allowing for multiple storage services to exist and represent different storage locations
Using boto3.client('s3') returns a S3 client object that can be used to interact with the S3 cloud service
Assign the bucket name to an object variable. This will be uuse later to configure the S3 client

The boto3 object becomes available because of the import statement at the top of the file. This object will produce a client object that can interface with the service passed as an argument, in this case 's3'

Uploading Files

The first behaviour to add to this service will be the ability to upload files to the specified bucket. The behaviour of this method is to upload a local file to a S3 bucket.

def upload_file(self, file_name, object_name=None): //(1)
	if object_name is None: //(2)
		object_name = file_name 

	response = self.client.upload_file(file_name, self.bucket_name, object_name) //(3)
	
	return response //(4)

A method that takes a file_name parameter that represents the local location of the file to be uploaded to S3. object_name is an opional parameter to rename the file on S3
If value is found for renaming the file, the current file name will be used
Theupload_file() method is called the the S3 client and proper values are passed.
- Param 1 - path to local file
- Param 2 - name of bucket to up load
- Param 3 - name of object on S3
Return the response sent from the API call (if any)

Run the code

Create a file to upload

touch File.txt

Next, create a python script to instantiate and run the storage service

service_runner.py

from storage_service import StorageService //(1)

storage_service = StorageService("your.first.boto.s3.bucket") //(2)

print(storage_service.upload_file('file.txt')) //(3)

Import the storage service
Instantiate a new service with a bucket location
Call the service and using the text file to upload

Notice nothing is return from the AWS api, but this is not the case always.

Downloading Files

This method will use the name of an S3 object to retrieve it from the cloud and save to the local system

def download_object(self, object_name, file_name=None): //(1)
	if file_name is None:
		file_name = object_name

	response = self.client.download_file(self.bucket_name, object_name, file_name) //2

	return response

This method takes object_name to retrieve the object from S3 and file_name as an option parameter to rename the file when saving it locally
The download_file() method will be supplied the following values
- Param 1 - Name of bucket to access
- Param 2 - Name of object to retrieve from S3
- Param 3 - What name to use for file on the local system

Run the Code

service_runner.py

print(storage_service.download_object('file.txt', 'file_s3.txt'))

There should now be a new file saved locally named "file_s3.txt"

Listing Files

The ability for the storage service to return a list of all the object in the bucket could be useful

def list_all_objects(self):
    objects = self.client.list_objects(Bucket=self.bucket_name) //(1)

    if "Contents" in objects: //(2)
        response = objects["Contents"]
    else:
        response = {}

    return response

list_objects() only needs a bucket name to be supplied as a parameter. This will return a lengthy json object that will be saved stored as a dictionary
The bit we are are interested in has a key of 'Contents'. This contains an array of S3 objects. The problem is, if it's an empty bucket then the 'Contents' key will absent from the json. So check first if the key exist and then assign the response. If the bucket is empty an empty dictionary is returned

Run the Code

service_runner.py

s3Objects = storage_service.list_all_objects()

for file in s3Objects:
    print(file["Key"])

list_all_obj returns a list of dictionaries that has information about the objects contained in the bucket. The object names are found unther the 'Key' key in the dictionary

Deleting Files

Finally there should be a way to remove objects from the bucket.

def delete_object(self, object_name):
    response = self.client.delete_object(Bucket=self.bucket_name, Key=object_name) //(1)

    return response

The delete_object() method takes two arguments:

Param 1 - bucket to access
Param 2 - S3 object to delete

Run the code

service_runner.py

print(storage_service.delete_object('file1.txt'))

Conclusion

Using the boto3 S3 client, we built a simple storage service to manage file transfers to and from a specified bucket. We took care to build it in a way that we can reuse this service and its code throughout any project we wish to add the service to. If we ever need to add or remove functionality the service code can easily implement these changes.

Using AWS-CLI to interact with AWS S3

Froilan Miranda — Wed, 25 Mar 2020 15:32:59 GMT

Overview

This article discusses some of the common commands of AWS-CLI to communicate with the AWS S3 service. The command line tool is a quick and easy way to manage S3 buckets. It is not a complicate interface, with only a hand full of commands. The following will use these commands to create, read, update and remove objects from S3

Anatomy

aws s3 [ ...]

About Paths

Every S3 command consist of at least one path argument. A path can be represented in two ways:

LocalPath - represents a path on the local file system. This can be relative or absolute
S3Uri - represents a S3 bucket, object, or prefix.

The S3 directories are refered to as prefixes

S3 resource paths

An S3 URI path is formatted like so:

s3://SomeBucket/ObjectKey

Lets break it down piece by piece.

s3:// : the path to a resource on S3 must begin with this prefix. This denotes that path aurgument refers to a S3 resource.

SomeBucket/ : this refers to the unique bucket name to access

ObjectKey : this is the specified key name value for the object within the bucket

Order Matters

All AWS-CLI S3 commands take one or two uri path arguments. The first argument will be the source path. This could be a local resource or a an S3 resource. When a second path argument is found, this will represent the destination path. This too can be either a local or S3 path. If a command only calls for one path, this is because the command operates on the source resource alone, and ther is no need of a destination path

S3 Operations

The AWS-CLI S3 commands can operate on single files or on file directories

Single File Operations

Here are commands that will operate on a single file:

cp - copy a resource from source path to destination
mv - move a resource from source path to destination
rm - rm a resource at source path

If the --recursive flag is used the operation may affect more than one file.

Slashes Matter

When creating path arguments for the source and the destination, the direction of the slashed matter. When representing a path on the local file system use the slash seperator used by the operating system. When representing a S3 resource use forward slashes.

When configuring the destination resource path, having a slash on end or not can have different behaviors

aws s3 cp src/file.txt s3://bucketname/src

aws s3 cp src/file.txt s3://bucketname/src/

Directory & Prefix Operations

Here are some commands that operate on a directories and/or the contents:

sync - Syncs directories and S3 prefixes
mb - create/make a bucket
rb - remove a bucket
ls - the directory content

Slashes Don't Matter

Unlike single file operatioins, post fixing slashes doesn't affect how directory/prefix operations work

Filters

Most commands allow for filtering using the --exclude and --include parameters. Pattern matching is achieved with the following symbols.

*: Matches everything
?: Matches any single character
[sequence]: Matches any character in sequence
[!sequence]: Matches any character not in sequence

The exclude and include parameters can be used multiple time is a single command

--exclude "*" --include "*.txt"

When multiple filter parameters are present the latter will override the former.

--include "*.txt" --exclude "*"

But reversing the order leads to a different outcome

Filters are applied to the source directory

`aws s3 sync ./ s3://bucket.on.s3 --exclude "" --include ".mov" --include "*.ogg"

This command will perform a sync using the current directory as the source and will exclude all files except .ogg and .mov files.

Summary

We have looked at some of the common ways to interacte with S3 using the AWS-CLI. Breaking down the format of path arguments we are able to connect with S3 buckets and objects with a local file system. We explored several ways to configure s3 commands and filter results. We are now ready to administer S3 through the command line interface!

Install and Configure Boto3 with AWS

Froilan Miranda — Tue, 17 Mar 2020 22:23:41 GMT

Introduction

AWS (Amazon Web Services) is an ecosystem with an abundace of services to fufill many of our development needs. There are also some great ways for us to interact with these services. We can simply log in to the web browser console(https://aws.amazon.com
). Or we can use the command line with AWS-CLI to executing commands on services from the terminal.

There are also many API's that allow code to interact with AWS programmatically. And that is what we are going to do in the following article. Creating a Python script that uses the Boto3 library to connect to an amazon web service. We will use the following objectives to reach this goal:

Install the Python libraries needed to connect to AWS within a virtual environment
Configure credentials to gain access to AWS
Execute code to interact with AWS

Prerequisites

Virtual Environment

What is a Virtual Environment?

A Virtual Environment is a self contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.

Why should I use a Virtual Environment?

A Virtual Environment keeps all dependencies for the Python project separate from dependencies of other projects. This has a few advantages:

It makes dependency management for the project easy.
It enables using and testing of different library versions by quickly spinning up a new environment and verifying the compatibility of the code with the different version.

Mac OS X Setup

Create a folder where the virtual environments will reside $ mkdir ~/python-envs
To create a new environment named sample-env execute $ python3 -m venv ~/python-envs/sample-env
To activate the environment execute $ source ~/python-envs/sample-env/bin/activate
Install BOT03 package using $ pip3 install boto3
To deactivate the environment execute $ deactivate

Configuration

To use BOTO3 you will need a credentials file to verify you identity so that you may access your account services. We previously covered this in an article for seting up AWS-CLI. AWS-CLI will create this credentials file for you during the setup, head over here if you haven't set up AWS-CLI.

Code

We will create a simple python script to connect to S3 and upload a file

Begin by create a directory to write your files to. I will be using the one created earlier in this article ~/python-env/sample-env

Now create a new file name 'storage_servie.py' inside the 'sample-env' directory

Add the below lines of code to the new file

import boto3	// 1

client = boto3.client('s3') // 2
client.create_bucket(Bucket="your.first.boto.s3.bucket")	// 3

buckets = client.list_buckets()	// 4
for i in buckets['Buckets']:	// 5
    print(i['Name'])

Import the boto3 library into the program
This will return an object that will allow for interaction with S3
Use the newly acquired S3 client to create a bucket
Return a list of the current buckets in use and accessable to the user credentials that were used in the configuration section
Iterate through the results and print the name of each bucket

Finally, we will start up the virtual environment

Start up the virtual environment
$ source ~/python-envs/sample-env/bin/activate

If you haven't done so, install the boto3 package into the environment $ pip3 install boto3

Run the script
$ python3 storage_service.py

If everything works out properly you should see a list of S3 bucket names

your.first.boto.s3.bucket

Conclusion

We have taken the first steps of opening up our Python applications to an ever expanding set of services through AWS.

We used a Python Virtual Enviroment to create an independent development space for our code, and installed the BOTO3 library. Using AWSCLI we were able to set up the credentials we needed to access AWS. Finally, we created a script that executed commands to create and query resources on s3.

AWS - Install and Setup Command Line Interface

Froilan Miranda — Wed, 26 Feb 2020 22:56:22 GMT

Overview

Amazon Web Services offer a large number of cloud based services and tools. There are a few different ways that we can interact with these services. These options include a web browser interface (https://aws.amazon.com/) and SDKs (Software Development Kits) for most major programming language. This article focuses on using the AWS-CLI (Amazon Web Services Command Line Interface.) As the name suggest, the AWS-CLI allows us to interface using a system command line application. We will cover installation and setup so that we can connect to an existing AWS account and begin interacting with Amazon Web Services through the command line.

Prerequisites

Before we can begin there are a few things you will need to have:

An AWS account
Familiarity with a system command line application(e.g. terminal, iTerm)
For the Homebrew installation, you will need Homebrew installed (https://brew.sh/.)

Installing AWS-CLI

There are 2 versions of AWS-CLI as of the writing of this article. AWS-CLI version 2 is the latest release and this will be the version we use. Lets look at 2 methods for installing AWS-CLI

Note that this article is running AWS-CLI on Mac OSX Catalina. Follow these links for other system installations:

Homebrew Installation

Homebrew is a fantastic little package manager that makes life much easier when it comes to installing and maintaing applications on OSX.

We simply use the install command of brew to tell Homebrew that we want to install awscli.
$ brew install awscli

MacOS Installation

We can manualling download and install using the command line directly.

curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /

Using the curl command will retrieve the install package and will be saved to the local drive using the file name set by the -o option.
Next, installer will install the downloaded pkg file, specified by the -pkg option. -target / option will install the package in the proper directory

Confirm Installation

We can quickly confirm everything was installed properly by running the following commands and confirming the following output

$ which aws
/usr/local/bin/aws 
$ aws --version
aws-cli/2.0.0 Python/3.8.1 Darwin/19.3.0 botocore/2.0.0dev4

Note: Your version of Python may be different depending on what is installed on you dev environment

Configuration

Now that we have the AWS command line tool installed we will need to configure it with our AWS account information. This will allow us to connect to our AWS account and gain access to services.

This will happen in 2 parts

Using Amazon's IAM service to create an access key pair to grant aws-cli access to the AWS account
Using the access key pair to configure aws-cli

Part 1 - Access Keys

In this section we will go through the steps for creating and downloading access credentials that will be used in Part 2 to complete the confuguration of AWS-CLI and gain access to the AWS account.

In order for you complete is section you will need to have created an AWS account (https://aws.amazon.com/)

Note: It is recommended that you use another account other than your root account. Such as creating another account with Admin privaleges will work well with this tutorial.

Sign in and Navigate to IAM services section of the aws.amazon.com (https://console.aws.amazon.com/iam/)
Locate the left-side menu and click on the users link under Access Management
You will see a list of users in the main window. Click on the user that will be used by aws-cli to log in.
This will take you to a summary page, where you will find a tab labeled Security credentials. Click this tab.
Under the section labled Access keys, click the Create access key button. This will open a new dialog box with the newly created access key.

At this point you can choose to download a csv file of the credential. Otherwise, you can copy and paste the Access key ID and Secret access key somewhere for use later.

Warning: You will not have access to the secret key after you close this dialog box.

Warning: Keep you access keys confidential. Sharing this information is like sharing your account. Proceed with caution.

Part 2 - awscli config

Now that we have the access key needed to configure awscli, we can return to the command line. The aws configure command is the fastest way to initially configure AWS-CLI

$ aws configure
AWS Access Key ID [None]: enter-your-access-key-id-here
AWS Secret Access Key [None]: enter-your-secret-access-key-here
Default region name [None]: us-west-2
Default output format [None]: json

AWS Access Key ID - the key ID value from the access key that was created in Part 1 - Access Keys

AWS Secret Access Key - the secret access key that was created in Part 1 - Access Keys

Default region name - the default region that the AWS-CLI will choose unless otherwise specified. Here I am using the value of us-west-2, but you can use any available region.

Default output format - This will tell AWSCLI how we want the output to be displayed to us in the command line. There are 4 possible output formats:

json
yaml
text
table

Once this is complete, AWS-CLI saves this configuration in a profile named default. These are the values AWS-CLI will use if no other values are explicitly defined. AWS-CLI will create a .aws folder in the user's home directory, as well as, create two text files inside the .aws folder.

.aws/credentials holds the profile access key ID and secret access key
.aws/config holds the profile default region name and output format

Note: You can always update/change these values by running aws configure again

Confirm Configuration

To check if everything is working properly we run a simple command against AWS S3, amazons storage service.

First, if you don't have any S3 buckets available, you will need to quickly create one (https://s3.console.aws.amazon.com/s3/).

Then, use aws s3 ls to list all the buckets that we have access to.

❯ aws s3 ls
2019-07-04 16:34:50 cloudtrail-logs
2020-01-14 15:18:54 cf-templates
2019-07-04 17:00:49 config-bucket
2020-01-15 15:08:13 query-results-bucket

Above is an example of a list of storage buckets that reside in this users S3 account. Depending on the S3 buckets you have created your list will look different

Conclusion

If you have made it this far, congratulations. You are now able to begin interacting with your AWS services through the command line interface.

We did this by downloading and installing the AWS-CLI application. We then created an access key using IAM through the web browser. We configured AWS-CLI using aws configure and the credintials from the access key to quickly configure and initialize AWS-CLI. Finally, using aws s3 ls command to list all S3 buckets available to verify that the profile configurations are complete.

In the next few articles we will look at interacting with individual services to manage service features

ITerm 2 Cheat Sheet

Froilan Miranda — Thu, 17 Oct 2019 14:03:28 GMT

Tabs and Windows

Function	Shortcut
New Tab	`⌘` + `T`
Close Tab or Window	`⌘` + `W` (same as many mac apps)
Go to Tab	`⌘` + `Number Key` (ie: `⌘2` is 2nd tab)
Go to Split Pane by Direction	`⌘` + `Option` + `Arrow Key`
Cycle iTerm Windows	`⌘` + `backtick` (true of all mac apps and works with desktops/mission control)
Splitting
Split Window Vertically (same profile)	`⌘` + `D`
Split Window Horizontally (same profile)	`⌘` + `Shift` + `D` (mnemonic: shift is a wide horizontal key)
Moving
Move a pane with the mouse	`⌘` + `Alt` + `Shift` and then drag the pane from anywhere
Fullscreen
Fullscreen	`⌘`+ `Enter`
Maximize a pane	`⌘` + `Shift` + `Enter` (use with fullscreen to temp fullscreen a pane!)
Resize Pane	`Ctrl` + `⌘` + `Arrow` (given you haven't mapped this to something else)
Less Often Used By Me
Go to Split Pane by Order of Use	`⌘` + `]` , `⌘` + `[`
Split Window Horizontally (new profile)	`Option` + `⌘` + `H`
Split Window Vertically (new profile)	`Option` + `⌘` + `V`
Previous Tab	`⌘`+ `Left Arrow` (I usually move by tab number)
Next Tab	`⌘`+ `Right Arrow`
Go to Window	`⌘` + `Option` + `Number`

Basic Moves

Function	Shortcut
Move back one character	`Ctrl` + `B`
Move forward one character	`Ctrl` + `F`
Delete current character	`Ctrl` + `D`
Delete previous word (in shell)	`Ctrl` + `W`

Moving Faster

A lot of shell shortcuts work in iterm and it's good to learn these because arrow keys, home/end
keys and Mac equivalents don't always work. For example ⌘ + Left Arrow is usually the same as Home
(go to beginning of current line) but that doesn't work in the shell. Home works in many apps but it
takes you away from the home row.

Function	Shortcut
Move to the start of line	`Ctrl` + `A` or `Home`
Move to the end of line	`Ctrl` + `E` or `End`
Move forward a word	`Option` + `F`
Move backward a word	`Option` + `B`
Set Mark	`⌘` + `M`
Jump to Mark	`⌘` + `J`
Moving by word on a line (this is a shell thing but passes through fine)	`Ctrl` + `Left/Right Arrow`
Cursor Jump with Mouse (shell and vim - might depend on config)	`Option` + `Left Click`

Copy and Paste with iTerm without using the mouse

I don't use this feature too much.

Function	Shortcut
Enter Copy Mode	`Shift` + `⌘` + `C`
Enter Character Selection Mode in Copy Mode	`Ctrl` + `V`
Move cursor in Copy Mode	`HJKL` vim motions or arrow keys
Copy text in Copy Mode	`Ctrl` + `K`

Copy actions goes into the normal system clipboard which you can paste like normal.

Search the Command History

Function	Shortcut
Search as you type	`Ctrl` + `R` and type the search term; Repeat `Ctrl` + `R` to loop through result
Search the last remembered search term	`Ctrl` + `R` twice
End the search at current history entry	`Ctrl` + `Y`
Cancel the search and restore original line	`Ctrl` + `G`

Misc

Function	Shortcut
Clear the screen/pane (when `Ctrl + L` won't work)	`⌘` + `K` (I use this all the time)
Broadcast command to all panes in window (nice when needed!)	`⌘` + `Alt` + `I` (again to toggle)
Find Cursor	`⌘` + `/` or use a theme or cursor shape that is easy to see