Pandas and MySql with a hit of AWS RDS

Introduction

This article will look at connecting Python and MySQL database. With the help of some sql connection tools, transfering data between Python and MySQL will be simplified. Finally, we will migrate and run the database on a cloud platform

prerequisites

  • database with some data to work on
  • aws developer account
  • python (demo is python3)

Create MySQL Db on AWS RDS

  1. Log in to AWS (https://aws.amazon.com/)
  2. In the Resources section, click DB Instances
  3. In the top right you will find a Create database button, click this
  4. The "Create database" page will take you through setting up the database. Here are the settngs we will use
    • Choose a databese ceation method - Standard
    • Engine options
      • MySQL
      • Version - Use the version closest to your local version
    • Template - Free Tier
    • Settings
      • DB instance identifier - give the database a name
      • Master username - create name
      • Master password - create password
    • DB instance size - leave default settings
    • Storage - leave default settings
    • Availability & durability - Do not create a standby instance
    • Connectivity - change to publicly available
    • Database authentication - Passord authentication
    • Additional Configurations - leave defaults

It will take a few minutes to get the database up and running

meanwhile...

Set up virtual environment

  1. Create a directory for the files mkdir sample-mysql-rds and then cd into the newly created directory.
  2. Initialize the virtual environment python3 -m venv ./
  3. Start up the virtual environment source ./bin/activate

import modules

With the virtual environment setup and running we can turn our attention to the modules needed to this demo

  • modules
    • pandas
    • sqlalchemy
    • PyMySQL
    • boto3(optional)

Pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

pandas can be installed via pip from PyPI.

pip3 install pandas

SqlAlchemy

SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. SQLAlchemy provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.

pip3 install SQLAlchemy

PyMySQL

This package contains a pure-Python MySQL client library

pip3 install PyMySQL

Boto3 (Optional)

Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.

pip3 install boto3

Examine Database

Now that we are finished with the python dependencies. Let's take a moment to review the data we plan to import into the python script. Here, I will be using data about the English Premier League.

You can either use a DBMS IDE like DataGrip or MySQL Workbench, or just interface with your database with command line terminal. Make sure the data exist and is available. We will connect to the local database first to make sure everthings is functioning properly and then we will deploy the database to the AWS RDS

Python and MySQl

Now that we have the data store in place and the right modules available to us, it's time to put rubber to the road!

import data from mysql

Create a new Python script

app.py

# (1)
import pandas as pd 
import sqlalchemy
from sqlalchemy import Table, Column, Integer, String, MetaData

# (2)
engine = sqlalchemy.create_engine('mysql+pymysql://username:password@localhost/demo_epl_1819')

  1. First import the pandas and sqlalchemy libraries
  2. Using create_engine() creates an Engine object that can be used to bridge python and a relational database. Let's look at the string parameter that is passed

'mysql+pymysql://username:password@localhost/demo_epl_1819'

  • mysql = database type
  • pymysql = sql interpreter for the engine to use
  • username:password = username and password
  • @localhost/demo_epl_1819 = database url

table

Here we will look at accessing a MySQL table and reading the data into a dataframe.

app.py

df = pd.read_sql_table('match_results', engine)
print(df.head())
print(type(df))

The read_sql_table() method returns all the records of a MySQL table by passsing the table name and the Engine object created earlier as arguments. Pandas will read all sql table data into a dataframe

query

Queries can also be created to for customization

# create dataframe from sql query result
query_1 = 'SELECT  HomeTeam, AwayTeam, FTR FROM match_results;'
df_query = pd.read_sql_query(query_1, engine)
print(df_query.head())
print(type(df))

read_sql_query() takes a sql statement as a string and an Engine object. This will execute a query on the database and return any values as a pandas dataframe

Write to csv and save to s3 (Optional)

This section will lean on another article we did, where we created an storage service to write and read objects from S3 buckets.

  1. Create a directory named services
  2. Within services create a Python file name storage_service.py

storage_service.py

*NOTE: you will need to have your AWS credentials available for the boto3 service to work. If you need help reference this articles

import boto3


class StorageService:

    def __init__(self, storage_location):
        self.client = boto3.client('s3')
        self.bucket_name = storage_location

    def upload_file(self, file_name, object_name=None):
        if object_name is None:
            object_name = file_name

        response = self.client.upload_file(file_name, self.bucket_name, object_name)

        return response

    def download_object(self, object_name, file_name=None):
        if file_name is None:
            file_name = object_name

        print(file_name + " is the file name")

        response = self.client.download_file(self.bucket_name, object_name, file_name)

        return response

    def list_all_objects(self):
        objects = self.client.list_objects(Bucket=self.bucket_name)

        if "Contents" in objects:
            response = objects["Contents"]
        else:
            response = {}

        return response

    def delete_object(self, object_name):
        response = self.client.delete_object(Bucket=self.bucket_name, Key=object_name)

        return response


We will not go into detail about this code. If you are curious checkout the other demo for more on this service.

Pass data to S3

app.py

from services.storage_service import StorageService # (1)

storage_service = StorageService("your.first.boto.s3.bucket") # (2)
df_query.to_csv("output.csv", index=False) # (3)
storage_service.upload_file("output.csv") # (4)
  1. Import the storage service into the app
  2. Instantiate a StorageService object
  3. Create a csv file from the dataset
  4. Use storage service to upload csv to S3 bucket

write to another table

Here is an example of using sqlalchemy to create a table. Then, use pandas with sqlalchemy to write dataframe contents to the new SQL table

meta = MetaData() 				# (1)

results_table = Table( 			# (2)
   'simple_result', meta,
   Column('id', Integer, primary_key=True, autoincrement=True),
   Column('HomeTeam', String(25)),
   Column('AwayTeam', String(25)),
   Column('FTR', String(1))
)
meta.create_all(engine)			# (3)

# (4)
df_query.to_sql(name='simple_result', con=engine, index=False, if_exists='append')
  1. MetaData() is a container object that keeps together many different features of a database (or multiple databases) being described.
  2. Create a Table object that represents the table to be created
  3. .create_all() will cause the MetaData() instance to create any tables associated with it
  4. Pandas dataframes have the method to_sql() writes records stored in a DataFrame to a SQL database
    • name - SQL table name
    • con - alchemysql Engine connection
    • index - write dataframe index's as column in table
    • if_exists - behaviour with table exist

Migrate to Cloud

  1. Export data from MySQL to file format of your choice
  2. Get back to AWS RDS web console and click the MySQL database you created earlier. This will take you to a details page
  3. In the section titled 'Connectivity & security' make note of two things here
    1. Endpoint value
    2. Port number
  4. Check that scurity group is open to all traffic
  5. Use terminal or command line to log into the new database. mysql --port=3306 --host=<<endpoint>> --user=<<username>> --password
  6. create a new database
  7. import the sql exported earlier. Log out of mysql and up load the .sql file with the following mysql -h <<endpoint>> -u <<username>> -p --port=3306 <<database name>> < <<path to sql file>>
  8. Check the database loaded correctly
  9. Back in app.py change the database url to point to the RDS DB instance endpoint mysql+pymysql://<<user name>>:<<password>>@<<db endpoint>>:<<port number>>/<<dbname>>
  10. Run python script

Tear Down

Remember to spin down any services you do not wish to incur any charges onpi

Conclusion

Using library SQLAlchemy with Pandas allowed easy access to our local and remote databases in the form of dataframs. We then looked at converting the dataframse and saving this data as CSV format and then saving that data in S3, leveraging a storage service used in a previous article. Finally we migrated the MySQL database to the AWS RDS and updated our application to connect to the remote database