Pandas and MySql with a hit of AWS RDS
Introduction
This article will look at connecting Python and MySQL database. With the help of some sql connection tools, transfering data between Python and MySQL will be simplified. Finally, we will migrate and run the database on a cloud platform
prerequisites
- database with some data to work on
- aws developer account
- python (demo is python3)
Create MySQL Db on AWS RDS
- Log in to AWS (https://aws.amazon.com/)
- In the Resources section, click DB Instances
- In the top right you will find a Create database button, click this
- The "Create database" page will take you through setting up the database. Here are the settngs we will use
- Choose a databese ceation method - Standard
- Engine options
- MySQL
- Version - Use the version closest to your local version
- Template - Free Tier
- Settings
- DB instance identifier - give the database a name
- Master username - create name
- Master password - create password
- DB instance size - leave default settings
- Storage - leave default settings
- Availability & durability - Do not create a standby instance
- Connectivity - change to publicly available
- Database authentication - Passord authentication
- Additional Configurations - leave defaults
It will take a few minutes to get the database up and running
meanwhile...
Set up virtual environment
- Create a directory for the files
mkdir sample-mysql-rds
and thencd
into the newly created directory. - Initialize the virtual environment
python3 -m venv ./
- Start up the virtual environment
source ./bin/activate
import modules
With the virtual environment setup and running we can turn our attention to the modules needed to this demo
- modules
- pandas
- sqlalchemy
- PyMySQL
- boto3(optional)
Pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
pandas can be installed via pip from PyPI.
pip3 install pandas
SqlAlchemy
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. SQLAlchemy provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.
pip3 install SQLAlchemy
PyMySQL
This package contains a pure-Python MySQL client library
pip3 install PyMySQL
Boto3 (Optional)
Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.
pip3 install boto3
Examine Database
Now that we are finished with the python dependencies. Let's take a moment to review the data we plan to import into the python script. Here, I will be using data about the English Premier League.
You can either use a DBMS IDE like DataGrip or MySQL Workbench, or just interface with your database with command line terminal. Make sure the data exist and is available. We will connect to the local database first to make sure everthings is functioning properly and then we will deploy the database to the AWS RDS
Python and MySQl
Now that we have the data store in place and the right modules available to us, it's time to put rubber to the road!
import data from mysql
Create a new Python script
app.py
# (1)
import pandas as pd
import sqlalchemy
from sqlalchemy import Table, Column, Integer, String, MetaData
# (2)
engine = sqlalchemy.create_engine('mysql+pymysql://username:password@localhost/demo_epl_1819')
- First import the pandas and sqlalchemy libraries
- Using
create_engine()
creates an Engine object that can be used to bridge python and a relational database. Let's look at the string parameter that is passed
'mysql+pymysql://username:password@localhost/demo_epl_1819'
- mysql = database type
- pymysql = sql interpreter for the engine to use
- username:password = username and password
- @localhost/demo_epl_1819 = database url
table
Here we will look at accessing a MySQL table and reading the data into a dataframe.
app.py
df = pd.read_sql_table('match_results', engine)
print(df.head())
print(type(df))
The read_sql_table()
method returns all the records of a MySQL table by passsing the table name and the Engine object created earlier as arguments. Pandas will read all sql table data into a dataframe
query
Queries can also be created to for customization
# create dataframe from sql query result
query_1 = 'SELECT HomeTeam, AwayTeam, FTR FROM match_results;'
df_query = pd.read_sql_query(query_1, engine)
print(df_query.head())
print(type(df))
read_sql_query()
takes a sql statement as a string and an Engine object. This will execute a query on the database and return any values as a pandas dataframe
Write to csv and save to s3 (Optional)
This section will lean on another article we did, where we created an storage service to write and read objects from S3 buckets.
- Create a directory named services
- Within services create a Python file name storage_service.py
storage_service.py
*NOTE: you will need to have your AWS credentials available for the boto3 service to work. If you need help reference this articles
import boto3
class StorageService:
def __init__(self, storage_location):
self.client = boto3.client('s3')
self.bucket_name = storage_location
def upload_file(self, file_name, object_name=None):
if object_name is None:
object_name = file_name
response = self.client.upload_file(file_name, self.bucket_name, object_name)
return response
def download_object(self, object_name, file_name=None):
if file_name is None:
file_name = object_name
print(file_name + " is the file name")
response = self.client.download_file(self.bucket_name, object_name, file_name)
return response
def list_all_objects(self):
objects = self.client.list_objects(Bucket=self.bucket_name)
if "Contents" in objects:
response = objects["Contents"]
else:
response = {}
return response
def delete_object(self, object_name):
response = self.client.delete_object(Bucket=self.bucket_name, Key=object_name)
return response
We will not go into detail about this code. If you are curious checkout the other demo for more on this service.
Pass data to S3
app.py
from services.storage_service import StorageService # (1)
storage_service = StorageService("your.first.boto.s3.bucket") # (2)
df_query.to_csv("output.csv", index=False) # (3)
storage_service.upload_file("output.csv") # (4)
- Import the storage service into the app
- Instantiate a StorageService object
- Create a csv file from the dataset
- Use storage service to upload csv to S3 bucket
write to another table
Here is an example of using sqlalchemy to create a table. Then, use pandas with sqlalchemy to write dataframe contents to the new SQL table
meta = MetaData() # (1)
results_table = Table( # (2)
'simple_result', meta,
Column('id', Integer, primary_key=True, autoincrement=True),
Column('HomeTeam', String(25)),
Column('AwayTeam', String(25)),
Column('FTR', String(1))
)
meta.create_all(engine) # (3)
# (4)
df_query.to_sql(name='simple_result', con=engine, index=False, if_exists='append')
MetaData()
is a container object that keeps together many different features of a database (or multiple databases) being described.- Create a Table object that represents the table to be created
.create_all()
will cause theMetaData()
instance to create any tables associated with it- Pandas dataframes have the method
to_sql()
writes records stored in a DataFrame to a SQL database- name - SQL table name
- con - alchemysql Engine connection
- index - write dataframe index's as column in table
- if_exists - behaviour with table exist
Migrate to Cloud
- Export data from MySQL to file format of your choice
- Get back to AWS RDS web console and click the MySQL database you created earlier. This will take you to a details page
- In the section titled 'Connectivity & security' make note of two things here
- Endpoint value
- Port number
- Check that scurity group is open to all traffic
- Use terminal or command line to log into the new database.
mysql --port=3306 --host=<<endpoint>> --user=<<username>> --password
- create a new database
- import the sql exported earlier. Log out of mysql and up load the .sql file with the following
mysql -h <<endpoint>> -u <<username>> -p --port=3306 <<database name>> < <<path to sql file>>
- Check the database loaded correctly
- Back in app.py change the database url to point to the RDS DB instance endpoint
mysql+pymysql://<<user name>>:<<password>>@<<db endpoint>>:<<port number>>/<<dbname>>
- Run python script
Tear Down
Remember to spin down any services you do not wish to incur any charges onpi
Conclusion
Using library SQLAlchemy with Pandas allowed easy access to our local and remote databases in the form of dataframs. We then looked at converting the dataframse and saving this data as CSV format and then saving that data in S3, leveraging a storage service used in a previous article. Finally we migrated the MySQL database to the AWS RDS and updated our application to connect to the remote database