<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Virtual Artifact]]></title><description><![CDATA[Thoughts, stories and ideas.]]></description><link>https://blog.virtual-artifact.com/</link><image><url>https://blog.virtual-artifact.com/favicon.png</url><title>Virtual Artifact</title><link>https://blog.virtual-artifact.com/</link></image><generator>Ghost 5.5</generator><lastBuildDate>Sat, 09 May 2026 09:48:33 GMT</lastBuildDate><atom:link href="https://blog.virtual-artifact.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Coming soon]]></title><description><![CDATA[<p>This is Virtual Artifact, a brand new site by Froilan Miranda that&apos;s just getting started. Things will be up and running here shortly, but you can <a href="#/portal/">subscribe</a> in the meantime if you&apos;d like to stay up to date and receive emails when new content is published!</p>]]></description><link>https://blog.virtual-artifact.com/coming-soon/</link><guid isPermaLink="false">62e176c9735891445763d359</guid><category><![CDATA[News]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Wed, 27 Jul 2022 17:32:57 GMT</pubDate><media:content url="https://static.ghost.org/v4.0.0/images/feature-image.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://static.ghost.org/v4.0.0/images/feature-image.jpg" alt="Coming soon"><p>This is Virtual Artifact, a brand new site by Froilan Miranda that&apos;s just getting started. Things will be up and running here shortly, but you can <a href="#/portal/">subscribe</a> in the meantime if you&apos;d like to stay up to date and receive emails when new content is published!</p>]]></content:encoded></item><item><title><![CDATA[AWS API Gateway Introduction]]></title><description><![CDATA[<p>Today API&apos;s are the hot thing, and for good reason. The add simplicity and flexibility to complex data architectures. API Gateway offers the ability to create RESTful APIs and WebSocket APIs within the AWS Cloud Platform. API Gateway also supports serverless, containerized workloads.</p><p>Amazon API Gateway is an</p>]]></description><link>https://blog.virtual-artifact.com/untitled-2/</link><guid isPermaLink="false">62e185bb08b3cd09cac935c7</guid><category><![CDATA[AWS]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Tue, 26 Jan 2021 01:06:29 GMT</pubDate><content:encoded><![CDATA[<p>Today API&apos;s are the hot thing, and for good reason. The add simplicity and flexibility to complex data architectures. API Gateway offers the ability to create RESTful APIs and WebSocket APIs within the AWS Cloud Platform. API Gateway also supports serverless, containerized workloads.</p><p>Amazon API Gateway is an AWS service for creating, publishing, maintaining, monitoring, and securing REST, HTTP, and WebSocket APIs at any scale. Developers can create API access to any web service connected to the internet. This include AWS Cloud services and services not within the AWS Cloud.</p><h2 id="api-types">API Types</h2><ul><li>RESTful APIs - Optimized for serverless workloads and HTTP backend</li><li>HTTP based</li><li>Enabled stateless client-server communication</li><li>Standard GET, POST, PUT, PATCH AND DELETE HTTP methods</li><li>WebSocket APIs - Real-time two way communication between applications</li><li>Adheres to the WebSocket protocol, which enables stateful, full-duplex communication between client and server.</li><li>Route incoming messages based on message content.</li></ul><h2 id="benefits">Benefits</h2><ul><li>Efficient API development - The ability to run multiple versions of an API that can be quickly tested and released</li><li>Performance at any scale - Using the AWS global structure keeps performance high and scaling easy</li><li>Cost savings at scale - Different tiering plans allow for cost flexibility as requests grow</li><li>Easy monitoring - Easy integration with CloudWatch to monitor API usage, latency, error rates and more.</li><li>Flexible security controls - Leverage IAM and Cognito to finely tune access, as well as, OAuth 2 and OIDC support.</li><li>RESTful API options - Easily create and customize your RESTful APIs</li></ul><h2 id="lambdas">Lambdas</h2><p>Using API Gateway with AWS Lambdas allows for the app-facing part of the AWS serverless infrastructure.</p><p>Stream line a web application by hosting it on AWS Lambda. Then expose lambda functions through API Gateway. Both services are highly available ,scalable<br>and can be monitored through CloudWatch. This greatly simplifies development and administration efforts.</p><h2 id="access-api-gateway">Access API Gateway</h2><p>AWS allows access to API Gateway through several means:</p><ul><li>AWS Management Console</li><li>AWS SDKs</li><li>API Gateway V1 and V2 APIs</li><li>AWS CLI</li><li>AWS Tools for Windows PowerShell</li></ul><h2 id="pricing">Pricing</h2><h3 id="http-and-rest-api">HTTP and REST API</h3><p>API Gateway charges for what you use. The charges are for number of API calls and the amount of data transferred out. There is no upfront fees are setup charges.</p><p>There are options for Private APIs and data caching that can also affect charges.</p><h3 id="websocket-api">WebSocket API</h3><p>WebSocket APIs only incur a charge when messages are sent and received and connection minutes</p><h3 id="free-tier">Free Tier</h3><p>If you are still on the Free Tier, you have 1,000,000 API calls and 1,000,000 messages and 750,000 connection minutes available.</p><h3 id="after-free-tier-expires">After Free Tier Expires</h3><p>Charges after the free tier vary. Consult the prices page to learn more about API Gateway pricing. <a href="https://aws.amazon.com/api-gateway/pricing/">Learn More</a></p><h2 id="conclusion">Conclusion</h2><p>AWS API Gateway allows for a easy, secure, and scalable way to expose a web application through a RESTful service. Used with Lambda and other AWS services, you can quickly develop, test and deploy web interfaces.</p><p>In Addition, it is a service that offers Free Tier access. This makes it is easy to test and try with out buying.</p>]]></content:encoded></item><item><title><![CDATA[AWS CodeCommit Introduction - Part 2]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In the previous article we looked at a general overview of the AWS CodeCommit service.</p>
<p>In this article we will look at setup and accessing a repository in CodeCommit.</p>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p>There are several ways to work with CodeCommit:</p>
<ul>
<li>AWS Management Console</li>
<li>Use Git credentials with HTTPs</li>
<li>Federated Access</li>
<li>Temporary credentials</li>
<li>Web</li></ul>]]></description><link>https://blog.virtual-artifact.com/aws-codecommit-introduction-part-2/</link><guid isPermaLink="false">62e185bb08b3cd09cac935c6</guid><category><![CDATA[AWS]]></category><category><![CDATA[Git]]></category><category><![CDATA[Getting Started]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Mon, 18 Jan 2021 23:33:20 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In the previous article we looked at a general overview of the AWS CodeCommit service.</p>
<p>In this article we will look at setup and accessing a repository in CodeCommit.</p>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p>There are several ways to work with CodeCommit:</p>
<ul>
<li>AWS Management Console</li>
<li>Use Git credentials with HTTPs</li>
<li>Federated Access</li>
<li>Temporary credentials</li>
<li>Web Identity Provider</li>
</ul>
<p>This article will use Git credentials and HTTPS</p>
<h2 id="prerequisites">Prerequisites</h2>
<p>You will need the following setup in order to follow along with this walkthrough</p>
<ul>
<li>Git version control on you local machine <a href="https://git-scm.com/downloads">More Info</a></li>
<li>An AWS account with access to IAM credentials <a href="http://aws.amazon.com">More Info</a></li>
</ul>
<h2 id="part1settinguppermissions">Part 1 - Setting Up Permissions</h2>
<p>First we are going to give an existing AWS user the proper policies to access CodeCommit</p>
<p>!! images here</p>
<ol>
<li>Loging to AWS Management Console</li>
<li>Type &apos;iam&apos; in the top search bar and select the IAM service from the drop down.</li>
<li>Select &apos;Users&apos; from the left menu</li>
<li>Select the user you wish to add access to CodeCommit</li>
<li>Make sure the Permissions tab is selected and click add permissions</li>
<li>Select &apos;Attach existing policies dirctly in the Grant permissions section and type CodeCommit in the section below it.</li>
<li>Click the checkbox next to &apos;AWSCodeCommitFullAccess&apos;</li>
<li>Click &apos;Next: Review&apos; in the lower right corner</li>
</ol>
<p>This will take you back to the Summary page for the user.</p>
<h2 id="part2creategitcredentials">Part 2 - Create Git Credentials</h2>
<ol>
<li>Loging to AWS Management Console</li>
<li>Type &apos;iam&apos; in the top search bar and select the IAM service from the drop down.</li>
<li>Select &apos;Users&apos; from the left menu</li>
<li>Select the user you wish to add access to CodeCommit</li>
<li>Make sure the &apos;Security credentials&apos; tab is selected. Scroll down to the &apos;HTTPS Git credentials for AWS CodeCommit&apos; section and click Generate credentials.</li>
<li>Download the credentials csv some where safe. This will be needed later to connect to the repository</li>
</ol>
<h2 id="part3createarepository">Part 3 - Create a Repository</h2>
<ol>
<li>Loging to AWS Management Console</li>
<li>Type &apos;codecommit&apos; in the top search bar and select the CodeCommit service from the drop down.</li>
<li>Click &apos;Create repository&apos;</li>
<li>Give your repository a name and click create in the bottom right corner</li>
</ol>
<p>This will bring you to a &apos;Connections steps&apos; page with a green Success bar at the top.</p>
<p>Scroll down to &apos;Step 3: Clone the repository&apos; and copy the repository location</p>
<h2 id="part4connecttotherepository">Part 4 - Connect to the repository</h2>
<ol>
<li>Open terminal and move to the directory that you wish to clone the repository to.</li>
<li>Use the repository address from &apos;Part 3&apos; to clone the repo. <code>git clone &lt;repo-url&gt;</code></li>
<li>You will be prompted to enter your git credentials from &apos;Part 2&apos;</li>
<li>Now, that the repository is cloned to your local machine you can interact with it as you would any other git repository</li>
<li>Add some files, push and confirm in the AWS Management Console</li>
</ol>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[AWS CodeCommit Introduction - Part 1]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In todays development environment, Git-based services are a norm. With GitHub being the industry standard and others like Bitbucket closly following. AWS has also added there solution for Git-based version control called CodeCommit.</p>
<p>We will look at some of the features of CodeCommit and what makes it special compared to</p>]]></description><link>https://blog.virtual-artifact.com/aws-codecommit-basics/</link><guid isPermaLink="false">62e185bb08b3cd09cac935c5</guid><category><![CDATA[AWS]]></category><category><![CDATA[Git]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Sun, 10 Jan 2021 23:43:44 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In todays development environment, Git-based services are a norm. With GitHub being the industry standard and others like Bitbucket closly following. AWS has also added there solution for Git-based version control called CodeCommit.</p>
<p>We will look at some of the features of CodeCommit and what makes it special compared to the rest of the version control players</p>
<h2 id="awscodecommit">AWS CodeCommit</h2>
<p>As stated earlier, CodeCommit is an AWS hosted cloud service that allows for Git-based version control.</p>
<p>What does all this mean?</p>
<h2 id="awscloud">AWS Cloud</h2>
<p>AWS offers an ever expanding catalog of cloud services. These services are focused on creating secure, scalable and dynamically priceds solutions for application infrastucture. CodeCommit is source control service with all of these in mind. By using AWS to host private Git repositories you can shift the weight of some responsiblity to a proven cloud services provider. Not having to manage and scale your source control solution yourself allows your business to focus on what is important...devloping and delivering code. It supports standard Git operations and works with existing Git-based tools to fit right into your development pipeline.</p>
<h2 id="pricing">Pricing</h2>
<p>The pricing for CodeCommit is pretty straight forward. It is free for the first 5 users and then $1 for every user above 5. This is great for anyone who want to get in and start trying out this service.</p>
<h2 id="fullymanaged">Fully Managed</h2>
<p>Because AWS fully manages the platform, there is no need for provisioning servers, updating software, configurations. Also, AWS&apos;s large cloud network means that service availbility and durability are high. This leaves you with no hardware or software concerns, lowering adminitrative overhead.</p>
<h2 id="security">Security</h2>
<p>AWS is very much about security and it doesn&apos;t stop with CodeCommit. Data stored in CodeCommit is encrypted at rest and in transit.</p>
<h2 id="collaborationisamust">Collaboration is a Must</h2>
<p>One of the best benifits of the Git is the collaborative abilities when used in conjunction with online repositories. CodeCommit supports all the capabilities we know and love. Pull request, notifications, comments and more. The development process is the same as the other online repositories we have grown accustom to.</p>
<h2 id="sizematters">Size Matters</h2>
<p>CodeCommit can scale to meet any development needs. It can handle large amounts of files, branches and revision histories. There is no limit on file size, repository size or file types.</p>
<h2 id="itsbetterontheinside">It&apos;s better on the Inside</h2>
<p>If you are already working with AWS and it&apos;s vast list of services, you will be able to benifit from how easy it is to intergrate them with CodeCommit. From within the AWS family of services CodeCommit can streamline with deployment, monitoring, serverless services and more.</p>
<h2 id="makingthemove">Making the Move</h2>
<p>CodeCommit will be compatible with most Git-based repository making migration easy. CodeCommit uses all the standard Git commands so there is nothing new for you to learn there. If you are familiar with AWS CLI and API&apos;s, there is support for CodeCommit with these interfaces as well.</p>
<h2 id="conclusion">Conclusion</h2>
<p>CodeCommit is a profesional and well equiped Git repository hosting service backed by the AWS global infrastructure. It is not as well know as some of the industry staples like GitHub and Bitbucket. But we have seen that it offers the same benifits and a little extra if you are already using AWS managed services. Because of its free tier offers, I recommend taking it for a spin.</p>
<h2 id="andbeyond">And Beyond</h2>
<p>In the next few articles we will look at seting up CodeComment and intergrating it with some of AWS Cloud Services. Hope to see you there.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[PySpark - Spark SQL Context]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>Spark aims to make it easy to work with data. One way they achieve this is by working with spark data as if you were working on a SQL database</p>
<ul>
<li>Spark SQL enables querying of DataFrames as database tables</li>
<li>Temporary per-session and global tables</li>
<li>The Catalyst optimizer makes SQL queries</li></ul>]]></description><link>https://blog.virtual-artifact.com/pyspark-spark-sql-context/</link><guid isPermaLink="false">62e185bb08b3cd09cac935c3</guid><category><![CDATA[Getting Started]]></category><category><![CDATA[Spark 2]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Mon, 20 Apr 2020 15:10:02 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>Spark aims to make it easy to work with data. One way they achieve this is by working with spark data as if you were working on a SQL database</p>
<ul>
<li>Spark SQL enables querying of DataFrames as database tables</li>
<li>Temporary per-session and global tables</li>
<li>The Catalyst optimizer makes SQL queries fast</li>
<li>Tables schemas can be inferred or explicitly specified</li>
</ul>
<h2 id="basicoperations">Basic Operations</h2>
<pre><code>from pyspark.sql import SparkSession
</code></pre>
<p>Import SparkSession</p>
<pre><code>spark = SparkSession.builder\
                    .appName(&apos;Analyzing Students&apos;)\
                    .getOrCreate()
</code></pre>
<p>Create a new Session</p>
<pre><code>from pyspark.sql.types import Row
from datetime import datetime
</code></pre>
<p>Import the some libraries</p>
<pre><code>record = sc.parallelize([Row(id = 1,
                             name = &apos;Jill&apos;,
                            active = True,
                            clubs = [&apos;chess&apos;, &apos;hockey&apos;],
                            subjects = {&apos;math&apos;:80, &apos;english&apos;: 56},
                            enrolled = datetime(2014, 8, 1, 14, 1, 5)),
                        Row(id = 2,
                            name = &apos;George&apos;,
                            active = False,
                            clubs = [&apos;chess&apos;,&apos;soccer&apos;],
                           subjects = {&apos;math&apos;: 60, &apos;english&apos;:96},
                           enrolled = datetime(2015, 3, 21, 8, 2, 5))
                        ])
</code></pre>
<p>Use <code>paralize(...)</code> to create and RDD made of Row objects that contains a mixture of data types</p>
<pre><code>record_df = record.toDF()
record_df.show()
</code></pre>
<p>Create a DataFrame from the RDD</p>
<pre><code>record_df.createOrReplaceTempView(&apos;records&apos;)
</code></pre>
<p>Run this data as SQL we first need to regester the DataFram as a table. The name of the SQL table is <strong>records</strong> and only exist within this session. Once the session exits the table is also destroyed</p>
<pre><code>all_records_df = sqlContext.sql(&apos;SELECT * FROM records&apos;)

all_records_df.show()
</code></pre>
<p>Using the <code>sqlContext.sql(...)</code> to pass SQL statements to query the table and return a DataFrame</p>
<pre><code>sqlContext.sql(&apos;SELECT id, clubs[1], subjects[&quot;english&quot;] FROM records&apos;).show()
</code></pre>
<p>This is using a complex query, returning sebsets of collection data from queried rows</p>
<pre><code>sqlContext.sql(&apos;SELECT ID, NOT active FROM records&apos;).show()
</code></pre>
<p>Logical operators(AND, OR, NOT)</p>
<pre><code>sqlContext.sql(&apos;SELECT* FROM records WHERE subjects[&quot;english&quot;] &gt; 90&apos;).show()
</code></pre>
<p>comparison operators are also available (&lt;,&gt;,&lt;=,&gt;=)</p>
<pre><code>record_df.createGlobalTempView(&apos;global_records&apos;)
</code></pre>
<p>In order to make a table accessible to all sessions on the cluster we must register the table as <em>Global</em></p>
<pre><code>sqlContext.sql(&apos;SELECT * FROM global_temp.global_records&apos;).show()
</code></pre>
<p>In order to access global table view and namespace must be provided with the table name</p>
<h2 id="analyzingdatawithsparksql">Analyzing Data with Spark SQL</h2>
<pre><code>from pyspark.sql import SparkSession
</code></pre>
<p>Import SparkSession</p>
<pre><code>spark = SparkSession.builder\
                    .appName(&quot;Analyzing airline data&quot;)\
                    .getOrCreate()
</code></pre>
<p>Create Session</p>
<pre><code>from pyspark.sql.types import Row
from datetime import datetime
</code></pre>
<p>Import some libraries to be used later</p>
<pre><code>airlinesPath = &apos;/Users/froilanmiranda/python-envs/sparktest/datasets/airlines.csv&apos;
flightsPath = &apos;/Users/froilanmiranda/python-envs/sparktest/datasets/flights.csv&apos;
airportsPath = &apos;/Users/froilanmiranda/python-envs/sparktest/datasets/airports.csv&apos;
</code></pre>
<p>Create some variable to represent the paths to the data sets</p>
<pre><code>airlines = spark.read\
                .format(&apos;csv&apos;)\
                .option(&apos;header&apos;, &apos;true&apos;)\
                .load(airlinesPath)
</code></pre>
<p>Create a DataFrame from the csv file</p>
<pre><code>airlines.createOrReplaceTempView(&apos;airlines&apos;)
</code></pre>
<p>Register a table view of the DataFrame that is only accessible from this SparkSession</p>
<pre><code>airlines = spark.sql(&apos;SELECT * FROM airlines&apos;)
airlines.columns
</code></pre>
<p>Explore the data by displaying the columns</p>
<pre><code>airlines.show(5)
</code></pre>
<p>Continue exploring the data by displaying the first few rows</p>
<pre><code>flights = spark.read\
                .format(&apos;csv&apos;)\
                .option(&apos;header&apos;,&apos;true&apos;)\
                .load(flightsPath)
</code></pre>
<p>Read in the next csv as a DataFrame</p>
<pre><code>flights.createOrReplaceTempView(&apos;flights&apos;)

flights.columns
</code></pre>
<p>Register another table view from the DataFrame. Then display the columns to begin exploring the data</p>
<pre><code>flights.show(5)
</code></pre>
<p>Explore this data by printing the first few records to the screen</p>
<pre><code>flights.count(), airlines.count()
</code></pre>
<p>Get a total record count for each set of data</p>
<pre><code>flights_count = spark.sql(&apos;SELECT COUNT(*) FROM flights&apos;)
airlines_count = spark.sql(&apos;SELECT COUNT(*) FROM airlines&apos;)
</code></pre>
<p>We can also use sql to get the same data</p>
<pre><code>flights_count, airlines_count
</code></pre>
<p>Display the result of the SQL query. Notice the result is a DataFrame</p>
<pre><code>flights_count.collect()[0][0], airlines_count.collect()[0][0]
</code></pre>
<p>We can matrix notation to extract particular values from the resulting DataFrame</p>
<pre><code>total_distance_df = spark.sql(&apos;SELECT distance FROM flights&apos;)\ # (1)
                        .agg({&apos;distance&apos;:&apos;sum&apos;})\ # (2)
                        .withColumnRenamed(&apos;sum(distance)&apos;,  &apos;total_distance&apos;) # (3)
</code></pre>
<p>Mixing of DataFrame and SQL operations are valid since the sqlContext will return a DataFrame</p>
<ol>
<li>Return the &apos;distance&apos; columm as a DataFrame</li>
<li>Apply Aggregatin on the DataFrame</li>
<li>Creat a new column in the DataFrame and assign the aggregate value to the new column</li>
</ol>
<pre><code>total_distance_df.show()
</code></pre>
<p>Display DataFrame values</p>
<pre><code>all_delays_2012 = spark.sql(
    &apos;SELECT date, airlines, flight_number, departure_delay &apos; +
    &apos;FROM flights WHERE departure_delay &gt; 0 and year(date) = 2012&apos;)
</code></pre>
<p>Results in an empty DataFrame, no records match the <code>WHERE</code> criteria</p>
<pre><code>all_delays_2012.show(5)
</code></pre>
<p>Displays empty DataFrame</p>
<pre><code>all_delays_2014 = spark.sql(
    &apos;SELECT date, airlines, flight_number, departure_delay &apos; +
    &apos;FROM flights WHERE departure_delay &gt; 0 and year(date) = 2014&apos;)

all_delays_2014.show(5)
</code></pre>
<p>Change the criteria to capture data that exists in the table view</p>
<pre><code>all_delays_2014.createOrReplaceTempView(&apos;all_delays&apos;)
</code></pre>
<p>Register the resulting DataFrame as a table view</p>
<pre><code>all_delays_2014.orderBy(all_delays_2014.departure_delay.desc()).show(5)
</code></pre>
<p>Sort all the records by the delay time. Notice the values for delay don&apos;t make sense, earlier we saw that other delay times were greater in value. Why is this? Because the delay column is being treated as a string value. This is why taking your time to observe and explore your data is crucial. We will not use this data so we can leave it as is</p>
<pre><code>delay_count = spark.sql(&apos;SELECT COUNT(departure_delay) FROM all_delays&apos;)
</code></pre>
<p>Collect the total count of flights delayed</p>
<pre><code>delay_count.show()
</code></pre>
<p>Display this result</p>
<pre><code>delay_count.collect()[0][0]
</code></pre>
<p>Extract the single piece of data</p>
<pre><code>delay_percent = delay_count.collect()[0][0] / flights_count.collect()[0][0] * 100
delay_percent
</code></pre>
<p>Using all this data we can calculate the percentage of flights that were delayed</p>
<pre><code>delay_per_airline = spark.sql(&apos;SELECT airlines, departure_delay  FROM flights&apos;)\
                        .groupBy(&apos;airlines&apos;)\
                        .agg({&apos;departure_delay&apos;:&apos;avg&apos;})\
                        .withColumnRenamed(&apos;avg(departure_delay)&apos;, &apos;departure_delay&apos;)
</code></pre>
<p>Now lets get the average delay by airline</p>
<pre><code>delay_per_airline.orderBy(delay_per_airline.departure_delay.desc()).show(5)
</code></pre>
<p>Ordering by departure delay in descending order gives us the airlines with the longest delays</p>
<pre><code>delay_per_airline.createOrReplaceTempView(&apos;delay_per_airline&apos;)
</code></pre>
<p>Register the DataFrame as a table view to perform SQL queries</p>
<pre><code>delay_per_airline = spark.sql(&apos;SELECT * FROM delay_per_airline ORDER BY departure_delay DESC&apos;)
</code></pre>
<p>This will assign ordered data from the SQL table into a DataFrame.</p>
<pre><code>delay_per_airline.show(5)
</code></pre>
<p>Displaying this data we can see it matches the previous operation. This is to show that SQL and DataFrame operations will result in the same outcome</p>
<pre><code>delay_per_airline = spark.sql(&apos;SELECT * FROM delay_per_airline &apos; +
                              &apos;JOIN airlines ON airlines.code = delay_per_airline.airlines &apos; +
                              &apos;ORDER BY departure_delay DESC&apos;)
</code></pre>
<p>Using a SQL join, we are able to combine two registerd SQL tables and return a DataFrame</p>
<pre><code>delay_per_airline.show(5)
</code></pre>
<p>Display the first few columns of the resulting DataFrame</p>
<h2 id="inferredandexplicitschemas">Inferred and Explicit Schemas</h2>
<p>Spark with infer data types with creating DataFrames. But sometimes we will need to explictly set the schema of the DataFrame</p>
<pre><code>from pyspark.sql import SparkSession
</code></pre>
<pre><code>spark = SparkSession.builder\
                    .appName(&apos;Inferred and explicit schemas&apos;)\
                    .getOrCreate()
</code></pre>
<pre><code>from pyspark.sql.types import Row
</code></pre>
<p>Import the needed libraries and create the needed entities as per usual</p>
<pre><code>lines = sc.textFile(&apos;/Users/froilanmiranda/python-envs/sparktest/datasets/students.txt&apos;)
</code></pre>
<p>Use the SparkContext that is directly available to read the text file into an RDD</p>
<pre><code>lines.collect()
</code></pre>
<p>This is a comma seperated list about a few students. Every line is a string and every string has values seperated by commas</p>
<pre><code>parts = lines.map(lambda l: l.split(&apos;,&apos;)) # (1)

parts.collect() # (2)
</code></pre>
<ol>
<li>Use the map function with a lambda to create a list from the the string value of each row</li>
<li>Display the result to screen</li>
</ol>
<pre><code>students = parts.map(lambda p: Row(name=p[0], math=int(p[1]), english=int(p[2]), science=int(p[3])))
</code></pre>
<p>Again, use the map function with a lambda to create Row objects from the list</p>
<pre><code>students.collect()
</code></pre>
<p>Display the result</p>
<pre><code>schemaStudents = spark.createDataFrame(students)

schemaStudents.createOrReplaceTempView(&apos;students&apos;)
</code></pre>
<p>Create a DataFrame from the RDD and then register it as a SQL table</p>
<pre><code>schemaStudents.columns
</code></pre>
<p>Show column info of the DataFrame</p>
<pre><code>schemaStudents.schema
</code></pre>
<p>We did not declare a schema for the DataFrame but it was able to use reflection to infer the schema. Notice the data type <code>StructType</code> and <code>StructField</code></p>
<pre><code>spark.sql(&apos;SELECT * FROM students&apos;).show()
</code></pre>
<p>It is because of the infered typing that Spark can create a schema for the table view when it is registered</p>
<pre><code>parts.collect()
</code></pre>
<p>Now lets use the parts RDD to create a DataFrame and explicitly define the schema. We can see is an RDD of List elements</p>
<pre><code>parts_typed = parts.map(lambda p: Row(name=p[0], math=int(p[1]), english=int(p[2]), science=int(p[3])))
</code></pre>
<p>As we can see above the values for the grades are strings and we will need them to be numbers. Using the map and lambda together to accomplish this</p>
<pre><code>schemaString = &apos;name math english science&apos;
</code></pre>
<p>This is just to map out what columns we want to configure the schema to for visual reference only</p>
<pre><code>from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType

fields = [StructField(&apos;name&apos;, StringType(), True),
         StructField(&apos;math&apos;, IntegerType(), True),
         StructField(&apos;english&apos;, IntegerType(), True),
         StructField(&apos;science&apos;, IntegerType(), True)]
</code></pre>
<p>Specify the fields for every record. Each column is represented by a <code>StructField</code> and takes values for column name, data type, is nullable.</p>
<pre><code>schema = StructType(fields)
</code></pre>
<p>Create a <code>StructType</code> and pass the <code>StructFields</code> as a parameter to create a schema</p>
<pre><code>schemaStudents = spark.createDataFrame(parts_typed, schema)
</code></pre>
<p>Create a DataFrame using the RDD of List and the configured schema</p>
<pre><code>schemaStudents.columns
</code></pre>
<p>Confirm columns have been properly named</p>
<pre><code>schemaStudents.schema
</code></pre>
<p>Confirm schema is configured correctly</p>
<pre><code>schemaStudents.createOrReplaceTempView(&apos;students_explicit&apos;)
</code></pre>
<p>Register the DataFrame as a SQL table</p>
<pre><code>spark.sql(&apos;SELECT * FROM students_explicit&apos;).show()
</code></pre>
<p>And now with the schema explicitly in place we can query the data with SQL</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[PySpark - Using DataFrames]]></title><description><![CDATA[<!--kg-card-begin: markdown--><h1 id="sparkdataframes">Spark DataFrames</h1>
<p>Previously we looked at RDDs, and were the primary data set in Spark 1. In Spark 2 we rarely use RDDs only for low level transformations and control over the dataset. If the data is unstructured or streaming data we then have to rely on RDDs, for everything</p>]]></description><link>https://blog.virtual-artifact.com/pyspark-using-dataframes/</link><guid isPermaLink="false">62e185bb08b3cd09cac935c2</guid><category><![CDATA[Python]]></category><category><![CDATA[Spark 2]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Wed, 15 Apr 2020 17:08:17 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h1 id="sparkdataframes">Spark DataFrames</h1>
<p>Previously we looked at RDDs, and were the primary data set in Spark 1. In Spark 2 we rarely use RDDs only for low level transformations and control over the dataset. If the data is unstructured or streaming data we then have to rely on RDDs, for everything else we will use DataFrames</p>
<h2 id="sparksessionvssparkcontext">SparkSession vs. SparkContext</h2>
<p>Up until now we have been using the SparkContext as the entry point to Spark. Moving forward, the SparkSession will be the entry point we will utilize</p>
<p>SparkSession offers:</p>
<ul>
<li>Ease of Use
<ul>
<li>SparkSession - simplified entry point</li>
<li>No confusion about which context to use</li>
<li>Encapsulates SQLContext and Hive Context</li>
</ul>
</li>
</ul>
<p>To create a <strong>SparkSession</strong> use <code>SparkSession.builder()</code></p>
<h2 id="exploringdatawithdataframes">Exploring Data with DataFrames</h2>
<h3 id="sparksessionanddataframes">SparkSession and DataFrames</h3>
<pre><code>from pyspark.sql import SparkSession
</code></pre>
<p>Import the neccessary libraries</p>
<pre><code>spark = SparkSession.builder\
                    .appName(&quot;Analyzing London Crime Data&quot;)\
                    .getOrCreate
</code></pre>
<p>Build a new SparkSession and assign it a name if a session with the same name does not exist. Otherwise, return the existing session with this app name. This will be the entry point to the Spark engine</p>
<pre><code>data = spark.read\	# (1)
            .format(&quot;csv&quot;)\ # (2)
            .option(&quot;header&quot;, &quot;true&quot;)\ # (3)
            .load(&quot;../datasets/london_crime_by_lsoa.csv&quot;) # (4)
</code></pre>
<ol>
<li><code>.read()</code> returns a DataFrameReader that can be used to read non-streaming data in as a DataFrame.</li>
<li><code>.format(...)</code> sets the file format to be read</li>
<li><code>.options(...)</code> adds input options for the underlying data source</li>
<li><code>.load(...)</code> loads input in as a DataFrame from a data source</li>
</ol>
<pre><code>data.printSchema()
</code></pre>
<p>Remember DataFrames are always structured data. Using <code>.printSchema</code> will print the schema of the tabular data</p>
<pre><code>data.count()
</code></pre>
<p>We can see the number of rows in this DataFrame</p>
<pre><code>data.limit(5).show()
</code></pre>
<p>Examine data by looking at the fist 5 rows with the <code>.show()</code> function</p>
<h3 id="dropandselectcolumns">Drop and Select Columns</h3>
<pre><code>data.dropna()
</code></pre>
<p>Drop rows which have values that are not available (N/A). As we know is a major part of data cleaning</p>
<pre><code>data = data.drop(&apos;lsoa_code&apos;)

data.show(5)
</code></pre>
<p>To drop a column we can use <code>.drop(...)</code> and pass a column name as a string value.</p>
<pre><code>total_boroughs = data.select(&apos;borough&apos;)\ # (1)
                    .distinct() # (2)
total_boroughs.show()
</code></pre>
<ol>
<li>Select a column from the DataFrame</li>
<li>Select only distinct values</li>
</ol>
<pre><code>total_boroughs.count()
</code></pre>
<p>The number of distinct values for this column from the DataFrame</p>
<h3 id="filterrecords">Filter Records</h3>
<pre><code>hackney_data = data.filter(data[&apos;borough&apos;] == &apos;Hackney&apos;)
hackney_data.show(5)
</code></pre>
<p>Using <code>.filter(...)</code> we can filter records using columns their data</p>
<pre><code># (1)
data_2015_2016 = data.filter(data[&apos;year&apos;].isin([&quot;2015&quot;, &quot;2016&quot;]))

# (2)
data_2015_2016.sample(fraction=0.1).show()
</code></pre>
<ol>
<li>Notice the use of <code>.isin(...)</code> withing the filter parameters. This will select the records with column values in the range of the parameters passed into it</li>
<li><code>.sample(...)</code> returns a sampled subset of this DataFrame. <code>fraction</code> determines the fraction size of the full DataFrame to return</li>
</ol>
<pre><code>data_2014_onwards = data.filter(data[&apos;year&apos;] &gt;=2014)

data_2014_onwards.sample(fraction=0.1).show()
</code></pre>
<p>Another example of <code>.filter(...)</code> using <code>&gt;=</code> comparator to select records with a coloumn value greater or equal to the value passed</p>
<h3 id="aggregationsandgrouping">Aggregations and grouping</h3>
<pre><code>borough_crime_count = data.groupBy(&apos;borough&apos;)\
                            .count()

borough_crime_count.show(5)
</code></pre>
<p>DataFrames support grouping of data, with the <code>.groupBy(...)</code> function. <code>.groupBy(...)</code> can be used on any column</p>
<pre><code>borough_crime_count = data.groupBy(&apos;borough&apos;)\
                            .agg({&quot;value&quot;:&quot;sum&quot;})

borough_crime_count.show(5)
</code></pre>
<p><code>.agg(...)</code> is a function that will compute aggregates and return the result as a DataFrame.</p>
<p>Built-in aggregation functions:</p>
<ul>
<li>avg</li>
<li>max</li>
<li>min</li>
<li>sum</li>
<li>count</li>
</ul>
<pre><code>borough_convictin_sum = data.groupBy(&apos;borough&apos;)\
                            .agg({&quot;value&quot;:&quot;sum&quot;})\
                            .withColumnRenamed(&apos;sum(value)&apos;,&apos;convictions&apos;)

borough_convictin_sum.show(5)
</code></pre>
<p>Using <code>.withColumnRenamed(&lt;original name&gt;, &lt;new name&gt;)</code> will result in the column name being replaced by a new name</p>
<pre><code>total_borough_convictions = borough_conviction_sum.agg({&apos;convictions&apos;:&apos;sum&apos;})

total_borough_convictions.show()
</code></pre>
<p>By removing the grouping function, the aggregate will act on the whole DataFrame</p>
<pre><code>total_convictions = total_borough_convictions.collect()[0][0]
</code></pre>
<p>Using matrix notation we can grab the value out of the collectiona and assign it to variable</p>
<pre><code>import pyspark.sql.functions as fun
</code></pre>
<p>Imports some extra functionality from the PySpark library</p>
<pre><code>borough_percentage_contribution = borough_conviction_sum.withColumn(
    &apos;% contribution&apos;,
    func.round(borough_conviction_sum.convictions / total_convictions * 100, 2))

borough_percentage_contribution.printSchema()
</code></pre>
<p>Here we create a new column and use the previous variable to calculate the new column value</p>
<pre><code>borough_percentage_contribution.orderBy(borough_percentage_contribution[2].desc())\
                                .show(10)
</code></pre>
<p>we can use <code>.orderBy(...)</code> and a column index to transform the DataFrame by ascending and descending order</p>
<pre><code>conviction_monthly = data.filter(data[&apos;year&apos;] == 2014)\
                            .groupBy(&apos;month&apos;)\
                            .agg({&apos;value&apos;:&apos;sum&apos;})\
                            .withColumnRenamed(&apos;sum(value)&apos;, &apos;convictions&apos;)
</code></pre>
<p>Here we use a combination of group by, aggregate and column renaming to extract the data</p>
<pre><code>total_conviction_monthly = conviction_monthly.agg({&apos;convictions&apos;:&apos;sum&apos;})\
                                            .collect()[0][0]

total_conviction_monthly = conviction_monthly.withColumn(
                &apos;percent&apos;,
                func.round(conviction_monthly.convictions/total_conviction_monthly * 100, 2))
total_conviction_monthly.columns
</code></pre>
<p>Now we use more transformations to alter the data more and print the resulting DataFrame columns</p>
<pre><code>total_conviction_monthly.orderBy(total_conviction_monthly.percent.desc()).show()
</code></pre>
<p>Finally, we order the resulting DataFrame and display</p>
<h2 id="aggregationsandvisualizations">Aggregations and Visualizations</h2>
<pre><code>crimes_category = data.groupBy(&apos;major_category&apos;)\
                        .agg({&apos;value&apos;:&apos;sum&apos;})\
                        .withColumnRenamed(&apos;sum(value)&apos;,&apos;convictions&apos;)
</code></pre>
<p>use group by and aggregates to create a DataFrame</p>
<pre><code>crimes_category.orderBy(crimes_category.convictions.desc()).show()
</code></pre>
<p>Order and display the new DataFrame</p>
<pre><code>year_df = data.select(&apos;year&apos;)
</code></pre>
<p>Create a new DataFrame from one column</p>
<pre><code>year_df.agg({&apos;year&apos;:&apos;min&apos;}).show()
</code></pre>
<p>Use the min aggregate to return the minimum value</p>
<pre><code>year_df.agg({&apos;year&apos;:&apos;max&apos;}).show()
</code></pre>
<p>Use the max aggregate to reutrn the maximum value</p>
<pre><code>year_df.describe().show()
</code></pre>
<p><code>.describe()</code> will return:</p>
<ul>
<li>count</li>
<li>mean</li>
<li>standard deviation</li>
<li>min</li>
<li>max</li>
</ul>
<pre><code>data.crosstab(&apos;borough&apos;, &apos;major_category&apos;)\
    .select(&apos;borough_major_category&apos;, &apos;Burglary&apos;, &apos;Drugs&apos;, &apos;Fraud or Forgery&apos;, &apos;Robbery&apos;)\
    .show()
</code></pre>
<p><code>.crosstab(...)</code> computes a pair-wise frequency table of the given columns. Also known as a contingency table.</p>
<pre><code>get_ipython().magic(&apos;matplotlib inline&apos;)
import matplotlib.pyplot as plt
plt.style.use(&apos;ggplot&apos;)
</code></pre>
<p>Matplotlib graphs displayed inline on this notebook</p>
<pre><code>def describe_year(year):
    yearly_details = data.filter(data.year == year)\
                        .groupBy(&apos;borough&apos;)\
                        .agg({&apos;value&apos;:&apos;sum&apos;})\
                        .withColumnRenamed(&apos;sum(value)&apos;, &apos;convictions&apos;)
    
    borough_list = [x[0] for x in yearly_details.toLocalIterator()]
    convictions_list = [x[1] for x in yearly_details.toLocalIterator()]
    
    plt.figure(figsize=(33,10))
    plt.bar(borough_list, convictions_list)
    
    plt.title(&apos;Crime for the year: &apos; + year, fontsize=30)
    plt.xlabel(&apos;Boroughs&apos;,fontsize=30)
    plt.ylabel(&apos;Convictions&apos;, fontsize=30)
    
    plt.xticks(rotation=90, fontsize=30)
    plt.yticks(fontsize=30)
    plt.autoscale()
    plt.show()
</code></pre>
<p>This is a helper function to contain all the necessary steps to create the DataFrame based off year and create the chart to visualize it</p>
<h2 id="extractingdataanduserdefinedfunctions">Extracting Data and User Defined Functions</h2>
<p>In this section we will explore using DataFrames to explore and clean data. We will use <em>User Defined Functions</em> to assist us in the process</p>
<pre><code>from pyspark.sql import SparkSession
</code></pre>
<p>First import <em>SparkSession</em></p>
<pre><code>spark = SparkSession.builder\
                .appName(&apos;Analyzing soccer players&apos;)\
                .getOrCreate()
</code></pre>
<p>Create <em>SparkSession</em> instance</p>
<pre><code>players = spark.read\
                .format(&apos;csv&apos;)\
                .option(&apos;header&apos;, &apos;true&apos;)\
                .load(&apos;../datasets/player.csv&apos;)
</code></pre>
<p>Read in the data source into a DataFrame</p>
<pre><code>players.printSchema()
</code></pre>
<p>Look at the schema</p>
<pre><code>players.show(5)
</code></pre>
<p>Checkout the first few records</p>
<pre><code>player_attributes = spark.read\
                        .format(&apos;csv&apos;)\
                        .option(&apos;header&apos;, &apos;true&apos;)\
                        .load(&apos;../datasets/Player_Attributes.csv&apos;)
</code></pre>
<p>Read in a second CSV data source in a DataFrame</p>
<pre><code>player_attributes.printSchema()
</code></pre>
<p>Again, look at the schema</p>
<pre><code>players.count(), player_attributes.count()
</code></pre>
<p>Lets view the total record count for each DataFrame</p>
<pre><code>player_attributes.select(&apos;player_api_id&apos;)\
                .distinct()\
                .count()
</code></pre>
<p>Notice that entities from one DataFrame have a many to one relationship with the records of the other data set</p>
<pre><code>players = players.drop(&apos;id&apos;, &apos;player_fifa_api_id&apos;)
players.columns
</code></pre>
<p>Get rid of unwanted data columns</p>
<pre><code>player_attributes = player_attributes.drop(
    &apos;id&apos;,
    &apos;player_fifa_api_id&apos;,
    &apos;preferred_foot&apos;,
    &apos;attacking_work_rate&apos;,
    &apos;defensive_work_rate&apos;,
    &apos;crossing&apos;,
    &apos;jumping&apos;,
    &apos;sprint_speed&apos;,
    &apos;balance&apos;,
    &apos;aggression&apos;,
    &apos;short_passing&apos;,
    &apos;potential&apos;
)
player_attributes.columns
</code></pre>
<p>Get rid of unwanted data colums</p>
<pre><code>player_attributes = player_attributes.dropna()
players = players.dropna()
</code></pre>
<p>Remove records with non available data</p>
<pre><code>players.count(), player_attributes.count()
</code></pre>
<p>Look at the new data count</p>
<p><strong>User defined functions</strong></p>
<pre><code>from pyspark.sql.functions import udf
</code></pre>
<p>Import the <strong>User defined functions</strong> library</p>
<pre><code>year_extract_udf = udf(lambda date: date.split(&apos;-&apos;)[0]) # (1)

player_attributes = player_attributes.withColumn( # (2)
    &apos;year&apos;,
    year_extract_udf(player_attributes.date)
)
</code></pre>
<ol>
<li>Create an UDF with a lambda function operate on a date value and return the year only</li>
<li>Create a new column for the year and extract the values from the data column using the UDF</li>
</ol>
<pre><code>player_attributes = player_attributes.drop(&apos;date&apos;)
</code></pre>
<p>Now we can drop the data column, as the year data has been copied to another column</p>
<pre><code>player_attributes.columns
</code></pre>
<p>view the new schema</p>
<h2 id="joiningdataframes">Joining DataFrames</h2>
<p>Spark DataFrames can be joined much like SQL tables can be joined. In this section we will join data to create a new DataFrame.</p>
<pre><code>pa_2016 = player_attributes.filter(player_attributes.year == 2016)
</code></pre>
<p>Create a new DataFrame from a subset of another DataFrame</p>
<pre><code>pa_2016.count()
</code></pre>
<p>View the count</p>
<pre><code>pa_2016.select(pa_2016.player_api_id)\
    .distinct()\
    .count()
</code></pre>
<p>Select only destinct value to make sure the unique ids match with the DataFrame we want to join</p>
<pre><code>pa_striker_2016 = pa_2016.groupBy(&apos;player_api_id&apos;)\
                        .agg({
                            &apos;finishing&apos;:&apos;avg&apos;,
                            &apos;shot_power&apos;:&apos;avg&apos;,
                            &apos;acceleration&apos;:&apos;avg&apos;
                        })
</code></pre>
<p>Since one data set has many records associated with an entity we will group the records by entity id first, then average the values of the columns we are interested in to create a one to one relatioinship</p>
<pre><code>pa_striker_2016.count()
</code></pre>
<p>Check that the two DataFrame counts match</p>
<pre><code>pa_striker_2016.show(5)
</code></pre>
<p>Take a quick look at the new aggregated data</p>
<pre><code>pa_striker_2016 = pa_striker_2016.withColumnRenamed(&apos;avg(finishing)&apos;, &apos;finishing&apos;)\
                                 .withColumnRenamed(&apos;avg(shot_power)&apos;, &apos;shot_power&apos;)\
                                 .withColumnRenamed(&apos;avg(acceleration)&apos;, &apos;acceleration&apos;)
</code></pre>
<p>Rename the columns for readablity</p>
<pre><code>weight_finishing = 1
weight_shot_power = 2
weight_acceleration = 1

total_weight = weight_finishing + weight_shot_power + weight_acceleration
</code></pre>
<p>Lets create a weighted grading system to apply more value to some attributes</p>
<pre><code>strikers = pa_striker_2016.withColumn(&apos;striker_grade&apos;,
                                     (pa_striker_2016.finishing * weight_finishing + \
                                      pa_striker_2016.shot_power * weight_shot_power + \
                                      pa_striker_2016.acceleration * weight_acceleration) / total_weight)
</code></pre>
<p>Create a new column and apply the grading syestem to caluculate the each row value</p>
<pre><code>strikers = strikers.drop(&apos;finishing&apos;,
                         &apos;acceleration&apos;,
                         &apos;shot_power&apos;)
</code></pre>
<p>Remove uneeded fields</p>
<pre><code>strikers = strikers.filter(strikers.striker_grade &gt; 70)\
                    .sort(strikers.striker_grade.desc())

strikers.show(10)
</code></pre>
<p>Drop lower grades from the dataset</p>
<pre><code>strikers.count(), players.count()
</code></pre>
<p>See how many entities we have left</p>
<pre><code>striker_details = players.join(strikers, players.player_api_id == strikers.player_api_id)
</code></pre>
<p>Now we can join the two DataFrames</p>
<pre><code>striker_details.columns
</code></pre>
<p>View the columns, and take note the double both join fields</p>
<pre><code>striker_details.count()
</code></pre>
<p>Check that count is inline with before</p>
<pre><code>striker_details = players.join(strikers, [&apos;player_api_id&apos;])
</code></pre>
<p>Alternate way to join</p>
<pre><code>striker_details.show(5)
</code></pre>
<p>view the data</p>
<pre><code>striker_details.columns
</code></pre>
<p>View the columns, and take note the single join column</p>
<h2 id="savingdataframestocsvandjson">Saving DataFrames to CSV and JSON</h2>
<p>Saving to file is pretty straight forward</p>
<h3 id="csv">CSV</h3>
<pre><code>striker_details.select(&quot;player_name&quot;, &quot;striker_grade&quot;)\  # (1)
                .coalesce(1)\  # (2)
                .write\  # (3)
                .option(&apos;header&apos;, &apos;true&apos;)\  # (4)
                .csv(&apos;striker_grade.csv&apos;)  # (5)
</code></pre>
<ol>
<li>Select the columns to export</li>
<li>how many files to break the data into</li>
<li>Begin the write command</li>
<li>Any options to apply</li>
<li>File format and file name</li>
</ol>
<h3 id="json">JSON</h3>
<pre><code>striker_details.select(&quot;player_name&quot;, &quot;striker_grade&quot;)\
                .write\
                .json(&apos;striker_grade.json&apos;)
</code></pre>
<h2 id="goingfurtherwithjoins">Going Further with Joins</h2>
<p>Here we will cover other ways to join DataFrames</p>
<pre><code>valuesA = [(&apos;John&apos;, 100000), (&apos;James&apos;, 150000), (&apos;Emily&apos;, 65000), (&apos;Nina&apos;, 200000)]
tableA = spark.createDataFrame(valuesA, [&apos;name&apos;, &apos;salary&apos;])
</code></pre>
<p>Create a DataFrame from a list of tuples</p>
<pre><code>tableA.show()
</code></pre>
<p>View DataFrame</p>
<pre><code>valuesB = [(&apos;James&apos;, 2), (&apos;Emily&apos;,3), (&apos;Darth Vader&apos;, 5), (&apos;Princess Leia&apos;, 6)]

tableB = spark.createDataFrame(valuesB, [&apos;name&apos;, &apos;employee_id&apos;])
</code></pre>
<p>Create a second DataFrame</p>
<pre><code>tableB.show()
</code></pre>
<p>View DataFrame</p>
<pre><code>inner_join = tableA.join(tableB, tableA.name == tableB.name)
inner_join.show()
</code></pre>
<p>This is the behavior that we have seen previously</p>
<pre><code>left_join = tableA.join(tableB, tableA.name == tableB.name, how=&apos;left&apos;)
left_join.show()
</code></pre>
<p>using the <code>how</code> parameter to explicitly declare the type of join</p>
<pre><code>right_join = tableA.join(tableB, tableA.name == tableB.name, how=&apos;right&apos;)
right_join.show()
</code></pre>
<p>Outer join right</p>
<pre><code>full_outer_join = tableA.join(tableB, tableA.name == tableB.name, how=&apos;full&apos;)
full_outer_join.show()
</code></pre>
<p>Full outer join</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[PySpark - Using RDDs]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>I will keep my data in a folder named &apos;datasets&apos;</p>
<p>If PySpark is not already loaded up, go ahead and start PySpark and create a new Jupyter notebook</p>
<p>View information about the SparkContext by inputing <code>sc</code></p>
<p>If we were running a cluster of nodes the output would be</p>]]></description><link>https://blog.virtual-artifact.com/pyspark-using-rdds/</link><guid isPermaLink="false">62e185bb08b3cd09cac935c1</guid><category><![CDATA[Python]]></category><category><![CDATA[Spark 2]]></category><category><![CDATA[Pandas]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Tue, 14 Apr 2020 23:11:40 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>I will keep my data in a folder named &apos;datasets&apos;</p>
<p>If PySpark is not already loaded up, go ahead and start PySpark and create a new Jupyter notebook</p>
<p>View information about the SparkContext by inputing <code>sc</code></p>
<p>If we were running a cluster of nodes the output would be a bit more interesting. As we are running in standalone mode there is little output</p>
<p>Lets import a few things</p>
<pre><code>from pyspark.sql.types import Row # (1)
from datatime import datetime # (2)
</code></pre>
<ol>
<li><strong>Row</strong> is a spark object that represents a single row of a dataframe, we will see this shortly</li>
<li><strong>datetime</strong> is standard from Python</li>
</ol>
<pre><code>simple_data = sc.parallelize([1, &quot;Alice&quot;, 50])
simple_data
</code></pre>
<p><code>sc.parallelize(...)</code> converts data into an RDD</p>
<pre><code>simple_data.count()
</code></pre>
<p>Returns a number representing the number of entities in the RDD</p>
<pre><code>simple_data.first()
</code></pre>
<p>Access the first element in the RDD</p>
<p><code>.count()</code> and <code>.first()</code> are what is considered an <em>Action</em></p>
<pre><code>simple_data.take(2)
</code></pre>
<p><code>.take(...)</code> will return a subset of the RDD as a list</p>
<pre><code>simple_data.collect()
</code></pre>
<p><code>.collect()</code> will return all values in the RDD as a list</p>
<p>These have been some examples of <em>Actions</em>. Remember that when calling an action it will trigger all the transformations that occurred before it to execute. This can be a costly operation on large datasets so be care when and where you use them</p>
<p>Up until now we have been using <strong>RDD</strong>s, and this is fine to do in Spark. However Spark 2 offers the new <strong>DataFrame</strong> and we will use dataframes a lot more than RDDs. The take away is that Spark 2 still has access to the underlying RDD construct. Lets quickly try to create a dataframe from an RDD</p>
<pre><code>df = simple_data.toDF()
</code></pre>
<p><code>toDF()</code> will try to convert an RDD to a DataFrame</p>
<p>Here we get an error. The problem here is that the data in the rows of the RDD are not structured. The data types are mixed and dataframes</p>
<p>This RDD has no schema, contains elements of different types - it cannot be converted to a DataFrame</p>
<h2 id="convertrddstodataframes">Convert RDDs to DataFrames</h2>
<pre><code>records = sc.parallelize([[1, &quot;Alice&quot;, 50], [2,&quot;Bob&quot;, 100]])
records
</code></pre>
<p>Here we create an RDD with structured data, two rows with matching data schemas</p>
<pre><code>records.collect()
</code></pre>
<p>Again, <code>collect()</code> returns all rows in the RDD</p>
<pre><code>records.count()
</code></pre>
<p>Again, returns the row count of the RDD</p>
<pre><code>records.first()
</code></pre>
<p>Again, returns the first record</p>
<pre><code>records.take(2)
</code></pre>
<pre><code>records.collect()
</code></pre>
<p>Because of the size of this RDD, the previous two methods have the same return values</p>
<pre><code>df = records.toDf()
</code></pre>
<p>This will return a Spark DataFrame for the RDDs values. Because the RDDs rows have the same number of columns and those columns have the same data type, Spark can create a DataFrame from this RDDs values</p>
<pre><code>df
</code></pre>
<p>We can see Spark infers the datatypes</p>
<pre><code>df.show()
</code></pre>
<p><code>.show()</code> allows for a quick view of the dataframe. Here is show the first 20 rows as a default</p>
<p>Take notice that the Column names have been automatically generated and assigned</p>
<p>If we want to specify the column names we must make use of the <strong>Row</strong> object imported earlier. Using a <strong>Row</strong> object to create an RDD will pass column name data</p>
<pre><code>data = sc.parallelize([Row(id=1,
                           name=&quot;Alice&quot;,
                           score=50)])
data
</code></pre>
<p>Now if we inspect the column names withing the row object</p>
<p><code>data.collect()</code></p>
<p><code>data.count()</code></p>
<p>Lets create dataframe from this RDD</p>
<pre><code>df = data.toDF()
df.show()
</code></pre>
<p>We can now see the column names applied to the output.</p>
<p>Lets add some more data</p>
<pre><code>data = sc.parallelize([Row(id=1,
                           name=&quot;Alice&quot;,
                           score=50),
                      Row(id=2,
                           name=&quot;Bob&quot;,
                           score=100),
                      Row(id=3,
                           name=&quot;Charlee&quot;,
                           score=150)])
data
</code></pre>
<p>Now, convert to a dataframe and show</p>
<pre><code>df = data.toDF()
df.show()
</code></pre>
<p>And as before, since the data is structured Spark will infer the datatypes and effortlessly convert to a dataframe</p>
<h2 id="workingwithcomplexdata">Working with complex data</h2>
<pre><code>complex_data = sc.parallelize([Row(
                                col_float=1.44,
                                col_integer=10,
                                col_sring=&quot;John&quot;)])
</code></pre>
<p>We create a RDD with one Row object. This row consist of a float, integer and string value types</p>
<pre><code>complex_data_df = complex_data.toDF()
complex_data_df.show()
</code></pre>
<p>Convert the complex data to a dataframe</p>
<pre><code>complex_data = sc.parallelize([Row(
                                col_float=1.44,
                                col_integer=10,
                                col_sring=&quot;John&quot;,
                                col_boolean=True,
                                col_list=[1,2,3])])
</code></pre>
<p>Now we see a good mixture of datatypes, take note of the list in the last column.</p>
<pre><code>complex_data_df = complex_data.toDF()
complex_data_df.show()
</code></pre>
<p>After converting to a dataframe, we can see fromt the table displayed fromthe <code>show()</code> method that the list type has been preserved</p>
<pre><code>complex_data = sc.parallelize([Row(
                                col_list=[1,2,3],
                                col_dic={&quot;k1&quot;: 0},
                                col_row=Row(a=10,b=20,c=30),
                                col_time=datetime(2014, 8, 1, 14, 1 ,5)
                              ),
                              Row(
                                col_list=[1,2,3,4,5],
                                col_dic={&quot;k1&quot;: 0, &quot;k2&quot;:1},
                                col_row=Row(a=40,b=50,c=60),
                                col_time=datetime(2014, 8, 1, 14, 1 ,6)
                              ),
                              Row(
                                col_list=[1,2,3,4,5,6,7],
                                col_dic={&quot;k1&quot;: 0,&quot;k2&quot;: 0,&quot;k3&quot;: 0},
                                col_row=Row(a=70,b=80,c=90),
                                col_time=datetime(2014, 8, 1, 14, 1 ,7)
                              )])
</code></pre>
<p>Here we can see all the complex structures supported by dataframes in spark</p>
<pre><code>complex_data_df = complex_data.toDF()
complex_data_df.show()
</code></pre>
<h2 id="sqlcontext">SQL Context</h2>
<p>We can ust the sqlContext to run SQL queries on the Spark data</p>
<p><code>sqlContext = SQLContext(sc)</code></p>
<p><code>sqlContext</code></p>
<p>This wraps around the SparkContext to add SQL functionality</p>
<pre><code>df = sqlContext.range(5)
df
</code></pre>
<p><code>.range(5)</code> on the sqlContext obect will return a dataframe with five one column rows with the integer values 1-5</p>
<p><code>df.count()</code></p>
<pre><code>data = [(&apos;Alice&apos;,50),
        (&apos;Bob&apos;,80),
        (&apos;Charlee&apos;, 75)]
</code></pre>
<p>Create a list of touples and assign to <code>data</code> variable</p>
<p><code>sqlContext.createDataFrame(data).show()</code></p>
<p>Creates a dataframe from the list and displays the data. Note the column names have been automatically generated</p>
<pre><code>sqlContext.createDataFrame(data, [&apos;Name&apos;, &apos;Score&apos;]).show()
</code></pre>
<p>The same operation but with specifing the column name</p>
<pre><code>complex_data = [
                (1.0,
                10,
                &quot;Alice&quot;,
                True,
                [1,2,3],
                {&quot;k1&quot;:0},
                Row(a=1,b=2,c=3),
                datetime(2014, 8,1,14,1,5)),
    
                (2.0,
                20,
                &quot;Bob&quot;,
                True,
                [1,2,3,4,5],
                {&quot;k1&quot;:0,&quot;k2&quot;:1},
                Row(a=1,b=2,c=3),
                datetime(2014, 8,1,14,1,5)),

                (3.0,
                30,
                &quot;Charlee&quot;,
                False,
                [1,2,3,4,5,6,7],
                {&quot;k1&quot;:0,&quot;k2&quot;:1,&quot;k3&quot;:2},
                Row(a=1,b=2,c=3),
                datetime(2014, 8,1,14,1,5))    
               ]
</code></pre>
<p>List of complex data</p>
<p><code>sqlContext.createDataFrame(complex_data).show()</code></p>
<p>Convert to dataframe and display</p>
<pre><code>complex_data_df = sqlContext.createDataFrame(complex_data,[
        &apos;col_integer&apos;,
        &apos;col_float&apos;,
        &apos;col_string&apos;,
        &apos;col_boolean&apos;,
        &apos;col_list&apos;,
        &apos;col_dictionary&apos;,
        &apos;col_row&apos;,
        &apos;col_date_time&apos;]
)
complex_data_df.show()
</code></pre>
<p>Convert to dataframe with column name and display</p>
<pre><code>data = sc.parallelize([
    Row(1,&apos;Alice&apos;,50),
    Row(2,&apos;Bob&apos;,100),
    Row(3,&apos;Charlee&apos;,150)
])
</code></pre>
<p>Create an RDD with some Row objects, but with no column name specification for the Row object</p>
<pre><code>column_names = Row(&apos;id&apos;,&apos;name&apos;,&apos;score&apos;)
students = data.map(lambda r: column_names(*r))
</code></pre>
<p>We can apply column name to an RDD after it has been created by using the <code>.map(...)</code> function.</p>
<p><code>students</code></p>
<p>This returns a new RDD</p>
<p>Note: The map() operation performs a transformation on every element in the RDD</p>
<p><code>students.collect()</code></p>
<p>We see that the column names have been assigned to all the records</p>
<pre><code>students_df = sqlContext.createDataFrame(students)
students_df
</code></pre>
<p>Use the SQLContext to create a dataframe from the students RDD</p>
<p><code>student_df.show()</code></p>
<p>Notice the dataframe has recognized all the column names properly</p>
<h2 id="accessingrddsfromdataframes">Accessing RDDs from DataFrames</h2>
<p>Looking back to the complex data we created earlier</p>
<pre><code>complex_data_df.first()
</code></pre>
<p>This data consist is primative data types as well as complex data types</p>
<pre><code>complex_data_df.take(2)
</code></pre>
<p>Dataframes are in tabular format and can be accessed using matrix notation</p>
<pre><code>cell_string = complex_data_df.collect()[0][2]
cell_string
</code></pre>
<p>another example</p>
<pre><code>cell_list = complex_data_df.collect()[0][4]
cell_list
</code></pre>
<p>Modify the list</p>
<pre><code>cell_list.append(100)
cell_list
</code></pre>
<pre><code>complex_data_df.show()
</code></pre>
<p>Take note that the original data is unaltered. This is because accessing the data is then returned as seperate value</p>
<pre><code>complex_data_df.rdd\
                .map(lambda x: (x.col_string, x.col_dictionary))\
                .collect()
</code></pre>
<p>Extract specific columns by converting the DataFrame to an RDD</p>
<pre><code>complex_data_df.select(
    &apos;col_string&apos;,
    &apos;col_list&apos;,
    &apos;col_date_time&apos;
).show()
</code></pre>
<p><code>.select(...)</code> will return only the specified column names</p>
<pre><code>complex_data_df.rdd\
                .map(lambda x: (x.col_string + &quot; Boo&quot;))\
                .collect()
</code></pre>
<p>A map() operation which appends &quot;Boo&quot; to every string in the column</p>
<p>Dataframes do not support the <code>.map(...)</code> function</p>
<pre><code>complex_data_df.select(
    &apos;col_integer&apos;,
    &apos;col_float&apos;
    )\
    .withColumn(
    &apos;col_sum&apos;,
    complex_data_df.col_integer + complex_data_df.col_float
    )\
    .show()
</code></pre>
<p>To perform a calculation with column values we need to use the <code>.withColumn(...)</code></p>
<ol>
<li>Select the column to use for calculation</li>
<li>Create a new column with the reulting values</li>
</ol>
<pre><code>complex_data_df.select(&apos;col_boolean&apos;)\
                .withColumn(
                    &apos;col_opposite&apos;,
                    complex_data_df.col_boolean == False)\
                .show()
</code></pre>
<p>Here is another example of <code>.withColumn(...)</code> that inverts the value of booleans in a column</p>
<pre><code>complex_data_df.withColumnRenamed(&apos;col_dictionary&apos;,&apos;col_map&apos;).show()
</code></pre>
<p>This example renames the column</p>
<pre><code>complex_data_df.select(complex_data_df.col_string.alias(&apos;Name&apos;)).show()
</code></pre>
<p>This will select and rename a column</p>
<h2 id="sparkdataframesandpandas">Spark DataFrames and Pandas</h2>
<p>Pandas and Spark DataFrames are interoperable</p>
<pre><code>import pandas
</code></pre>
<p>Import the pandas library, do not forget to <code>pip install</code> if needed</p>
<pre><code>df_pandas = complex_data_df.toPandas()
df_pandas
</code></pre>
<p>Converting a Spark DataFrame to a Pandas DataFrame is done using <code>toPandas()</code></p>
<p>Remember that Spark DataFrames are built on top of RDDs and stored across multiple nodes. Conversely, Pandas dataframes are stored in memory of the machine it is running on.</p>
<pre><code>df_spark = sqlContext.createDataFrame(df_pandas).show()
df_spark
</code></pre>
<p>On the flip side the <code>.createDataFrame(...)</code> will convert a Pandas DataFrame to a Spark DataFrame</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Spark 2 setup]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>Demo will cover:</p>
<ul>
<li>Install standalone Spark on your local machine</li>
<li>Set up the PySpark REPL interface</li>
</ul>
<p>Req&apos;s</p>
<ul>
<li>This demo will by done with Python 3</li>
<li>Java v8</li>
<li>jupyter notebooks</li>
</ul>
<p>Download Spark 2 from <a href>https://spark.apache.org/downloads.html</a></p>
<ol>
<li>Choose the most recent 2.x build</li>
<li>Choose package:</li></ol>]]></description><link>https://blog.virtual-artifact.com/spark-2-setup/</link><guid isPermaLink="false">62e185bb08b3cd09cac935c0</guid><category><![CDATA[Getting Started]]></category><category><![CDATA[Spark 2]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Mon, 13 Apr 2020 14:15:15 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>Demo will cover:</p>
<ul>
<li>Install standalone Spark on your local machine</li>
<li>Set up the PySpark REPL interface</li>
</ul>
<p>Req&apos;s</p>
<ul>
<li>This demo will by done with Python 3</li>
<li>Java v8</li>
<li>jupyter notebooks</li>
</ul>
<p>Download Spark 2 from <a href>https://spark.apache.org/downloads.html</a></p>
<ol>
<li>Choose the most recent 2.x build</li>
<li>Choose package: Pre-built for Apache Hadoop 2.7 and later. (Standalone installation does not require you to have Hadoop installed)</li>
<li>Download generated link</li>
<li>Move the download file to a suitable location on you harddrive</li>
<li>run the this command to unpack download, <code>sudo tar -xvzf &lt;path to spark binary file download&gt;</code></li>
<li><code>ls</code> to confirm files have unpacked</li>
</ol>
<p>Standalone Spark requires some environment variables to be set.</p>
<p>Open bash profile, <code>nano ~/.bash_profile</code> and set the following:</p>
<pre><code>export SPARK_HOME=&quot;/path/to/spark2/folder&quot;
export PATH=&quot;$SPARK_HOME/bin:$PATH&quot;
</code></pre>
<p>Note: if you don&apos;t have java home set, this will need to be done as well.</p>
<pre><code>export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
</code></pre>
<p>Exit nano and reload the bash profile</p>
<pre><code>source ~/.bash_profile
</code></pre>
<p>If not installed, install PySpark:</p>
<pre><code>pyspark --version

# if not installed
pip install pyspark
</code></pre>
<p>With Pyspark installed we can create a spark shell</p>
<pre><code>pyspark
</code></pre>
<p>Once the shell starts up we can get access to the Spark Context by typing <code>sc</code> and exiting the shell with <code>exit()</code></p>
<pre><code>&gt;&gt;&gt; sc
&lt;SparkContext master=local[*] appName=PySparkShell&gt;
&gt;&gt;&gt; exit()
</code></pre>
<p>Instead of interacting with the Pyspark shell directly, we can setup jupyter notebooks to launch when we start up Spark 2</p>
<p>We will need to declare more environment variables</p>
<p>Open bash profile again, <code>nano ~/.bash_profile</code></p>
<p>add the following</p>
<pre><code># original version
export PYSPARK_SUBMIT_ARGS=&quot;PYSPARK-SHELL&quot;
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=&apos;notebook&apos; pyspark
</code></pre>
<p>Save and close the file. Then, reload the bash profile<br>
<code>source ~/.bash_profile</code></p>
<p>Now, when you run <code>pyspark</code> a jupyter notebook server will start</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[AWS - Creating a simple VPC and EC2 instance]]></title><description><![CDATA[<!--kg-card-begin: markdown--><h2 id="introduction">Introduction</h2>
<p>This article with step through the process of quickly creating a VPC and attching an EC2 instance</p>
<h2 id="creatingavpc">Creating a VPC</h2>
<p>Rescource by region will automaticlly have a default vpc, subnets, route tables and more created for you. One default for every Region. We can use the default VPC and</p>]]></description><link>https://blog.virtual-artifact.com/aws-create-a-simple-vpc-and-ec2/</link><guid isPermaLink="false">62e185bb08b3cd09cac935bf</guid><category><![CDATA[AWS]]></category><category><![CDATA[EC2]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Thu, 09 Apr 2020 14:50:57 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h2 id="introduction">Introduction</h2>
<p>This article with step through the process of quickly creating a VPC and attching an EC2 instance</p>
<h2 id="creatingavpc">Creating a VPC</h2>
<p>Rescource by region will automaticlly have a default vpc, subnets, route tables and more created for you. One default for every Region. We can use the default VPC and reconfigure it to meet our needs. However, we will make one from scratch because that is more fun</p>
<ol>
<li>Log into AWS and select the VPC service</li>
<li>Choose your region</li>
<li>Launch VPC wizard</li>
<li><strong>Step1: Select a VPC Configuration</strong> - VPC with a Single Public Subnet</li>
<li><strong>Step 2: VPC with a Single Public Subnet</strong>
<ol>
<li>IPv4 CIDR block: 10.0.0.0/16</li>
<li>IPv6 CIDER block: No IPv6 CIDR block</li>
<li>VPC name: name of your VPC</li>
<li>Public subnet&apos;s IPv4 CIDR: 10.0.0.0/24</li>
<li>Availability Zone: Select first availability zone</li>
<li>Subnet name: public-subnet-a</li>
<li>The rest of the options can stay as they are</li>
<li>Click create VPC</li>
</ol>
</li>
</ol>
<p>This will take you to a successfully create page.</p>
<h2 id="creatinganec2instance">Creating an EC2 Instance</h2>
<p>Now it&apos;s time to create an EC2 Instance, we will choose to create an Amazon Linux instace</p>
<ol>
<li>Select EC2 from the AWS web console</li>
<li>Click &apos;Launch Instance&apos;</li>
<li>Step 1: Select Amazon Linux 2</li>
<li>Step 2: Select &apos;t2.micro&apos; instance type, click continue</li>
<li>Step 3:
<ul>
<li><strong>Num. of instances</strong> : 1</li>
<li><strong>Network</strong>: your new vpc</li>
<li><strong>Subnet</strong>: your new subnet</li>
<li><strong>Auto-assign Public IP</strong>: enable</li>
<li>The rest of the settings can stay at default, click next</li>
</ul>
</li>
</ol>
<ul>
<li>Step 4: The wizard automaticly configures EBS for us and we can move on, click next</li>
<li>Step 5: click &apos;add another tag&apos; with key/value Name/demo-app. This will help to identify it amongst serveral instances.</li>
<li>Step 6:
<ul>
<li><strong>Assign a security group</strong>: Create a new sercurity group</li>
<li><strong>Security group name</strong>: demo-ec2-sg</li>
<li><strong>Description</strong>: Security group for awesome demo instances</li>
</ul>
</li>
</ul>
<p>At this point we one rule for SSH already created, this is good but we will need to create another rule.</p>
<ol>
<li>Click &apos;Add Rule&apos;</li>
<li>Type: Custom TCP</li>
<li>Port Range: 8000</li>
<li>Source: Anywhere</li>
<li>Click Review and Launch</li>
<li>Click Launch</li>
</ol>
<p><strong>Very Important</strong></p>
<p>This last step is critical. In the pop up window you will have the option to select an existing key pair or create a new key pair. Select &apos;Create a new key pair&apos; give it a name demo-app-keys. Then click &apos;Download Key Pair&apos; and save this somewhere you will not loose or delete it. Finally, click &apos;Launch Instance&apos;</p>
<h2 id="connectinganddeployingtoanec2instance">Connecting and Deploying to an EC2 Instance</h2>
<p>To complete this section you will want Python 3 installed on your development machine</p>
<p>To test out the EC2 instance we are going to create a very simple HTTP server that will deliver a simplet HTML page</p>
<h3 id="creatinghtmlandpythonscripts">Creating HTML and Python scripts</h3>
<p>Create the follow html page</p>
<p><em>index.html</em></p>
<pre><code>&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;meta charset=&quot;UTF-8&quot;&gt;
    &lt;title&gt;Super Awsome Web Page 5000&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h3&gt;Hello World!&lt;/h3&gt;
&lt;/body&gt;
&lt;/html&gt;
</code></pre>
<p>Create the following python script:</p>
<p><em>http_runner.py</em></p>
<pre><code>import http.server
import socketserver

PORT = 8000
Handler = http.server.SimpleHTTPRequestHandler

with socketserver.TCPServer((&quot;&quot;, PORT), Handler) as httpd:
    print(&quot;serving at port&quot;, PORT)
    httpd.serve_forever()
</code></pre>
<p>Start the server <code>python3 http_runner.py</code></p>
<p>confirm by opening a browser and navigate to localhost:8000</p>
<h3 id="configureec2">Configure EC2</h3>
<ol>
<li>Goto EC2 Dashboard, click &apos;running instances&apos;</li>
<li>Select the new instance</li>
<li>In the &apos;Description&apos; tab take note of:
<ul>
<li>Public IP/ Private IP</li>
<li>Key pair</li>
<li>Availability zone</li>
</ul>
</li>
<li>Modify key pair file <code>chmod  400 ~/path/to/your.pem</code></li>
<li><code>ssh -i &lt;path to pem file&gt; ec2-user@&lt;instance ip address&gt;</code></li>
<li>type <code>yes</code> when promted about authenticity of host</li>
<li>you are now logging into the new machine</li>
</ol>
<p>Now that we are logged in to the new instance we should always update the system first</p>
<p><code>sudo yum update</code></p>
<p>Type <code>Y</code> to initiate the operation</p>
<p>Install Python</p>
<p><code>yum list installed | grep -i python3</code></p>
<p><code>sudo yum install python3 -y</code></p>
<p><code>mkdir simple_http_app</code></p>
<p><code>python3 -m venv simple_http_app/</code></p>
<p><code>source ~/simple_http_app/bin/activate</code></p>
<p><code>pip install pip --upgrade</code></p>
<h3 id="filetransfer">File Transfer</h3>
<p>Time to move our local files to the ec2 instance</p>
<ol>
<li>Exit the ec2 console by typing <code>exit</code>, returning you to your local system prompt</li>
<li>scp -r -i &lt;pem_file&gt; &lt;local_code&gt; ec2-user@&lt;ec2_ip&gt;:/home/ec2-user/simple_http_app</li>
<li>Log back into the instance</li>
<li>Run python script and verify by using the browser to view the html page</li>
</ol>
<p>Very Important: This server is not for production use. This is just a simple http service for development and testing</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Pandas and MySql with a hit of AWS RDS]]></title><description><![CDATA[<!--kg-card-begin: markdown--><h2 id="introduction">Introduction</h2>
<p>This article will look at connecting Python and MySQL database. With the help of some sql connection tools, transfering data between Python and MySQL will be simplified. Finally, we will migrate and run the database on a cloud platform</p>
<p><strong>prerequisites</strong></p>
<ul>
<li>database with some data to work on</li>
<li>aws developer</li></ul>]]></description><link>https://blog.virtual-artifact.com/pandas-and-mysql-with-a-hit-of-aws-rds/</link><guid isPermaLink="false">62e185bb08b3cd09cac935be</guid><category><![CDATA[Python]]></category><category><![CDATA[AWS]]></category><category><![CDATA[MySQL]]></category><category><![CDATA[Pandas]]></category><category><![CDATA[RDS]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Wed, 01 Apr 2020 19:41:50 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h2 id="introduction">Introduction</h2>
<p>This article will look at connecting Python and MySQL database. With the help of some sql connection tools, transfering data between Python and MySQL will be simplified. Finally, we will migrate and run the database on a cloud platform</p>
<p><strong>prerequisites</strong></p>
<ul>
<li>database with some data to work on</li>
<li>aws developer account</li>
<li>python (demo is python3)</li>
</ul>
<h2 id="createmysqldbonawsrds">Create MySQL Db on AWS RDS</h2>
<ol>
<li>Log in to AWS (<a href="https://aws.amazon.com/">https://aws.amazon.com/</a>)</li>
<li>In the <strong>Resources</strong> section, click <strong>DB Instances</strong></li>
<li>In the top right you will find a <strong>Create database</strong> button, click this</li>
<li>The &quot;Create database&quot; page will take you through setting up the database. Here are the settngs we will use
<ul>
<li><strong>Choose a databese ceation method</strong> - Standard</li>
<li><strong>Engine options</strong>
<ul>
<li>MySQL</li>
<li>Version - Use the version closest to your local version</li>
</ul>
</li>
<li><strong>Template</strong> - Free Tier</li>
<li><strong>Settings</strong>
<ul>
<li>DB instance identifier - give the database a name</li>
<li>Master username - create name</li>
<li>Master password - create password</li>
</ul>
</li>
<li><strong>DB instance size</strong> - leave default settings</li>
<li><strong>Storage</strong> - leave default settings</li>
<li><strong>Availability &amp; durability</strong> - Do not create a standby instance</li>
<li><strong>Connectivity</strong> - change to publicly available</li>
<li><strong>Database authentication</strong> - Passord authentication</li>
<li><strong>Additional Configurations</strong> - leave defaults</li>
</ul>
</li>
</ol>
<p>It will take a few minutes to get the database up and running</p>
<p>meanwhile...</p>
<h2 id="setupvirtualenvironment">Set up virtual environment</h2>
<ol>
<li>Create a directory for the files <code>mkdir sample-mysql-rds</code> and then <code>cd</code> into the newly created directory.</li>
<li>Initialize the virtual environment <code>python3 -m venv ./</code></li>
<li>Start up the virtual environment <code>source ./bin/activate</code></li>
</ol>
<h2 id="importmodules">import modules</h2>
<p>With the virtual environment setup and running we can turn our attention to the modules needed to this demo</p>
<ul>
<li>modules
<ul>
<li>pandas</li>
<li>sqlalchemy</li>
<li>PyMySQL</li>
<li>boto3(optional)</li>
</ul>
</li>
</ul>
<h3 id="pandas">Pandas</h3>
<p>pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.</p>
<p>pandas can be installed via pip from PyPI.</p>
<p><code>pip3 install pandas</code></p>
<h3 id="sqlalchemy">SqlAlchemy</h3>
<p>SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. SQLAlchemy provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.</p>
<p><code>pip3 install SQLAlchemy</code></p>
<h3 id="pymysql">PyMySQL</h3>
<p>This package contains a pure-Python MySQL client library</p>
<p><code>pip3 install PyMySQL</code></p>
<h3 id="boto3optional">Boto3 (Optional)</h3>
<p>Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.</p>
<p><code>pip3 install boto3</code></p>
<h2 id="examinedatabase">Examine Database</h2>
<p>Now that we are finished with the python dependencies. Let&apos;s take a moment to review the data we plan to import into the python script. Here, I will be using data about the English Premier League.</p>
<p>You can either use a DBMS IDE like DataGrip or MySQL Workbench, or just interface with your database with command line terminal. Make sure the data exist and is available. We will connect to the local database first to make sure everthings is functioning properly and then we will deploy the database to the AWS RDS</p>
<h2 id="pythonandmysql">Python and MySQl</h2>
<p>Now that we have the data store in place and the right modules available to us, it&apos;s time to put rubber to the road!</p>
<h3 id="importdatafrommysql">import data from mysql</h3>
<p>Create a new Python script</p>
<p><em>app.py</em></p>
<pre><code># (1)
import pandas as pd 
import sqlalchemy
from sqlalchemy import Table, Column, Integer, String, MetaData

# (2)
engine = sqlalchemy.create_engine(&apos;mysql+pymysql://username:password@localhost/demo_epl_1819&apos;)

</code></pre>
<ol>
<li>First import the <em>pandas</em> and <em>sqlalchemy</em> libraries</li>
<li>Using <code>create_engine()</code> creates an Engine object that can be used to bridge python and a relational database. Let&apos;s look at the string parameter that is passed</li>
</ol>
<p><code>&apos;mysql+pymysql://username:password@localhost/demo_epl_1819&apos;</code></p>
<ul>
<li>mysql = database type</li>
<li>pymysql = sql interpreter for the engine to use</li>
<li>username:password = username and password</li>
<li>@localhost/demo_epl_1819 = database url</li>
</ul>
<h4 id="table">table</h4>
<p>Here we will look at accessing a MySQL table and reading the data into a dataframe.</p>
<p><em>app.py</em></p>
<pre><code>df = pd.read_sql_table(&apos;match_results&apos;, engine)
print(df.head())
print(type(df))
</code></pre>
<p>The <code>read_sql_table()</code> method returns all the records of a MySQL table by passsing the table name and the Engine object created earlier as arguments. Pandas will read all sql table data into a dataframe</p>
<h4 id="query">query</h4>
<p>Queries can also be created to for customization</p>
<pre><code># create dataframe from sql query result
query_1 = &apos;SELECT  HomeTeam, AwayTeam, FTR FROM match_results;&apos;
df_query = pd.read_sql_query(query_1, engine)
print(df_query.head())
print(type(df))

</code></pre>
<p><code>read_sql_query()</code> takes a sql statement as a string and an Engine object. This will execute a query on the database and return any values as a pandas dataframe</p>
<h3 id="writetocsvandsavetos3optional">Write to csv and save to s3 (Optional)</h3>
<p>This section will lean on <a href="http://creating-a-aws-s3-service-with-python">another article</a> we did, where we created an storage service to write and read objects from S3 buckets.</p>
<ol>
<li>Create a directory named <em>services</em></li>
<li>Within services create a Python file name <em>storage_service.py</em></li>
</ol>
<p><em>storage_service.py</em></p>
<p>*NOTE: you will need to have your AWS credentials available for the boto3 service to work. If you need help reference <a href="http://creating-a-aws-s3-service-with-python">this articles</a></p>
<pre><code>import boto3


class StorageService:

    def __init__(self, storage_location):
        self.client = boto3.client(&apos;s3&apos;)
        self.bucket_name = storage_location

    def upload_file(self, file_name, object_name=None):
        if object_name is None:
            object_name = file_name

        response = self.client.upload_file(file_name, self.bucket_name, object_name)

        return response

    def download_object(self, object_name, file_name=None):
        if file_name is None:
            file_name = object_name

        print(file_name + &quot; is the file name&quot;)

        response = self.client.download_file(self.bucket_name, object_name, file_name)

        return response

    def list_all_objects(self):
        objects = self.client.list_objects(Bucket=self.bucket_name)

        if &quot;Contents&quot; in objects:
            response = objects[&quot;Contents&quot;]
        else:
            response = {}

        return response

    def delete_object(self, object_name):
        response = self.client.delete_object(Bucket=self.bucket_name, Key=object_name)

        return response


</code></pre>
<p>We will not go into detail about this code. If you are curious checkout the other demo for more on this service.</p>
<h4 id="passdatatos3">Pass data to S3</h4>
<p><em>app.py</em></p>
<pre><code>from services.storage_service import StorageService # (1)

storage_service = StorageService(&quot;your.first.boto.s3.bucket&quot;) # (2)
df_query.to_csv(&quot;output.csv&quot;, index=False) # (3)
storage_service.upload_file(&quot;output.csv&quot;) # (4)
</code></pre>
<ol>
<li>Import the storage service into the app</li>
<li>Instantiate a StorageService object</li>
<li>Create a csv file from the dataset</li>
<li>Use storage service to upload csv to S3 bucket</li>
</ol>
<h3 id="writetoanothertable">write to another table</h3>
<p>Here is an example of using sqlalchemy to create a table. Then, use pandas with sqlalchemy to write dataframe contents to the new SQL table</p>
<pre><code>meta = MetaData() 				# (1)

results_table = Table( 			# (2)
   &apos;simple_result&apos;, meta,
   Column(&apos;id&apos;, Integer, primary_key=True, autoincrement=True),
   Column(&apos;HomeTeam&apos;, String(25)),
   Column(&apos;AwayTeam&apos;, String(25)),
   Column(&apos;FTR&apos;, String(1))
)
meta.create_all(engine)			# (3)

# (4)
df_query.to_sql(name=&apos;simple_result&apos;, con=engine, index=False, if_exists=&apos;append&apos;)
</code></pre>
<ol>
<li><code>MetaData()</code> is a container object that keeps together many different features of a database (or multiple databases) being described.</li>
<li>Create a Table object that represents the table to be created</li>
<li><code>.create_all()</code> will cause the <code>MetaData()</code> instance to create any tables associated with it</li>
<li>Pandas dataframes have the method <code>to_sql()</code> writes records stored in a DataFrame to a SQL database
<ul>
<li>name - SQL table name</li>
<li>con - alchemysql Engine connection</li>
<li>index - write dataframe index&apos;s as column in table</li>
<li>if_exists - behaviour with table exist</li>
</ul>
</li>
</ol>
<h2 id="migratetocloud">Migrate to Cloud</h2>
<ol>
<li>Export data from MySQL to file format of your choice</li>
<li>Get back to AWS RDS web console and click the MySQL database you created earlier. This will take you to a details page</li>
<li>In the section titled &apos;Connectivity &amp; security&apos; make note of two things here
<ol>
<li>Endpoint value</li>
<li>Port number</li>
</ol>
</li>
<li>Check that scurity group is open to all traffic</li>
<li>Use terminal or command line to log into the new database. <code>mysql --port=3306 --host=&lt;&lt;endpoint&gt;&gt; --user=&lt;&lt;username&gt;&gt; --password</code></li>
<li>create a new database</li>
<li>import the sql exported earlier. Log out of mysql and up load the <em>.sql</em> file with the following <code>mysql -h &lt;&lt;endpoint&gt;&gt; -u &lt;&lt;username&gt;&gt; -p --port=3306 &lt;&lt;database name&gt;&gt; &lt; &lt;&lt;path to sql file&gt;&gt;</code></li>
<li>Check the database loaded correctly</li>
<li>Back in <em>app.py</em> change the database url to point to the RDS DB instance endpoint <code>mysql+pymysql://&lt;&lt;user name&gt;&gt;:&lt;&lt;password&gt;&gt;@&lt;&lt;db endpoint&gt;&gt;:&lt;&lt;port number&gt;&gt;/&lt;&lt;dbname&gt;&gt;</code></li>
<li>Run python script</li>
</ol>
<h2 id="teardown">Tear Down</h2>
<p>Remember to spin down any services you do not wish to incur any charges onpi</p>
<h2 id="conclusion">Conclusion</h2>
<p>Using library SQLAlchemy with Pandas allowed easy access to our local and remote databases in the form of dataframs. We then looked at converting the dataframse and saving this data as CSV format and then saving that data in S3, leveraging a storage service used in a previous article. Finally we migrated the MySQL database to the AWS RDS and updated our application to connect to the remote database</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Creating a AWS S3 service with Python]]></title><description><![CDATA[<!--kg-card-begin: markdown--><h2 id="overview">Overview</h2>
<p>In a previous module we use the boto3 library to connect our python script with an AWS service. Here, we are going continue down this path by taking a look at some operations we can perform using AWS S3 and python. We will create a storage service with python</p>]]></description><link>https://blog.virtual-artifact.com/creating-a-aws-s3-service-with-python/</link><guid isPermaLink="false">62e185bb08b3cd09cac935bd</guid><category><![CDATA[AWS]]></category><category><![CDATA[Python]]></category><category><![CDATA[S3]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Wed, 25 Mar 2020 15:33:44 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h2 id="overview">Overview</h2>
<p>In a previous module we use the boto3 library to connect our python script with an AWS service. Here, we are going continue down this path by taking a look at some operations we can perform using AWS S3 and python. We will create a storage service with python that interfaces with S3 and give us a chance to use some common operations.</p>
<h2 id="initiatetheenvironment">Initiate the Environment</h2>
<p>We will use a virtual environment to allow us to develop our code in an isolated environment. Developing in an virual environment allows us to manage our projects independently, making for a more portability and less coupled to our development machine.</p>
<ol>
<li>
<p>To create a new environment named sample-env execute: <code>$ python3 -m venv ~/python-envs/sample-boto-s3</code></p>
</li>
<li>
<p>To activate the environment execute: <code>$ source ~/python-envs/sample-boto-s3/bin/activate</code></p>
</li>
<li>
<p>Install BOT03 package using: <code>$ pip3 install boto3</code></p>
</li>
</ol>
<h2 id="settinguptheproject">Setting up the project</h2>
<p>By setting up access to S3 as a service, it will encapsulate the code need to interact with S3. This will in turn make the service more reusable and portable for other projects.</p>
<ol>
<li>Open up your favorite Python IDE and lets get to the good stuff...code!</li>
<li>Create a new python script file called <em>storage-service.py</em> and save it to the <em>sample-boto-s3</em> folder created for the virtual environment previously.</li>
</ol>
<h2 id="createtheserivice">Create the Serivice</h2>
<p>Open <code>storage_service.py</code> and enter the following:</p>
<pre><code>import boto3 // (1)

class StoreageService

def __init__(self, storage_location): // (2)
        self.client = boto3.client(&apos;s3&apos;) // (3)
        self.bucket_name = storage_location  // (4)
</code></pre>
<ol>
<li>In order to leverage the boto3 library we must import it first</li>
<li>Passing the storage location at the time of constructions allows for the decoupling of the storage url from class code. When a <em>StorageService</em> object is created the bucket location will be passed into the constructor, allowing for multiple storage services to exist and represent different storage locations</li>
<li>Using <code>boto3.client(&apos;s3&apos;)</code> returns a S3 client object that can be used to interact with the S3 cloud service</li>
<li>Assign the bucket name to an object variable. This will be uuse later to configure the S3 client</li>
</ol>
<p>The <code>boto3</code> object becomes available because of the import statement at the top of the file. This object will produce a client object that can interface with the service passed as an argument, in this case <code>&apos;s3&apos;</code></p>
<h3 id="uploadingfiles">Uploading Files</h3>
<p>The first behaviour to add to this service will be the ability to upload files to the specified bucket. The behaviour of this method is to upload a local file to a S3 bucket.</p>
<pre><code>def upload_file(self, file_name, object_name=None): //(1)
	if object_name is None: //(2)
		object_name = file_name 

	response = self.client.upload_file(file_name, self.bucket_name, object_name) //(3)
	
	return response //(4)
</code></pre>
<ol>
<li>A method that takes a <code>file_name</code> parameter that represents the local location of the file to be uploaded to S3. <code>object_name</code> is an opional parameter to rename the file on S3</li>
<li>If value is found for renaming the file, the current file name will be used</li>
<li>The<code>upload_file()</code> method is called the the S3 client and proper values are passed.
<ul>
<li>Param 1 - path to local file</li>
<li>Param 2 - name of bucket to up load</li>
<li>Param 3 - name of object on S3</li>
</ul>
</li>
<li>Return the response sent from the API call (if any)</li>
</ol>
<h4 id="runthecode">Run the code</h4>
<p>Create a file to upload</p>
<p><code>touch File.txt</code></p>
<p>Next, create a python script to instantiate and run the storage service</p>
<p><em>service_runner.py</em></p>
<pre><code>from storage_service import StorageService //(1)

storage_service = StorageService(&quot;your.first.boto.s3.bucket&quot;) //(2)

print(storage_service.upload_file(&apos;file.txt&apos;)) //(3)
</code></pre>
<ol>
<li>Import the storage service</li>
<li>Instantiate a new service with a bucket location</li>
<li>Call the service and using the text file to upload</li>
</ol>
<p>Notice nothing is return from the AWS api, but this is not the case always.</p>
<p>Log in to AWS Web Console to confirm the file was uploaded successfully.</p>
<h3 id="downloadingfiles">Downloading Files</h3>
<p>This method will use the name of an S3 object to retrieve it from the cloud and save to the local system</p>
<pre><code>def download_object(self, object_name, file_name=None): //(1)
	if file_name is None:
		file_name = object_name

	response = self.client.download_file(self.bucket_name, object_name, file_name) //2

	return response
</code></pre>
<ol>
<li>This method takes <code>object_name</code> to retrieve the object from S3 and <code>file_name</code> as an option parameter to rename the file when saving it locally</li>
<li>The <code>download_file()</code> method will be supplied the following values
<ul>
<li>Param 1 - Name of bucket to access</li>
<li>Param 2 - Name of object to retrieve from S3</li>
<li>Param 3 - What name to use for file on the local system</li>
</ul>
</li>
</ol>
<h4 id="runthecode">Run the Code</h4>
<p><em>service_runner.py</em></p>
<p><code>print(storage_service.download_object(&apos;file.txt&apos;, &apos;file_s3.txt&apos;))</code></p>
<p>There should now be a new file saved locally named &quot;file_s3.txt&quot;</p>
<h3 id="listingfiles">Listing Files</h3>
<p>The ability for the storage service to return a list of all the object in the bucket could be useful</p>
<pre><code>def list_all_objects(self):
    objects = self.client.list_objects(Bucket=self.bucket_name) //(1)

    if &quot;Contents&quot; in objects: //(2)
        response = objects[&quot;Contents&quot;]
    else:
        response = {}

    return response
</code></pre>
<ol>
<li><code>list_objects()</code> only needs a bucket name to be supplied as a parameter. This will return a lengthy json object that will be saved stored as a dictionary &lt;key, value&gt;</li>
<li>The bit we are are interested in has a key of &apos;Contents&apos;. This contains an array of S3 objects. The problem is, if it&apos;s an empty bucket then the &apos;Contents&apos; key will absent from the json. So check first if the key exist and then assign the response. If the bucket is empty an empty dictionary is returned</li>
</ol>
<h3 id="runthecode">Run the Code</h3>
<p><em>service_runner.py</em></p>
<pre><code>s3Objects = storage_service.list_all_objects()

for file in s3Objects:
    print(file[&quot;Key&quot;])
</code></pre>
<p><code>list_all_obj</code> returns a list of dictionaries that has information about the objects contained in the bucket. The object names are found unther the &apos;Key&apos; key in the dictionary</p>
<h3 id="deletingfiles">Deleting Files</h3>
<p>Finally there should be a way to remove objects from the bucket.</p>
<pre><code>def delete_object(self, object_name):
    response = self.client.delete_object(Bucket=self.bucket_name, Key=object_name) //(1)

    return response
</code></pre>
<p>The <code>delete_object()</code> method takes two arguments:</p>
<ul>
<li>Param 1 - bucket to access</li>
<li>Param 2 - S3 object to delete</li>
</ul>
<h4 id="runthecode">Run the code</h4>
<p><em>service_runner.py</em></p>
<p><code>print(storage_service.delete_object(&apos;file1.txt&apos;))</code></p>
<h2 id="conclusion">Conclusion</h2>
<p>Using the boto3 S3 client, we built a simple storage service to manage file transfers to and from a specified bucket. We took care to build it in a way that we can reuse this service and its code throughout any project we wish to add the service to. If we ever need to add or remove functionality the service code can easily implement these changes.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Using AWS-CLI to interact with AWS S3]]></title><description><![CDATA[<!--kg-card-begin: markdown--><h2 id="overview">Overview</h2>
<p>This article discusses some of the common commands of AWS-CLI to communicate with the AWS S3 service. The command line tool is a quick and easy way to manage S3 buckets. It is not a complicate interface, with only a hand full of commands. The following will use these</p>]]></description><link>https://blog.virtual-artifact.com/using-aws-cli-to-interact-with-aws-s3/</link><guid isPermaLink="false">62e185bb08b3cd09cac935bc</guid><category><![CDATA[AWS]]></category><category><![CDATA[S3]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Wed, 25 Mar 2020 15:32:59 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h2 id="overview">Overview</h2>
<p>This article discusses some of the common commands of AWS-CLI to communicate with the AWS S3 service. The command line tool is a quick and easy way to manage S3 buckets. It is not a complicate interface, with only a hand full of commands. The following will use these commands to create, read, update and remove objects from S3</p>
<h2 id="anatomy">Anatomy</h2>
<p><code>aws s3 &lt;Command&gt; [&lt;Arg&gt; ...]</code></p>
<h2 id="aboutpaths">About Paths</h2>
<p>Every S3 command consist of at least one path argument. A path can be represented in two ways:</p>
<ol>
<li><code>LocalPath</code> - represents a path on the local file system. This can be relative or absolute</li>
<li><code>S3Uri</code> - represents a S3 bucket, object, or prefix.</li>
</ol>
<p>The S3 directories are refered to as <em>prefixes</em></p>
<h3 id="s3resourcepaths">S3 resource paths</h3>
<p>An S3 URI path is formatted like so:</p>
<p><code>s3://SomeBucket/ObjectKey</code></p>
<p>Lets break it down piece by piece.</p>
<p><code>s3://</code> : the path to a resource on S3 must begin with this prefix. This denotes that path aurgument refers to a S3 resource.</p>
<p><code>SomeBucket/</code> : this refers to the unique bucket name to access</p>
<p><code>ObjectKey</code> : this is the specified key name value for the object within the bucket</p>
<h3 id="ordermatters">Order Matters</h3>
<p>All AWS-CLI S3 commands take one or two uri path arguments. The first argument will be the <em>source</em> path. This could be a local resource or a an S3 resource. When a second path argument is found, this will represent the destination path. This too can be either a local or S3 path. If a command only calls for one path, this is because the command operates on the source resource alone, and ther is no need of a destination path</p>
<h2 id="s3operations">S3 Operations</h2>
<p>The AWS-CLI S3 commands can operate on single files or on file directories</p>
<h3 id="singlefileoperations">Single File Operations</h3>
<p>Here are commands that will operate on a single file:</p>
<ul>
<li>cp - copy a resource from source path to destination</li>
<li>mv - move a resource from source path to destination</li>
<li>rm - rm a resource at source path</li>
</ul>
<p>If the <code>--recursive</code> flag is used the operation may affect more than one file.</p>
<h4 id="slashesmatter">Slashes Matter</h4>
<p>When creating path arguments for the source and the destination, the direction of the slashed matter. When representing a path on the local file system use the slash seperator used by the operating system. When representing a S3 resource use forward slashes.</p>
<p>When configuring the destination resource path, having a slash on end or not can have different behaviors</p>
<p><code>aws s3 cp src/file.txt s3://bucketname/src</code></p>
<p><code>aws s3 cp src/file.txt s3://bucketname/src/</code></p>
<h3 id="directoryprefixoperations">Directory &amp; Prefix Operations</h3>
<p>Here are some commands that operate on a directories and/or the contents:</p>
<ul>
<li>sync - Syncs  directories  and S3 prefixes</li>
<li>mb - create/make a bucket</li>
<li>rb - remove a bucket</li>
<li>ls - the directory content</li>
</ul>
<h4 id="slashesdontmatter">Slashes Don&apos;t Matter</h4>
<p>Unlike single file operatioins, post fixing slashes doesn&apos;t affect how directory/prefix operations work</p>
<h2 id="filters">Filters</h2>
<p>Most commands allow for filtering using the <code>--exclude &lt;value&gt;</code> and <code>--include &lt;value&gt;</code> parameters. Pattern matching is achieved with the following symbols.</p>
<ul>
<li>*: Matches everything</li>
<li>?: Matches any single character</li>
<li>[sequence]: Matches any character in sequence</li>
<li>[!sequence]: Matches any character not in sequence</li>
</ul>
<p>The <em>exclude</em> and <em>include</em> parameters can be used multiple time is a single command</p>
<p><code>--exclude &quot;*&quot; --include &quot;*.txt&quot;</code></p>
<p>When multiple filter parameters are present the latter will override the former.</p>
<p><code>--include &quot;*.txt&quot; --exclude &quot;*&quot;</code></p>
<p>But reversing the order leads to a different outcome</p>
<p>Filters are applied to the source directory</p>
<p>`aws s3 sync ./ s3://bucket.on.s3 --exclude &quot;<em>&quot; --include &quot;</em>.mov&quot; --include &quot;*.ogg&quot;</p>
<p>This command will perform a sync using the current directory as the source and will exclude all files except <em>.ogg</em> and <em>.mov</em> files.</p>
<h2 id="summary">Summary</h2>
<p>We have looked at some of the common ways to interacte with S3 using the AWS-CLI. Breaking down the format of path arguments we are able to connect with S3 buckets and objects with a local file system. We explored several ways to configure <code>s3</code> commands and filter results. We are now ready to administer S3 through the command line interface!</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Install and Configure Boto3 with AWS]]></title><description><![CDATA[<!--kg-card-begin: markdown--><h2 id="introduction">Introduction</h2>
<p>AWS (Amazon Web Services) is an ecosystem with an abundace of services to fufill many of our development needs. There are also some great ways for us to interact with these services. We can simply log in to the web browser console(<a href="https://aws.amazon.com">https://aws.amazon.com</a><br>
). Or we can</p>]]></description><link>https://blog.virtual-artifact.com/install-and-configure-boto3-with-aws/</link><guid isPermaLink="false">62e185bb08b3cd09cac935bb</guid><category><![CDATA[AWS]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Tue, 17 Mar 2020 22:23:41 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h2 id="introduction">Introduction</h2>
<p>AWS (Amazon Web Services) is an ecosystem with an abundace of services to fufill many of our development needs. There are also some great ways for us to interact with these services. We can simply log in to the web browser console(<a href="https://aws.amazon.com">https://aws.amazon.com</a><br>
). Or we can use the command line with AWS-CLI to executing commands on services from the terminal.</p>
<p>There are also many API&apos;s that allow code to interact with AWS programmatically. And that is what we are going to do in the following article. Creating a Python script that uses the Boto3 library to connect to an amazon web service. We will use the following objectives to reach this goal:</p>
<ol>
<li>Install the Python libraries needed to connect to AWS within a virtual environment</li>
<li>Configure credentials to gain access to AWS</li>
<li>Execute code to interact with AWS</li>
</ol>
<h3 id="prerequisites">Prerequisites</h3>
<ul>
<li>Python (<a href="https://www.python.org/">https://www.python.org/</a>)</li>
<li>Pip (<a href="https://pip.pypa.io/en/stable/">https://pip.pypa.io/en/stable/</a>)</li>
<li>AWS-CLI</li>
</ul>
<h2 id="virtualenvironment">Virtual Environment</h2>
<h3 id="whatisavirtualenvironment">What is a Virtual Environment?</h3>
<p>A Virtual Environment is a self contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.</p>
<h3 id="whyshouldiuseavirtualenvironment">Why should I use a Virtual Environment?</h3>
<p>A Virtual Environment keeps all dependencies for the Python project separate from dependencies of other projects. This has a few advantages:</p>
<p>It makes dependency management for the project easy.<br>
It enables using and testing of different library versions by quickly spinning up a new environment and verifying the compatibility of the code with the different version.</p>
<h3 id="macosxsetup">Mac OS X Setup</h3>
<ol>
<li>Create a folder where the virtual environments will reside <code>$ mkdir ~/python-envs</code></li>
<li>To create a new environment named sample-env execute <code>$ python3 -m venv ~/python-envs/sample-env</code></li>
<li>To activate the environment execute <code>$ source ~/python-envs/sample-env/bin/activate</code></li>
<li>Install BOT03 package using <code>$ pip3 install boto3</code></li>
<li>To deactivate the environment execute <code>$ deactivate</code></li>
</ol>
<h2 id="configuration">Configuration</h2>
<p>To use BOTO3 you will need a credentials file to verify you identity so that you may access your account services. We previously covered this in an article for seting up AWS-CLI. AWS-CLI will create this credentials file for you during the setup, head over <a href="https://www.google.com">here</a> if you haven&apos;t set up AWS-CLI.</p>
<h2 id="code">Code</h2>
<p>We will create a simple python script to connect to S3 and upload a file</p>
<p>Begin by create a directory to write your files to. I will be using the one created earlier in this article <code>~/python-env/sample-env</code></p>
<p>Now create a new file name &apos;storage_servie.py&apos; inside the &apos;sample-env&apos; directory</p>
<p>Add the below lines of code to the new file</p>
<pre><code>import boto3	// 1

client = boto3.client(&apos;s3&apos;) // 2
client.create_bucket(Bucket=&quot;your.first.boto.s3.bucket&quot;)	// 3

buckets = client.list_buckets()	// 4
for i in buckets[&apos;Buckets&apos;]:	// 5
    print(i[&apos;Name&apos;])
</code></pre>
<ol>
<li>Import the boto3 library into the program</li>
<li>This will return an object that will allow for interaction with S3</li>
<li>Use the newly acquired S3 client to create a bucket</li>
<li>Return a list of the current buckets in use and accessable to the user credentials that were used in the configuration section</li>
<li>Iterate through the results and print the name of each bucket</li>
</ol>
<p>Finally, we will start up the virtual environment</p>
<p>Start up the virtual environment<br>
<code>$ source ~/python-envs/sample-env/bin/activate</code></p>
<p>If you haven&apos;t done so, install the boto3 package into the environment <code>$ pip3 install boto3</code></p>
<p>Run the script<br>
<code>$ python3 storage_service.py</code></p>
<p>If everything works out properly you should see a list of S3 bucket names</p>
<p><code>your.first.boto.s3.bucket</code></p>
<h2 id="conclusion">Conclusion</h2>
<p>We have taken the first steps of opening up our Python applications to an ever expanding set of services through AWS.</p>
<p>We used a Python Virtual Enviroment to create an independent development space for our code, and installed the BOTO3 library. Using AWSCLI we were able to set up the credentials we needed to access AWS. Finally, we created a script that executed commands to create and query resources on s3.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[AWS - Install and Setup Command Line Interface]]></title><description><![CDATA[<h2 id="overview">Overview</h2><p>Amazon Web Services offer a large number of cloud based services and tools. There are a few different ways that we can interact with these services. These options include a web browser interface (<a href="https://aws.amazon.com/">https://aws.amazon.com/</a>) and SDKs (Software Development Kits) for most major programming language. This article</p>]]></description><link>https://blog.virtual-artifact.com/aws-install-and-setup-aws-cli/</link><guid isPermaLink="false">62e185bb08b3cd09cac935b9</guid><category><![CDATA[AWS]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Wed, 26 Feb 2020 22:56:22 GMT</pubDate><content:encoded><![CDATA[<h2 id="overview">Overview</h2><p>Amazon Web Services offer a large number of cloud based services and tools. There are a few different ways that we can interact with these services. These options include a web browser interface (<a href="https://aws.amazon.com/">https://aws.amazon.com/</a>) and SDKs (Software Development Kits) for most major programming language. This article focuses on using the AWS-CLI (Amazon Web Services Command Line Interface.) &#xA0;As the name suggest, the AWS-CLI allows us to interface using a system command line application. We will cover installation and setup so that we can connect to an existing AWS account and begin interacting with Amazon Web Services through the command line.</p><hr><!--kg-card-begin: markdown--><h2 id="prerequisites">Prerequisites</h2>
<p>Before we can begin there are a few things you will need to have:</p>
<ol>
<li>An AWS account</li>
<li>Familiarity with a system command line application(e.g. terminal, iTerm)</li>
<li>For the Homebrew installation, you will need Homebrew installed (<a href="https://brew.sh/.)">https://brew.sh/.)</a></li>
</ol>
<!--kg-card-end: markdown--><hr><!--kg-card-begin: markdown--><h2 id="installingawscli">Installing AWS-CLI</h2>
<p>There are 2 versions of AWS-CLI as of the writing of this article. AWS-CLI version 2 is the latest release and this will be the version we use. Lets look at 2 methods for installing AWS-CLI</p>
<p>Note that this article is running AWS-CLI on Mac OSX Catalina. Follow these links for other system installations:</p>
<ul>
<li>Linux - <a href="https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html">https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html</a></li>
<li>Windows - <a href="https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-windows.html">https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-windows.html</a></li>
</ul>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><h3 id="homebrewinstallation">Homebrew Installation</h3>
<p>Homebrew is a fantastic little package manager that makes life much easier when it comes to installing and maintaing applications on OSX.</p>
<ol>
<li>We simply use the install command of brew to tell Homebrew that we want to install awscli.<br>
<code>$ brew install awscli</code></li>
</ol>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><h3 id="macosinstallation">MacOS Installation</h3>
<p>We can manualling download and install using the command line directly.</p>
<pre><code>curl &quot;https://awscli.amazonaws.com/AWSCLIV2.pkg&quot; -o &quot;AWSCLIV2.pkg&quot;
sudo installer -pkg AWSCLIV2.pkg -target /
</code></pre>
<ol>
<li>
<p>Using the <code>curl</code> command will retrieve the install package and will be saved to the local drive using the file name set by the <code>-o</code> option.</p>
</li>
<li>
<p>Next, <code>installer</code> will install the downloaded pkg file, specified by the <code>-pkg</code> option. <code>-target /</code> option will install the package in the proper directory</p>
</li>
</ol>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><h3 id="confirminstallation">Confirm Installation</h3>
<p>We can quickly confirm everything was installed properly by running the following commands and confirming the following output</p>
<pre><code>$ which aws
/usr/local/bin/aws 
$ aws --version
aws-cli/2.0.0 Python/3.8.1 Darwin/19.3.0 botocore/2.0.0dev4
</code></pre>
<p>Note: Your version of Python may be different depending on what is installed on you dev environment</p>
<!--kg-card-end: markdown--><hr><!--kg-card-begin: markdown--><h3 id="configuration">Configuration</h3>
<p>Now that we have the AWS command line tool installed we will need to configure it with our AWS account information. This will allow us to connect  to our AWS account and gain access to services.</p>
<p>This will happen in 2 parts</p>
<ol>
<li>Using Amazon&apos;s IAM service to create an access key pair to grant aws-cli access to the AWS account</li>
<li>Using the access key pair to configure aws-cli</li>
</ol>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><h4 id="part1accesskeys">Part 1 - Access Keys</h4>
<p>In this section we will go through the steps for creating and downloading access credentials that will be used in Part 2 to complete the confuguration of AWS-CLI and gain access to the AWS account.</p>
<p>In order for you complete is section you will need to have created an AWS account (<a href="https://aws.amazon.com/">https://aws.amazon.com/</a>)</p>
<p>Note: It is recommended that you use another account other than your root account. Such as creating another account with <em>Admin</em> privaleges will work well with this tutorial.</p>
<ol>
<li><strong>Sign in</strong> and Navigate to <strong>IAM</strong> services section of the aws.amazon.com (<a href="https://console.aws.amazon.com/iam/">https://console.aws.amazon.com/iam/</a>)</li>
<li>Locate the left-side menu and click on the users link under <strong>Access Management</strong></li>
<li>You will see a list of users in the main window. Click on the user that will be used by aws-cli to log in.</li>
<li>This will take you to a summary page, where you will find a tab labeled <strong>Security credentials</strong>. Click this tab.</li>
<li>Under the section labled <strong>Access keys</strong>, click the <strong>Create access key</strong> button. This will open a new dialog box with the newly created access key.</li>
</ol>
<p>At this point you can choose to download a csv file of the credential. Otherwise, you can copy and paste the <strong>Access key ID</strong> and <strong>Secret access key</strong> somewhere for use later.</p>
<p><strong>Warning:</strong> You will not have access to the secret key after you close this dialog box.</p>
<p><strong>Warning:</strong> Keep you access keys confidential. Sharing this information is like sharing your account. Proceed with caution.</p>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><h4 id="part2awscliconfig">Part 2 - awscli config</h4>
<p>Now that we have the access key needed to configure awscli, we can return to the command line. The <code>aws configure</code> command is the fastest way to initially configure AWS-CLI</p>
<pre><code>$ aws configure
AWS Access Key ID [None]: enter-your-access-key-id-here
AWS Secret Access Key [None]: enter-your-secret-access-key-here
Default region name [None]: us-west-2
Default output format [None]: json
</code></pre>
<p><strong>AWS Access Key ID</strong> - the key ID value from the access key that was created in Part 1 - Access Keys</p>
<p><strong>AWS Secret Access Key</strong> - the secret access key that was created in Part 1 - Access Keys</p>
<p><strong>Default region name</strong> - the default region that the AWS-CLI will choose unless otherwise specified. Here I am using the value of us-west-2, but you can use any available region.</p>
<p><strong>Default output format</strong> - This will tell AWSCLI how we want the output to be displayed to us in the command line. There are 4 possible output formats:</p>
<ul>
<li>json</li>
<li>yaml</li>
<li>text</li>
<li>table</li>
</ul>
<p>Once this is complete, AWS-CLI saves this configuration in a profile named <code>default</code>. These are the values AWS-CLI will use if no other values are explicitly defined. AWS-CLI will create a .aws folder in the user&apos;s home directory, as well as, create two text files inside the .aws folder.</p>
<ol>
<li><strong>.aws/credentials</strong> holds the profile access key ID and secret access key</li>
<li><strong>.aws/config</strong> holds the profile default region name and output format</li>
</ol>
<p>Note: You can always update/change these values by running <code>aws configure</code> again</p>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><h4 id="confirmconfiguration">Confirm Configuration</h4>
<p>To check if everything is working properly we run a simple command against AWS S3, amazons storage service.</p>
<p>First, if you don&apos;t have any S3 buckets available, you will need to quickly create one (<a href="https://s3.console.aws.amazon.com/s3/">https://s3.console.aws.amazon.com/s3/</a>).</p>
<p>Then, use <code>aws s3 ls</code> to list all the buckets that we have access to.</p>
<pre><code>&#x276F; aws s3 ls
2019-07-04 16:34:50 cloudtrail-logs
2020-01-14 15:18:54 cf-templates
2019-07-04 17:00:49 config-bucket
2020-01-15 15:08:13 query-results-bucket
</code></pre>
<p>Above is an example of a list of storage buckets that reside in this users S3 account. Depending on the S3 buckets you have created your list will look different</p>
<!--kg-card-end: markdown--><hr><!--kg-card-begin: markdown--><h2 id="conclusion">Conclusion</h2>
<p>If you have made it this far, congratulations. You are now able to begin interacting with your AWS services through the command line interface.</p>
<p>We did this by downloading and installing the AWS-CLI application. We then created an access key using IAM through the web browser. We configured AWS-CLI using <code>aws configure</code> and the credintials from the access key to quickly configure and initialize AWS-CLI. Finally, using <code>aws s3 ls</code> command to list all S3 buckets available to verify that the profile configurations are complete.</p>
<p>In the next few articles we will look at interacting with individual services to manage service features</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[ITerm 2 Cheat Sheet]]></title><description><![CDATA[<!--kg-card-begin: markdown--><h2 id="tabsandwindows">Tabs and Windows</h2>
<table>
<thead>
<tr>
<th><strong>Function</strong></th>
<th><strong>Shortcut</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>New Tab</td>
<td><code>&#x2318;</code> + <code>T</code></td>
</tr>
<tr>
<td>Close Tab or Window</td>
<td><code>&#x2318;</code> + <code>W</code>  (same as many mac apps)</td>
</tr>
<tr>
<td>Go to Tab</td>
<td><code>&#x2318;</code> + <code>Number Key</code>  (ie: <code>&#x2318;2</code> is 2nd tab)</td>
</tr>
<tr>
<td>Go to Split Pane by Direction</td>
<td><code>&#x2318;</code> + <code>Option</code> + <code>Arrow Key</code></td>
</tr>
<tr>
<td>Cycle iTerm Windows</td>
<td><code>&#x2318;</code> + <code>backtick</code>  (true of all</td></tr></tbody></table>]]></description><link>https://blog.virtual-artifact.com/iterm-2-cheat-sheet/</link><guid isPermaLink="false">62e185bb08b3cd09cac935b7</guid><category><![CDATA[cheatsheet]]></category><dc:creator><![CDATA[Froilan Miranda]]></dc:creator><pubDate>Thu, 17 Oct 2019 14:03:28 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h2 id="tabsandwindows">Tabs and Windows</h2>
<table>
<thead>
<tr>
<th><strong>Function</strong></th>
<th><strong>Shortcut</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>New Tab</td>
<td><code>&#x2318;</code> + <code>T</code></td>
</tr>
<tr>
<td>Close Tab or Window</td>
<td><code>&#x2318;</code> + <code>W</code>  (same as many mac apps)</td>
</tr>
<tr>
<td>Go to Tab</td>
<td><code>&#x2318;</code> + <code>Number Key</code>  (ie: <code>&#x2318;2</code> is 2nd tab)</td>
</tr>
<tr>
<td>Go to Split Pane by Direction</td>
<td><code>&#x2318;</code> + <code>Option</code> + <code>Arrow Key</code></td>
</tr>
<tr>
<td>Cycle iTerm Windows</td>
<td><code>&#x2318;</code> + <code>backtick</code>  (true of all mac apps and works with desktops/mission control)</td>
</tr>
<tr>
<td><strong>Splitting</strong></td>
<td></td>
</tr>
<tr>
<td>Split Window Vertically (same profile)</td>
<td><code>&#x2318;</code> + <code>D</code></td>
</tr>
<tr>
<td>Split Window Horizontally (same profile)</td>
<td><code>&#x2318;</code> + <code>Shift</code> + <code>D</code>  (mnemonic: shift is a wide horizontal key)</td>
</tr>
<tr>
<td><strong>Moving</strong></td>
<td></td>
</tr>
<tr>
<td>Move a pane with the mouse</td>
<td><code>&#x2318;</code> + <code>Alt</code> + <code>Shift</code> and then drag the pane from anywhere</td>
</tr>
<tr>
<td><strong>Fullscreen</strong></td>
<td></td>
</tr>
<tr>
<td>Fullscreen</td>
<td><code>&#x2318;</code>+ <code>Enter</code></td>
</tr>
<tr>
<td>Maximize a pane</td>
<td><code>&#x2318;</code> + <code>Shift</code> + <code>Enter</code>  (use with fullscreen to temp fullscreen a pane!)</td>
</tr>
<tr>
<td>Resize Pane</td>
<td><code>Ctrl</code> + <code>&#x2318;</code> + <code>Arrow</code> (given you haven&apos;t mapped this to something else)</td>
</tr>
<tr>
<td><strong>Less Often Used By Me</strong></td>
<td></td>
</tr>
<tr>
<td>Go to Split Pane by Order of Use</td>
<td><code>&#x2318;</code> + <code>]</code> , <code>&#x2318;</code> + <code>[</code></td>
</tr>
<tr>
<td>Split Window Horizontally (new profile)</td>
<td><code>Option</code> + <code>&#x2318;</code> + <code>H</code></td>
</tr>
<tr>
<td>Split Window Vertically (new profile)</td>
<td><code>Option</code> + <code>&#x2318;</code> + <code>V</code></td>
</tr>
<tr>
<td>Previous Tab</td>
<td><code>&#x2318;</code>+ <code>Left Arrow</code>  (I usually move by tab number)</td>
</tr>
<tr>
<td>Next Tab</td>
<td><code>&#x2318;</code>+ <code>Right Arrow</code></td>
</tr>
<tr>
<td>Go to Window</td>
<td><code>&#x2318;</code> + <code>Option</code> + <code>Number</code></td>
</tr>
</tbody>
</table>
<h1 id="basicmoves">Basic Moves</h1>
<table>
<thead>
<tr>
<th><strong>Function</strong></th>
<th><strong>Shortcut</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>Move back one character</td>
<td><code>Ctrl</code> + <code>B</code></td>
</tr>
<tr>
<td>Move forward one character</td>
<td><code>Ctrl</code> + <code>F</code></td>
</tr>
<tr>
<td>Delete current character</td>
<td><code>Ctrl</code> + <code>D</code></td>
</tr>
<tr>
<td>Delete previous word (in shell)</td>
<td><code>Ctrl</code> + <code>W</code></td>
</tr>
</tbody>
</table>
<h1 id="movingfaster">Moving Faster</h1>
<p>A lot of shell shortcuts work in iterm and it&apos;s good to learn these because arrow keys, home/end<br>
keys and Mac equivalents don&apos;t always work.  For example <code>&#x2318;</code> + <code>Left Arrow</code> is usually the same as <code>Home</code><br>
(go to beginning of current line) but that doesn&apos;t work in the shell.  Home works in many apps but it<br>
takes you away from the home row.</p>
<table>
<thead>
<tr>
<th><strong>Function</strong></th>
<th><strong>Shortcut</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>Move to the start of line</td>
<td><code>Ctrl</code> + <code>A</code> or <code>Home</code></td>
</tr>
<tr>
<td>Move to the end of line</td>
<td><code>Ctrl</code> + <code>E</code> or <code>End</code></td>
</tr>
<tr>
<td>Move forward a word</td>
<td><code>Option</code> + <code>F</code></td>
</tr>
<tr>
<td>Move backward a word</td>
<td><code>Option</code> + <code>B</code></td>
</tr>
<tr>
<td>Set Mark</td>
<td><code>&#x2318;</code> + <code>M</code></td>
</tr>
<tr>
<td>Jump to Mark</td>
<td><code>&#x2318;</code> + <code>J</code></td>
</tr>
<tr>
<td>Moving by word on a line (this is a shell thing but passes through fine)</td>
<td><code>Ctrl</code> + <code>Left/Right Arrow</code></td>
</tr>
<tr>
<td>Cursor Jump with Mouse (shell and vim - might depend on config)</td>
<td><code>Option</code> + <code>Left Click</code></td>
</tr>
</tbody>
</table>
<h1 id="copyandpastewithitermwithoutusingthemouse">Copy and Paste with iTerm without using the mouse</h1>
<p>I don&apos;t use this feature too much.</p>
<table>
<thead>
<tr>
<th><strong>Function</strong></th>
<th><strong>Shortcut</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>Enter Copy Mode</td>
<td><code>Shift</code> + <code>&#x2318;</code> + <code>C</code></td>
</tr>
<tr>
<td>Enter Character Selection Mode in Copy Mode</td>
<td><code>Ctrl</code> + <code>V</code></td>
</tr>
<tr>
<td>Move cursor in Copy Mode</td>
<td><code>HJKL</code> vim motions or arrow keys</td>
</tr>
<tr>
<td>Copy text in Copy Mode</td>
<td><code>Ctrl</code> + <code>K</code></td>
</tr>
</tbody>
</table>
<p>Copy actions goes into the normal system clipboard which you can paste like normal.</p>
<h1 id="searchthecommandhistory">Search the Command History</h1>
<table>
<thead>
<tr>
<th><strong>Function</strong></th>
<th><strong>Shortcut</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>Search as you type</td>
<td><code>Ctrl</code> + <code>R</code> and type the search term; Repeat <code>Ctrl</code> + <code>R</code> to loop through result</td>
</tr>
<tr>
<td>Search the last remembered search term</td>
<td><code>Ctrl</code> + <code>R</code> twice</td>
</tr>
<tr>
<td>End the search at current history entry</td>
<td><code>Ctrl</code> + <code>Y</code></td>
</tr>
<tr>
<td>Cancel the search and restore original line</td>
<td><code>Ctrl</code> + <code>G</code></td>
</tr>
</tbody>
</table>
<h1 id="misc">Misc</h1>
<table>
<thead>
<tr>
<th><strong>Function</strong></th>
<th><strong>Shortcut</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>Clear the screen/pane (when <code>Ctrl + L</code> won&apos;t work)</td>
<td><code>&#x2318;</code> + <code>K</code>  (I use this all the time)</td>
</tr>
<tr>
<td>Broadcast command to all panes in window (nice when needed!)</td>
<td><code>&#x2318;</code> + <code>Alt</code> +  <code>I</code> (again to toggle)</td>
</tr>
<tr>
<td>Find Cursor</td>
<td><code>&#x2318;</code> + <code>/</code>  <em>or use a theme or cursor shape that is easy to see</em></td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown-->]]></content:encoded></item></channel></rss>