Table of Contents
Python is one of the most popular programming language. Due to its simplified syntax, Programmers can develop a wide range of software applications, fast and easy. With Python, you can easily build Data Science and Machine Learning projects, as well as general purpose applications. When combined with the popular SQL Server Data Platform, it provides Software Developers and Data Scientists, with a powerful software development setup, that can be used for building just about anything.
Structured Query Language (SQL or “Sequel”) is the way we communicate with a relational database. It’s an intuitive and powerful language and mastering it is rapidly becoming a prerequisite for career paths outside of software development.
SQL has been one of the most common skill sets required by employers for software developers for years, according to research published by Indeed. That fact is not likely to change any time soon, as businesses have woken up to the reality that the data they capture, generate, and store has tremendous business value to improve performance and efficiency and uncover new paths for growth.
Databases in Python
Basics of Relational Databases
Introduction to Databases
Tables, Columns, Rows, and Relationships are part of the relational model.
Connecting to your database
- MS SQL
Autoloading Tables from a database
SQLAlchemy can be used to automatically load tables from a database using something called reflection. Reflection is the process of reading the database and building the metadata based on that information. It’s the opposite of creating a Table by hand and is very useful for working with existing databases.
To perform reflection, you will first need to import and initialize a MetaData object. MetaData objects contain information about tables stored in a database. During reflection, the MetaData object will be populated with information about the reflected table automatically, so we only need to initialize it before reflecting by calling MetaData().
You will also need to import the Table object from the SQLAlchemy package. Then, you use this Table object to read your table from the engine, autoload the columns, and populate the metadata. This can be done with a single call to Table(): using the Table object in this manner is a lot like passing arguments to a function. For example, to autoload the columns with the engine, you have to specify the keyword arguments autoload=True and autoload_with=engine to Table().
Finally, to view information about the object you just created, you will use the repr() function. For any Python object, repr() returns a text representation of that object. For SQLAlchemy Table objects, it will return the information about that table contained in the metadata.
Viewing Table details
Now you can learn more about the columns and structure of your table. It is important to get an understanding of your database by examining the column names. This can be done by using the .columns attribute and accessing the .keys() method. For example, census.columns.keys() would return a list of column names of the census table.
Following this, we can use the metadata container to find out more details about the reflected table such as the columns and their types. For example, information about the table objects are stored in the metadata.tables dictionary, so you can get the metadata of your census table with metadata.tables[‘census’]. This is similar to your use of the repr() function on the census table.
|from sqlalchemy import create_engine, MetaData, Table
engine = create_engine(‘sqlite:///census.sqlite’)
metadata = MetaData()
# Reflect the census table from the engine: census
census = Table(‘census’, metadata, autoload=True, autoload_with=engine)
# Print the column names
# Print full metadata of census
Introduction to SQL: raw SQL vs SQLAlchemy
Selecting data from a Table: raw SQL
We first need to establish a connection to it by using the .connect() method on the engine. This is because the create_engine() function that you have used before returns an instance of an engine, but it does not actually open a connection until an action is called that would require a connection, such as a query.
Using what we just learned about SQL and applying the .execute() method on our connection, we can leverage a raw SQL query to query all the records in our census table. The object returned by the .execute() method is a ResultProxy. On this ResultProxy, we can then use the .fetchall() method to get our results – that is, the ResultSet.
Selecting data from a Table with SQLAlchemy
It’s now time to build your first select statement using SQLAlchemy. SQLAlchemy provides a nice “Pythonic” way of interacting with databases. When you used raw SQL in the last exercise, you queried the database directly. When using SQLAlchemy, you will go through a Table object instead, and SQLAlchemy will take case of translating your query to an appropriate SQL statement for you. So rather than dealing with the differences between specific dialects of traditional SQL such as MySQL or PostgreSQL, you can leverage the Pythonic framework of SQLAlchemy to streamline your workflow and more efficiently query your data. For this reason, it is worth learning even if you may already be familiar with traditional SQL.
You will also fetch only a few records of the ResultProxy by using .fetchmany() with a size argument specifying the number of records to fetch.
Handling a ResultSet
- ResultProxy: The object returned by the .execute() It can be used in a variety of ways to get the data returned by the query.
- ResultSet: The actual data asked for in the query when using a fetch method such as .fetchall()on a ResultProxy.
This separation between the ResultSet and ResultProxy allows us to fetch as much or as little data as we desire.
Applying Filtering, Ordering and Grouping to Queries
Filtering and targeting data
Connecting to a PostgreSQL database
There are three components to the connection string in this exercise: the dialect and driver (‘postgresql+psycopg2://’), followed by the username and password (‘username:password’), followed by the host and port (‘@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com:1234/’), and finally, the database name (‘census’). You will have to pass this string as an argument to create_engine() in order to connect to the database.
|# Import create_engine function
from sqlalchemy import create_engine
# Create an engine to the census database
engine = create_engine(‘postgresql+psycopg2://username:email@example.com:1234/census’)
# Use the .table_names() method on the engine to print the table names
# [‘census’, ‘state_fact’, ‘vrska’, ‘census1’, ‘data’, ‘data1′, ’employees3′, ‘users’, ’employees’, ’employees_2′]
In addition to standard Python comparators, we can also use methods such as in_() to create more powerful where() clauses. You can see a full list of expressions in the SQLAlchemy Documentation.
Method in_(), when used on a column, allows us to include records where the value of a column is among a list of possible values. For example, where(census.columns.age.in_([20, 30, 40])) will return only records for people who are exactly 20, 30, or 40 years old.
Along with in_, you can also use methods like and_,any_ to create more powerful where() clauses. You might have noticed that we did not use any of the fetch methods to retrieve a ResultSet like in the previous exercises. Indeed, if you are only interested in manipulating one record at a time, you can iterate over the ResultProxy directly!
SQLAlchemy also allows users to use conjunctions such as and_(), or_(), and not_() to build more complex filtering. For example, we can get a set of records for people in New York who are 21 or 37 years old with the following code:
and_(census.columns.state == ‘New York’,
or_(census.columns.age == 21,
census.columns.age == 37
Overview of ordering
Ordering by a single column
To sort the result output by a field, we use the .order_by() method. By default, the .order_by() method sorts from lowest to highest on the supplied column.
|# Build a query to select the state column: stmt
stmt = select([census.columns.state])
# Order stmt by the state column
stmt = stmt.order_by(census.columns.state)
# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()
# Print the first 10 results
# [(‘Alabama’,), (‘Alabama’,), (‘Alabama’,), (‘Alabama’,), (‘Alabama’,), (‘Alabama’,), (‘Alabama’,), (‘Alabama’,), (‘Alabama’,), (‘Alabama’,)]
Ordering in descending order by a single column
You can also use .order_by() to sort from highest to lowest by wrapping a column in the desc() function. Although you haven’t seen this function in action, it generalizes what you have already learned.
Pass desc() (for “descending”) inside an .order_by() with the name of the column you want to sort by. For instance, stmt.order_by(desc(table.columns.column_name)) sorts column_name in descending order.
Ordering by multiple columns
We can pass multiple arguments to the .order_by() method to order by multiple columns. In fact, we can also sort in ascending or descending order for each individual column.
Counting, summing and grouping data
Counting distinct data
SQLAlchemy’s func module provides access to built-in SQL functions that can make operations like counting and summing faster and more efficient.
We can use func.sum() to get a sum of the pop2008 column of census as shown below:
If instead you want to count the number of values in pop2008, you could use func.count() like this:
Furthermore, if you only want to count the distinct values of pop2008, you can use the .distinct() method:
Count of records by state
Often, we want to get a count for each record with a particular value in another column. The .group_by() method helps answer this type of query. You can pass a column to the .group_by() method and use in an aggregate function like sum() or count(). Much like the .order_by() method, .group_by() can take multiple columns as arguments.
|# Import func
from sqlalchemy import func
# Build a query to select the state and count of ages by state: stmt
stmt = select([census.columns.state, func.count(census.columns.age)])
# Group stmt by state
stmt = stmt.group_by(census.columns.state)
# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()
# Print results
# Print the keys/column names of the results returned
|[(‘Alabama’, 172), (‘Alaska’, 172), (‘Arizona’, 172), (‘Arkansas’, 172), (‘California’, 172),
(‘Wisconsin’, 172), (‘Wyoming’, 172)]
Determining the population sum by state
To avoid confusion with query result column names like count_1, we can use the .label() method to provide a name for the resulting column. This gets appended to the function method we are using, and its argument is the name we want to use.
We can pair func.sum() with .group_by() to get a sum of the population by State and use the label() method to name the output.
We can also create the func.sum() expression before using it in the select statement. We do it the same way we would inside the select statement and store it in a variable. Then we use that variable in the select statement where the func.sum() would normally be.
Use pandas and matplotlib to visualize our data
ResultsSets and pandas dataframes
|# import pandas
import pandas as pd
# Create a DataFrame from the results: df
df = pd.DataFrame(results)
# Set column names
df.columns = results.keys()
# Print the Dataframe
Advanced SQLAlchemy Queries
Calculating values in a query
Connecting to a MySQL database
|# Import create_engine function
from sqlalchemy import create_engine
# Create an engine to the census database
engine = create_engine(‘mysql+pymysql://username:firstname.lastname@example.org:3306/census’)
# Print the table names
# [‘census’, ‘state_fact’]
Calculating a difference between two columns
Often, you’ll need to perform math operations as part of a query, such as if you wanted to calculate the change in population from 2000 to 2008. For math operations on numbers, the operators in SQLAlchemy work the same way as they do in Python.
You can use these operators to perform addition (+), subtraction (-), multiplication (*), division (/), and modulus (%) operations. Note: They behave differently when used with non-numeric column types.
Let’s now find the top 5 states by population growth between 2000 and 2008.
[‘state’, ‘sex’, ‘age’, ‘pop2000’, ‘pop2008’]
connection.execute(select([census])).fetchmany(3)[(‘Illinois’, ‘M’, 0, 89600, 95012),
(‘Illinois’, ‘M’, 1, 88445, 91829),
(‘Illinois’, ‘M’, 2, 88729, 89547)]
|# Build query to return state names by population difference from 2008 to 2000: stmt
stmt = select([census.columns.state, (census.columns.pop2008–census.columns.pop2000).label(‘pop_change’)])
# Append group by for the state: stmt_grouped
stmt_grouped = stmt.group_by(census.columns.state)
# Append order by for pop_change descendingly: stmt_ordered
stmt_ordered = stmt_grouped.order_by(desc(‘pop_change’))
# Return only 5 results: stmt_top5
stmt_top5 = stmt_ordered.limit(5)
# Use connection to execute stmt_top5 and fetch all results
results = connection.execute(stmt_top5).fetchall()
# Print the state and population change for each record
for result in results:
Determining the overall percentage of women
It’s possible to combine functions and operators in a single select statement as well. These combinations can be exceptionally handy when we want to calculate percentages or averages, and we can also use the case() expression to operate on data that meets specific criteria while not affecting the query as a whole. The case() expression accepts a list of conditions to match and the column to return if the condition matches, followed by an else_ if none of the conditions match. We can wrap this entire expression in any function or math operation we like.
Automatic joins with an established relationship
If you have two tables that already have an established relationship, you can automatically use that relationship by just adding the columns we want from each table to the select statement. Recall that Jason constructed the following query:
|stmt = select([census.columns.pop2008, state_fact.columns.abbreviation])|
In order to join the census and state_fact tables and select the pop2008 column from the first and the abbreviation column from the second. In this case, the census and state_fact tables had a pre-defined relationship: the state column of the former corresponded to the name column of the latter.
If you aren’t selecting columns from both tables or the two tables don’t have a defined relationship, you can still use the .join() method on a table to join it with another table and get extra data related to our query.
The join() takes the table object you want to join in as the first argument and a condition that indicates how the tables are related to the second argument.
Finally, you use the .select_from() method on the select statement to wrap the join clause.
For example, the following code joins the census table to the state_fact table such that the state column of the census table corresponded to the name column of the state_fact table.
|stmt = stmt.select_from(
state_fact, census.columns.state ==
More practice with joins
You can use the same select statement you built in the last exercise, however, let’s add a twist and only return a few columns and use the other table in a group_by() clause.
|stmt = select([
# [(‘Texas’, 15446707263, ‘South Atlantic’)]
Working with hierarchical tables
Using alias to handle same table joined queries
Often, you’ll have tables that contain hierarchical data, such as employees and managers who are also employees. For this reason, you may wish to join a table to itself on different columns. The .alias() method, which creates a copy of a table, helps accomplish this task. Because it’s the same table, you only need a where clause to specify the join condition.
Here, you’ll use the .alias() method to build a query to join the employees table against itself to determine to whom everyone reports.
# [‘id’, ‘name’, ‘job’, ‘mgr’, ‘hiredate’, ‘sal’, ‘comm’, ‘dept’]
# [(‘JOHNSON’, 6), (‘HARDING’, 9), (‘TAFT’, 2), (‘HOOVER’, 2), (‘LINCOLN’, 6)]
Leveraging functions and group_bys with hierarchical data
It’s also common to want to roll up data which is in a hierarchical table. Rolling up data requires making sure you’re careful which alias you use to perform the group_bys and which table you use for the function.
Dealing with large ResultSets
Working on blocks of records
Sometimes you may have the need to work on a large ResultProxy, and you may not have the memory to load all the results at once.
To work around that issue, you can get blocks of rows from the ResultProxy by using the .fetchmany() method inside a loop. With .fetchmany(), give it an argument of the number of records you want. When you reach an empty list, there are no more rows left to fetch, and you have processed all the results of the query.
Then you need to use the .close() method to close out the connection to the database.
Creating and Manipulating your own Databases
Creating databases and tables
Creating tables with SQLAlchemy
Previously, you used the Table object to reflect a table from an existing database, but what if you wanted to create a new table? You’d still use the Table object; however, you’d need to replace the autoload and autoload_with parameters with Column objects.
The Column object takes a name, a SQLAlchemy type with an optional format, and optional keyword arguments for different constraints.
When defining the table, recall how in the video Jason passed in 255 as the maximum length of a String by using Column(‘name’, String(255)). Checking out the slides from the video may help: you can download them by clicking on ‘Slides’ next to the IPython Shell.
After defining the table, you can create the table in the database by using the .create_all() method on metadata and supplying the engine as the only parameter. Go for it!
Constraints and data defaults
You’re now going to practice creating a table with some constraints! Often, you’ll need to make sure that a column is unique, nullable, a positive value, or related to a column in another table. This is where constraints come in.
You can also set a default value for the column if no data is passed to it via the default keyword on the column.
Inserting data into a table
Inserting a single row
There are several ways to perform an insert with SQLAlchemy; however, we are going to focus on the one that follows the same pattern as the select statement.
It uses an insert statement where you specify the table as an argument, and supply the data you wish to insert into the value via the .values() method as keyword arguments
Inserting multiple records at once
When inserting multiple records at once, you do not use the .values() method. Instead, you’ll want to first build a list of dictionaries that represents the data you want to insert, with keys being the names of the columns.
In the .execute() method, you can pair this list of dictionaries with an insert statement, which will insert all the records in your list of dictionaries.
Loading a CSV into a table
One way to do that would be to read a CSV file line by line, create a dictionary from each line, and then use insert(), like you did in the previous exercise.
But there is a faster way using pandas. You can read a CSV file into a DataFrame using the read_csv() function (this function should be familiar to you, but you can run help(pd.read_csv) in the console to refresh your memory!). Then, you can call the .to_sql() method on the DataFrame to load it into a SQL table in a database. The columns of the DataFrame should match the columns of the SQL table.
.to_sql() has many parameters, but in this exercise we will use the following:
- nameis the name of the SQL table (as a string).
- conis the connection to the database that you will use to upload the data.
- if_existsspecifies how to behave if the table already exists in the database; possible values are “fail”, “replace”, and “append”.
- index(True or False) specifies whether to write the DataFrame’s index as a column.
<sqlalchemy.engine.base.Connection at 0x7feca8d64e10>
|# append the data from census_df to the “census” table via connection
census_df.to_sql(name=‘census’, con=connection, if_exists=‘append’, index=False)
Updating data in a database
Updating individual records
The update statement is very similar to an insert statement. For example, you can update all wages in the employees table as follows:
|stmt = update(employees).values(wage=100.00)|
The update statement also typically uses a where clause to help us determine what data to update. For example, to only update the record for the employee with ID 15, you would append the previous statement as follows:
Updating multiple records
By using a where clause that selects more records, you can update multiple records at once.
Unlike inserting, updating multiple records works exactly the same way as updating a single record (as long as you are updating them with the same value).
You can also update records with data from a select statement. This is called a correlated update. It works by defining a select statement that returns the value you want to update the record with and assigning that select statement as the value in update.
Removing data from a database
Deleting all the records from a table
Often, you’ll need to empty a table of all of its records so you can reload the data. You can do this with a delete statement with just the table as an argument. For example, delete the table extra_employees by executing as follows:
|delete_stmt = delete(extra_employees)
result_proxy = connection.execute(delete_stmt)
Deleting specific records
By using a where() clause, you can target the delete statement to remove only certain records. For example, delete all rows from the employees table that had id 3 with the following delete statement:
|delete(employees).where(employees.columns.id == 3)|
Deleting a table completely
You’re now going to practice dropping individual tables from a database with the .drop() method, as well as all tables in a database with the .drop_all() method!
Do be careful when deleting tables, as it’s not simple or fast to restore large databases!
Remember, you can check to see if a table exists on an engine with the .exists(engine) method.
Putting it all together
Census case study
Setup the engine and metadata
In this exercise, your job is to create an engine to the database that will be used in this chapter. Then, you need to initialize its metadata.
Recall how you did this by leveraging create_engine() and MetaData().
|# Import create_engine, MetaData
from sqlalchemy import create_engine, MetaData
# Define an engine to connect to chapter5.sqlite: engine
engine = create_engine(‘sqlite:///chapter5.sqlite’)
# Initialize MetaData: metadata
metadata = MetaData()
Create the table to the database
Having setup the engine and initialized the metadata, you will now define the census table object and then create it in the database using the metadata and engine.
To create it in the database, you will have to use the .create_all() method on the metadata with engine as the argument.
When creating columns of type String(), it’s important to spend some time thinking about what their maximum lengths should be.
“YOU DON’T HAVE TO BE GREAT TO START, BUT YOU HAVE TO START TO BE GREAT.”