Table of Contents
Data is power. The more data we have, the better and more robust products we create. However, working with large amounts of data has its challenges. We need software tools and packages to gain insights, like for creating a data summary in Python.
A substantial number of data-based solutions and products use tabular data, that is, data stored in a table format with labeled rows and columns. Each row represents an observation (i.e., a data point) and columns represent features or attributes about that observation.
As the numbers of rows and columns increase, it becomes more difficult to inspect data manually. Since we almost always work with large datasets, using a software tool to summarize data is a fundamental requirement.
As the leading programming language in the data science ecosystem, Python has libraries for creating data summaries. The most popular and commonly used library for this purpose is pandas.
pandas is a data analysis and manipulation library for Python. In this article, we go over several examples to demonstrate how to use pandas for creating and displaying data summaries.
Generating Data Summary in Python
Let’s start with importing pandas.
|import pandas as pd|
Consider a sales dataset in CSV format that contains the sales and stock quantities of some products and their product groups. We create a pandas DataFrame for the data in this file and display the first 5 rows as below:
|df = pd.read_csv(“sales.csv”)
We continue summarizing the data by focusing on each column separately. pandas has two main data structures: DataFrame and Series. A DataFrame is a two-dimensional data structure, whereas a Series is one-dimensional. Each column in a DataFrame may be considered a Series.
If a column contains categorical data as does the product group column in our DataFrame, we can check the count of distinct values in it. We do so with the unique() or nunique() functions.
array([‘A’, ‘C’, ‘B’, ‘G’, ‘D’, ‘F’, ‘E’], dtype=object)
The nunique() function returns the count of distinct values, whereas the unique() function displays the distinct values. Another commonly used summary function on categorical columns is value_counts(). It shows the distinct values in a column along with the counts of their occurrences. Thus, we get an overview of the distribution of the data.
Name: product_group, dtype: int64
Group A has the most products, followed by Group B with 75 products. The output of the value_counts() function is sorted in descending order by the count of occurrences.
When working with numeric columns, we need different methods to summarize data. For instance, it does not make sense to check the number of distinct values for the sales quantity column. Instead, we calculate statistical measures such as mean, median, minimum, and maximum.
Let’s first calculate the average value of the sales quantity column.
We simply select the column of interest and apply the mean() function. We can perform this operation on multiple columns as well.
When selecting multiple columns from a DataFrame, make sure to specify them as a list. Otherwise, pandas generates a key error.
Just as easily as we can calculate a single statistic on multiple columns in a single operation, we can calculate multiple statistics at once. One option is to use the apply() function as below:
The functions are written in a list and then passed to apply(). The median is the value in the middle when the values are sorted. Comparing the mean and median values gives us an idea about the skewness of the distribution.
We have lots of options to create a data summary in pandas. For instance, we can use a dictionary to calculate separate statistics for different columns. Here is an example:
The keys of the dictionary indicate the column names and the values show the statistics to be calculated for that column.
We can do the same operations with the agg() function instead of apply(). The syntax is the same, so don’t be surprised if you come across tutorials that use the agg() function instead.
pandas is a highly useful and practical library in many aspects. For instance, we can calculate various statistics on all numeric columns with just one function: describe():
The statistics in this DataFrame give us a broad overview of the distribution of values. The count is the count of values (i.e., rows). The “25%,” “50%,” and “75%” indicate the first, second, and third quartiles, respectively. The second quartile (i.e., 50%) is also known as the median. Finally, “std” is the standard deviation of the column.
A data summary in Python can be created for a specific part of the DataFrame. We just need to filter the relevant part before applying the functions.
For instance, we describe the data for just Product Group A as below:
We first select the rows whose product group value is A and then use the describe() function. The output is in the same format as in the previous example, but the values are calculated only for Product Group A.
We can apply filters on numeric columns as well. For instance, the following line of code calculates the average sales quantity of products with a stock greater than 500.
pandas allows for creating more complex filters quite efficiently.
Summarizing Groups of Data
We can create a data summary separately for different groups in the data. It is quite similar to what we have done in the previous example. The only addition is grouping the data.
We group the rows by the distinct values in a column with the groupby() function. The following code groups the rows by product group.
Once the groups are formed, we can calculate any statistic and describe or summarize the data. Let’s calculate the average sales quantity for each product group.
Name: sales_qty, dtype: float64
We can also perform multiple aggregations in a single operation. In addition to the average sales quantities, let’s also count the number of products in each group. We use the agg() function, which allows for assigning names for aggregated columns as well.
avg_sales_qty = (“sales_qty”, “mean”),
number_of_products = (“product_code”,”count”)
Data Distribution With a Matplotlib Histogram
Data visualization is another highly efficient technique for summarizing data. Matplotlib is a popular library in Python for exploring and summarizing data visually.
There are many different types of data visualizations. A histogram is used to check the data distribution of numeric columns. It divides the entire value range into discrete bins and counts the number of values in each bin. As a result, we get an overview of the distribution of the data.
Let’s create a histogram of the sales quantity column.
|import matplotlib.pyplot as plt
In the first line, we import the pyplot interface of Matplotlib. The second line creates an empty figure object with the specified size. The third line plots the histogram of the sales quantity column on the figure object. The bins parameter determines the number of bins.
Data Summary in Python
It is of crucial importance to understand the data at hand before proceeding to create data-based products. You can start with a data summary in Python. In this article, we have reviewed several examples with the pandas and Matplotlib libraries to summarize data.