Reading, writing, and manipulating tabular data: polars
This part of the course is inspired by the Polars Getting Started Guide.
Polars is a programming library for reading, writing, and manipulating tabular data.
It is a fast and efficient library that is written in Rust and has so-called Python bindings which offer the ability to use it from within Python code.
For this and the following chapter, we first create a micromamba environment that contains the necessary software tools.
Create the file envs/pystats.yaml
with the following content:
channels:
- conda-forge
dependencies:
- python =3.11
- polars =1.1
- altair =5.3
- altair_saver =0.5
- vegafusion =1.6
- vegafusion-python-embed =1.6
- vl-convert-python =1.5
- vega_datasets =0.9
- ipykernel =6.29
- scipy =1.14
Exercise
Now create a new micromamba environment named pystats
from this file.
THe following steps are inspired by the official getting started guide of Polars.
Step 1: Setup the notebook
First, create a new file notebook.ipynb
under chapters/polars
and open it in VSCode.
In the opened notebook, select the pystats environment as the kernel with the button on the top right.
Step 2: Creating a dataframe from scratch
We can create a dataframe from scratch by using the pl.DataFrame
constructor.
In the first cell of the notebook, we import relevant modules:
import polars as pl
from datetime import date
Explanation
The first line imports the Polars library and provides it as a shorthand alias pl
.
The second line imports the datetime module from the Python standard library.
This module allows us to define dates.
Then, we create a second cell with the following content:
df = pl.DataFrame(
{
"a": [1, 2, 3],
"b": [
date(2019, 1, 1),
date(2021, 1, 2),
date(2023, 1, 3),
],
"c": [4.0, 5.0, 6.0],
"d": ["a", "b", "c"],
}
)
Explanation
The dictionary passed to the pl.DataFrame
constructor contains the column names as keys and lists of values as values.
Step 3: Inspecting the dataframe
By simply ending the cell with the variable name, the dataframe is shown below the cell.
Step 4: Reading and writing data frames
Polars supports reading and writing data frames from and to various file formats.
Let us write the dataframe df
to a CSV file:
df.write_csv("data.csv")
And subsequently read it back:
df2 = pl.read_csv("data.csv")
Step 5: Comparing data frames
We can compare two data frames by using the equals
method:
df.equals(df2)
Step 6: Expressions
Polars offers four central ways to manipulate data frames:
select
: select and manipulate columns, replacing the existing oneswith_columns
: select and manipulate columns, adding new onesfilter
: filter rows based on conditionsgroupby
: group rows based on conditions
All of these take one or more expressions as arguments. Thereby, the semantic of the expressions depends on the context.
Step 7: Selecting data
Let us select something in the example dataframe df
:
df.select(
pl.col("a") * 2,
pl.col("d").map_elements(lambda s: s + "x", return_dtype=str),
)
Explanation
The first expression multiplies the values in column a
by 2.
The second expression appends an x
to each value in column d
.
Returned is the modified dataframe with just the two columns.
Expressions in the select
or with_columns
context produce so-called Series
, which represent columns of the dataframe.
Both operations can contain multiple expressions, which may yield either single (scalar) values or series that have the same length as the dataframe has rows.
In the mixed case, the scalar values are broadcasted (i.e. repeated) to the number of rows.
Exercise
The sum
method of expressions computes the sum of all values in a column, which is obviously a single value.
Extend above selection by an additional expression that computes the sum of column c
.
See how the value is broadcasted to all rows because the other expressions are row-wise.
Select can also be used to remove columns from the dataframe:
df.select(pl.col("*").exclude("c", "d"))
Exercise
What is the meaning of the individual elements of this statement?
While select
limits the returned dataframe to the queried columns, with_columns
adds new columns or replaces exising ones:
df.with_columns(
(pl.col("a") * 2).alias("a_times_two"),
pl.col("d").map_elements(lambda s: s + "x", return_dtype=str).alias("d_with_x"),
)
Explanation
The alias
method that expressions offer allows to assign a reasonable name to the resulting columns.
If we would omit the alias
invocations, the expressions would modify their columns.
Step 8: Filtering data
Expressions in the filter
context have to produce a boolean series that has the same length as the dataframe has rows.
The rows for which the series is True
are kept, while the others are removed.
Let us filter the dataframe df
to keep only the rows where the value in column c
is at least 5.0:
df.filter(pl.col("c") >= 5.0)
Step 9: Accessing individual items
The item
method of dataframes returns an individual value, in particular if the dataframe has only one row and column.
In combination with select
and filter
this allows to access individual items in the dataframe.
Let us select the value in the a
column where the d
column contains the value b
:
df.filter(pl.col("d") == "b").select(pl.col("a")).item()
Exercise
The is_between
method of expressions allows to filter for values that fall in a specific range.
Look up its usage in the Polars API docs.
Use it to filter for the row where the column b
is between the beginning of 2020 (date(2020, 1, 1)
) and the beginning of 2022 (date(2022, 1, 1)
).
This should be a single row.
Obtain the value of the column c
from that row.
Step 10: Grouping data
Let us first append some additional data to our dataframe:
extended_df = df.vstack(
pl.DataFrame(
{
"a": [5, 1, 2],
"b": [
date(2021, 1, 1),
date(2019, 1, 2),
date(2023, 1, 3),
],
"c": [None, 1.1, 2.0],
"d": ["a", "a", "c"],
}
)
)
Exercise
What is the meaning and implication of the
None
in thec
column?Display the resulting dataframe in the variable
extended_df
in your notebook.
Grouping of data can be useful to e.g. calculate summary statistics per group.
Let us group extended_df
by the values in column d
and calculate the mean of column a
per group:
extended_df.group_by("d").agg(
pl.col("c").mean()
)
Exercise
What happens with the the
None
value in columnc
?Now, group by the year. For this purpose, first add a new column (using
with_columns
) that holds the year of each row, derived from columnb
. For this purpose, polars offers a special subset of expressions that perform operations on date or time objects, accessible by the attribute dt, see the docs. Assign a reasonable name for this column. Then, group by the year, and calculate the minimum value of columna
per group (use the expression methodmin()
for this).