Substantiate that polars isn't copying data even though python reports different id - dataframe

If I have two tables A and B, how can I do a join of B into A in place so that A keeps all its data and is modified by that join without having to make a copy?
And can that join only take specified columns from B into A?
A:
┌─────┬─────┬───────┐
│ one ┆ two ┆ three │
╞═════╪═════╪═══════╡
│ a ┆ 1 ┆ 3 │
│ b ┆ 4 ┆ 6 │
│ c ┆ 7 ┆ 9 │
│ d ┆ 10 ┆ 12 │
│ e ┆ 13 ┆ 15 │
│ f ┆ 16 ┆ 18 │
└─────┴─────┴───────┘
B:
┌─────┬─────┬───────┬──────┐
│ one ┆ two ┆ three ┆ four │
╞═════╪═════╪═══════╪══════╡
│ a ┆ 1 ┆ 3 ┆ yes │
│ c ┆ 7 ┆ 9 ┆ yes │
│ f ┆ 16 ┆ 18 ┆ yes │
└─────┴─────┴───────┴──────┘
I'd like to left join A and B, keeping all data in A and the four column of B - renamed as result.
With data.table I can do exactly this after reading A and B:
address(A)
# [1] "0x55fc74197910"
A[B, on = .(one, two), result := i.four]
A
# one two three result
# 1: a 1 3 yes
# 2: b 4 6 <NA>
# 3: c 7 9 yes
# 4: d 10 12 <NA>
# 5: e 13 15 <NA>
# 6: f 16 18 yes
address(A)
# [1] "0x55fc74197910"
With polars in python:
A.join(B, on = ["one", "two"], how = 'left')
# shape: (6, 5)
# ┌─────┬─────┬───────┬─────────────┬──────┐
# │ one ┆ two ┆ three ┆ three_right ┆ four │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═══════╪═════════════╪══════╡
# │ a ┆ 1 ┆ 3 ┆ 3 ┆ yes │
# │ b ┆ 4 ┆ 6 ┆ null ┆ null │
# │ c ┆ 7 ┆ 9 ┆ 9 ┆ yes │
# │ d ┆ 10 ┆ 12 ┆ null ┆ null │
# │ e ┆ 13 ┆ 15 ┆ null ┆ null │
# │ f ┆ 16 ┆ 18 ┆ 18 ┆ yes │
# └─────┴─────┴───────┴─────────────┴──────┘
A
# shape: (6, 3)
# ┌─────┬─────┬───────┐
# │ one ┆ two ┆ three │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═══════╡
# │ a ┆ 1 ┆ 3 │
# │ b ┆ 4 ┆ 6 │
# │ c ┆ 7 ┆ 9 │
# │ d ┆ 10 ┆ 12 │
# │ e ┆ 13 ┆ 15 │
# │ f ┆ 16 ┆ 18 │
# └─────┴─────┴───────┘
A is unchanged. If A is assigned again:
id(A)
# 139703375023552
A = A.join(B, on = ['one', 'two'], right_on=["four"])
id(A)
# 139703374967280
its memory address changes.

There is indeed no copy occurring there; if you think of the DataFrame class as a container (like a python list), you can see the same sort of thing happening here - the container id changes, but the contents of the container are not copied:
# create a list/container with some object data
v1 = [object(), object(), object()]
print(v1)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v2 = v1[:]
print(v2)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v3 = v1[:2]
print(v3)
# [<object at 0x1686b6510>, <object at 0x1686b6490>]
(Each of v1, v2, and v3 will have different ids).

Related

Properly groupby and filter with Polar

I have df for my work with 3 main columns: cid1, cid2, cid3, and more 7 columns cid4, cid5, etc.
cid1 and cid2 is int, another columns is float.
Each combitations of cid1 and cid2 is a workset with some rows where is values of all other columns is different. I want to filter df and receive my df with only max values in column cid3 for each combination of cid1 and cid2. cid4 and next columns must be leaved without changes.
This code helps me with one part of my task:
df = (df
.groupby(["cid1", "cid2"])
.agg([pl.max("cid3").alias("max_cid3")])
)
It's receives only 3 columns: cid1, cid2, max_cid3 and filter all rows when cid3 is not maximal.
But I can't find how to receive all another columns (cid4, etc) for that rows without changes.
df = (df
.groupby(["cid1", "cid2"])
.agg([pl.max("cid3").alias("max_cid3"), pl.col("cid4")])
)
I tried to add pl.col("cid4") to list of aggs but in column I see as values different lists of some cid4 values.
How I can make it properly? Maybe Polars haves another way to make it then groupby?
In Pandas I can make it:
import pandas as pd
import numpy as np
df["max_cid3"] = df.groupby(['cid1', 'cid2'])['cid3'].transform(np.max)
And then filter df wherever cid3==max_cid3
But I can't find a way to make it in Polars.
Thank you!
In polars you can use a Window function
df.with_column(
pl.col("cid3").max().over(["cid1", "cid2"])
.alias("max_cid3")
)
shape: (5, 6)
┌──────┬──────┬──────┬──────┬──────┬──────────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ max_cid3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╪══════╪══════════╡
│ 1 ┆ 1 ┆ 1 ┆ 4 ┆ 4 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 5 ┆ 5 ┆ 9 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 9 ┆ 6 ┆ 4 ┆ 9 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 ┆ 7 ┆ 9 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 1 ┆ 8 ┆ 3 ┆ 1 │
└──────┴──────┴──────┴──────┴──────┴──────────┘
You could also put it directly inside .filter()
df.filter(
pl.col("cid3") == pl.col("cid3").max().over(["cid1", "cid2"])
)
Data used:
df = pl.DataFrame({
"cid1": [1, 2, 2, 1, 3],
"cid2": [1, 2, 2, 1, 3],
"cid3": [1, 2, 9, 1, 1],
"cid4": [4, 5, 6, 7, 8],
"cid5": [4, 5, 4, 9, 3],
})
>>> df.to_pandas().groupby(["cid1", "cid2"])["cid3"].transform("max")
0 1
1 9
2 9
3 1
4 1
Name: cid3, dtype: int64

how to update the polars dataframe

I want to update a polars library dataframe,
polars syntax/command which I used for the purpose:
df[0, 'A'] = 'some value'
but the above code gives an error:
ValueError: cannot set with list/tuple as value; use a scalar value
I am using polars 0.13.55
The above coode was previously working in polars 0.13.51
Minimal Code to reproduce the problem:
df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2'], "ISP" :
["N/A", "N/A"] } )
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
i = 0
for row in df.rows():
for ip, isp in isp_names.items():
if(row[0] == ip):
df[i, 'ISP'] = isp #**This line gives the Value error**
i = i + 1
It looks as though you might be trying to update the values of DataFrame, particularly where values are missing (the "N/A" values).
In addition the advice of #jvz, I would recommend using a left join for your purposes, rather than using a dictionary and a for loop. Using for loops is very slow, and is to be avoided. By contrast, a left join will be very performant, and is built for exactly these types of situations.
We'll take this in steps.
First, let's first expand your example.
df = pl.DataFrame(
{"IP": ["1.1.1.1", "2.2.2.2", "3.3.3.3", "4.4.4.4"],
"ISP": ["N/A", "N/A", "PQR", "N/A"]}
)
df
shape: (4, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ N/A │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘
Notice that we have three rows with "N/A" values, but one row that already has a valid value, "PQR".
Next, let's convert your dictionary of updated ISP values to a DataFrame, so that we can join the two DataFrames.
isp_df = pl.DataFrame(
data=[[key, value] for key, value in isp_names.items()],
columns=["IP", "ISP_updated"],
orient="row",
)
isp_df
shape: (2, 2)
┌─────────┬─────────────┐
│ IP ┆ ISP_updated │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════════════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────────────┘
Now, we simply join the two DataFrames. The how="left" ensures that we keep all rows from df, even if there are no corresponding rows in isp_df.
df.join(isp_df, on="IP", how="left")
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP ┆ ISP ┆ ISP_updated │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A ┆ null │
└─────────┴─────┴─────────────┘
Notice the null values in ISP_updated. These are cases where you had no updated values for a particular IP value.
To complete the process, we use fill_null to copy the values from the ISP column into the ISP_updated column for those cases where isp_df had no updates for a particular IP value.
(
df
.join(isp_df, on="IP", how="left")
.with_column(
pl.col("ISP_updated").fill_null(pl.col("ISP"))
)
)
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP ┆ ISP ┆ ISP_updated │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A ┆ N/A │
└─────────┴─────┴─────────────┘
Now, your ISP_updated column has the updated values for each ISP. If you want, you can drop and rename columns so that your final column is labeled ISP.
(
df
.join(isp_df, on="IP", how="left")
.with_column(
pl.col("ISP_updated").fill_null(pl.col("ISP"))
)
.drop("ISP")
.rename({"ISP_updated": "ISP"})
)
shape: (4, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘
As the size of your DataFrames gets large, you will definitely want to avoid using for loops. Using join will be far faster.
I am unable to reproduce the error on the latest version (0.13.56), so updating polars may help.
May I also suggest two improvements to the code, where the second improvements avoids the issue you run into altogether?
First, a more Pythonic version:
df = pl.DataFrame( {"IP": ['1.1.1.1', '2.2.2.2'],
"ISP": ["N/A", "N/A"] } )
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
for i, row in enumerate(df.rows()):
df[i, 'ISP'] = isp_names[row[0]]
I.e., use enumerate to keep your i aligned with row, and do not loop isp_names separately but simply get the value by the key.
Second, Polars has an excellent expression system, meaning you do not have to pre-allocate ISP column or write a loop:
df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2']})
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
df.with_column(pl.col("IP").apply(isp_names.get).alias("ISP"))
which returns df as:
shape: (2, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────┘

How to fill n random rows after filtering in polars

I'm thinking for over a few hours how to fill n rows after filtering in polars with some value.
To give you an example, I'd like to do the following operation in polars.
Given a dataframe with column a that have 1s and 2s, we want to create column b that:
Has True if 1 in column a.
Has the same number of True as number of 1s in column a for 2s in column a. This is kind of stratification. Rows to receive True should be random.
The rest of the rows in b has value False.
This is how I can do it in pandas:
df = pd.DataFrame({
'a': [2, 2, 2, 1, 2, 1]
})
df
a
0 2
1 2
2 2
3 1
4 2
5 1
n = df.shape[0]
n_1 = df['a'].value_counts()[1]
n_2 = n - n_1
df['b'] = False
df.loc[df['a'] == 1, 'b'] = True
idx = df.loc[df['a'] == 2].index[np.random.choice(n_2, n_1, replace=False)]
df.loc[idx, "b"] = True
df
a b
0 2 False
1 2 False
2 2 True
3 1 True
4 2 True
5 1 True
Any help appreciated!
In general, I recommend avoiding "index"-type strategies, as they tend to be slow and inefficient. Also, we want to avoid sorting and re-sorting large datasets, particularly if they have lots of columns.
So instead, what we'll do is construct column b separately from the original DataFrame, and then insert the finished column b into the original DataFrame.
Since you are transitioning from Pandas, I'll walk through how we'll do this in Polars, and print the results at each step. (For your final solution, you can combine many of these intermediate steps and eliminate the implicit print statements after each step.)
Data
I'm going to expand your dataset, so that it has more columns than is strictly needed. This will show us how to isolate the columns we need, and how to put the final result back into your DataFrame.
import polars as pl
df = pl.DataFrame({
"a": [2, 2, 2, 1, 2, 1, 2],
"c": ['one', 'two', 'three', 'four', 'five', 'six', 'seven'],
"d": [6.0, 5, 4, 3, 2, 1, 0],
})
df
shape: (7, 3)
┌─────┬───────┬─────┐
│ a ┆ c ┆ d │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═════╪═══════╪═════╡
│ 2 ┆ one ┆ 6.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ two ┆ 5.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ three ┆ 4.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ four ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ five ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ six ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ seven ┆ 0.0 │
└─────┴───────┴─────┘
Constructing column b
First, we'll create a new DataFrame using only column a and add row numbers to track the original position of each element. We'll then sort the 1s to the bottom - we'll see why in a moment.
df_a = df.select('a').with_row_count().sort('a', reverse=True)
df_a
shape: (7, 2)
┌────────┬─────┐
│ row_nr ┆ a │
│ --- ┆ --- │
│ u32 ┆ i64 │
╞════════╪═════╡
│ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 1 │
└────────┴─────┘
Next we'll count the 1s and 2s using the value_counts method, which creates a new DataFrame with the results.
values = df_a.get_column('a').value_counts().sort('a')
values
shape: (2, 2)
┌─────┬────────┐
│ a ┆ counts │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═════╪════════╡
│ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 5 │
└─────┴────────┘
So we have two 1s and five 2s. We'll create two variables with this information that we'll use later.
nr_1, nr_2 = values.get_column('counts')
print(f"{nr_1=}", f"{nr_2=}")
>>> print(f"{nr_1=}", f"{nr_2=}")
nr_1=2 nr_2=5
Now we'll construct the top part of b, which corresponds to the five 2s. We'll need three False and two True values. We'll use the shuffle method to randomly shuffle the values. (You can set the seed= value according to your needs.)
b = (
pl.repeat(True, nr_1, eager=True)
.extend_constant(False, nr_2 - nr_1)
.shuffle(seed=37)
)
b
shape: (5,)
Series: '' [bool]
[
true
false
false
false
true
]
Now let's extend b with the two True values that correspond to the 1s (that we previously sorted to the bottom of our df_a DataFrame.)
b = b.extend_constant(True, nr_1)
b
shape: (7,)
Series: '' [bool]
[
true
false
false
false
true
true
true
]
And we'll add this column to our df_a, to see how the values of a and b align.
df_a = df_a.select([
pl.all(),
b.alias("b")]
)
df_a
shape: (7, 3)
┌────────┬─────┬───────┐
│ row_nr ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ bool │
╞════════╪═════╪═══════╡
│ 0 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 1 ┆ true │
└────────┴─────┴───────┘
We see that our two 1s at the bottom of column a both correspond to a True value in b. And we see that two of the 2s in column a correspond to True values, while the remaining values are False.
Adding column b back to our original DataFrame
All that's left to do is restore the original sort order, and insert column b into our original DataFrame.
df_a = df_a.sort("row_nr")
df = df.select([
pl.all(),
df_a.get_column("b")
])
df
shape: (7, 4)
┌─────┬───────┬─────┬───────┐
│ a ┆ c ┆ d ┆ b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ bool │
╞═════╪═══════╪═════╪═══════╡
│ 2 ┆ one ┆ 6.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ two ┆ 5.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ three ┆ 4.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ four ┆ 3.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ five ┆ 2.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ six ┆ 1.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ seven ┆ 0.0 ┆ true │
└─────┴───────┴─────┴───────┘
Simplifying
If your original DataFrame is not large, you don't need to create a separate df_a -- you can sort (and re-sort) the original DataFrame. (But once your datasets get large, unnecessarily sorting lots of additional columns at each step can slow your computations.)
Also, you can combine many of the intermediate steps, as you see fit.

What's the equivalent of `pandas.Series.map(json.loads)` in polars?

Based on the document of polars, one can use json_path_match to extract JSON fields into string series.
But can we do something like pandas.Series.map(json.loads) to convert the whole JSON string at once? One can then further convert the loaded JSON series into another dataframe with sane dtypes.
I know I can do it first in pandas, but I'm looking for a way in polars.
I should first point out that there is a polars.read_json method. For example:
import polars as pl
import io
json_file = """[{"a":"1", "b":10, "c":[1,2,3]},
{"a":"2", "b":20, "c":[3,4,5]},
{"a":"3.1", "b":30.2, "c":[8,8,8]},
{"a":"4", "b":40.0, "c":[9,9,90]}]
"""
pl.read_json(io.StringIO(json_file))
shape: (4, 3)
┌─────┬──────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list [i64] │
╞═════╪══════╪════════════╡
│ 1 ┆ 10.0 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ [9, 9, 90] │
└─────┴──────┴────────────┘
But to answer your specific question about JSON data already loaded into a Series, I think what you're looking for is the polars.Series.apply method, which will apply a callable function to each cell of a Polars Series.
For example, let's say we have the following JSON fields already loaded into a Series in a Polars DataFrame:
import json
import polars as pl
df = pl.DataFrame(
{
"json_val": [
'{"a":"1", "b":10, "c":[1,2,3]}',
'{"a":"2", "b":20, "c":[3,4,5]}',
'{"a":"3.1", "b":30.2, "c":[8,8,8]}',
'{"a":"4", "b":40.0, "c":[9,9,90]}',
]
}
)
print(df)
shape: (4, 1)
┌─────────────────────────────────────┐
│ json_val │
│ --- │
│ str │
╞═════════════════════════════════════╡
│ {"a":"1", "b":10, "c":[1,2,3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"2", "b":20, "c":[3,4,5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"3.1", "b":30.2, "c":[8,8,8... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"4", "b":40.0, "c":[9,9,90]... │
└─────────────────────────────────────┘
We can use apply and the json.loads function. In this example, that will yield a Series of type struct:
df.select(pl.col("json_val").apply(json.loads))
shape: (4, 1)
┌──────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞══════════════════════════╡
│ {"1",10,[1, 2, 3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2",20,[3, 4, 5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1",30,[8, 8, 8]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4",40,[9, 9, 90]} │
└──────────────────────────┘
(One caution, notice how column b has been truncated to an integer.)
Depending on the structure of your JSON, you may be able to also use the polars.DataFrame.unnest function to split the json_val struct column into separate columns.
df.select(pl.col("json_val").apply(json.loads)).unnest("json_val")
shape: (4, 3)
┌─────┬─────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ list [i64] │
╞═════╪═════╪════════════╡
│ 1 ┆ 10 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40 ┆ [9, 9, 90] │
└─────┴─────┴────────────┘
Does this help get you started?
Edit: handling type-conversion issues
One general strategy that I use with any un-typed input file (especially csv files) is to return all values as a string/polars.Utf8 type. That way, I can explicitly convert types later, after I've had a chance to visually inspect the results. (I've been burned too often by "automatic" type conversions.)
The json.loads method has two helpful keyword options parse_float and parse_int that will help in this case. We can use a simple lambda function to tell the json parser to leave integer and float columns as strings.
# define our own translate function to keep floats/ints as strings
def json_translate(json_str: str):
return json.loads(json_str, parse_float=lambda x: x, parse_int=lambda x: x)
df.select(pl.col("json_val").apply(f=json_translate))
shape: (4, 1)
┌────────────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞════════════════════════════════╡
│ {"1","10",["1", "2", "3"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2","20",["3", "4", "5"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1","30.2",["8", "8", "8"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4","40.0",["9", "9", "90"]} │
└────────────────────────────────┘
Notice that all the integer and float values are left as strings, and remain so when we use the unnest function (the column headers below show "str").
df.select(pl.col("json_val").apply(f=json_translate)).unnest('json_val')
shape: (4, 3)
┌─────┬──────┬──────────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list [str] │
╞═════╪══════╪══════════════════╡
│ 1 ┆ 10 ┆ ["1", "2", "3"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ ["3", "4", "5"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ ["8", "8", "8"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ ["9", "9", "90"] │
└─────┴──────┴──────────────────┘
From this point, you can use Polars' cast expression to convert the strings to the specific numeric types that you want. Here's a Stack Overflow question that can help with the cast.

how to replace pandas df.rank(axis=1) with polars

Alpha factors need section rank sometimes, like this:
import pandas as pd
df = pd.Dataframe(some_data)
df.rank(axis=1, pct=True)
how to implement this with polars efficiently?
A polars DataFrame has properties being:
columns consist of homogeneous data (e.g. every column is a single type).
rows consist of heterogenous data (e.g. data types on a row may differ).
For this reason polars does not want the axis=1 API from pandas. It does not make much sense to compute the rank between numeric, string, boolean or even complexer nested types like structs and lists.
Pandas solves this by giving you numeric_only keyword argument.
Polars' is more opinionated and wants to nudge you in using the expression API.
Expression
Polars expressions work on columns that have the guarantee that they consist of homogeneous data. Columns have this guarantee, rows in a DataFrame not so much. Luckily we have a data type that has the guarantee that the rows are homogeneous: pl.List data type.
Let's say we have the following data:
grades = pl.DataFrame({
"student": ["bas", "laura", "tim", "jenny"],
"arithmetic": [10, 5, 6, 8],
"biology": [4, 6, 2, 7],
"geography": [8, 4, 9, 7]
})
print(grades)
shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas ┆ 10 ┆ 4 ┆ 8 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ laura ┆ 5 ┆ 6 ┆ 4 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ 6 ┆ 2 ┆ 9 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny ┆ 8 ┆ 7 ┆ 7 │
└─────────┴────────────┴─────────┴───────────┘
If we want to compute the rank of all the columns except for student, we can collect those into a list data type:
This would give:
grades.select([
pl.concat_list(pl.all().exclude("student")).alias("all_grades")
])
shape: (4, 1)
┌────────────┐
│ all_grades │
│ --- │
│ list [i64] │
╞════════════╡
│ [10, 4, 8] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [5, 6, 4] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [6, 2, 9] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 7, 7] │
└────────────┘
Running polars expression on list elements
We can run any polars expression on the elements of a list with the arr.eval expression! These expressions entirely run on polars' query engine and can run in parallel so will be super fast.
Polars doesn't provide a keyword argument the compute the percentages of the ranks. But because expressions are so versatile we can create our own percentage rank expression.
Note that we must select the list's element from the context. When we apply expressions over list elements. Any col()/first() selection suffices.
# the percentage rank expression
rank_pct = pl.col("").rank(reverse=True) / pl.col("").count()
grades.with_column(
# create the list of homogeneous data
pl.concat_list(pl.all().exclude("student")).alias("all_grades")
).select([
# select all columns except the intermediate list
pl.all().exclude("all_grades"),
# compute the rank by calling `arr.eval`
pl.col("all_grades").arr.eval(rank_pct, parallel=True).alias("grades_rank")
])
This outputs:
shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list [f32] │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas ┆ 10 ┆ 4 ┆ 8 ┆ [0.333333, 1.0, 0.666667] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ laura ┆ 5 ┆ 6 ┆ 4 ┆ [0.666667, 0.333333, 1.0] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ 6 ┆ 2 ┆ 9 ┆ [0.666667, 1.0, 0.333333] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny ┆ 8 ┆ 7 ┆ 7 ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘
Note that this solution works for any expressions/operation you want to do row wise.