Alpha factors need section rank sometimes, like this:
import pandas as pd
df = pd.Dataframe(some_data)
df.rank(axis=1, pct=True)
how to implement this with polars efficiently?
A polars DataFrame has properties being:
columns consist of homogeneous data (e.g. every column is a single type).
rows consist of heterogenous data (e.g. data types on a row may differ).
For this reason polars does not want the axis=1 API from pandas. It does not make much sense to compute the rank between numeric, string, boolean or even complexer nested types like structs and lists.
Pandas solves this by giving you numeric_only keyword argument.
Polars' is more opinionated and wants to nudge you in using the expression API.
Expression
Polars expressions work on columns that have the guarantee that they consist of homogeneous data. Columns have this guarantee, rows in a DataFrame not so much. Luckily we have a data type that has the guarantee that the rows are homogeneous: pl.List data type.
Let's say we have the following data:
grades = pl.DataFrame({
"student": ["bas", "laura", "tim", "jenny"],
"arithmetic": [10, 5, 6, 8],
"biology": [4, 6, 2, 7],
"geography": [8, 4, 9, 7]
})
print(grades)
shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas ┆ 10 ┆ 4 ┆ 8 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ laura ┆ 5 ┆ 6 ┆ 4 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ 6 ┆ 2 ┆ 9 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny ┆ 8 ┆ 7 ┆ 7 │
└─────────┴────────────┴─────────┴───────────┘
If we want to compute the rank of all the columns except for student, we can collect those into a list data type:
This would give:
grades.select([
pl.concat_list(pl.all().exclude("student")).alias("all_grades")
])
shape: (4, 1)
┌────────────┐
│ all_grades │
│ --- │
│ list [i64] │
╞════════════╡
│ [10, 4, 8] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [5, 6, 4] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [6, 2, 9] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 7, 7] │
└────────────┘
Running polars expression on list elements
We can run any polars expression on the elements of a list with the arr.eval expression! These expressions entirely run on polars' query engine and can run in parallel so will be super fast.
Polars doesn't provide a keyword argument the compute the percentages of the ranks. But because expressions are so versatile we can create our own percentage rank expression.
Note that we must select the list's element from the context. When we apply expressions over list elements. Any col()/first() selection suffices.
# the percentage rank expression
rank_pct = pl.col("").rank(reverse=True) / pl.col("").count()
grades.with_column(
# create the list of homogeneous data
pl.concat_list(pl.all().exclude("student")).alias("all_grades")
).select([
# select all columns except the intermediate list
pl.all().exclude("all_grades"),
# compute the rank by calling `arr.eval`
pl.col("all_grades").arr.eval(rank_pct, parallel=True).alias("grades_rank")
])
This outputs:
shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list [f32] │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas ┆ 10 ┆ 4 ┆ 8 ┆ [0.333333, 1.0, 0.666667] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ laura ┆ 5 ┆ 6 ┆ 4 ┆ [0.666667, 0.333333, 1.0] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ 6 ┆ 2 ┆ 9 ┆ [0.666667, 1.0, 0.333333] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny ┆ 8 ┆ 7 ┆ 7 ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘
Note that this solution works for any expressions/operation you want to do row wise.
Related
If I have two tables A and B, how can I do a join of B into A in place so that A keeps all its data and is modified by that join without having to make a copy?
And can that join only take specified columns from B into A?
A:
┌─────┬─────┬───────┐
│ one ┆ two ┆ three │
╞═════╪═════╪═══════╡
│ a ┆ 1 ┆ 3 │
│ b ┆ 4 ┆ 6 │
│ c ┆ 7 ┆ 9 │
│ d ┆ 10 ┆ 12 │
│ e ┆ 13 ┆ 15 │
│ f ┆ 16 ┆ 18 │
└─────┴─────┴───────┘
B:
┌─────┬─────┬───────┬──────┐
│ one ┆ two ┆ three ┆ four │
╞═════╪═════╪═══════╪══════╡
│ a ┆ 1 ┆ 3 ┆ yes │
│ c ┆ 7 ┆ 9 ┆ yes │
│ f ┆ 16 ┆ 18 ┆ yes │
└─────┴─────┴───────┴──────┘
I'd like to left join A and B, keeping all data in A and the four column of B - renamed as result.
With data.table I can do exactly this after reading A and B:
address(A)
# [1] "0x55fc74197910"
A[B, on = .(one, two), result := i.four]
A
# one two three result
# 1: a 1 3 yes
# 2: b 4 6 <NA>
# 3: c 7 9 yes
# 4: d 10 12 <NA>
# 5: e 13 15 <NA>
# 6: f 16 18 yes
address(A)
# [1] "0x55fc74197910"
With polars in python:
A.join(B, on = ["one", "two"], how = 'left')
# shape: (6, 5)
# ┌─────┬─────┬───────┬─────────────┬──────┐
# │ one ┆ two ┆ three ┆ three_right ┆ four │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═══════╪═════════════╪══════╡
# │ a ┆ 1 ┆ 3 ┆ 3 ┆ yes │
# │ b ┆ 4 ┆ 6 ┆ null ┆ null │
# │ c ┆ 7 ┆ 9 ┆ 9 ┆ yes │
# │ d ┆ 10 ┆ 12 ┆ null ┆ null │
# │ e ┆ 13 ┆ 15 ┆ null ┆ null │
# │ f ┆ 16 ┆ 18 ┆ 18 ┆ yes │
# └─────┴─────┴───────┴─────────────┴──────┘
A
# shape: (6, 3)
# ┌─────┬─────┬───────┐
# │ one ┆ two ┆ three │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═══════╡
# │ a ┆ 1 ┆ 3 │
# │ b ┆ 4 ┆ 6 │
# │ c ┆ 7 ┆ 9 │
# │ d ┆ 10 ┆ 12 │
# │ e ┆ 13 ┆ 15 │
# │ f ┆ 16 ┆ 18 │
# └─────┴─────┴───────┘
A is unchanged. If A is assigned again:
id(A)
# 139703375023552
A = A.join(B, on = ['one', 'two'], right_on=["four"])
id(A)
# 139703374967280
its memory address changes.
There is indeed no copy occurring there; if you think of the DataFrame class as a container (like a python list), you can see the same sort of thing happening here - the container id changes, but the contents of the container are not copied:
# create a list/container with some object data
v1 = [object(), object(), object()]
print(v1)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v2 = v1[:]
print(v2)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v3 = v1[:2]
print(v3)
# [<object at 0x1686b6510>, <object at 0x1686b6490>]
(Each of v1, v2, and v3 will have different ids).
I have df for my work with 3 main columns: cid1, cid2, cid3, and more 7 columns cid4, cid5, etc.
cid1 and cid2 is int, another columns is float.
Each combitations of cid1 and cid2 is a workset with some rows where is values of all other columns is different. I want to filter df and receive my df with only max values in column cid3 for each combination of cid1 and cid2. cid4 and next columns must be leaved without changes.
This code helps me with one part of my task:
df = (df
.groupby(["cid1", "cid2"])
.agg([pl.max("cid3").alias("max_cid3")])
)
It's receives only 3 columns: cid1, cid2, max_cid3 and filter all rows when cid3 is not maximal.
But I can't find how to receive all another columns (cid4, etc) for that rows without changes.
df = (df
.groupby(["cid1", "cid2"])
.agg([pl.max("cid3").alias("max_cid3"), pl.col("cid4")])
)
I tried to add pl.col("cid4") to list of aggs but in column I see as values different lists of some cid4 values.
How I can make it properly? Maybe Polars haves another way to make it then groupby?
In Pandas I can make it:
import pandas as pd
import numpy as np
df["max_cid3"] = df.groupby(['cid1', 'cid2'])['cid3'].transform(np.max)
And then filter df wherever cid3==max_cid3
But I can't find a way to make it in Polars.
Thank you!
In polars you can use a Window function
df.with_column(
pl.col("cid3").max().over(["cid1", "cid2"])
.alias("max_cid3")
)
shape: (5, 6)
┌──────┬──────┬──────┬──────┬──────┬──────────┐
│ cid1 ┆ cid2 ┆ cid3 ┆ cid4 ┆ cid5 ┆ max_cid3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╪══════╪══════════╡
│ 1 ┆ 1 ┆ 1 ┆ 4 ┆ 4 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 5 ┆ 5 ┆ 9 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 9 ┆ 6 ┆ 4 ┆ 9 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 ┆ 7 ┆ 9 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 1 ┆ 8 ┆ 3 ┆ 1 │
└──────┴──────┴──────┴──────┴──────┴──────────┘
You could also put it directly inside .filter()
df.filter(
pl.col("cid3") == pl.col("cid3").max().over(["cid1", "cid2"])
)
Data used:
df = pl.DataFrame({
"cid1": [1, 2, 2, 1, 3],
"cid2": [1, 2, 2, 1, 3],
"cid3": [1, 2, 9, 1, 1],
"cid4": [4, 5, 6, 7, 8],
"cid5": [4, 5, 4, 9, 3],
})
>>> df.to_pandas().groupby(["cid1", "cid2"])["cid3"].transform("max")
0 1
1 9
2 9
3 1
4 1
Name: cid3, dtype: int64
I'm thinking for over a few hours how to fill n rows after filtering in polars with some value.
To give you an example, I'd like to do the following operation in polars.
Given a dataframe with column a that have 1s and 2s, we want to create column b that:
Has True if 1 in column a.
Has the same number of True as number of 1s in column a for 2s in column a. This is kind of stratification. Rows to receive True should be random.
The rest of the rows in b has value False.
This is how I can do it in pandas:
df = pd.DataFrame({
'a': [2, 2, 2, 1, 2, 1]
})
df
a
0 2
1 2
2 2
3 1
4 2
5 1
n = df.shape[0]
n_1 = df['a'].value_counts()[1]
n_2 = n - n_1
df['b'] = False
df.loc[df['a'] == 1, 'b'] = True
idx = df.loc[df['a'] == 2].index[np.random.choice(n_2, n_1, replace=False)]
df.loc[idx, "b"] = True
df
a b
0 2 False
1 2 False
2 2 True
3 1 True
4 2 True
5 1 True
Any help appreciated!
In general, I recommend avoiding "index"-type strategies, as they tend to be slow and inefficient. Also, we want to avoid sorting and re-sorting large datasets, particularly if they have lots of columns.
So instead, what we'll do is construct column b separately from the original DataFrame, and then insert the finished column b into the original DataFrame.
Since you are transitioning from Pandas, I'll walk through how we'll do this in Polars, and print the results at each step. (For your final solution, you can combine many of these intermediate steps and eliminate the implicit print statements after each step.)
Data
I'm going to expand your dataset, so that it has more columns than is strictly needed. This will show us how to isolate the columns we need, and how to put the final result back into your DataFrame.
import polars as pl
df = pl.DataFrame({
"a": [2, 2, 2, 1, 2, 1, 2],
"c": ['one', 'two', 'three', 'four', 'five', 'six', 'seven'],
"d": [6.0, 5, 4, 3, 2, 1, 0],
})
df
shape: (7, 3)
┌─────┬───────┬─────┐
│ a ┆ c ┆ d │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═════╪═══════╪═════╡
│ 2 ┆ one ┆ 6.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ two ┆ 5.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ three ┆ 4.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ four ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ five ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ six ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ seven ┆ 0.0 │
└─────┴───────┴─────┘
Constructing column b
First, we'll create a new DataFrame using only column a and add row numbers to track the original position of each element. We'll then sort the 1s to the bottom - we'll see why in a moment.
df_a = df.select('a').with_row_count().sort('a', reverse=True)
df_a
shape: (7, 2)
┌────────┬─────┐
│ row_nr ┆ a │
│ --- ┆ --- │
│ u32 ┆ i64 │
╞════════╪═════╡
│ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 1 │
└────────┴─────┘
Next we'll count the 1s and 2s using the value_counts method, which creates a new DataFrame with the results.
values = df_a.get_column('a').value_counts().sort('a')
values
shape: (2, 2)
┌─────┬────────┐
│ a ┆ counts │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═════╪════════╡
│ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 5 │
└─────┴────────┘
So we have two 1s and five 2s. We'll create two variables with this information that we'll use later.
nr_1, nr_2 = values.get_column('counts')
print(f"{nr_1=}", f"{nr_2=}")
>>> print(f"{nr_1=}", f"{nr_2=}")
nr_1=2 nr_2=5
Now we'll construct the top part of b, which corresponds to the five 2s. We'll need three False and two True values. We'll use the shuffle method to randomly shuffle the values. (You can set the seed= value according to your needs.)
b = (
pl.repeat(True, nr_1, eager=True)
.extend_constant(False, nr_2 - nr_1)
.shuffle(seed=37)
)
b
shape: (5,)
Series: '' [bool]
[
true
false
false
false
true
]
Now let's extend b with the two True values that correspond to the 1s (that we previously sorted to the bottom of our df_a DataFrame.)
b = b.extend_constant(True, nr_1)
b
shape: (7,)
Series: '' [bool]
[
true
false
false
false
true
true
true
]
And we'll add this column to our df_a, to see how the values of a and b align.
df_a = df_a.select([
pl.all(),
b.alias("b")]
)
df_a
shape: (7, 3)
┌────────┬─────┬───────┐
│ row_nr ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ bool │
╞════════╪═════╪═══════╡
│ 0 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 1 ┆ true │
└────────┴─────┴───────┘
We see that our two 1s at the bottom of column a both correspond to a True value in b. And we see that two of the 2s in column a correspond to True values, while the remaining values are False.
Adding column b back to our original DataFrame
All that's left to do is restore the original sort order, and insert column b into our original DataFrame.
df_a = df_a.sort("row_nr")
df = df.select([
pl.all(),
df_a.get_column("b")
])
df
shape: (7, 4)
┌─────┬───────┬─────┬───────┐
│ a ┆ c ┆ d ┆ b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ bool │
╞═════╪═══════╪═════╪═══════╡
│ 2 ┆ one ┆ 6.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ two ┆ 5.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ three ┆ 4.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ four ┆ 3.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ five ┆ 2.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ six ┆ 1.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ seven ┆ 0.0 ┆ true │
└─────┴───────┴─────┴───────┘
Simplifying
If your original DataFrame is not large, you don't need to create a separate df_a -- you can sort (and re-sort) the original DataFrame. (But once your datasets get large, unnecessarily sorting lots of additional columns at each step can slow your computations.)
Also, you can combine many of the intermediate steps, as you see fit.
Based on the document of polars, one can use json_path_match to extract JSON fields into string series.
But can we do something like pandas.Series.map(json.loads) to convert the whole JSON string at once? One can then further convert the loaded JSON series into another dataframe with sane dtypes.
I know I can do it first in pandas, but I'm looking for a way in polars.
I should first point out that there is a polars.read_json method. For example:
import polars as pl
import io
json_file = """[{"a":"1", "b":10, "c":[1,2,3]},
{"a":"2", "b":20, "c":[3,4,5]},
{"a":"3.1", "b":30.2, "c":[8,8,8]},
{"a":"4", "b":40.0, "c":[9,9,90]}]
"""
pl.read_json(io.StringIO(json_file))
shape: (4, 3)
┌─────┬──────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list [i64] │
╞═════╪══════╪════════════╡
│ 1 ┆ 10.0 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ [9, 9, 90] │
└─────┴──────┴────────────┘
But to answer your specific question about JSON data already loaded into a Series, I think what you're looking for is the polars.Series.apply method, which will apply a callable function to each cell of a Polars Series.
For example, let's say we have the following JSON fields already loaded into a Series in a Polars DataFrame:
import json
import polars as pl
df = pl.DataFrame(
{
"json_val": [
'{"a":"1", "b":10, "c":[1,2,3]}',
'{"a":"2", "b":20, "c":[3,4,5]}',
'{"a":"3.1", "b":30.2, "c":[8,8,8]}',
'{"a":"4", "b":40.0, "c":[9,9,90]}',
]
}
)
print(df)
shape: (4, 1)
┌─────────────────────────────────────┐
│ json_val │
│ --- │
│ str │
╞═════════════════════════════════════╡
│ {"a":"1", "b":10, "c":[1,2,3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"2", "b":20, "c":[3,4,5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"3.1", "b":30.2, "c":[8,8,8... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"4", "b":40.0, "c":[9,9,90]... │
└─────────────────────────────────────┘
We can use apply and the json.loads function. In this example, that will yield a Series of type struct:
df.select(pl.col("json_val").apply(json.loads))
shape: (4, 1)
┌──────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞══════════════════════════╡
│ {"1",10,[1, 2, 3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2",20,[3, 4, 5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1",30,[8, 8, 8]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4",40,[9, 9, 90]} │
└──────────────────────────┘
(One caution, notice how column b has been truncated to an integer.)
Depending on the structure of your JSON, you may be able to also use the polars.DataFrame.unnest function to split the json_val struct column into separate columns.
df.select(pl.col("json_val").apply(json.loads)).unnest("json_val")
shape: (4, 3)
┌─────┬─────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ list [i64] │
╞═════╪═════╪════════════╡
│ 1 ┆ 10 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40 ┆ [9, 9, 90] │
└─────┴─────┴────────────┘
Does this help get you started?
Edit: handling type-conversion issues
One general strategy that I use with any un-typed input file (especially csv files) is to return all values as a string/polars.Utf8 type. That way, I can explicitly convert types later, after I've had a chance to visually inspect the results. (I've been burned too often by "automatic" type conversions.)
The json.loads method has two helpful keyword options parse_float and parse_int that will help in this case. We can use a simple lambda function to tell the json parser to leave integer and float columns as strings.
# define our own translate function to keep floats/ints as strings
def json_translate(json_str: str):
return json.loads(json_str, parse_float=lambda x: x, parse_int=lambda x: x)
df.select(pl.col("json_val").apply(f=json_translate))
shape: (4, 1)
┌────────────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞════════════════════════════════╡
│ {"1","10",["1", "2", "3"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2","20",["3", "4", "5"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1","30.2",["8", "8", "8"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4","40.0",["9", "9", "90"]} │
└────────────────────────────────┘
Notice that all the integer and float values are left as strings, and remain so when we use the unnest function (the column headers below show "str").
df.select(pl.col("json_val").apply(f=json_translate)).unnest('json_val')
shape: (4, 3)
┌─────┬──────┬──────────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list [str] │
╞═════╪══════╪══════════════════╡
│ 1 ┆ 10 ┆ ["1", "2", "3"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ ["3", "4", "5"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ ["8", "8", "8"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ ["9", "9", "90"] │
└─────┴──────┴──────────────────┘
From this point, you can use Polars' cast expression to convert the strings to the specific numeric types that you want. Here's a Stack Overflow question that can help with the cast.
I have a Pandas DataFrame/Polars dataframe / Pyarrow table with a string key column. You can assume the strings are random. I want to partition that dataframe into N smaller dataframes based on this key column.
With an integer column, I can just use df1 = df[df.key % N == 1], df2 = df[df.key % N == 2] etc.
My best guess at how you are going to do that with a string column is apply a hash function (e.g. summing the ascii values of the string) to convert it to an integer column, then use the modulus.
Please let me know what's the most efficient way this can be done in either Pandas, Polars or Pyarrow, ideally with pure columnar operations within the API. Doing a df.apply is likely too slow for my use case.
I would try using hash_rows to see how it performs on your dataset and computing platform. (Note that in the calculation, I'm effectively selecting only the key field and running the hash_rows on that)
N = 50
df = df.with_column(
pl.lit(df.select(['key']).hash_rows() % N).alias('hash')
)
I just ran this on a dataset with almost 49 million records on a 32-core system, and it completed within seconds. (The 'key' field in my dataset was last names of people.)
I should also note, there's a partition_by method that may be of help in the partitioning.
I have a small addition to #cbilots answer. Polars has a hash expression, so computing a partition id would be trivial.
If you combine that with partition_by you can create partitioned at blazing speed with:
df = pl.DataFrame({
"strings": ["A", "A", "B", "A"],
"payload": [1, 2, 3, 4]
})
N = 2
(df.with_columns([
(pl.col("strings").hash() % N).alias("partition_id")
]).partition_by("partition_id"))
[shape: (3, 3)
┌─────────┬─────────┬──────────────┐
│ strings ┆ payload ┆ partition_id │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u64 │
╞═════════╪═════════╪══════════════╡
│ A ┆ 1 ┆ 0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2 ┆ 0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 4 ┆ 0 │
└─────────┴─────────┴──────────────┘,
shape: (1, 3)
┌─────────┬─────────┬──────────────┐
│ strings ┆ payload ┆ partition_id │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u64 │
╞═════════╪═════════╪══════════════╡
│ B ┆ 3 ┆ 1 │
└─────────┴─────────┴──────────────┘]
The grouping and the materialization of the partitions will be done in parallel.