How to filter df by value list with Polars? - dataframe

I have Polars df from a csv and I try to filter it by value list:
list = [1, 2, 4, 6, 48]
df = (
pl.read_csv("bm.dat", sep=';', new_columns=["cid1", "cid2", "cid3"])
.lazy()
.filter((pl.col("cid1") in list) & (pl.col("cid2") in list))
.collect()
)
I receive an error:
ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. Hint: use '&' or '|' to chain Expr together, not and/or.
But when I comment #.lazy() and #.collect(), I receive this error again.
I tried only one filter .filter(pl.col("cid1") in list, and received the error again.
How to filter df by value list with Polars?

Your error relates to using the in operator. In Polars, you want to use the is_in Expression.
For example:
df = pl.DataFrame(
{
"cid1": [1, 2, 3],
"cid2": [4, 5, 6],
"cid3": [7, 8, 9],
}
)
list = [1, 2, 4, 6, 48]
(
df.lazy()
.filter((pl.col("cid1").is_in(list)) & (pl.col("cid2").is_in(list)))
.collect()
)
shape: (1, 3)
┌──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╡
│ 1 ┆ 4 ┆ 7 │
└──────┴──────┴──────┘
But if we attempt to use the in operator instead, we get our error again.
(
df.lazy()
.filter((pl.col("cid1") in list) & (pl.col("cid2") in list))
.collect()
)
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/expr/expr.py", line 155, in __bool__
raise ValueError(
ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. Hint: use '&' or '|' to chain Expr together, not and/or.

Related

How to create polars DataFrame with Vec<Vec<f64>> as a Series

I desire a DataFrame like so:
timestamp | bids | asks | ticker
-------------------------------------------------------------------
1598215600 | [[10, 20], [15, 30]] | [[20, 10], [25, 20]] | "AAPL"
1598222400 | [[11, 25], [16, 35]] | [[22, 15], [28, 25]] | "MSFT"
1598229200 | [[12, 30], [18, 40]] | [[24, 20], [30, 30]] | "GOOG"
The bids Series has a Vec<Vec> structure, which in plain words is a vector that holds a pair (another vector) of the price and amount (two values).
What is the required rust code to create this? If possible answer in rust, but python works too I guess I can recreate it.
I'm new to rust so it's possible this is not optimal.
From looking around it seems like ChunkedArray may be the way to go?
use polars::prelude::*;
fn build_column(rows: &Vec<[[i64; 2]; 2]>) -> Series {
ListChunked::from_iter(rows.into_iter().map(|row| {
ListChunked::from_iter(
row.into_iter()
.map(|values| Int64Chunked::from_slice("", values).into_series()),
)
.into_series()
}))
.into_series()
}
fn main() -> PolarsResult<()> {
let asks = vec![
[[20, 10], [25, 20]],
[[22, 15], [28, 25]],
[[24, 20], [30, 30]],
];
let bids = vec![
[[10, 20], [15, 30]],
[[11, 25], [16, 35]],
[[12, 30], [18, 40]],
];
let df = df!(
"timestamp" => [1598215600, 1598222400, 1598229200],
"asks" => build_column(&asks),
"bids" => build_column(&bids),
"ticker" => ["AAPL", "MSFT", "GOOG"]
);
println!("{:?}", df);
Ok(())
}
Ok(shape: (3, 4)
┌────────────┬──────────────────────┬──────────────────────┬────────┐
│ timestamp ┆ asks ┆ bids ┆ ticker │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ list[list[i64]] ┆ list[list[i64]] ┆ str │
╞════════════╪══════════════════════╪══════════════════════╪════════╡
│ 1598215600 ┆ [[20, 10], [25, 20]] ┆ [[10, 20], [15, 30]] ┆ AAPL │
│ 1598222400 ┆ [[22, 15], [28, 25]] ┆ [[11, 25], [16, 35]] ┆ MSFT │
│ 1598229200 ┆ [[24, 20], [30, 30]] ┆ [[12, 30], [18, 40]] ┆ GOOG │
└────────────┴──────────────────────┴──────────────────────┴────────┘)

(Polars) How to get element from a column with list by index specified in another column

I have a dataframe with 2 columns, where first column contains lists, and second column integer indexes. How to get elements from first column by index specified in second column? Or even better, put that element in 3rd column. So for example, how from this
a = pl.DataFrame([{'lst': [1, 2, 3], 'ind': 1}, {'lst': [4, 5, 6], 'ind': 2}])
┌───────────┬─────┐
│ lst ┆ ind │
│ --- ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ [1, 2, 3] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 │
└───────────┴─────┘
you can get this
b = pl.DataFrame([{'lst': [1, 2, 3], 'ind': 1, 'list[ind]': 2}, {'lst': [4, 5, 6], 'ind': 2, 'list[ind]': 6}])
┌───────────┬─────┬───────────┐
│ lst ┆ ind ┆ list[ind] │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪═══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴───────────┘
Thanks.
Edit
As of python polars 0.14.24 this can be done more easily by
df.with_column(pl.col("lst").arr.get(pl.col("ind")).alias("list[ind]"))
Original answer
You can use with_row_count() to add a row count column for grouping, then explode() the list so each list element is on each row. Then call take() over the row count column using over() to select the element from the subgroup.
df = pl.DataFrame({"lst": [[1, 2, 3], [4, 5, 6]], "ind": [1, 2]})
df = (
df.with_row_count()
.with_column(
pl.col("lst").explode().take(pl.col("ind")).over(pl.col("row_nr")).alias("list[ind]")
)
.drop("row_nr")
)
shape: (2, 3)
┌───────────┬─────┬───────────┐
│ lst ┆ ind ┆ list[ind] │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪═══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴───────────┘
Here is my approach:
Create a custom function to get the values as per the required index.
def get_elem(d):
sel_idx = d[0]
return d[1][sel_idx]
here is a test data.
df = pl.DataFrame({'lista':[[1,2,3],[4,5,6]],'idx':[1,2]})
Now lets create a struct on these two columns(it will create a dict) and apply an above function
df.with_columns([
pl.struct(['idx','lista']).apply(lambda x: get_elem(list(x.values()))).alias('req_elem')])
shape: (2, 3)
┌───────────┬─────┬──────────┐
│ lista ┆ idx ┆ req_elem │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴──────────┘
If your number of unique idx elements isn't absolutely massive, you can build a when/then expression to select based on the value of idx using list.get(idx):
import polars as pl
df = pl.DataFrame([{"lst": [1, 2, 3], "ind": 1}, {"lst": [4, 5, 6], "ind": 2}])
# create when/then expression for each unique index
idxs = df["ind"].unique()
ind, lst = pl.col("ind"), pl.col("lst") # makes expression generator look cleaner
expr = pl.when(ind == idxs[0]).then(lst.arr.get(idxs[0]))
for idx in idxs[1:]:
expr = expr.when(ind == idx).then(lst.arr.get(idx))
expr = expr.otherwise(None)
df.select(expr)
shape: (2, 1)
┌─────┐
│ lst │
│ --- │
│ i64 │
╞═════╡
│ 2 │
├╌╌╌╌╌┤
│ 6 │
└─────┘

Combine 2 different sized arrays element-wise based on index pairing array

Say, we had 2 arrays of unique values:
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) # any values are possible,
b = np.array([0, 11, 12, 13, 14, 15, 16, 17, 18, 19]) # sorted values are for demonstration
, where a[0] corresponds to b[0], a[1] to b[11], a[2]-b[12], etc.
Then, due to some circumstances we randomly lost some of it and received noise elements from/to both a & b. Now 'useful data' in a and b are kind of 'eroded' like this:
a = np.array([0, 1, 313, 2, 3, 4, 5, 934, 6, 8, 9, 730, 241, 521])
b = np.array([112, 514, 11, 13, 16, 955, 17, 18, 112])
The noise elements have negligible probability to coincide with any of 'useful data'. So, if to search them, we could find the left ones and to define the 'index pairing array':
cor_tab = np.array([[1,2], [4,3], [8,4], [9,7]])
which, if applied, provides pairs of 'useful data' left:
np.column_stack((a[cor_tab[:,0]], b[cor_tab[:,1]]))
array([[1, 11],
[3, 13],
[6, 16],
[8, 18]])
The question: Given the 'eroded' a and b, how to combine them into numpy array such that:
values indexed in cor_tab are paired in the same column/row,
lost values are treated as -1,
noise as 'don't care', and
array looks like this:
[[ -1 112],
[ 0 514],
[ 1 11],
[313 -1],
[ 2 -1],
[ 3 13],
[ 4 -1],
[ 5 -1],
[934 -1],
[ 6 16],
[ -1 955],
[ -1 17],
[ 8 18],
[ 9 -1],
[730 -1],
[241 -1],
[521 112]]
, where 'useful data' is at indices: 2, 5, 9, 12?
Initially I solved this, in dubious way:
import numpy as np
def combine(aa, bb, t):
c0 = np.empty((0), int)
c1 = np.empty((0), int)
# add -1 & 'noise' at the left side:
if t[0][0] > t[0][1]:
c0 = np.append(c0, aa[: t[0][0]])
c1 = np.append(c1, [np.append([-1] * (t[0][0] - t[0][1]), bb[: t[0][1]])])
else:
c0 = np.append(c0, [np.append([-1] * (t[0][1] - t[0][0]), aa[: t[0][0]])])
c1 = np.append(c1, bb[: t[0][1]])
ind_compenstr = t[0][0] - t[0][1] # 'index compensator'
for i, ii in enumerate(t):
x = ii[0] - ii[1] - ind_compenstr
# add -1 & 'noise' in the middle:
if x > 0:
c0 = np.append(c0, [aa[ii[0]-x:ii[0]]])
c1 = np.append(c1, [[-1] * x])
elif x == 0:
c0 = np.append(c0, [aa[ii[0]-x:ii[0]]])
c1 = np.append(c1, [bb[ii[1]-x:ii[1]]])
else:
x = abs(x)
c0 = np.append(c0, [[-1] * x])
c1 = np.append(c1, [bb[ii[1]-x:ii[1]]])
# add useful elements:
c0 = np.append(c0, aa[ii[0]])
c1 = np.append(c1, bb[ii[1]])
ind_compenstr += x
# add -1 & 'noise' at the right side:
l0 = len(aa) - t[-1][0]
l1 = len(bb) - t[-1][1]
if l0 > l1:
c0 = np.append(c0, aa[t[-1][0] + 1:])
c1 = np.append(c1, [np.append(bb[t[-1][1] + 1:], [-1] * (l0 - l1))])
else:
c0 = np.append(c0, [np.append(aa[t[-1][0] + 1:], [-1] * (l1 - l0))])
c1 = np.append(c1, bb[t[-1][1] + 1:])
return np.array([c0,c1])
But bellow I suggest another solution.
It is difficult to understand what the question want, but IIUC, at first, we need to find the column size of the expected array that contains combined uncommon values between the two arrays (np.union1d), and then create an array based on that size full filled by -1 (np.full). Now, using np.searchsorted, the indices of values of an array in another array will be achieved. Values that are not contained in the other array can be given by np.in1d in invert mode. So we can achieve the goal by indexing as:
union_ = np.union1d(a, b)
# [0 1 2 3 4 5 6 7 8 9]
res = np.full((2, union_.size), -1)
# [[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
# [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]]
arange_row_ids = np.arange(union_.size)
# [0 1 2 3 4 5 6 7 8 9]
col_inds = np.searchsorted(a, b)[np.in1d(b, a, invert=True)]
# np.searchsorted(a, b) ---> [1 3 6 7 7]
# np.in1d(b, a, invert=True) ---> [False False False True False]
# [7]
res[0, np.delete(arange_row_ids, col_inds + np.arange(col_inds.size))] = a
# np.delete(arange_row_ids, col_inds + np.arange(col_inds.size)) ---> [0 1 2 3 4 5 6 8 9]
# [[ 0 1 2 3 4 5 6 -1 8 9]
# [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1]]
col_inds = np.searchsorted(b, a)[np.in1d(a, b, invert=True)]
# np.searchsorted(b, a) ---> [0 0 1 1 2 2 2 4 5]
# np.in1d(a, b, invert=True) ---> [ True False True False True True False False True]
# [0 1 2 2 5]
res[1, np.delete(arange_row_ids, col_inds + np.arange(col_inds.size))] = b
# np.delete(arange_row_ids, col_inds + np.arange(col_inds.size)) ---> [1 3 6 7 8]
# [[ 0 1 2 3 4 5 6 -1 8 9]
# [-1 1 -1 3 -1 -1 6 7 8 -1]]
The question is not clear enough to see if the answer is the expected one, but I think it is helpful that could help for further modifications based on the need.
Here's a partially vectorized solution:
import numpy as np
# this function if from Divakar's answer at #https://stackoverflow.com/questions/38619143/convert-python-#sequence-to-numpy-array-filling-missing-values that I used as #function:
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())[::-1]
out = np.full(mask.shape, -1, dtype=int)
out[mask] = np.concatenate(v)
return out
# 2 arrays with eroded useful data and the index pairing array:
a = np.array([0, 1, 313, 2, 3, 4, 5, 934, 6, 8, 9, 730, 241, 521])
b = np.array([112, 514, 11, 13, 16, 955, 17, 18, 112])
cor_tab = np.array([[1,2], [4,3], [8,4], [9,7]])
# split every array by correspondent indices in `cor_tab`:
aa = np.split(a, cor_tab[:,0]+1)
bb = np.split(b, cor_tab[:,1]+1)
#initiate 2 flat empty arrays:
aaa = np.empty((0), int)
bbb = np.empty((0), int)
# loop over the splitted arrays:
for i, j in zip(aa,bb):
c = boolean_indexing([i, j])
aaa = np.append(aaa, c[0])
bbb = np.append(bbb, c[1])
ccc = np.array([aaa,bbb]).T
In case of other types of data, here is another example. Lets take two arrays of letters:
a = np.array(['y', 'w', 'a', 'e', 'i', 'o', 'u', 'y', 'w', 'a', 'e', 'i', 'o', 'u'])
b = np.array(['t', 'h', 'b', 't', 'c', 'n', 's', 'j', 'p', 'z', 'n', 'h', 't', 's', 'm', 'p'])
, and index pairing array:
cor_tab = np.array([[2,0], [3,2], [4,3], [5,5], [6,6], [9,10], [11,12], [13,13]])
np.column_stack((a[cor_tab[:,0]], b[cor_tab[:,1]]))
array([['a', 't'], # useful data
['e', 'b'],
['i', 't'],
['o', 'n'],
['u', 's'],
['a', 'n'],
['i', 't'],
['u', 's']], dtype='<U1')
The only correction required is dtype='<U1' in boolean_indexing(). Result is:
[['y' '-'],
['w' '-'],
['a' 't'],
['-' 'h'],
['e' 'b'],
['i' 't'],
['-' 'c'],
['o' 'n'],
['u' 's'],
['-' 'j'],
['y' 'p'],
['w' 'z'],
['a' 'n'],
['e' 'h'],
['i' 't'],
['o' '-'],
['u' 's'],
['-' 'm'],
['-' 'p']]
It works for floats as well if change dtype in boolean_indexing() to float.

How to perform computations easily between every column in a polars DataFrame and the mean of that column

Environment
macos: monterey
node: v18.1.0
nodejs-polars: 0.5.3
Goal
Subtract every column in a polars DataFrame with the mean of that column.
Pandas solution
In pandas the solution is very concise thanks to DataFrame.sub(other, axis='columns', level=None, fill_value=None). other is scalar, sequence, Series, or DataFrame:
df.sub(df.mean())
df - df.mean()
nodejs-polars solution
While in nodejs-polars function, other only seems to be a Series according to sub: (other) => wrap("sub", prepareOtherArg(other).inner()).
1. Prepare data
console.log(df)
┌─────────┬─────────┬─────────┬─────────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 13520 ┆ -16 ┆ 384 ┆ 208 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 384 ┆ 176 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13456 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 352 ┆ 176 │
└─────────┴─────────┴─────────┴─────────┘
console.log(df.mean())
┌─────────┬─────────┬─────────┬─────────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 13478.4 ┆ -16.0 ┆ 371.2 ┆ 176.0 │
└─────────┴─────────┴─────────┴─────────┘
2. first try
df.sub(df.mean())
Error: Failed to determine supertype of Int64 and Struct([Field { name: "A", dtype: Int32 }, Field { name: "B", dtype: Int32 }, Field { name: "C", dtype: Int32 }, Field { name: "D", dtype: Int32 }])
3. second try
df.sub(pl.Series(df.mean().row(0)))
Program crashes due to memory problems.
4. third try
After some investigations, I noticed the tests:
test("sub", () => {
const actual = pl.DataFrame({
"foo": [1, 2, 3],
"bar": [4, 5, 6]
}).sub(1);
const expected = pl.DataFrame({
"foo": [0, 1, 2],
"bar": [3, 4, 5]
});
expect(actual).toFrameEqual(expected);
});
test("sub:series", () => {
const actual = pl.DataFrame({
"foo": [1, 2, 3],
"bar": [4, 5, 6]
}).sub(pl.Series([1, 2, 3]));
const expected = pl.DataFrame({
"foo": [0, 0, 0],
"bar": [3, 3, 3]
});
expect(actual).toFrameEqual(expected);
});
nodejs-polars seems to be unable to complete this task gracefully right now. So my current solution is a bit cumbersome: perform operations column by column then concat the results.
pl.concat(df.columns.map((col) => df.select(col).sub(df.select(col).mean(0).toSeries())), {how:'horizontal'})
Is there a better or easier way to do it?
5. new try
I just came out an easier solution, but it's hard to understand, and I'm still trying to figure out what happened under the hood.
df.select(pl.col('*').sub(pl.col('*').mean()))
You tagged this problem with [python-polars], so I'll provide a solution using Polars with Python. (Perhaps you can translate that to Node-JS.)
Starting with our data:
import polars as pl
df = pl.DataFrame(
{
"A": [13520, 13472, 13456, 13472, 13472],
"B": [-16, -16, -16, -16, -16],
"C": [384, 384, 368, 368, 352],
"D": [208, 176, 160, 160, 176],
}
)
df
shape: (5, 4)
┌───────┬─────┬─────┬─────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╪═════╡
│ 13520 ┆ -16 ┆ 384 ┆ 208 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 384 ┆ 176 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13456 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 352 ┆ 176 │
└───────┴─────┴─────┴─────┘
We can very concisely solve this problem:
df.with_columns([
(pl.all() - pl.all().mean()).suffix('_centered')
])
shape: (5, 8)
┌───────┬─────┬─────┬─────┬────────────┬────────────┬────────────┬────────────┐
│ A ┆ B ┆ C ┆ D ┆ A_centered ┆ B_centered ┆ C_centered ┆ D_centered │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪═════╪═════╪═════╪════════════╪════════════╪════════════╪════════════╡
│ 13520 ┆ -16 ┆ 384 ┆ 208 ┆ 41.6 ┆ 0.0 ┆ 12.8 ┆ 32.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 384 ┆ 176 ┆ -6.4 ┆ 0.0 ┆ 12.8 ┆ 0.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13456 ┆ -16 ┆ 368 ┆ 160 ┆ -22.4 ┆ 0.0 ┆ -3.2 ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 368 ┆ 160 ┆ -6.4 ┆ 0.0 ┆ -3.2 ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 352 ┆ 176 ┆ -6.4 ┆ 0.0 ┆ -19.2 ┆ 0.0 │
└───────┴─────┴─────┴─────┴────────────┴────────────┴────────────┴────────────┘
If you want to overwrite the columns, you can eliminate the suffix expression:
df.with_columns([
(pl.all() - pl.all().mean())
])
shape: (5, 4)
┌───────┬─────┬───────┬───────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪═════╪═══════╪═══════╡
│ 41.6 ┆ 0.0 ┆ 12.8 ┆ 32.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4 ┆ 0.0 ┆ 12.8 ┆ 0.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -22.4 ┆ 0.0 ┆ -3.2 ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4 ┆ 0.0 ┆ -3.2 ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4 ┆ 0.0 ┆ -19.2 ┆ 0.0 │
└───────┴─────┴───────┴───────┘
Edit: essentially, the polars.all or the polars.col('*') replicates an entire expression for each column, so that:
pl.col('*') - pl.col('*').mean()
is syntactic sugar for:
[
pl.col('A') - pl.col('A').mean(),
pl.col('B') - pl.col('B').mean(),
pl.col('C') - pl.col('C').mean(),
pl.col('D') - pl.col('D').mean(),
]

Complex indirect slice indexing : how to do it?

I'm looking to find a vector formulation instead of a loop for the following problem.
import numpy as np
ny = 6 ; nx = 4 ; na = 2
aa = np.array (np.arange (ny*nx), dtype=np.int32)
aa.shape = (ny, nx)
print ( 'aa : ', aa)
# ix1 has length nx
ix1 = np.array ( [0, 2, 1, 3] )
For each value of the second index in aa, I want to take in aa a slice that starts at ix1, of length na
With a loop :
- 1
bb = np.empty ( [na, nx], dtype=np.int32 )
for xx in np.arange (nx) :
bb [:, xx] = aa [ ix1[xx]:ix1[xx]+na, xx]
print ( 'bb : ', bb)
- 2
bb = np.empty ( [na, nx], dtype=np.int32 )
for xx in np.arange (nx) :
bb [:, xx] = aa [ slice(ix1[xx],ix1[xx]+na), xx]
print ( 'bb : ', bb)
- 3
bb = np.empty ( [na, nx], dtype=np.int32 )
for xx in np.arange (nx) :
bb [:, xx] = aa [ np.s_[ix1[xx]:ix1[xx]+na], xx]
print ( 'bb : ', bb)
Is there a vector form of this ?
None of the following works
print ( np.ix_ (ix1,ix1+na) )
aa [ np.ix_ (ix1,ix1+na) ]
print ( np.s_ [ix1:ix1+na] )
aa [ np.s_ [ix1:ix1+na] ]
print ( slice(ix1,ix1+na) )
aa [ slice(ix1,ix1+na) ]
print ( (slice(ix1,ix1+na), slice(None,None) ))
aa [ (slice(ix1,ix1+na), slice(None,None))]
Look at the problem cases. np.s_ is just a way of creating a slice object. It doesn't add any functionality:
In [562]: ix1
Out[562]: array([0, 2, 1, 3])
In [563]: slice(ix1,ix1+na)
Out[563]: slice(array([0, 2, 1, 3]), array([2, 4, 3, 5]), None)
In [564]: np.s_[ix1: ix1+na]
Out[564]: slice(array([0, 2, 1, 3]), array([2, 4, 3, 5]), None)
Using either as index is the same as (your previous loops showed the equivalence of these slice notations):
In [569]: aa[ix1:ix1+na]
Traceback (most recent call last):
File "<ipython-input-569-f4db64c86100>", line 1, in <module>
aa[ix1:ix1+na]
TypeError: only integer scalar arrays can be converted to a scalar index
While it's possible to create a slice object with array values, it does not work in an actual index.
Think of it as the equivalent of trying to create a range of numbers:
In [572]: np.arange(ix1[0], ix1[0]+na)
Out[572]: array([0, 1])
In [573]: np.arange(ix1, ix1+na)
Traceback (most recent call last):
File "<ipython-input-573-94cfee666466>", line 1, in <module>
np.arange(ix1, ix1+na)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
range between 0 and 2 is fine, but not range between the arrays. Indexing slices must be between scalars, not arrays.
linspace does allow us to create multidimensional ranges:
In [574]: np.linspace(ix1,ix1+na,2,endpoint=False, dtype=int)
Out[574]:
array([[0, 2, 1, 3],
[1, 3, 2, 4]])
As long as the number of values is the same (here 2), the other values are just a matter of scaling or offset.
In [576]: ix1 + np.arange(0,2)[:,None]
Out[576]:
array([[0, 2, 1, 3],
[1, 3, 2, 4]])
That 2d linspace index can be used to index the rows of aa, along with a arange for columns:
In [579]: aa[Out[574],np.arange(4)]
Out[579]:
array([[ 0, 9, 6, 15],
[ 4, 13, 10, 19]], dtype=int32)
Basically the only alternative to joining multiple indexing operations is to construct a join indexing array(s). Here it's easy to do. In more general case, that join might itself require concatenation.
I asked for aa and bb.
In [580]: aa
Out[580]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]], dtype=int32)
In [581]: bb
Out[581]:
array([[ 0, 9, 6, 15],
[ 4, 13, 10, 19]], dtype=int32)