How to create polars DataFrame with Vec<Vec<f64>> as a Series - dataframe

I desire a DataFrame like so:
timestamp | bids | asks | ticker
-------------------------------------------------------------------
1598215600 | [[10, 20], [15, 30]] | [[20, 10], [25, 20]] | "AAPL"
1598222400 | [[11, 25], [16, 35]] | [[22, 15], [28, 25]] | "MSFT"
1598229200 | [[12, 30], [18, 40]] | [[24, 20], [30, 30]] | "GOOG"
The bids Series has a Vec<Vec> structure, which in plain words is a vector that holds a pair (another vector) of the price and amount (two values).
What is the required rust code to create this? If possible answer in rust, but python works too I guess I can recreate it.

I'm new to rust so it's possible this is not optimal.
From looking around it seems like ChunkedArray may be the way to go?
use polars::prelude::*;
fn build_column(rows: &Vec<[[i64; 2]; 2]>) -> Series {
ListChunked::from_iter(rows.into_iter().map(|row| {
ListChunked::from_iter(
row.into_iter()
.map(|values| Int64Chunked::from_slice("", values).into_series()),
)
.into_series()
}))
.into_series()
}
fn main() -> PolarsResult<()> {
let asks = vec![
[[20, 10], [25, 20]],
[[22, 15], [28, 25]],
[[24, 20], [30, 30]],
];
let bids = vec![
[[10, 20], [15, 30]],
[[11, 25], [16, 35]],
[[12, 30], [18, 40]],
];
let df = df!(
"timestamp" => [1598215600, 1598222400, 1598229200],
"asks" => build_column(&asks),
"bids" => build_column(&bids),
"ticker" => ["AAPL", "MSFT", "GOOG"]
);
println!("{:?}", df);
Ok(())
}
Ok(shape: (3, 4)
┌────────────┬──────────────────────┬──────────────────────┬────────┐
│ timestamp ┆ asks ┆ bids ┆ ticker │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ list[list[i64]] ┆ list[list[i64]] ┆ str │
╞════════════╪══════════════════════╪══════════════════════╪════════╡
│ 1598215600 ┆ [[20, 10], [25, 20]] ┆ [[10, 20], [15, 30]] ┆ AAPL │
│ 1598222400 ┆ [[22, 15], [28, 25]] ┆ [[11, 25], [16, 35]] ┆ MSFT │
│ 1598229200 ┆ [[24, 20], [30, 30]] ┆ [[12, 30], [18, 40]] ┆ GOOG │
└────────────┴──────────────────────┴──────────────────────┴────────┘)

Related

How to filter df by value list with Polars?

I have Polars df from a csv and I try to filter it by value list:
list = [1, 2, 4, 6, 48]
df = (
pl.read_csv("bm.dat", sep=';', new_columns=["cid1", "cid2", "cid3"])
.lazy()
.filter((pl.col("cid1") in list) & (pl.col("cid2") in list))
.collect()
)
I receive an error:
ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. Hint: use '&' or '|' to chain Expr together, not and/or.
But when I comment #.lazy() and #.collect(), I receive this error again.
I tried only one filter .filter(pl.col("cid1") in list, and received the error again.
How to filter df by value list with Polars?
Your error relates to using the in operator. In Polars, you want to use the is_in Expression.
For example:
df = pl.DataFrame(
{
"cid1": [1, 2, 3],
"cid2": [4, 5, 6],
"cid3": [7, 8, 9],
}
)
list = [1, 2, 4, 6, 48]
(
df.lazy()
.filter((pl.col("cid1").is_in(list)) & (pl.col("cid2").is_in(list)))
.collect()
)
shape: (1, 3)
┌──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╡
│ 1 ┆ 4 ┆ 7 │
└──────┴──────┴──────┘
But if we attempt to use the in operator instead, we get our error again.
(
df.lazy()
.filter((pl.col("cid1") in list) & (pl.col("cid2") in list))
.collect()
)
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/expr/expr.py", line 155, in __bool__
raise ValueError(
ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. Hint: use '&' or '|' to chain Expr together, not and/or.

(Polars) How to get element from a column with list by index specified in another column

I have a dataframe with 2 columns, where first column contains lists, and second column integer indexes. How to get elements from first column by index specified in second column? Or even better, put that element in 3rd column. So for example, how from this
a = pl.DataFrame([{'lst': [1, 2, 3], 'ind': 1}, {'lst': [4, 5, 6], 'ind': 2}])
┌───────────┬─────┐
│ lst ┆ ind │
│ --- ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ [1, 2, 3] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 │
└───────────┴─────┘
you can get this
b = pl.DataFrame([{'lst': [1, 2, 3], 'ind': 1, 'list[ind]': 2}, {'lst': [4, 5, 6], 'ind': 2, 'list[ind]': 6}])
┌───────────┬─────┬───────────┐
│ lst ┆ ind ┆ list[ind] │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪═══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴───────────┘
Thanks.
Edit
As of python polars 0.14.24 this can be done more easily by
df.with_column(pl.col("lst").arr.get(pl.col("ind")).alias("list[ind]"))
Original answer
You can use with_row_count() to add a row count column for grouping, then explode() the list so each list element is on each row. Then call take() over the row count column using over() to select the element from the subgroup.
df = pl.DataFrame({"lst": [[1, 2, 3], [4, 5, 6]], "ind": [1, 2]})
df = (
df.with_row_count()
.with_column(
pl.col("lst").explode().take(pl.col("ind")).over(pl.col("row_nr")).alias("list[ind]")
)
.drop("row_nr")
)
shape: (2, 3)
┌───────────┬─────┬───────────┐
│ lst ┆ ind ┆ list[ind] │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪═══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴───────────┘
Here is my approach:
Create a custom function to get the values as per the required index.
def get_elem(d):
sel_idx = d[0]
return d[1][sel_idx]
here is a test data.
df = pl.DataFrame({'lista':[[1,2,3],[4,5,6]],'idx':[1,2]})
Now lets create a struct on these two columns(it will create a dict) and apply an above function
df.with_columns([
pl.struct(['idx','lista']).apply(lambda x: get_elem(list(x.values()))).alias('req_elem')])
shape: (2, 3)
┌───────────┬─────┬──────────┐
│ lista ┆ idx ┆ req_elem │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴──────────┘
If your number of unique idx elements isn't absolutely massive, you can build a when/then expression to select based on the value of idx using list.get(idx):
import polars as pl
df = pl.DataFrame([{"lst": [1, 2, 3], "ind": 1}, {"lst": [4, 5, 6], "ind": 2}])
# create when/then expression for each unique index
idxs = df["ind"].unique()
ind, lst = pl.col("ind"), pl.col("lst") # makes expression generator look cleaner
expr = pl.when(ind == idxs[0]).then(lst.arr.get(idxs[0]))
for idx in idxs[1:]:
expr = expr.when(ind == idx).then(lst.arr.get(idx))
expr = expr.otherwise(None)
df.select(expr)
shape: (2, 1)
┌─────┐
│ lst │
│ --- │
│ i64 │
╞═════╡
│ 2 │
├╌╌╌╌╌┤
│ 6 │
└─────┘

How to perform computations easily between every column in a polars DataFrame and the mean of that column

Environment
macos: monterey
node: v18.1.0
nodejs-polars: 0.5.3
Goal
Subtract every column in a polars DataFrame with the mean of that column.
Pandas solution
In pandas the solution is very concise thanks to DataFrame.sub(other, axis='columns', level=None, fill_value=None). other is scalar, sequence, Series, or DataFrame:
df.sub(df.mean())
df - df.mean()
nodejs-polars solution
While in nodejs-polars function, other only seems to be a Series according to sub: (other) => wrap("sub", prepareOtherArg(other).inner()).
1. Prepare data
console.log(df)
┌─────────┬─────────┬─────────┬─────────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 13520 ┆ -16 ┆ 384 ┆ 208 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 384 ┆ 176 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13456 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 352 ┆ 176 │
└─────────┴─────────┴─────────┴─────────┘
console.log(df.mean())
┌─────────┬─────────┬─────────┬─────────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 13478.4 ┆ -16.0 ┆ 371.2 ┆ 176.0 │
└─────────┴─────────┴─────────┴─────────┘
2. first try
df.sub(df.mean())
Error: Failed to determine supertype of Int64 and Struct([Field { name: "A", dtype: Int32 }, Field { name: "B", dtype: Int32 }, Field { name: "C", dtype: Int32 }, Field { name: "D", dtype: Int32 }])
3. second try
df.sub(pl.Series(df.mean().row(0)))
Program crashes due to memory problems.
4. third try
After some investigations, I noticed the tests:
test("sub", () => {
const actual = pl.DataFrame({
"foo": [1, 2, 3],
"bar": [4, 5, 6]
}).sub(1);
const expected = pl.DataFrame({
"foo": [0, 1, 2],
"bar": [3, 4, 5]
});
expect(actual).toFrameEqual(expected);
});
test("sub:series", () => {
const actual = pl.DataFrame({
"foo": [1, 2, 3],
"bar": [4, 5, 6]
}).sub(pl.Series([1, 2, 3]));
const expected = pl.DataFrame({
"foo": [0, 0, 0],
"bar": [3, 3, 3]
});
expect(actual).toFrameEqual(expected);
});
nodejs-polars seems to be unable to complete this task gracefully right now. So my current solution is a bit cumbersome: perform operations column by column then concat the results.
pl.concat(df.columns.map((col) => df.select(col).sub(df.select(col).mean(0).toSeries())), {how:'horizontal'})
Is there a better or easier way to do it?
5. new try
I just came out an easier solution, but it's hard to understand, and I'm still trying to figure out what happened under the hood.
df.select(pl.col('*').sub(pl.col('*').mean()))
You tagged this problem with [python-polars], so I'll provide a solution using Polars with Python. (Perhaps you can translate that to Node-JS.)
Starting with our data:
import polars as pl
df = pl.DataFrame(
{
"A": [13520, 13472, 13456, 13472, 13472],
"B": [-16, -16, -16, -16, -16],
"C": [384, 384, 368, 368, 352],
"D": [208, 176, 160, 160, 176],
}
)
df
shape: (5, 4)
┌───────┬─────┬─────┬─────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╪═════╡
│ 13520 ┆ -16 ┆ 384 ┆ 208 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 384 ┆ 176 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13456 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 368 ┆ 160 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 352 ┆ 176 │
└───────┴─────┴─────┴─────┘
We can very concisely solve this problem:
df.with_columns([
(pl.all() - pl.all().mean()).suffix('_centered')
])
shape: (5, 8)
┌───────┬─────┬─────┬─────┬────────────┬────────────┬────────────┬────────────┐
│ A ┆ B ┆ C ┆ D ┆ A_centered ┆ B_centered ┆ C_centered ┆ D_centered │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪═════╪═════╪═════╪════════════╪════════════╪════════════╪════════════╡
│ 13520 ┆ -16 ┆ 384 ┆ 208 ┆ 41.6 ┆ 0.0 ┆ 12.8 ┆ 32.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 384 ┆ 176 ┆ -6.4 ┆ 0.0 ┆ 12.8 ┆ 0.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13456 ┆ -16 ┆ 368 ┆ 160 ┆ -22.4 ┆ 0.0 ┆ -3.2 ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 368 ┆ 160 ┆ -6.4 ┆ 0.0 ┆ -3.2 ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13472 ┆ -16 ┆ 352 ┆ 176 ┆ -6.4 ┆ 0.0 ┆ -19.2 ┆ 0.0 │
└───────┴─────┴─────┴─────┴────────────┴────────────┴────────────┴────────────┘
If you want to overwrite the columns, you can eliminate the suffix expression:
df.with_columns([
(pl.all() - pl.all().mean())
])
shape: (5, 4)
┌───────┬─────┬───────┬───────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪═════╪═══════╪═══════╡
│ 41.6 ┆ 0.0 ┆ 12.8 ┆ 32.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4 ┆ 0.0 ┆ 12.8 ┆ 0.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -22.4 ┆ 0.0 ┆ -3.2 ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4 ┆ 0.0 ┆ -3.2 ┆ -16.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ -6.4 ┆ 0.0 ┆ -19.2 ┆ 0.0 │
└───────┴─────┴───────┴───────┘
Edit: essentially, the polars.all or the polars.col('*') replicates an entire expression for each column, so that:
pl.col('*') - pl.col('*').mean()
is syntactic sugar for:
[
pl.col('A') - pl.col('A').mean(),
pl.col('B') - pl.col('B').mean(),
pl.col('C') - pl.col('C').mean(),
pl.col('D') - pl.col('D').mean(),
]

Postgresql query dictionary of objects in JSONB field

I have a table in a PostgreSQL 9.5 database with a JSONB field that contains a dictionary in the following form:
{'1': {'id': 1,
'length': 24,
'date_started': '2015-08-25'},
'2': {'id': 2,
'length': 27,
'date_started': '2015-09-18'},
'3': {'id': 3,
'length': 27,
'date_started': '2015-10-15'},
}
The number of elements in the dictionary (the '1', '2', etc.) may vary between rows.
I would like to be able to get the average of length using a single SQL query. Any suggestions on how to achieve this ?
Use jsonb_each:
[local] #= SELECT json, AVG((v->>'length')::int)
FROM j, jsonb_each(json) js(k, v)
GROUP BY json;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────┐
│ json │ avg │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────┤
│ {"1": {"id": 1, "length": 240, "date_started": "2015-08-25"}, "2": {"id": 2, "length": 27, "date_started": "2015-09-18"}, "3": {"id": 3, "length": 27, "date_started": "2015-10-15"}} │ 98.0000000000000000 │
│ {"1": {"id": 1, "length": 24, "date_started": "2015-08-25"}, "2": {"id": 2, "length": 27, "date_started": "2015-09-18"}, "3": {"id": 3, "length": 27, "date_started": "2015-10-15"}} │ 26.0000000000000000 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┘
(2 rows)
Time: 0,596 ms

Find object having the highest value in JSON in Postgresql

Assume we have couple of objects in database with attribute data where attribute data consists: {'gender' => {'male' => 40.0, 'female' => 30.0 => 'undefined' => 30.0}}.
I would like to find only these objects, which have the gender => male value the highest.
PostgreSQL 9.5
Assuming I understand your question correctly (example input/output would be useful):
WITH jsons(id, j) AS (
VALUES
(1, '{"gender": {"male": 40.0, "female": 30.0, "undefined": 30.0}}'::json),
(2, '{"gender": {"male": 40.0, "female": 30.0, "undefined": 30.0}}'),
(3, '{"gender": {"male": 0.0, "female": 30.0, "undefined": 30.0}}')
)
SELECT id, j
FROM jsons
WHERE (j->'gender'->>'male') :: float8 = (
SELECT MAX((j->'gender'->>'male') :: float8)
FROM jsons
)
;
┌────┬───────────────────────────────────────────────────────────────┐
│ id │ j │
├────┼───────────────────────────────────────────────────────────────┤
│ 1 │ {"gender": {"male": 40.0, "female": 30.0, "undefined": 30.0}} │
│ 2 │ {"gender": {"male": 40.0, "female": 30.0, "undefined": 30.0}} │
└────┴───────────────────────────────────────────────────────────────┘
(2 rows)