I have a series of Events which are produced by different users over time.
How can I aggregate this series by events that are close each other. Two events are close (in the same window) if:
b.user = a.user
and b.time >= a.time
and b.time - a.time <= interval '1 month'
This is a recursive condition. For example, the following dataset:
CREATE TABLE pg_temp.Data
("event" int, "user" int, "date" date, "value" int)
;
INSERT INTO pg_temp.Data
("event", "user", "date", "value")
VALUES
(1, 1, '2017-01-01', 5),
(2, 1, '2017-01-07', 3),
(3, 1, '2017-02-09', 2),
(4, 1, '2017-03-12', 4),
(5, 1, '2017-04-03', 7),
(6, 1, '2017-05-01', 6),
(7, 2, '2017-01-05', 9),
(8, 2, '2017-01-12', 1),
(9, 2, '2017-03-24', 6)
;
select * from pg_temp.Data
should be reduced to something like:
[
{
"init": "2017-01-01",
"latest": "2017-01-07",
"events": [
1,
2
],
"user": 1,
"value": 8
},
{
"init": "2017-02-09",
"latest": "2017-02-09",
"events": [
3
],
"user": 1,
"value": 2
},
{
"init": "2017-03-12",
"latest": "2017-05-01",
"events": [
4,
5,
6
],
"user": 1,
"value": 17
},
{
"init": "2017-01-05",
"latest": "2017-01-12",
"events": [
7,
8
],
"user": 2,
"value": 10
},
{
"init": "2017-03-24",
"latest": "2017-03-24",
"events": [
9
],
"user": 2,
"value": 6
}
]
Where init and latest are the window's time range and value is the sum of values in the window.
Note that events 6 and 4 are more than one month apart, but they've been aggregated into the same group because event 5 is in between them.
Use window functions:
SELECT min(date) AS init,
max(date) AS latest,
array_agg(event) AS events,
"user",
sum(value) AS value
FROM (SELECT event,
"user",
date,
value,
count(grp_start)
OVER (PARTITION BY "user" ORDER BY date) session_id
FROM (SELECT event,
"user",
date,
value,
CASE
WHEN date
> lag(date, 1, timestamp '-infinity')
OVER (PARTITION BY "user" ORDER BY date)
+ INTERVAL '1 month'
THEN 1
END grp_start
FROM data
) tagged
) numbered
GROUP BY "user", session_id
ORDER BY "user", init;
This will result in:
┌─────────────────────┬─────────────────────┬─────────┬──────┬───────┐
│ init │ latest │ events │ user │ value │
├─────────────────────┼─────────────────────┼─────────┼──────┼───────┤
│ 2017-01-01 00:00:00 │ 2017-01-07 00:00:00 │ {1,2} │ 1 │ 8 │
│ 2017-02-09 00:00:00 │ 2017-02-09 00:00:00 │ {3} │ 1 │ 2 │
│ 2017-03-12 00:00:00 │ 2017-05-01 00:00:00 │ {4,5,6} │ 1 │ 17 │
│ 2017-01-05 00:00:00 │ 2017-01-12 00:00:00 │ {7,8} │ 2 │ 10 │
│ 2017-03-24 00:00:00 │ 2017-03-24 00:00:00 │ {9} │ 2 │ 6 │
└─────────────────────┴─────────────────────┴─────────┴──────┴───────┘
(5 rows)
A word af advice: it is a good idea not to use column names like user that are reserved words. If you forget to use them inside double quotes, surprising things will happen (try it out).
Related
I have Polars df from a csv and I try to filter it by value list:
list = [1, 2, 4, 6, 48]
df = (
pl.read_csv("bm.dat", sep=';', new_columns=["cid1", "cid2", "cid3"])
.lazy()
.filter((pl.col("cid1") in list) & (pl.col("cid2") in list))
.collect()
)
I receive an error:
ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. Hint: use '&' or '|' to chain Expr together, not and/or.
But when I comment #.lazy() and #.collect(), I receive this error again.
I tried only one filter .filter(pl.col("cid1") in list, and received the error again.
How to filter df by value list with Polars?
Your error relates to using the in operator. In Polars, you want to use the is_in Expression.
For example:
df = pl.DataFrame(
{
"cid1": [1, 2, 3],
"cid2": [4, 5, 6],
"cid3": [7, 8, 9],
}
)
list = [1, 2, 4, 6, 48]
(
df.lazy()
.filter((pl.col("cid1").is_in(list)) & (pl.col("cid2").is_in(list)))
.collect()
)
shape: (1, 3)
┌──────┬──────┬──────┐
│ cid1 ┆ cid2 ┆ cid3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╡
│ 1 ┆ 4 ┆ 7 │
└──────┴──────┴──────┘
But if we attempt to use the in operator instead, we get our error again.
(
df.lazy()
.filter((pl.col("cid1") in list) & (pl.col("cid2") in list))
.collect()
)
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/expr/expr.py", line 155, in __bool__
raise ValueError(
ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. Hint: use '&' or '|' to chain Expr together, not and/or.
I have a dataframe with 2 columns, where first column contains lists, and second column integer indexes. How to get elements from first column by index specified in second column? Or even better, put that element in 3rd column. So for example, how from this
a = pl.DataFrame([{'lst': [1, 2, 3], 'ind': 1}, {'lst': [4, 5, 6], 'ind': 2}])
┌───────────┬─────┐
│ lst ┆ ind │
│ --- ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ [1, 2, 3] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 │
└───────────┴─────┘
you can get this
b = pl.DataFrame([{'lst': [1, 2, 3], 'ind': 1, 'list[ind]': 2}, {'lst': [4, 5, 6], 'ind': 2, 'list[ind]': 6}])
┌───────────┬─────┬───────────┐
│ lst ┆ ind ┆ list[ind] │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪═══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴───────────┘
Thanks.
Edit
As of python polars 0.14.24 this can be done more easily by
df.with_column(pl.col("lst").arr.get(pl.col("ind")).alias("list[ind]"))
Original answer
You can use with_row_count() to add a row count column for grouping, then explode() the list so each list element is on each row. Then call take() over the row count column using over() to select the element from the subgroup.
df = pl.DataFrame({"lst": [[1, 2, 3], [4, 5, 6]], "ind": [1, 2]})
df = (
df.with_row_count()
.with_column(
pl.col("lst").explode().take(pl.col("ind")).over(pl.col("row_nr")).alias("list[ind]")
)
.drop("row_nr")
)
shape: (2, 3)
┌───────────┬─────┬───────────┐
│ lst ┆ ind ┆ list[ind] │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪═══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴───────────┘
Here is my approach:
Create a custom function to get the values as per the required index.
def get_elem(d):
sel_idx = d[0]
return d[1][sel_idx]
here is a test data.
df = pl.DataFrame({'lista':[[1,2,3],[4,5,6]],'idx':[1,2]})
Now lets create a struct on these two columns(it will create a dict) and apply an above function
df.with_columns([
pl.struct(['idx','lista']).apply(lambda x: get_elem(list(x.values()))).alias('req_elem')])
shape: (2, 3)
┌───────────┬─────┬──────────┐
│ lista ┆ idx ┆ req_elem │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪══════════╡
│ [1, 2, 3] ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2 ┆ 6 │
└───────────┴─────┴──────────┘
If your number of unique idx elements isn't absolutely massive, you can build a when/then expression to select based on the value of idx using list.get(idx):
import polars as pl
df = pl.DataFrame([{"lst": [1, 2, 3], "ind": 1}, {"lst": [4, 5, 6], "ind": 2}])
# create when/then expression for each unique index
idxs = df["ind"].unique()
ind, lst = pl.col("ind"), pl.col("lst") # makes expression generator look cleaner
expr = pl.when(ind == idxs[0]).then(lst.arr.get(idxs[0]))
for idx in idxs[1:]:
expr = expr.when(ind == idx).then(lst.arr.get(idx))
expr = expr.otherwise(None)
df.select(expr)
shape: (2, 1)
┌─────┐
│ lst │
│ --- │
│ i64 │
╞═════╡
│ 2 │
├╌╌╌╌╌┤
│ 6 │
└─────┘
I have a question about using the flatten function in Snowflake. I'm having trouble with extracting data from following path data:performance: of the following JSON-object:
{
"data": {
"metadata": {
"id": "001",
"created_at": "2020-01-01"
},
"performance": {
"2020-01-01": {
"ad_performances": [{
"ad": "XoGKkgcy7V3BDm6m",
"ad_impressions": 1,
"clicks": 0,
"device": "-3",
"total_net_amount": 0
}, {
"ad": "XoGKkgmFlHa3V5xj",
"ad_impressions": 17,
"clicks": 0,
"device": "-4",
"total_net_amount": 0
}, {
"ad": "XoGKkgmFlHa3V5xj",
"ad_impressions": 5,
"clicks": 0,
"device": "-5",
"total_net_amount": 0
}, {
"ad": "XoGKkgcy7V3BDm6m",
"ad_impressions": 19,
"clicks": 0,
"device": "-2",
"total_net_amount": 0
}, {
"ad": "XoGKkgcy7V3BDm6m",
"ad_impressions": 5,
"clicks": 0,
"device": "-1",
"total_net_amount": 0
}]
}
}
}
Desired result is a table with the "date" (2020-01-01), "ad" and "impressions".
I tried to achieve the desired result with:
select
key::date as date
,f.value:performances:ad as performances_array
,f.value:performances:impressions as performances_array
from <table>, lateral flatten (input => CLMN:performances) f;
but I´m not able to extract data from the "performance-array". Can someone help me out?
Thank you!
Can you try this one?
select f.KEY date,
l.VALUE:"ad" as performances_array,
l.VALUE:"impressions" as performances_array
from mydata, lateral flatten (input => CLMN:data.performance ) f,
lateral flatten (input => f.VALUE ) s,
lateral flatten (input => s.VALUE ) l
;
+------------+--------------------+--------------------+
| DATE | PERFORMANCES_ARRAY | PERFORMANCES_ARRAY |
+------------+--------------------+--------------------+
| 2020-01-01 | "XoGKkgcy7V3BDm6m" | 1 |
| 2020-01-01 | "XoGKkgmFlHa3V5xj" | 17 |
| 2020-01-01 | "XoGKkgmFlHa3V5xj" | |
| 2020-01-01 | "XoGKkgcy7V3BDm6m" | 19 |
| 2020-01-01 | "XoGKkgcy7V3BDm6m" | 5 |
+------------+--------------------+--------------------+
Only 2 LATERAL FLATTENs are required to extract the rows
select
a.key::date as ad_date,
b.value:ad::varchar as ad,
b.value:ad_impressions::int as impressions
from j
, lateral flatten(input => v:data:performance) a
, lateral flatten(input => a.value:ad_performances) b;
AD_DATE
AD
IMPRESSIONS
2020-01-01
XoGKkgcy7V3BDm6m
1
2020-01-01
XoGKkgmFlHa3V5xj
17
2020-01-01
XoGKkgmFlHa3V5xj
5
2020-01-01
XoGKkgcy7V3BDm6m
19
2020-01-01
XoGKkgcy7V3BDm6m
5
If you want to aggregate the data by ad date and ad,
with r as
(
select
a.key::date as ad_date,
b.value:ad::varchar as ad,
b.value:ad_impressions::int as impressions
from j
, lateral flatten(input => v:data:performance) a
, lateral flatten(input => a.value:ad_performances) b
)
select ad_date, ad, sum(impressions) as impressions
from r
group by ad_date, ad;
AD_DATE
AD
IMPRESSIONS
2020-01-01
XoGKkgcy7V3BDm6m
25
2020-01-01
XoGKkgmFlHa3V5xj
22
I feel like there is a better way than this:
import pandas as pd
df = pd.DataFrame(
columns=" index c1 c2 v1 ".split(),
data= [
[ 0, "A", "X", 3, ],
[ 1, "A", "X", 5, ],
[ 2, "A", "Y", 7, ],
[ 3, "A", "Y", 1, ],
[ 4, "B", "X", 3, ],
[ 5, "B", "X", 1, ],
[ 6, "B", "X", 3, ],
[ 7, "B", "Y", 1, ],
[ 8, "C", "X", 7, ],
[ 9, "C", "Y", 4, ],
[ 10, "C", "Y", 1, ],
[ 11, "C", "Y", 6, ],]).set_index("index", drop=True)
def callback(x):
x['seq'] = range(1, x.shape[0] + 1)
return x
df = df.groupby(['c1', 'c2']).apply(callback)
print df
To achieve this:
c1 c2 v1 seq
0 A X 3 1
1 A X 5 2
2 A Y 7 1
3 A Y 1 2
4 B X 3 1
5 B X 1 2
6 B X 3 3
7 B Y 1 1
8 C X 7 1
9 C Y 4 1
10 C Y 1 2
11 C Y 6 3
Is there a way to do it that avoids the callback?
use cumcount(), see docs here
In [4]: df.groupby(['c1', 'c2']).cumcount()
Out[4]:
0 0
1 1
2 0
3 1
4 0
5 1
6 2
7 0
8 0
9 0
10 1
11 2
dtype: int64
If you want orderings starting at 1
In [5]: df.groupby(['c1', 'c2']).cumcount()+1
Out[5]:
0 1
1 2
2 1
3 2
4 1
5 2
6 3
7 1
8 1
9 1
10 2
11 3
dtype: int64
This might be useful
df = df.sort_values(['userID', 'date'])
grp = df.groupby('userID')['ItemID'].aggregate(lambda x: '->'.join(tuple(x))).reset_index()
print(grp)
it will create a sequence like this
If you have a dataframe similar to the one below and you want to add seq column by building it from c1 or c2, i.e. keep a running count of similar values (or until a flag comes up) in other column(s), read on.
df = pd.DataFrame(
columns=" c1 c2 seq".split(),
data= [
[ "A", 1, 1 ],
[ "A1", 0, 2 ],
[ "A11", 0, 3 ],
[ "A111", 0, 4 ],
[ "B", 1, 1 ],
[ "B1", 0, 2 ],
[ "B111", 0, 3 ],
[ "C", 1, 1 ],
[ "C11", 0, 2 ] ])
then first find group starters, (str.contains() (and eq()) is used below but any method that creates a boolean Series such as lt(), ne(), isna() etc. can be used) and call cumsum() on it to create a Series where each group has a unique identifying value. Then use it as the grouper on a groupby().cumsum() operation.
In summary, use a code similar to the one below.
# build a grouper Series for similar values
groups = df['c1'].str.contains("A$|B$|C$").cumsum()
# or build a grouper Series from flags (1s)
groups = df['c2'].eq(1).cumsum()
# groupby using the above grouper
df['seq'] = df.groupby(groups).cumcount().add(1)
The cleanliness of Jeff's answer is nice, but I prefer to sort explicitly...though generally without overwriting my df for these type of use-cases (e.g. Shaina Raza's answer).
So, to create a new column sequenced by 'v1' within each ('c1', 'c2') group:
df["seq"] = df.sort_values(by=['c1','c2','v1']).groupby(['c1','c2']).cumcount()
you can check with:
df.sort_values(by=['c1','c2','seq'])
or, if you want to overwrite the df, then:
df = df.sort_values(by=['c1','c2','seq']).reset_index()
Assume we have couple of objects in database with attribute data where attribute data consists: {'gender' => {'male' => 40.0, 'female' => 30.0 => 'undefined' => 30.0}}.
I would like to find only these objects, which have the gender => male value the highest.
PostgreSQL 9.5
Assuming I understand your question correctly (example input/output would be useful):
WITH jsons(id, j) AS (
VALUES
(1, '{"gender": {"male": 40.0, "female": 30.0, "undefined": 30.0}}'::json),
(2, '{"gender": {"male": 40.0, "female": 30.0, "undefined": 30.0}}'),
(3, '{"gender": {"male": 0.0, "female": 30.0, "undefined": 30.0}}')
)
SELECT id, j
FROM jsons
WHERE (j->'gender'->>'male') :: float8 = (
SELECT MAX((j->'gender'->>'male') :: float8)
FROM jsons
)
;
┌────┬───────────────────────────────────────────────────────────────┐
│ id │ j │
├────┼───────────────────────────────────────────────────────────────┤
│ 1 │ {"gender": {"male": 40.0, "female": 30.0, "undefined": 30.0}} │
│ 2 │ {"gender": {"male": 40.0, "female": 30.0, "undefined": 30.0}} │
└────┴───────────────────────────────────────────────────────────────┘
(2 rows)