pandas groupby per-group value - pandas

I have this data:
df = pd.DataFrame({
"dim1": [ "aaa", "aaa", "aaa", "aaa", "aaa", "aaa" ],
"dim2": [ "xxx", "xxx", "xxx", "yyy", "yyy", "yyy" ],
"iter": [ 0, 1, 2, 0, 1, 2 ],
"value1": [ 100, 101, 99, 500, 490, 510 ],
"value2": [ 10000, 10100, 9900, 50000, 49000, 51000 ],
})
I then groupby dim1/dim2 and out of all iterations, I pick value1/value2 for the minimum value1:
df = df.groupby(["dim1", "dim2"], group_keys=False) \
.apply(lambda x: x.sort_values("value1").head(1)).drop(columns=["iter"])
which returns:
dim1 dim2 value1 value2
aaa xxx 99 9900
aaa yyy 490 49000
My question: how can I add a new column that contains the min value1 per dim1 group:
dim1 dim2 value1 value2 new_col
aaa xxx 99 9900 99
aaa yyy 490 49000 99
I tried something like this, which didn't work:
df["new_col"] = df.groupby(["dim1"], group_keys=False) \
.apply(lambda x: x.value1.head(1))

IIUC, you can use .groupby + .transform afterwards:
df["new_col"] = df.groupby("dim1")["value1"].transform("min")
print(df)
Prints:
dim1 dim2 value1 value2 new_col
2 aaa xxx 99 9900 99
4 aaa yyy 490 49000 99

Related

how to groupby but counsecutively only

i have a dataframe like this:
data = {'costs': [150, 400, 300, 500, 350], 'month':[1, 2, 2, 1, 1]}
df = pd.DataFrame(data)
i want to use groupby(['month']).sum() but first row not to be
cobmined with fourth and fifth rows so the result of costs would be
like this
list(df['costs'])= [150, 700, 850]
Try:
x = (
df.groupby((df.month != df.month.shift(1)).cumsum())
.agg({"costs": "sum", "month": "first"})
.reset_index(drop=True)
)
print(x)
Prints:
costs month
0 150 1
1 700 2
2 850 1

How to flatten JSON values into frequency counts in SQL

I have a column with JSON values like so:
{'A': 'true', 'B': 'false', 'C': 'true'}
{'A': 'true', 'C': 'false'}
{'D': 'true'}
{'C': 'true', 'A': 'false'}
I would like to create an SQL query which counts the number of entries with each key-value combination in the json.
Note that the keys and values are unknown in advance.
So the output of the above would be:
2 A=true
1 A=false
1 B=false
2 C=true
1 C=false
1 D=true
How can I do that?
SELECT a1||':'||a2, count(*) from (
SELECT map_entries(cast(json_parse(x) as MAP<VARCHAR, VARCHAR>)) row from
(VALUES ('{"A": "true", "B": "false", "C": "true"}'), ('{"A": "true", "C": "false"}'), ('{"D": "true"}'), ('{"C": "true", "A": "false"}')) as t(x))
as nested_data CROSS JOIN UNNEST(row) as nested_data(a1, a2)
group by 1;
_col0 | _col1
---------+-------
D:true | 1
B:false | 1
C:false | 1
C:true | 2
A:false | 1
A:true | 2
https://prestosql.io/docs/current/functions/map.html

Count previous occurrences in Pandas [duplicate]

I feel like there is a better way than this:
import pandas as pd
df = pd.DataFrame(
columns=" index c1 c2 v1 ".split(),
data= [
[ 0, "A", "X", 3, ],
[ 1, "A", "X", 5, ],
[ 2, "A", "Y", 7, ],
[ 3, "A", "Y", 1, ],
[ 4, "B", "X", 3, ],
[ 5, "B", "X", 1, ],
[ 6, "B", "X", 3, ],
[ 7, "B", "Y", 1, ],
[ 8, "C", "X", 7, ],
[ 9, "C", "Y", 4, ],
[ 10, "C", "Y", 1, ],
[ 11, "C", "Y", 6, ],]).set_index("index", drop=True)
def callback(x):
x['seq'] = range(1, x.shape[0] + 1)
return x
df = df.groupby(['c1', 'c2']).apply(callback)
print df
To achieve this:
c1 c2 v1 seq
0 A X 3 1
1 A X 5 2
2 A Y 7 1
3 A Y 1 2
4 B X 3 1
5 B X 1 2
6 B X 3 3
7 B Y 1 1
8 C X 7 1
9 C Y 4 1
10 C Y 1 2
11 C Y 6 3
Is there a way to do it that avoids the callback?
use cumcount(), see docs here
In [4]: df.groupby(['c1', 'c2']).cumcount()
Out[4]:
0 0
1 1
2 0
3 1
4 0
5 1
6 2
7 0
8 0
9 0
10 1
11 2
dtype: int64
If you want orderings starting at 1
In [5]: df.groupby(['c1', 'c2']).cumcount()+1
Out[5]:
0 1
1 2
2 1
3 2
4 1
5 2
6 3
7 1
8 1
9 1
10 2
11 3
dtype: int64
This might be useful
df = df.sort_values(['userID', 'date'])
grp = df.groupby('userID')['ItemID'].aggregate(lambda x: '->'.join(tuple(x))).reset_index()
print(grp)
it will create a sequence like this
If you have a dataframe similar to the one below and you want to add seq column by building it from c1 or c2, i.e. keep a running count of similar values (or until a flag comes up) in other column(s), read on.
df = pd.DataFrame(
columns=" c1 c2 seq".split(),
data= [
[ "A", 1, 1 ],
[ "A1", 0, 2 ],
[ "A11", 0, 3 ],
[ "A111", 0, 4 ],
[ "B", 1, 1 ],
[ "B1", 0, 2 ],
[ "B111", 0, 3 ],
[ "C", 1, 1 ],
[ "C11", 0, 2 ] ])
then first find group starters, (str.contains() (and eq()) is used below but any method that creates a boolean Series such as lt(), ne(), isna() etc. can be used) and call cumsum() on it to create a Series where each group has a unique identifying value. Then use it as the grouper on a groupby().cumsum() operation.
In summary, use a code similar to the one below.
# build a grouper Series for similar values
groups = df['c1'].str.contains("A$|B$|C$").cumsum()
# or build a grouper Series from flags (1s)
groups = df['c2'].eq(1).cumsum()
# groupby using the above grouper
df['seq'] = df.groupby(groups).cumcount().add(1)
The cleanliness of Jeff's answer is nice, but I prefer to sort explicitly...though generally without overwriting my df for these type of use-cases (e.g. Shaina Raza's answer).
So, to create a new column sequenced by 'v1' within each ('c1', 'c2') group:
df["seq"] = df.sort_values(by=['c1','c2','v1']).groupby(['c1','c2']).cumcount()
you can check with:
df.sort_values(by=['c1','c2','seq'])
or, if you want to overwrite the df, then:
df = df.sort_values(by=['c1','c2','seq']).reset_index()

Pandas time re-sampling categorical data from a column with calculations from another numerical column

I have a data-frame with a categorical column and a numerical , the index set to time data
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
the above code gives:
Kind Values
date
2013-03-01 A 1.0
2013-03-02 B 1.5
2013-03-01 A 2.0
2013-03-02 B 3.0
2013-03-01 B 5.0
2013-03-02 A 3.0
My aim is to achieve the below data-frame:
A_count B_count A_Val max B_Val max
date
2013-03-01 2 1 2 5
2013-03-02 0 3 0 3
Which also has the time as index . Here, I note that If we use
data = pd.DataFrame(data.resample('D')['Pack'].value_counts())
we get :
Kind
date Kind
2013-03-01 A 2
B 1
2013-03-02 B 3
Use DataFrame.pivot_table with flattening MultiIndex in columns in list comprehension:
df = pd.DataFrame({
'date': [
'2013-03-01 ', '2013-03-02 ',
'2013-03-01 ', '2013-03-02',
'2013-03-01 ', '2013-03-02 '
],
'Kind': [
'A', 'B', 'A', 'B', 'B', 'B'
],
'Values': [1, 1.5, 2, 3, 5, 3]
})
df['date'] = pd.to_datetime(df['date'])
#is possible omit
#df = df.set_index('date')
df = df.pivot_table(index='date', columns='Kind', values='Values', aggfunc=['count','max'])
df.columns = [f'{b}_{a}' for a, b in df.columns]
print (df)
A_count B_count A_max B_max
date
2013-03-01 2.0 1.0 2.0 5.0
2013-03-02 NaN 3.0 NaN 3.0
Another solution with Grouper for resample by days:
df = df.set_index('date')
df = df.groupby([pd.Grouper(freq='d'), 'Kind'])['Values'].agg(['count','max']).unstack()
df.columns = [f'{b}_{a}' for a, b in df.columns]

Lookup smallest value greater than current

I have an objects table and a lookup table. In the objects table, I'm looking to add the smallest value from the lookup table that is greater than the object's number.
I found this similar question but it's about finding a value greater than a constant, rather than changing for each row.
In code:
import pandas as pd
objects = pd.DataFrame([{"id": 1, "number": 10}, {"id": 2, "number": 30}])
lookup = pd.DataFrame([{"number": 3}, {"number": 12}, {"number": 40}])
expected = pd.DataFrame(
[
{"id": 1, "number": 10, "smallest_greater": 12},
{"id": 2, "number": 30, "smallest_greater": 40},
]
)
First compare each value lookup['number'] by objects['number'] to 2d boolean mask, then add cumsum and compare first value by 1 and get position by numpy.argmax for set value by lookup['number'].
Output is generated with numpy.where for overwrite all not matched values to NaN.
objects = pd.DataFrame([{"id": 1, "number": 10}, {"id": 2, "number": 30},
{"id": 3, "number": 100},{"id": 4, "number": 1}])
print (objects)
id number
0 1 10
1 2 30
2 3 100
3 4 1
m1 = lookup['number'].values >= objects['number'].values[:, None]
m2 = np.cumsum(m1, axis=1) == 1
m3 = np.any(m1, axis=1)
out = lookup['number'].values[m2.argmax(axis=1)]
objects['smallest_greater'] = np.where(m3, out, np.nan)
print (objects)
id number smallest_greater
0 1 10 12.0
1 2 30 40.0
2 3 100 NaN
3 4 1 3.0
smallest_greater = []
for i in objects['number']: smallest_greater.append(lookup['number'[lookup[lookup['number']>i].sort_values(by='number').index[0]])
objects['smallest_greater'] = smallest_greater