I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.
Here is the code to generate the data frame:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
"job": ["A", "A", "A", "B", "B"],
}
)
df
I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).
I have tried the following, but it is giving me an error:
df['change']=df.groupby(['id'])['job'].diff().fillna(0)
In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.
current answer:
df["change"] = df.groupby(
["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)
Here is the (longer) solution that I worked out:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": [0, 1, 0, 1, 2],
"job": ["A", "A", "A", "B", "B"],
}
)
df1 = df.set_index(['id', 'date']).sort_index()
df1['job_lag'] = df1.groupby(level='id')['job'].shift()
df1.job_lag.fillna(df1.job, inplace=True)
def change(x):
if x['job'] != x['job_lag'] :
return 1
else:
return 0
df1['dummy'] = df1.apply(change, axis=1)
df1
i have a dataframe like this:
data = {'costs': [150, 400, 300, 500, 350], 'month':[1, 2, 2, 1, 1]}
df = pd.DataFrame(data)
i want to use groupby(['month']).sum() but first row not to be
cobmined with fourth and fifth rows so the result of costs would be
like this
list(df['costs'])= [150, 700, 850]
Try:
x = (
df.groupby((df.month != df.month.shift(1)).cumsum())
.agg({"costs": "sum", "month": "first"})
.reset_index(drop=True)
)
print(x)
Prints:
costs month
0 150 1
1 700 2
2 850 1
This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.
You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5
I feel like there is a better way than this:
import pandas as pd
df = pd.DataFrame(
columns=" index c1 c2 v1 ".split(),
data= [
[ 0, "A", "X", 3, ],
[ 1, "A", "X", 5, ],
[ 2, "A", "Y", 7, ],
[ 3, "A", "Y", 1, ],
[ 4, "B", "X", 3, ],
[ 5, "B", "X", 1, ],
[ 6, "B", "X", 3, ],
[ 7, "B", "Y", 1, ],
[ 8, "C", "X", 7, ],
[ 9, "C", "Y", 4, ],
[ 10, "C", "Y", 1, ],
[ 11, "C", "Y", 6, ],]).set_index("index", drop=True)
def callback(x):
x['seq'] = range(1, x.shape[0] + 1)
return x
df = df.groupby(['c1', 'c2']).apply(callback)
print df
To achieve this:
c1 c2 v1 seq
0 A X 3 1
1 A X 5 2
2 A Y 7 1
3 A Y 1 2
4 B X 3 1
5 B X 1 2
6 B X 3 3
7 B Y 1 1
8 C X 7 1
9 C Y 4 1
10 C Y 1 2
11 C Y 6 3
Is there a way to do it that avoids the callback?
use cumcount(), see docs here
In [4]: df.groupby(['c1', 'c2']).cumcount()
Out[4]:
0 0
1 1
2 0
3 1
4 0
5 1
6 2
7 0
8 0
9 0
10 1
11 2
dtype: int64
If you want orderings starting at 1
In [5]: df.groupby(['c1', 'c2']).cumcount()+1
Out[5]:
0 1
1 2
2 1
3 2
4 1
5 2
6 3
7 1
8 1
9 1
10 2
11 3
dtype: int64
This might be useful
df = df.sort_values(['userID', 'date'])
grp = df.groupby('userID')['ItemID'].aggregate(lambda x: '->'.join(tuple(x))).reset_index()
print(grp)
it will create a sequence like this
If you have a dataframe similar to the one below and you want to add seq column by building it from c1 or c2, i.e. keep a running count of similar values (or until a flag comes up) in other column(s), read on.
df = pd.DataFrame(
columns=" c1 c2 seq".split(),
data= [
[ "A", 1, 1 ],
[ "A1", 0, 2 ],
[ "A11", 0, 3 ],
[ "A111", 0, 4 ],
[ "B", 1, 1 ],
[ "B1", 0, 2 ],
[ "B111", 0, 3 ],
[ "C", 1, 1 ],
[ "C11", 0, 2 ] ])
then first find group starters, (str.contains() (and eq()) is used below but any method that creates a boolean Series such as lt(), ne(), isna() etc. can be used) and call cumsum() on it to create a Series where each group has a unique identifying value. Then use it as the grouper on a groupby().cumsum() operation.
In summary, use a code similar to the one below.
# build a grouper Series for similar values
groups = df['c1'].str.contains("A$|B$|C$").cumsum()
# or build a grouper Series from flags (1s)
groups = df['c2'].eq(1).cumsum()
# groupby using the above grouper
df['seq'] = df.groupby(groups).cumcount().add(1)
The cleanliness of Jeff's answer is nice, but I prefer to sort explicitly...though generally without overwriting my df for these type of use-cases (e.g. Shaina Raza's answer).
So, to create a new column sequenced by 'v1' within each ('c1', 'c2') group:
df["seq"] = df.sort_values(by=['c1','c2','v1']).groupby(['c1','c2']).cumcount()
you can check with:
df.sort_values(by=['c1','c2','seq'])
or, if you want to overwrite the df, then:
df = df.sort_values(by=['c1','c2','seq']).reset_index()
Right now, I have an an array that I'm able to select off a table.
[{"_id": 1, "count: 3},{"_id": 2, "count: 14},{"_id": 3, "count: 5}]
From this, I only need the count for a particular _id. For example, I need the count for
_id: 3
I've read the documentation but I haven't been able to figure out the correct way to get the object.
WITH test_array(data) AS ( VALUES
('[
{"_id": 1, "count": 3},
{"_id": 2, "count": 14},
{"_id": 3, "count": 5}
]'::JSONB)
)
SELECT val->>'count' AS result
FROM
test_array ta,
jsonb_array_elements(ta.data) val
WHERE val #> '{"_id":3}'::JSONB;
Result:
result
--------
5
(1 row)