I have a data-frame which has columns like:
colA colB colC colD colE flag
A X 2018Q1 500 600 1
A X 2018Q2 200 800 1
A X 2018Q3 100 400 1
A X 2018Q4 500 600 1
A X 2019Q1 400 7000 0
A X 2019Q2 1500 6100 0
A X 2018Q3 5600 600 1
A X 2018Q4 500 6007 1
A Y 2016Q1 900 620 1
A Y 2016Q2 750 850 0
A Y 2017Q1 750 850 1
A Y 2017Q2 750 850 1
A Y 2017Q3 750 850 1
A Y 2018Q1 750 850 1
A Y 2018Q2 750 850 1
A Y 2018Q3 750 850 1
A Y 2018Q4 750 850 1
A row at colA, colB level passes a statistical check if at colA, colB level the value of flag==1 for continuous 4 quarters of data after sorting for one stride.
We have to stride like this: 2018Q1-2018Q4 then 2018Q2-2019Q1 .... so on if there is 4 continuous quarter and flag==1 then we lable that as 1.
The final output will be like:
colA colB colC colD colE flag check_qtr
A X 2018Q1 500 600 1 1
A X 2018Q2 200 800 1 1
A X 2018Q3 100 400 1 1
A X 2018Q4 500 600 1 1
A X 2019Q1 400 7000 0 0
A X 2019Q2 1500 6100 0 0
A X 2018Q3 5600 600 1 0
A X 2018Q4 500 6007 1 0
A Y 2016Q1 900 620 1 0
A Y 2016Q2 750 850 0 0
A Y 2017Q1 750 850 1 0
A Y 2017Q2 750 850 1 0
A Y 2017Q3 750 850 1 0
A Y 2018Q1 750 850 1 1
A Y 2018Q2 750 850 1 1
A Y 2018Q3 750 850 1 1
A Y 2018Q4 750 850 1 1
How can we do this using pandas and numpy?
Can we implemet this is using sql?
Concerning your first question, this can be done like this using pandas:
First i'll generate your example dataframe:
import pandas as pd
df = pd.DataFrame({'colA':['A']*17,
'colB':['X']*8+['Y']*9,
'flag':[1,1,1,1,0,0,1,1,1,0,1,1,1,1,1,1,1]})
df.set_index(['colA','colB'], inplace=True) # Set index as multilevel with colA and colB
Resulting in your example dataframe. However, to use the following approach, we'll need to go back to a normal index:
df.reset_index(inplace=True)
colA colB flag
0 A X 1
1 A X 1
2 A X 1
3 A X 1
4 A X 0
5 A X 0
6 A X 1
7 A X 1
8 A Y 1
9 A Y 0
10 A Y 1
11 A Y 1
12 A Y 1
13 A Y 1
14 A Y 1
15 A Y 1
16 A Y 1
Then to obtain your result column you can use the groupby function (with some print to understand what's going on):
from scipy.ndimage.interpolation import shift
import numpy as np
df['check_qtr'] = pd.Series(0,index=df.index) # Initialise your result column
for name, group in df.groupby(['colA','colB','flag']):
if name[2] == 1:
print(name)
idx = ((group.index.values - shift(group.index.values, 1, cval=-1)) == 1).astype(int) # Is the index of the following value just 1 place after current ?
print(idx)
bools = [idx[x:x+4].sum()==4 for x in range(len(idx))] # Are the 4 next indexes following each others ?
print(bools)
for idx in group.index.values[bools]: # For each index where the 4 next indexes following each others
df.loc[idx:idx+3,'check_qtr'] = 1 #set check_qtr in row idx to row idx+3
('A', 'X', 1)
[1 1 1 1 0 1]
[True, False, False, False, False, False]
('A', 'Y', 1)
[0 0 1 1 1 1 1 1]
[False, False, True, True, True, False, False, False]
Note that we are using +4 in the case where we are doing array indexing. Because array[x:x+4] will give you the 4 values at index x to x+3.
We are using +3 when using loc because loc doesn't use the same logic. It retrieves indexes by name and not position. So between value idx and idx+3 we'll get 4 values.
Giving you the result you want:
colA colB flag check_qtr
0 A X 1 1
1 A X 1 1
2 A X 1 1
3 A X 1 1
4 A X 0 0
5 A X 0 0
6 A X 1 0
7 A X 1 0
8 A Y 1 0
9 A Y 0 0
10 A Y 1 0
11 A Y 1 1
12 A Y 1 1
13 A Y 1 1
14 A Y 1 1
15 A Y 1 1
16 A Y 1 1
This may not be the perfect way to do it, but it can give you some hints about how you can use some of those functions !
Related
Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1
I have the dataframe as below.
Cycle Type Count Value
1 1 5 0.014
1 1 40 -0.219
1 1 5 0.001
1 1 100 -0.382
1 1 5 0.001
1 1 25 -0.064
2 1 5 0.003
2 1 110 -0.523
2 1 10 0.011
2 1 5 -0.009
2 1 5 0.012
2 1 156 -0.612
3 1 5 0.002
3 1 45 -0.167
3 1 5 0.003
3 1 10 -0.052
3 1 5 0.001
3 1 80 -0.194
I want to sum the 'Count' of all the positive & negative 'Value' AFTER groupby
The answer would something like
1 1 15 (sum of count when Value is positive),
1 1 165 (sum of count when Value is negative),
2 1 20,
2 1 171,
3 1 15,
3 1 135
I think this will work (grouped.set_index('Count').groupby(['Cycle','Type'])['Value']....... but i am unable to figure out how to specify positive & negative values to sum()
If I understood correctly, You can try below code,
df= pd.DataFrame (data)
df_negative=df[df['Value'] < 0]
df_positive=df[df['Value'] > 0]
df_negative = df_negative.groupby(['Cycle','Type']).Count.sum().reset_index()
df_positive = df_positive.groupby(['Cycle','Type']).Count.sum().reset_index()
df_combine = pd.concat([df_positive,df_negative]).sort_values('Cycle')
df_combine
In [4]: data = pd.read_csv('student_data.csv')
In [5]: data[:10]
Out[5]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
5 1 760 3.00 2
6 1 560 2.98 1
7 0 400 3.08 2
8 1 540 3.39 3
9 0 700 3.92 2
one_hot_data = pd.get_dummies(data['rank'])
# TODO: Drop the previous rank column
data = data.drop('rank', axis=1)
data = data.join(one_hot_data)
# Print the first 10 rows of our data
data[:10]
It always gives an error:
KeyError: 'rank'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-6a749c8f286e> in <module>()
1 # TODO: Make dummy variables for rank
----> 2 one_hot_data = pd.get_dummies(data['rank'])
3
4 # TODO: Drop the previous rank column
5 data = data.drop('rank', axis=1)
If get:
KeyError: 'rank'
it means there is no column rank. Obviously problem is with traling whitespace or encoding.
print (data.columns.tolist())
['admit', 'gre', 'gpa', 'rank']
Your solution should be simplify by DataFrame.pop - it select column and remove from original DataFrame:
data = data.join(pd.get_dummies(data.pop('rank')))
# Print the first 10 rows of our data
print(data[:10])
admit gre gpa 1 2 3 4
0 0 380 3.61 0 0 1 0
1 1 660 3.67 0 0 1 0
2 1 800 4.00 1 0 0 0
3 1 640 3.19 0 0 0 1
4 0 520 2.93 0 0 0 1
5 1 760 3.00 0 1 0 0
6 1 560 2.98 1 0 0 0
7 0 400 3.08 0 1 0 0
8 1 540 3.39 0 0 1 0
9 0 700 3.92 0 1 0 0
I tried your code and it works fine. You can need to rerun the previous cells which includes loading of the data
i have a dataframe df
id value
1 100
2 200
3 500
4 600
5 700
6 800
i have another dataframe df2
c_id flag
2 Y
3 Y
5 Y
Similarly df3
c_id flag
1 N
3 Y
4 Y
i want to merge these 3 dataframes and create a column in df
such that my df looks like:
id value flag
1 100 N
2 200 Y
3 500 Y
4 600 Y
5 700 Y
6 800 nan
I DON'T WANT TO USE df2 and df3 concatenation
for eg(
final = pd.concat([df2,df3],ignore_index=False)
final.drop_duplicates(inplace=True)
i don't want to use this method, is there any other way?
Using pd.merge, between df and combined df2+df3
In [1150]: df.merge(df2.append(df3), left_on=['id'], right_on=['c_id'], how='left')
Out[1150]:
id value c_id flag
0 1 100 1.0 N
1 2 200 2.0 Y
2 3 500 3.0 Y
3 3 500 3.0 Y
4 4 600 4.0 Y
5 5 700 5.0 Y
6 6 800 NaN NaN
Details
In [1151]: df2.append(df3)
Out[1151]:
c_id flag
0 2 Y
1 3 Y
2 5 Y
0 1 N
1 3 Y
2 4 Y
Using map you could
In [1140]: df.assign(flag=df.id.map(
df2.set_index('c_id')['flag'].combine_first(
df3.set_index('c_id')['flag']))
)
Out[1140]:
id value flag
0 1 100 N
1 2 200 Y
2 3 500 Y
3 4 600 Y
4 5 700 Y
5 6 800 NaN
Let me explain, using set_index and combine_first create a mapping for id and flag
In [1141]: mapping = df2.set_index('c_id')['flag'].combine_first(
df3.set_index('c_id')['flag'])
In [1142]: mapping
Out[1142]:
c_id
1 N
2 Y
3 Y
4 Y
5 Y
Name: flag, dtype: object
In [1143]: df.assign(flag=df.id.map(mapping))
Out[1143]:
id value flag
0 1 100 N
1 2 200 Y
2 3 500 Y
3 4 600 Y
4 5 700 Y
5 6 800 NaN
Merge on both df2 and df3
df= pd.merge(pd.merge(df,df2,on='ID',how='left'),df3,on='ID',how='left')
Fill nulls
df['ID'] =df['ID_y'].fillna(df['ID_x']
Delete the columns
del df['ID_y']; del df['ID_x']
Or you could just append,
df4 = df2.append(df3)
pd.merge(df,df4,how='left',on='ID')
I'm trying to rename the size() column as shown here like this:
x = monthly.copy()
x["size"] = x\
.groupby(["sub_acct_id", "clndr_yr_month"]).transform(np.size)
But what I'm getting is
ValueError: Wrong number of items passed 15, placement implies 1
Why is this not working for my dataframe?
If I simple print the copy:
x = monthly.copy()
print x
this is how the table looks like:
sub_acct_id clndr_yr_month
12716D 201601 219
201602 265
12716G 201601 221
201602 262
12716K 201601 181
201602 149
...
what I try to accomplish is to set the name of the column:
sub_acct_id clndr_yr_month size
12716D 201601 219
201602 265
12716G 201601 221
201602 262
12716K 201601 181
201602 149
...
You need:
x["size"] = x.groupby(["sub_acct_id", "clndr_yr_month"])['sub_acct_id'].transform('size')
Sample:
df = pd.DataFrame({'sub_acct_id': ['x', 'x', 'x','x','y','y','y','z','z']
, 'clndr_yr_month': ['a', 'b', 'c','c','a','b','c','a','b']})
print (df)
clndr_yr_month sub_acct_id
0 a x
1 b x
2 c x
3 c x
4 a y
5 b y
6 c y
7 a z
8 b z
df['size'] = df.groupby(['sub_acct_id', 'clndr_yr_month'])['sub_acct_id'].transform('size')
print (df)
clndr_yr_month sub_acct_id size
0 a x 1
1 b x 1
2 c x 2
3 c x 2
4 a y 1
5 b y 1
6 c y 1
7 a z 1
8 b z 1
Another solution with aggregating output:
df = df.groupby(['sub_acct_id', 'clndr_yr_month']).size().reset_index(name='Size')
print (df)
sub_acct_id clndr_yr_month Size
0 x a 1
1 x b 1
2 x c 2
3 y a 1
4 y b 1
5 y c 1
6 z a 1
7 z b 1