How to do time continuity checks using python in pandas data-frame - sql

I have a data-frame which has columns like:
colA colB colC colD colE flag
A X 2018Q1 500 600 1
A X 2018Q2 200 800 1
A X 2018Q3 100 400 1
A X 2018Q4 500 600 1
A X 2019Q1 400 7000 0
A X 2019Q2 1500 6100 0
A X 2018Q3 5600 600 1
A X 2018Q4 500 6007 1
A Y 2016Q1 900 620 1
A Y 2016Q2 750 850 0
A Y 2017Q1 750 850 1
A Y 2017Q2 750 850 1
A Y 2017Q3 750 850 1
A Y 2018Q1 750 850 1
A Y 2018Q2 750 850 1
A Y 2018Q3 750 850 1
A Y 2018Q4 750 850 1
A row at colA, colB level passes a statistical check if at colA, colB level the value of flag==1 for continuous 4 quarters of data after sorting for one stride.
We have to stride like this: 2018Q1-2018Q4 then 2018Q2-2019Q1 .... so on if there is 4 continuous quarter and flag==1 then we lable that as 1.
The final output will be like:
colA colB colC colD colE flag check_qtr
A X 2018Q1 500 600 1 1
A X 2018Q2 200 800 1 1
A X 2018Q3 100 400 1 1
A X 2018Q4 500 600 1 1
A X 2019Q1 400 7000 0 0
A X 2019Q2 1500 6100 0 0
A X 2018Q3 5600 600 1 0
A X 2018Q4 500 6007 1 0
A Y 2016Q1 900 620 1 0
A Y 2016Q2 750 850 0 0
A Y 2017Q1 750 850 1 0
A Y 2017Q2 750 850 1 0
A Y 2017Q3 750 850 1 0
A Y 2018Q1 750 850 1 1
A Y 2018Q2 750 850 1 1
A Y 2018Q3 750 850 1 1
A Y 2018Q4 750 850 1 1
How can we do this using pandas and numpy?
Can we implemet this is using sql?

Concerning your first question, this can be done like this using pandas:
First i'll generate your example dataframe:
import pandas as pd
df = pd.DataFrame({'colA':['A']*17,
'colB':['X']*8+['Y']*9,
'flag':[1,1,1,1,0,0,1,1,1,0,1,1,1,1,1,1,1]})
df.set_index(['colA','colB'], inplace=True) # Set index as multilevel with colA and colB
Resulting in your example dataframe. However, to use the following approach, we'll need to go back to a normal index:
df.reset_index(inplace=True)
colA colB flag
0 A X 1
1 A X 1
2 A X 1
3 A X 1
4 A X 0
5 A X 0
6 A X 1
7 A X 1
8 A Y 1
9 A Y 0
10 A Y 1
11 A Y 1
12 A Y 1
13 A Y 1
14 A Y 1
15 A Y 1
16 A Y 1
Then to obtain your result column you can use the groupby function (with some print to understand what's going on):
from scipy.ndimage.interpolation import shift
import numpy as np
df['check_qtr'] = pd.Series(0,index=df.index) # Initialise your result column
for name, group in df.groupby(['colA','colB','flag']):
if name[2] == 1:
print(name)
idx = ((group.index.values - shift(group.index.values, 1, cval=-1)) == 1).astype(int) # Is the index of the following value just 1 place after current ?
print(idx)
bools = [idx[x:x+4].sum()==4 for x in range(len(idx))] # Are the 4 next indexes following each others ?
print(bools)
for idx in group.index.values[bools]: # For each index where the 4 next indexes following each others
df.loc[idx:idx+3,'check_qtr'] = 1 #set check_qtr in row idx to row idx+3
('A', 'X', 1)
[1 1 1 1 0 1]
[True, False, False, False, False, False]
('A', 'Y', 1)
[0 0 1 1 1 1 1 1]
[False, False, True, True, True, False, False, False]
Note that we are using +4 in the case where we are doing array indexing. Because array[x:x+4] will give you the 4 values at index x to x+3.
We are using +3 when using loc because loc doesn't use the same logic. It retrieves indexes by name and not position. So between value idx and idx+3 we'll get 4 values.
Giving you the result you want:
colA colB flag check_qtr
0 A X 1 1
1 A X 1 1
2 A X 1 1
3 A X 1 1
4 A X 0 0
5 A X 0 0
6 A X 1 0
7 A X 1 0
8 A Y 1 0
9 A Y 0 0
10 A Y 1 0
11 A Y 1 1
12 A Y 1 1
13 A Y 1 1
14 A Y 1 1
15 A Y 1 1
16 A Y 1 1
This may not be the perfect way to do it, but it can give you some hints about how you can use some of those functions !

Related

Pandas: I want slice the data and shuffle them to genereate some synthetic data

Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1

Conditional sum after groupby based on value in another column

I have the dataframe as below.
Cycle Type Count Value
1 1 5 0.014
1 1 40 -0.219
1 1 5 0.001
1 1 100 -0.382
1 1 5 0.001
1 1 25 -0.064
2 1 5 0.003
2 1 110 -0.523
2 1 10 0.011
2 1 5 -0.009
2 1 5 0.012
2 1 156 -0.612
3 1 5 0.002
3 1 45 -0.167
3 1 5 0.003
3 1 10 -0.052
3 1 5 0.001
3 1 80 -0.194
I want to sum the 'Count' of all the positive & negative 'Value' AFTER groupby
The answer would something like
1 1 15 (sum of count when Value is positive),
1 1 165 (sum of count when Value is negative),
2 1 20,
2 1 171,
3 1 15,
3 1 135
I think this will work (grouped.set_index('Count').groupby(['Cycle','Type'])['Value']....... but i am unable to figure out how to specify positive & negative values to sum()
If I understood correctly, You can try below code,
df= pd.DataFrame (data)
df_negative=df[df['Value'] < 0]
df_positive=df[df['Value'] > 0]
df_negative = df_negative.groupby(['Cycle','Type']).Count.sum().reset_index()
df_positive = df_positive.groupby(['Cycle','Type']).Count.sum().reset_index()
df_combine = pd.concat([df_positive,df_negative]).sort_values('Cycle')
df_combine

What's the problem of this one-hot encoding?

In [4]: data = pd.read_csv('student_data.csv')
In [5]: data[:10]
Out[5]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
5 1 760 3.00 2
6 1 560 2.98 1
7 0 400 3.08 2
8 1 540 3.39 3
9 0 700 3.92 2
one_hot_data = pd.get_dummies(data['rank'])
# TODO: Drop the previous rank column
data = data.drop('rank', axis=1)
data = data.join(one_hot_data)
# Print the first 10 rows of our data
data[:10]
It always gives an error:
KeyError: 'rank'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-6a749c8f286e> in <module>()
1 # TODO: Make dummy variables for rank
----> 2 one_hot_data = pd.get_dummies(data['rank'])
3
4 # TODO: Drop the previous rank column
5 data = data.drop('rank', axis=1)
If get:
KeyError: 'rank'
it means there is no column rank. Obviously problem is with traling whitespace or encoding.
print (data.columns.tolist())
['admit', 'gre', 'gpa', 'rank']
Your solution should be simplify by DataFrame.pop - it select column and remove from original DataFrame:
data = data.join(pd.get_dummies(data.pop('rank')))
# Print the first 10 rows of our data
print(data[:10])
admit gre gpa 1 2 3 4
0 0 380 3.61 0 0 1 0
1 1 660 3.67 0 0 1 0
2 1 800 4.00 1 0 0 0
3 1 640 3.19 0 0 0 1
4 0 520 2.93 0 0 0 1
5 1 760 3.00 0 1 0 0
6 1 560 2.98 1 0 0 0
7 0 400 3.08 0 1 0 0
8 1 540 3.39 0 0 1 0
9 0 700 3.92 0 1 0 0
I tried your code and it works fine. You can need to rerun the previous cells which includes loading of the data

merging 3 dataframes on condition

i have a dataframe df
id value
1 100
2 200
3 500
4 600
5 700
6 800
i have another dataframe df2
c_id flag
2 Y
3 Y
5 Y
Similarly df3
c_id flag
1 N
3 Y
4 Y
i want to merge these 3 dataframes and create a column in df
such that my df looks like:
id value flag
1 100 N
2 200 Y
3 500 Y
4 600 Y
5 700 Y
6 800 nan
I DON'T WANT TO USE df2 and df3 concatenation
for eg(
final = pd.concat([df2,df3],ignore_index=False)
final.drop_duplicates(inplace=True)
i don't want to use this method, is there any other way?
Using pd.merge, between df and combined df2+df3
In [1150]: df.merge(df2.append(df3), left_on=['id'], right_on=['c_id'], how='left')
Out[1150]:
id value c_id flag
0 1 100 1.0 N
1 2 200 2.0 Y
2 3 500 3.0 Y
3 3 500 3.0 Y
4 4 600 4.0 Y
5 5 700 5.0 Y
6 6 800 NaN NaN
Details
In [1151]: df2.append(df3)
Out[1151]:
c_id flag
0 2 Y
1 3 Y
2 5 Y
0 1 N
1 3 Y
2 4 Y
Using map you could
In [1140]: df.assign(flag=df.id.map(
df2.set_index('c_id')['flag'].combine_first(
df3.set_index('c_id')['flag']))
)
Out[1140]:
id value flag
0 1 100 N
1 2 200 Y
2 3 500 Y
3 4 600 Y
4 5 700 Y
5 6 800 NaN
Let me explain, using set_index and combine_first create a mapping for id and flag
In [1141]: mapping = df2.set_index('c_id')['flag'].combine_first(
df3.set_index('c_id')['flag'])
In [1142]: mapping
Out[1142]:
c_id
1 N
2 Y
3 Y
4 Y
5 Y
Name: flag, dtype: object
In [1143]: df.assign(flag=df.id.map(mapping))
Out[1143]:
id value flag
0 1 100 N
1 2 200 Y
2 3 500 Y
3 4 600 Y
4 5 700 Y
5 6 800 NaN
Merge on both df2 and df3
df= pd.merge(pd.merge(df,df2,on='ID',how='left'),df3,on='ID',how='left')
Fill nulls
df['ID'] =df['ID_y'].fillna(df['ID_x']
Delete the columns
del df['ID_y']; del df['ID_x']
Or you could just append,
df4 = df2.append(df3)
pd.merge(df,df4,how='left',on='ID')

Set column name for size()

I'm trying to rename the size() column as shown here like this:
x = monthly.copy()
x["size"] = x\
.groupby(["sub_acct_id", "clndr_yr_month"]).transform(np.size)
But what I'm getting is
ValueError: Wrong number of items passed 15, placement implies 1
Why is this not working for my dataframe?
If I simple print the copy:
x = monthly.copy()
print x
this is how the table looks like:
sub_acct_id clndr_yr_month
12716D 201601 219
201602 265
12716G 201601 221
201602 262
12716K 201601 181
201602 149
...
what I try to accomplish is to set the name of the column:
sub_acct_id clndr_yr_month size
12716D 201601 219
201602 265
12716G 201601 221
201602 262
12716K 201601 181
201602 149
...
You need:
x["size"] = x.groupby(["sub_acct_id", "clndr_yr_month"])['sub_acct_id'].transform('size')
Sample:
df = pd.DataFrame({'sub_acct_id': ['x', 'x', 'x','x','y','y','y','z','z']
, 'clndr_yr_month': ['a', 'b', 'c','c','a','b','c','a','b']})
print (df)
clndr_yr_month sub_acct_id
0 a x
1 b x
2 c x
3 c x
4 a y
5 b y
6 c y
7 a z
8 b z
df['size'] = df.groupby(['sub_acct_id', 'clndr_yr_month'])['sub_acct_id'].transform('size')
print (df)
clndr_yr_month sub_acct_id size
0 a x 1
1 b x 1
2 c x 2
3 c x 2
4 a y 1
5 b y 1
6 c y 1
7 a z 1
8 b z 1
Another solution with aggregating output:
df = df.groupby(['sub_acct_id', 'clndr_yr_month']).size().reset_index(name='Size')
print (df)
sub_acct_id clndr_yr_month Size
0 x a 1
1 x b 1
2 x c 2
3 y a 1
4 y b 1
5 y c 1
6 z a 1
7 z b 1