Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1
I have a data-frame which has columns like:
colA colB colC colD colE flag
A X 2018Q1 500 600 1
A X 2018Q2 200 800 1
A X 2018Q3 100 400 1
A X 2018Q4 500 600 1
A X 2019Q1 400 7000 0
A X 2019Q2 1500 6100 0
A X 2018Q3 5600 600 1
A X 2018Q4 500 6007 1
A Y 2016Q1 900 620 1
A Y 2016Q2 750 850 0
A Y 2017Q1 750 850 1
A Y 2017Q2 750 850 1
A Y 2017Q3 750 850 1
A Y 2018Q1 750 850 1
A Y 2018Q2 750 850 1
A Y 2018Q3 750 850 1
A Y 2018Q4 750 850 1
A row at colA, colB level passes a statistical check if at colA, colB level the value of flag==1 for continuous 4 quarters of data after sorting for one stride.
We have to stride like this: 2018Q1-2018Q4 then 2018Q2-2019Q1 .... so on if there is 4 continuous quarter and flag==1 then we lable that as 1.
The final output will be like:
colA colB colC colD colE flag check_qtr
A X 2018Q1 500 600 1 1
A X 2018Q2 200 800 1 1
A X 2018Q3 100 400 1 1
A X 2018Q4 500 600 1 1
A X 2019Q1 400 7000 0 0
A X 2019Q2 1500 6100 0 0
A X 2018Q3 5600 600 1 0
A X 2018Q4 500 6007 1 0
A Y 2016Q1 900 620 1 0
A Y 2016Q2 750 850 0 0
A Y 2017Q1 750 850 1 0
A Y 2017Q2 750 850 1 0
A Y 2017Q3 750 850 1 0
A Y 2018Q1 750 850 1 1
A Y 2018Q2 750 850 1 1
A Y 2018Q3 750 850 1 1
A Y 2018Q4 750 850 1 1
How can we do this using pandas and numpy?
Can we implemet this is using sql?
Concerning your first question, this can be done like this using pandas:
First i'll generate your example dataframe:
import pandas as pd
df = pd.DataFrame({'colA':['A']*17,
'colB':['X']*8+['Y']*9,
'flag':[1,1,1,1,0,0,1,1,1,0,1,1,1,1,1,1,1]})
df.set_index(['colA','colB'], inplace=True) # Set index as multilevel with colA and colB
Resulting in your example dataframe. However, to use the following approach, we'll need to go back to a normal index:
df.reset_index(inplace=True)
colA colB flag
0 A X 1
1 A X 1
2 A X 1
3 A X 1
4 A X 0
5 A X 0
6 A X 1
7 A X 1
8 A Y 1
9 A Y 0
10 A Y 1
11 A Y 1
12 A Y 1
13 A Y 1
14 A Y 1
15 A Y 1
16 A Y 1
Then to obtain your result column you can use the groupby function (with some print to understand what's going on):
from scipy.ndimage.interpolation import shift
import numpy as np
df['check_qtr'] = pd.Series(0,index=df.index) # Initialise your result column
for name, group in df.groupby(['colA','colB','flag']):
if name[2] == 1:
print(name)
idx = ((group.index.values - shift(group.index.values, 1, cval=-1)) == 1).astype(int) # Is the index of the following value just 1 place after current ?
print(idx)
bools = [idx[x:x+4].sum()==4 for x in range(len(idx))] # Are the 4 next indexes following each others ?
print(bools)
for idx in group.index.values[bools]: # For each index where the 4 next indexes following each others
df.loc[idx:idx+3,'check_qtr'] = 1 #set check_qtr in row idx to row idx+3
('A', 'X', 1)
[1 1 1 1 0 1]
[True, False, False, False, False, False]
('A', 'Y', 1)
[0 0 1 1 1 1 1 1]
[False, False, True, True, True, False, False, False]
Note that we are using +4 in the case where we are doing array indexing. Because array[x:x+4] will give you the 4 values at index x to x+3.
We are using +3 when using loc because loc doesn't use the same logic. It retrieves indexes by name and not position. So between value idx and idx+3 we'll get 4 values.
Giving you the result you want:
colA colB flag check_qtr
0 A X 1 1
1 A X 1 1
2 A X 1 1
3 A X 1 1
4 A X 0 0
5 A X 0 0
6 A X 1 0
7 A X 1 0
8 A Y 1 0
9 A Y 0 0
10 A Y 1 0
11 A Y 1 1
12 A Y 1 1
13 A Y 1 1
14 A Y 1 1
15 A Y 1 1
16 A Y 1 1
This may not be the perfect way to do it, but it can give you some hints about how you can use some of those functions !
I want to create a cumulative variable based on a non-cumulative variable. This variable should be reset when the value of Y equals 1 (but the reset will start from the row below).
I want to do that for each ID in the data frame.
Data illustration:
ID X Non_cum Y
A .. 0 0
A .. 20 0
A .. 40 0
B .. 0 0
B .. 100 0
B .. 200 1
B .. 50 0
Expected result:
ID X Non_cum Y Cum
A .. 0 0 0
A .. 20 0 20
A .. 40 0 60
B .. 0 0 0
B .. 100 0 100
B .. 200 1 300
B .. 50 0 50
You can group by ID and cumsum of Y (with shift):
groups = df.groupby(['ID'])
df['Y_block'] = groups['Y'].shift(fill_value=0)
df['Y_block'] = groups['Y_block'].cumsum()
df['Cum'] = df.groupby(['ID','Y_block'])['Non_cum'].cumsum()
Output (Cum column):
0 0
1 20
2 60
3 0
4 100
5 300
6 50
Name: Cum, dtype: int64
I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480
Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64
In [4]: data = pd.read_csv('student_data.csv')
In [5]: data[:10]
Out[5]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
5 1 760 3.00 2
6 1 560 2.98 1
7 0 400 3.08 2
8 1 540 3.39 3
9 0 700 3.92 2
one_hot_data = pd.get_dummies(data['rank'])
# TODO: Drop the previous rank column
data = data.drop('rank', axis=1)
data = data.join(one_hot_data)
# Print the first 10 rows of our data
data[:10]
It always gives an error:
KeyError: 'rank'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-6a749c8f286e> in <module>()
1 # TODO: Make dummy variables for rank
----> 2 one_hot_data = pd.get_dummies(data['rank'])
3
4 # TODO: Drop the previous rank column
5 data = data.drop('rank', axis=1)
If get:
KeyError: 'rank'
it means there is no column rank. Obviously problem is with traling whitespace or encoding.
print (data.columns.tolist())
['admit', 'gre', 'gpa', 'rank']
Your solution should be simplify by DataFrame.pop - it select column and remove from original DataFrame:
data = data.join(pd.get_dummies(data.pop('rank')))
# Print the first 10 rows of our data
print(data[:10])
admit gre gpa 1 2 3 4
0 0 380 3.61 0 0 1 0
1 1 660 3.67 0 0 1 0
2 1 800 4.00 1 0 0 0
3 1 640 3.19 0 0 0 1
4 0 520 2.93 0 0 0 1
5 1 760 3.00 0 1 0 0
6 1 560 2.98 1 0 0 0
7 0 400 3.08 0 1 0 0
8 1 540 3.39 0 0 1 0
9 0 700 3.92 0 1 0 0
I tried your code and it works fine. You can need to rerun the previous cells which includes loading of the data