Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1
I'm not finding a similar example to understand this in python. I have a dataset that looks like this:
ID Capacity
A 50
A 50
A 50
B 30
B 30
B 30
C 100
C 100
C 100
I need to find the percent of each ID for the sum of the "Capacity" column. So, the answer looks like this:
ID Capacity Percent_Capacity
A 50 0.2777
A 50 0.2777
A 50 0.2777
B 30 0.1666
B 30 0.1666
B 30 0.1666
C 100 0.5555
C 100 0.5555
C 100 0.5555
Thank you - still learning python.
total=df.groupby('ID')['Capacity'].first().sum()
df['percent_capacity'] = df['Capacity']/total
df
ID Capacity percent_capacity
0 A 50 0.277778
1 A 50 0.277778
2 A 50 0.277778
3 B 30 0.166667
4 B 30 0.166667
5 B 30 0.166667
6 C 100 0.555556
7 C 100 0.555556
8 C 100 0.555556
Using drop_duplicates:
df['percent_capacity'] = df['Capacity']/df.drop_duplicates(subset='ID')['Capacity'].sum()
Output:
ID Capacity percent_capacity
0 A 50 0.277778
1 A 50 0.277778
2 A 50 0.277778
3 B 30 0.166667
4 B 30 0.166667
5 B 30 0.166667
6 C 100 0.555556
7 C 100 0.555556
8 C 100 0.555556
I have a data frame as shown below
ID Unit_ID Price Duration
1 A 200 2
2 B 1000 3
2 C 1000 3
2 D 1000 3
2 F 1000 3
2 G 200 1
3 A 500 2
3 B 200 2
From the above data frame if ID, Price and Duration are same then replace the Price by average (Price divided by count of Such combination).
For example from the above data frame from row 2 to 5 has same ID, Price and Duration, that means its count is 4, so the new Price = 1000/4 = 250.
Expected Output:
ID Unit_ID Price Duration
1 A 200 2
2 B 250 3
2 C 250 3
2 D 250 3
2 F 250 3
2 G 200 1
3 A 500 2
3 B 200 2
Use GroupBy.transform with GroupBy.size for Series with same size like original filled by counts, so possible divide by Series.div:
df['Price'] = df['Price'].div(df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
print (df)
ID Unit_ID Price Duration
0 1 A 200.0 2
1 2 B 250.0 3
2 2 C 250.0 3
3 2 D 250.0 3
4 2 F 250.0 3
5 2 G 200.0 1
6 3 A 500.0 2
7 3 B 200.0 2
Detail:
print (df.groupby(['ID','Price','Duration'])['Price'].transform('size'))
0 1
1 4
2 4
3 4
4 4
5 1
6 1
7 1
Name: Price, dtype: int64
I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2
The representation of pivot tabel not looks like something I looking for, to be more specific the order of the resulting rows.
I can`t figure out how to change it in proper way.
Example df:
test_df = pd.DataFrame({'name':['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
code for make pivot:
test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
Actual output:
salary status
month 1 2 3 1 2 3
name
name_1 100 100 100 1 1 2
name_2 110 110 110 1 1 3
name_3 120 120 120 2 2 1
The output I want to see:
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1
You would use sort_index, indicating the axis and the level:
piv = test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
piv.sort_index(axis='columns', level='month')
# salary status salary status salary status
#month 1 1 2 2 3 3
#name
#name_1 100 1 100 1 100 2
#name_2 110 1 110 1 110 3
#name_3 120 2 120 2 120 1
Use DataFrame.sort_index with axis=1, level=1 arguments
(test_df.pivot_table(index='name', columns=['month'],
values=['salary', 'status'])
.sort_index(axis=1, level=1))
[out]
salary status salary status salary status
month 1 1 2 2 3 3
name
name_1 100 1 100 1 100 2
name_2 110 1 110 1 110 3
name_3 120 2 120 2 120 1
import pandas as pd
df = pd.DataFrame({'name':
['name_1','name_1','name_1','name_2','name_2','name_2','name_3','name_3','name_3'],
'month':[1,2,3,1,2,3,1,2,3],
'salary':[100,100,100,110,110,110,120,120,120],
'status':[1,1,2,1,1,3,2,2,1]})
df = df.pivot_table(index='name', columns=['month'],
values=['salary', 'status']).sort_index(axis='columns', level='month')
print(df)