Cut continuous data by outliers - pandas

For example I have DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'b': [2, 2, 4, 3, 1000, 2000, 1, 500, 3]})
I need to cut by outliers and get these intervals: 1-4, 5-6, 7, 8, 9.
Cutting with pd.cut and pd.qcut does not give these results

You can group them by consecutive values depending on the above/below mask:
m = df['b'].gt(100)
df['group'] = m.ne(m.shift()).cumsum()
output:
a b group
0 1 2 1
1 2 2 1
2 3 4 1
3 4 3 1
4 5 1000 2
5 6 2000 2
6 7 1 3
7 8 500 4
8 9 3 5

Related

Pandas data manipulation from column to row elements [duplicate]

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 9 months ago.
I have dataset with millions of rows, here is an example of what it looks like and what I intend to output:
data = [[1, 100, 8], [1, 100, 4],
[1, 100,6], [2, 100, 0],
[2, 200, 1], [3, 300, 7],
[4, 400, 2], [5, 100, 6],
[5, 100, 3], [5, 600, 1]]
df= pd.DataFrame(data, columns =['user', 'time', 'item'])
print(df)
user time item
1 100 8
1 100 4
1 100 6
2 100 0
2 200 1
3 300 7
4 400 2
5 100 6
5 100 3
5 600 1
The desired output should have all items consumed by a user within the same time to appear together in the items column as follows
user time item
1 100 8,4,6
2 100 0
5 100 6,3
2 200 1
3 300 7
4 400 2
5 500 6
For example, user: 1 consumed products 8,4,6 within time: 100
How could this be achieved?
Use df.astype with Groupby.agg and df.sort_values:
In [489]: out = df.astype(str).groupby(['user', 'time'])['item'].agg(','.join).reset_index().sort_values('time')
In [490]: out
Out[490]:
user time item
0 1 100 8,4,6
1 2 100 0
5 5 100 6,3
2 2 200 1
3 3 300 7
4 4 400 2
6 5 600 1

create column based on column values - merge integers

I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)
You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2

Pandas - Add many new columns based on many aggregate functions

Pandas 1.0.5
import pandas as pd
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
#add columns
d['count'] = d.groupby(['card_id', 'day'])["amount"].transform('count')
d['min'] = d.groupby(['card_id', 'day'])["amount"].transform('min')
d['max'] = d.groupby(['card_id', 'day'])["amount"].transform('max')
I would like to change the three transform lines to one line. I tried this:
d['count', 'min', 'max'] = d.groupby(['card_id', 'day'])["amount"].transform('count', 'min', 'max')
Error: "TypeError: count() takes 1 positional argument but 3 were given"
I also tried this:
d[('count', 'min', 'max')] = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
Error: "TypeError: incompatible index of inserted column with frame index"
Use merge,
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
df_out = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
d.merge(df_out, left_on=['card_id', 'day'], right_index=True)
Output:
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40
The output of you groupyby is creating a multilevel index and the index of this ouput doesn't match the index of d, hence the error. However, we can join the columns in d to the index in output of the groupby using merge with column names and right_index=True.
You could use the assign function to get the results in one go:
grouping = df.groupby(["card_id", "day"])
df.assign(
count=grouping.transform("count"),
min=grouping.transform("min"),
max=grouping.transform("max"),
)
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40

Get and Modify column in groups on rows that meet a condition

I have this DataFrame:
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12], 'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
day hour sales
0 1 10 0
1 1 10 40
2 1 10 30
3 2 11 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 12 20
And I would like to filter to get the first entry of each day that has volume greater than 0. And as an additional thing I would like to change the 'sales' column for these to 9.
So to get something like this:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
I only came up with this iterative solution. But is there a solution, how I can apply it in a more functional way?
# Group by day:
groups = df.groupby(by=['day'])
# Get all indices of first non-zero sales entry per day:
indices = []
for name, group in groups:
group = group[group['sales'] > 0]
indices.append(group.index.to_list()[0])
# Change their values:
df.iloc[indices, df.columns.get_loc('hour')] = 9
You can create a group of df['day'] after checking of sales is greater than 0 , then get idxmax and filter out groups which doesnot have any value greater than 0 by using any , then assign with loc[]
g = df['sales'].gt(0).groupby(df['day'])
idx = g.idxmax()
df.loc[idx[g.any()],'hour']=9
print(df)
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
Create a mask m that groups by day as well as rows where the sales are not 0.
Then, use this mask as well as df['sales'] > 0 to change those specific rows to 9 with np.where()
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12],
'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
m = df.groupby(['day', df['sales'].ne(0)])['sales'].transform('first')
df['hour'] = np.where((df['sales'] == m) & (df['sales'] > 0), 9, df['hour'])
df
Out[37]:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

Multiplying Dataframe by Column Value

I'm currently trying to multiply a dataframe of local currency values and converting it to its relevant Canadian value by multiplying its relevant FX rate.
However, I keep getting this error:
ValueError: operands could not be broadcast together with shapes (12252,) (1021,)
This is the code I'm working with right now. It works when I have a handful rows of data, but keeps getting the ValueError once I use it on the full file (1022 rows of data incl. headers).
import pandas as pd
Local_File = ('RawData.xlsx')
df = pd.read_excel(Local_File, sheet_name = 'Local')
df2 = df.iloc[:,[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]].multiply(df['FX Spot Rate'],axis='index')
print (df2)
My dataframe looks something like this with 1022 rows of data (incl. header)
Appreciate any help! Thank you!
df = pd.DataFrame({'A': [1, 2, 3, 3, 1],
'B': [1, 2, 3, 3, 1],
'C': [9, 7, 4, 3, 9]})
A B C
0 1 1 9
1 2 2 7
2 3 3 4
3 3 3 3
4 1 1 9
df.iloc[:,1:] = df.iloc[:,1:].multiply(df['A'][:], axis="index")
df
A B C
0 1 1 9
1 2 4 14
2 3 9 12
3 3 9 9
4 1 1 9