I am trying to find the percentage contribution made by each date group. Given below is how my data looks like.
Expecting to find contribution of each product for a given date.
date, product, quantity
2020-01, prod_a, 100
2020-01, prod_b, 200
2020-01, prod_c, 20
2020-01, prod_d, 50
2020-02, prod_a, 30
2020-02, prod_b, 30
2020-02, prod_c, 40
My expected output would be as below:
date, product, quantity, prct_contributed
2020-01, prod_a, 100, 27%
2020-01, prod_b, 200, 54%
2020-01, prod_c, 20, 5%
2020-01, prod_d, 50, 14%
2020-02, prod_a, 30, 30%
2020-02, prod_b, 30, 30%
2020-02, prod_c, 40, 40%
Use groupby().transform():
df['quantity'] / df.groupby('date')['quantity'].transform('sum')
Related
Im trying to get some information out of log analytcs and I want to know if I can extract the avarege of values fron different lines
for exemple, lets say I have a table that goes like this:
.create table custumer (name: string, month: int, salary: long, living: long)
.ingest inline into table customer<
gabriel, 1, 1000, 500
gabriel, 2, 1000, 800
gabriel, 3, 2500, 800
gabriel, 4, 2500, 800
John, 1, 1500, 1000
John, 2, 1500, 500
John, 3, 1500, 500
John, 4, 1500, 1200
jina, 1, 3000, 1000
jina, 2, 3000, 1000
jina, 3, 3000, 1500
jina, 4, 5000, 2500
here we have the simplest possible table to explain my inquire, we're listing the salary and living expenses of each custumer per month (namely month 1, month 2, 3 and 4)
Then I want to kow the avarege salary and living expenses of gabriel, John and Jina in this period of 4 months
the actual query I want to aply this is a tad more complicated but this is enogh to explain my problem
I think this is what you are looking for:
datatable(name:string, month:int, salary:long, living:long)[
'gabriel', 1, 1000, 500,
'gabriel', 2, 1000, 800,
'gabriel', 3, 2500, 800,
'gabriel', 4, 2500, 800,
'John', 1, 1500, 1000,
'John', 2, 1500, 500,
'John', 3, 1500, 500,
'John', 4, 1500, 1200,
'jina', 1, 3000, 1000,
'jina', 2, 3000, 1000,
'jina', 3, 3000, 1500,
'jina', 4, 5000, 2500]
| summarize Avg_Salary=avg(salary), Avg_Expenses=avg(living) by name
Result:
name Avg_Salary Avg_Expenses
gabriel 1750 725
John 1500 800
jina 3500 1500
I am trying to find the average of last 5 days by day and product. Given below is how my Dataframe looks:
df=pd.DataFrame({
'day':['day_1','day_2','day_3','day_4','day_5','day_2','day_3','day_4','day_5','day_6','day_1'],
'product':['prod_a','prod_a','prod_a','prod_a','prod_a','prod_b','prod_b','prod_b','prod_b','prod_b','prod_b'],
'sale':[10,15,4,17,12,1,50,70,30,70,10]
})
To find the last 5 day average by day by product I did the below:
df_average = df.groupby(['day', 'product']).tail(5).groupby(['day', 'product']).mean()
Doing the above only returns back the actual value for that day for that product for that day and does not take the last 5 day average.
Expected output:
day, product, sale, last_5_average
day_1, prod_a , 10, 11.6
day_2, prod_a , 15, 12
day_3, prod_a , 4, 11
day_4, prod_a , 17, 14.5
day_5, prod_a , 12, 12
day_1, prod_b , 1, 44.2
day_2, prod_b , 50, 54
day_3, prod_b , 70, 55
day_4, prod_b , 30, 50
day_5, prod_b , 70, 60
day_6, prod_c , 50, 50
I hope this helps!
#original data frame
df=pd.DataFrame({
'day':['day_1','day_2','day_3','day_4','day_5','day_2','day_3','day_4','day_5','day_6','day_1'],
'product':['prod_a','prod_a','prod_a','prod_a','prod_a','prod_b','prod_b','prod_b','prod_b','prod_b','prod_b'],
'sale':[10,15,4,17,12,1,50,70,30,70,10]
})
#sort by product and day
df=df.sort_values(by=['product','day'])
#drop the sorted index
df=df.reset_index(drop=True)
#take rolling past 5 record's mean by product group
df['rolling_mean_sale']=df.groupby('product')['sale'].rolling(5).mean().reset_index()['sale']
I have a data set which contains account_number, date, balance, interest charged, and code. This is accounting data so transactions are posted and then reversed if they're was a mistake by the data provider so things can be posted and reversed multiple times.
Account_Number Date Balance Interest Charged Code
0012 01/01/2017 1,000,000 $ 50.00 Posted
0012 01/05/2017 1,000,000 $-50.00 Reversed
0012 01/07/2017 1,000,000 $ 50.00 Posted
0012 01/10/2017 1,000,000 $-50.00 Reversed
0012 01/15/2017 1,000,000 $50.00 Posted
0012 01/17/2017 1,500,000 $25.00 Posted
0012 01/18/2017 1,500,000 $-25.00 Reversed
Looking at the data set above- I am trying to figure out a way to look at every row by account number and balance and if they're is a inverse charge it should remove both of those rows and only keep a charge if they're is no corresponding reversal for it (01/15/2017). For example on 01/01/2017 a charge of 50.00 dollar was posted on a balance of 1,000,000 and on 01/05/2017 the charged was reversed on the same balance -- so both of these rows should be thrown out. This is the same case for 01/07 and 01/10.
I am not to sure on how to code out this problem - any ideas or tips would be great!
So the problem with a question like this is that there are many corner cases. Optimizing for them many or many not depend on how the data is already processed. That being said, here is one solution. Assuming -
For each Account number and balance, the for for each Reversed transaction is just after the corresponding payment.
>>import pandas as pd
>>from datetime import date
>>df = pd.DataFrame(data = [
['0012', date(2017, 1, 1), 1000000, 50, 'Posted'],['0012', date(2017, 1, 5), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 7), 1000000, 50, 'Posted'],['0012', date(2017, 1, 10), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 15), 1000000, 50, 'Posted'],['0012', date(2017, 1, 17), 1500000, 25, 'Posted'],
['0012', date(2017, 1, 18), 1500000, -25, 'Reversed'],],
columns=['Account_Number', 'Date', 'Balance', 'Interest Charged', 'Code'])
>>df
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-01 1000000 50 Posted
1 0012 2017-01-05 1000000 -50 Reversed
2 0012 2017-01-07 1000000 50 Posted
3 0012 2017-01-10 1000000 -50 Reversed
4 0012 2017-01-15 1000000 50 Posted
5 0012 2017-01-17 1500000 25 Posted
6 0012 2017-01-18 1500000 -25 Reversed
>> def f(df_g):
idx = df_g[df_g['Code'] == 'Reversed'].index
return df_g.loc[~df_g.index.isin(idx.union(idx-1)), ['Date', 'Interest Charged', 'Code']]
>>df.groupby(['Account_Number', 'Balance']).apply(f).reset_index().loc[:, df.columns]
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-15 1000000 50 Posted
How it works - Basically for each combination of Account Number and Balance, I look at the Rows with Reversed, and I remove them plus the row just before it.
EDIT: - To make it slightly more Robust (now it picked up the last row based on Amount, Balance and account_number:
>>df = pd.DataFrame(data = [
['0012', date(2017, 1, 1), 1000000, 53, 'Posted'],['0012', date(2017, 1, 7), 1000000, 50, 'Posted'],['0012', date(2017, 1, 5), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 10), 1000000, -53, 'Reversed'],
['0012', date(2017, 1, 15), 1000000, 50, 'Posted'],['0012', date(2017, 1, 17), 1500000, 25, 'Posted'],
['0012', date(2017, 1, 18), 1500000, -25, 'Reversed'],],
columns=['Account_Number', 'Date', 'Balance', 'Interest Charged', 'Code'])
>>df
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-01 1000000 53 Posted
1 0012 2017-01-07 1000000 50 Posted
2 0012 2017-01-05 1000000 -50 Reversed
3 0012 2017-01-10 1000000 -53 Reversed
4 0012 2017-01-15 1000000 50 Posted
5 0012 2017-01-17 1500000 25 Posted
6 0012 2017-01-18 1500000 -25 Reversed
>>output_cols = df.columns
>>df['ABS_VALUE'] = df['Interest Charged'].abs()
>>def f(df_g):
df_g = df_g.reset_index() # Added this new line
idx = df_g[df_g['Code'] == 'Reversed'].index
return df_g.loc[~df_g.index.isin(idx.union(idx-1)), ['Date', 'Interest Charged', 'Code']]
>>df.groupby(['Account_Number', 'Balance', 'ABS_VALUE']).apply(f).reset_index().loc[:, output_cols]
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-15 1000000 50 Posted
I am trying to set a multiindex to a simple pandas dataframe. The first index is type of shop and the second is the type of fruit. I was expecting to see two groups Shop1 and Shop2 for the first column but have ended up with three, Shop1, Shop2 and then Shop1 again. Why is this happening?
Area2 = pd.DataFrame({'01/01/2017': [2000, 2500, 100, 1600],
'01/02/2017': [2000, 2500, 50, 1000],
'01/03/2017': [2000, 500, 50, 1600,],
'01/04/2017': [2500, 2000, 0, 1600],
'Fruit': ['Apples', 'Banana', 'Pears', 'b/berry'],
'Shop': ['Shop1', 'Shop2', 'Shop1', 'Shop1']})
S2 = Area2.set_index(['Shop', 'Fruit'])
Current output
01/01/2017 01/02/2017 01/03/2017 01/04/2017
Shop Fruit
Shop1 Apples 2000 2000 2000 2500
Shop2 Banana 2500 2500 500 2000
Shop1 Pears 100 50 50 0
b/berry 1600 1000 1600 1600
What I was expecting
01/01/2017 01/02/2017 01/03/2017 01/04/2017
Shop Fruit
Shop1 Apples 2000 2000 2000 2500
Pears 100 50 50 0
b/berry 1600 1000 1600 1600
Shop2 Banana 2500 2500 500 2000
I think you need sort_index for sorting MultiIndex:
df = S2.sort_index()
print (df)
01/01/2017 01/02/2017 01/03/2017 01/04/2017
Shop Fruit
Shop1 Apples 2000 2000 2000 2500
Pears 100 50 50 0
b/berry 1600 1000 1600 1600
Shop2 Banana 2500 2500 500 2000
But first level of MultiIndex not showing same consecutive data by default.
I have dollar and euro currencies.
I want to calculate in the last column only euro prices.
SELECT
CUSTOMER,
PRODUCT,
PRICE,
CURRENCY,
FROM MORE.PRODUCTS
WHERE CUSTOMER = '1000'
CUSTOMER PRODUCT PRICE CURRENCY
1000 BIKE 100 €
1000 CAR 200 €
1000 BIKE 50 $
1000 CANON 120 €
1000 TRAIN 300 $
Example, I want SUM of € values only:
CUSTOMER PRODUCT PRICE CURRENCY TOTAL PRICE
1000 BIKE 100 € 420
1000 CAR 200 € 420
1000 BIKE 50 $ 420
1000 CANON 120 € 420
1000 TRAIN 300 $ 420
What is best way to do this?
I tried to use a subquery in the SELECT clause, but I was not able to get it to work.
Use an analytic function to calculate the total:
SELECT CUSTOMER,
PRODUCT,
PRICE,
CURRENCY,
SUM( CASE CURRENCY WHEN '€' THEN PRICE END ) OVER () AS TOTAL_PRICE
FROM MORE.PRODUCTS
WHERE CUSTOMER = '1000'
Output:
CUSTOMER PRODUCT PRICE CURRENCY TOTAL_PRICE
-------- ------- ----- -------- -----------
1000 BIKE 100 € 420
1000 CAR 200 € 420
1000 BIKE 50 $ 420
1000 CANON 120 € 420
1000 TRAIN 300 $ 420
SELECT CUSTOMER, SUM(TOTAL_PRICE)
FROM MORE.PRODUCTS
WHERE CUSTOMER = '1000' and CURRENCY = '€'
You have to filter for euro prices.