Create Dataframe from Matrix Search Calculation Pandas - pandas

I have the follwoign Dataframe:
df = pd.DataFrame({'Idenitiy': ['Haus1', 'Haus2', 'Haus1','Haus2'],
'kind': ['Gas', 'Gas', 'Strom','Strom'],
'2005':[2,3,5,6],
'2006':[2,3.5,5.5,7]})
No I would like to have the following dataframe as an output as the Product of the entitites:
Year Product(Gas) Product(Strom)
2005 6 30
2006 6 38,5
2007 7 38,5
Thank you.

Here's a way to do:
# multiply column values
from functools import reduce
def mult(f):
v = [reduce(lambda a,b : a*b, f['2005']), reduce(lambda a,b : a*b, f['2006'])]
return pd.Series(v, index=['2005','2006'])
# groupby and multiply column values
df1 = df.groupby('kind')[['2005','2006']].apply(mult).unstack().reset_index()
df1.columns = ['Year','Kind','vals']
print(df1)
Year Kind vals
0 2005 Gas 6.0
1 2005 Strom 30.0
2 2006 Gas 7.0
3 2006 Strom 38.5
# reshape the table
df1 = (df1
.pivot_table(index='Year', columns=['Kind'], values='vals'))
# fix column names
df1 = df1.add_prefix('Product_')
df1.columns.name = None
df1 = df1.reset_index()
Year Product_Gas Product_Strom
0 2005 6.0 30.0
1 2006 7.0 38.5

Related

Quickly replace values in a Pandas DataFrame

I have the following dataframe:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
# A B Sum
# 1 1 3 4
# 2 2 4 6
# Sum 3 7 10
I want to:
replace 1 by 3*4/10
replace 2 by 3*6/10
replace 3 by 4*7/10
replace 4 by 7*6/10
What is the easiest way to do this? I want the solution to be able to extend to n number of rows and columns. Been cracking my head over this. TIA!
If I understood you correctly:
df = pd.DataFrame(
{
'A':[1,2],
'B':[3,4]
}, index=['1','2'])
df.loc[:,'Sum'] = df.sum(axis=1)
df.loc['Sum'] = df.sum(axis=0)
print(df)
conditions = [(df==1), (df==2), (df==3), (df==4)]
values = [(3*4)/10, (3*6)/10, (4*7)/10, (7*6)/10]
df[df.columns] = np.select(conditions, values, df)
OutPut:
A B Sum
1 1.2 2.8 4.2
2 1.8 4.2 6.0
Sum 2.8 7.0 10.0
Let us try create it from original df before you do the sum and assign
import numpy as np
v = np.multiply.outer(df.sum(1).values,df.sum().values)/df.sum().sum()
out = pd.DataFrame(v,index=df.index,columns=df.columns)
out
Out[20]:
A B
1 1.2 2.8
2 1.8 4.2

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

Pandas | How to calculate the average value of each cell in multiple dataframe with the same shape?

I have several DataFrames like this:
they are saving in a list df_list = [df1,df2,df3,df4,df5....]
I want to generate a new DataFrame df_average.
In df_average, each grid is equal to the average values of the corresponding grid of df1, df2, df3, df4,df4. For example:
df_average[1,'Q1'] = average(df1[1,'Q1'],df2[1,'Q1'],df3[1,'Q1'],df4[1,'Q1']),
df_average[1,'Q2'] = average(df1[1,'Q2'],df2[1,'Q2'],df3[1,'Q2'],df4[1,'Q2'])
How to realize it in an efficient way ?
Code below averages values for each cell. The output size is same as the other dataframes:
# Import libraries
import pandas as pd
# Create DataFrame
df1 = pd.DataFrame({
'Q1': [1,2,3],
'Q2': [11,12,13],
'Q3': [10,20,30],
'Q4': [31,32,33],
'Q5': [61,62,63],
})
df2 = df1.copy()*2
df3 = df1.copy()*0.5
df4 = df1.copy()*-1
# Get average
df_average = (df1+df2+df3+df4)/4
df_average
Output:
Q1 Q2 Q3 Q4 Q5
0 0.625 6.875 6.25 19.375 38.125
1 1.250 7.500 12.50 20.000 38.750
2 1.875 8.125 18.75 20.625 39.375
You can use pd.concat, followed by a groupby on the index using mean for aggregation.
df1 = pd.DataFrame({'Q1':[1,2,3], 'Q2':[1,7,8], 'Q3':[8,9,1], 'Q4':[4,3,7]})
df2 = pd.DataFrame({'Q1':[7,9,10], 'Q2':[9,2,8], 'Q3':[3,4,2], 'Q4':[1,5,6]})
df_average = pd.concat([df1, df2])
df_average = df_average.groupby(df_average.index).agg({'Q1': 'mean',
'Q2': 'mean',
'Q3': 'mean',
'Q4': 'mean'})
print(df_average)
Q1 Q2 Q3 Q4
0 4.0 5.0 5.5 2.5
1 5.5 4.5 6.5 4.0
2 6.5 8.0 1.5 6.5

Pandas groupby calculate difference

import pandas as pd
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['Company',df['Date'].dt.year]).diff()
print(df)
Date Revenue YTD
0 NaT NaN
1 92 days -100.0
2 NaT NaN
3 92 days -22670.0
4 NaT NaN
5 92 days 28627.0
I would like to calculate the company's revenue difference by September and December. I have tried with groupby company and year. But the result is not what I am expecting
Expecting result
Date Company Revenue YTD
0 2017 A -100
1 2018 A -22670
2 2017 B 28627
IIUC, this should work
(df.assign(Date=df['Date'].dt.year,
Revenue_Diff=df.groupby(['Company',df['Date'].dt.year])['Revenue YTD'].diff())
.drop('Revenue YTD', axis=1)
.dropna()
)
Output:
Date Company Revenue_Diff
1 2017 A -100.0
3 2017 B -22670.0
5 2018 A 28627.0
Try this:
Set it up:
import pandas as pd
import numpy as np
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
Update with np.diff():
my_func = lambda x: np.diff(x)
df = (df.groupby([df.Date.dt.year, df.Company])
.agg({'Revenue YTD':my_func}))
print(df)
Revenue YTD
Date Company
2017 A -100
B -22670
2018 A 28627
Hope this helps.

Parse Excel file with multi-index in Pandas

What would be the best way to parse the above Excel file in a Pandas dataframe? The idea would be to be able to update data easily, adding columns, dropping lines. For example, for every origin, I would like to keep only output3. Then for every column (2000, ....,2013) divide it by 2 given a condition (say value > 6000) .
Below is what I tried: first to parse and drop the unnecessary lines, but it's not satisfactory, as I had to rename columns manually. So this doesn't look very optimal as a solution. Any better idea?
df = pd.read_excel("myExcel.xlsx", skiprows=2, sheet_name='1')
cols1 = list(df.columns)
cols1 = [str(x)[:4] for x in cols1]
cols2 = list(df.iloc[0,:])
cols2 = [str(x) for x in cols2]
cols = [x + "_" + y for x,y in zip(cols1,cols2)]
df.columns = cols
df = df.drop(["Unna_nan"], axis =1).rename(columns ={'Time_Origine':'Country','Unna_Output' : 'Series','Unna_Ccy' : 'Unit','2000_nan' : '2000','2001_nan': '2001','2002_nan':'2002','2003_nan' : '2003','2004_nan': '2004','2005_nan' : '2005','2006_nan' : '2006','2007_nan' : '2007','2008_nan' : '2008','2009_nan' : '2009','2010_nan' : '2010','2011_nan': '2011','2012_nan' : '2012','2013_nan':'2013','2014_nan':'2014','2015_nan':'2015','2016_nan':'2016','2017_nan':'2017'})
df.drop(0,inplace=True)
df.drop(df.tail(1).index, inplace=True)
idx = ['Country', 'Series', 'Unit']
df = df.set_index(idx)
df = df.query('Series == "Output3"')
Without having an excel like this I think something like this might work.
In order to only get the rows from output3, you can use the following:
df = pd.read_excel("myExcel.xlsx", skiprows=2, sheet_name='1')
df = df.loc[df['Output'] == 'output3']
Now dividing every cell by 2 if the cell value is greater than 6000 by using pandas apply:
def foo(bar):
if bar > 6000:
return bar / 2
return bar
for col in df.columns:
try:
int(col) # to check if this column is a year
df[col] = df[col].apply(foo)
except ValueError:
pass
#read first 2 rows to MultiIndex nad remove last one
df = pd.read_excel("Excel1.xlsx", skiprows=2, header=[0,1], skipfooter=1)
print (df)
#create helper DataFrame
cols = df.columns.to_frame().reset_index(drop=True)
cols.columns=['a','b']
cols['a'] = pd.to_numeric(cols['a'], errors='ignore')
cols['b'] = cols['b'].replace('Unit.1','tmp', regex=False)
#create new column by condition
cols['c'] = np.where(cols['b'].str.startswith('Unnamed'), cols['a'], cols['b'])
print (cols)
a b c
0 Time Country Country
1 Time Series Series
2 Time Unit Unit
3 Time tmp tmp
4 2000 Unnamed: 4_level_1 2000
5 2001 Unnamed: 5_level_1 2001
6 2002 Unnamed: 6_level_1 2002
7 2003 Unnamed: 7_level_1 2003
8 2004 Unnamed: 8_level_1 2004
9 2005 Unnamed: 9_level_1 2005
10 2006 Unnamed: 10_level_1 2006
11 2007 Unnamed: 11_level_1 2007
12 2008 Unnamed: 12_level_1 2008
13 2009 Unnamed: 13_level_1 2009
14 2010 Unnamed: 14_level_1 2010
15 2011 Unnamed: 15_level_1 2011
16 2012 Unnamed: 16_level_1 2012
17 2013 Unnamed: 17_level_1 2013
18 2014 Unnamed: 18_level_1 2014
19 2015 Unnamed: 19_level_1 2015
20 2016 Unnamed: 20_level_1 2016
21 2017 Unnamed: 21_level_1 2017
#overwrite columns by column c
df.columns = cols['c'].tolist()
#forward filling missing values
df['Country'] = df['Country'].ffill()
df = df.drop('tmp', axis=1).set_index(['Country','Series','Unit'])
print (df)