pandas dataframe transformation partial sums - pandas

I have a pandas dataframe
index A
1 3.4
2 4.5
3 5.3
4 2.1
5 4.0
6 5.3
...
95 3.4
96 1.2
97 8.9
98 3.4
99 2.7
100 7.6
from this I would like to create a dataframe B
1-5 sum(1-5)
6-10 sum(6-10)
...
96-100 sum(96-100)
Any ideas how to do this elegantly rather than brute-force?
Cheers, Mike

This will give you a series with the partial sums:
df['bin'] = df.index / 5
bin_sums = df.groupby('bin')['A'].sum()
Then, if you want to rename the index:
bin_sums.index = ['%s - %s' % (5*i, 5*(i+1)) for i in bin_sums.index]

Related

index compatibility of dataframe with multiindex result from apply on group

We have to apply an algorithm to columns in a dataframe, the data has to be grouped by a key and the result shall form a new column in the dataframe. Since it is a common use-case we wonder if we have chosen a correct approach or not.
Following code reflects our approach to the problem in a simplified manner.
import numpy as np
import pandas as pd
np.random.seed(42)
N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)
df = pd.DataFrame(data)
This generates a DataFrame as follows.
key x
0 0 0.969585
1 1 0.775133
2 1 0.939499
3 1 0.894827
4 1 0.597900
.. ... ...
95 53 0.036887
96 54 0.609564
97 55 0.502679
98 56 0.051479
99 56 0.278646
Application of exemplary methods on the DataFrame groups.
def magic(x, const):
return (x + np.abs(np.random.rand(len(x))) + float(const)).round(1)
def pandas_confrom_magic(df_per_key, const=1):
index = df_per_key['x'].index # preserve index
x = df_per_key['x'].to_numpy()
y = magic(x, const) # perform some pandas incompatible magic
return pd.Series(y, index=index) # reconstruct index
g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))
When assigning a new column to the result df['y'] = y_per_g it will throw a TypeError.
TypeError: incompatible index of inserted column with frame index
Thus a compatible multiindex needs to be introduced first.
df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)
Which yields the intended result.
key x y
index
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
... ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
Now we wonder if there is a more straight forward way of dealing with the index and if we generally have chosen a favorable approach.
Use Series.droplevel to remove first level of MultiIndex, such that it has the same index as df, then assign will working well:
g = df.groupby('key')
df['y'] = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
print (df)
key x y
0 0 0.969585 6.9
1 1 0.775133 6.0
2 1 0.939499 6.1
3 1 0.894827 6.4
4 1 0.597900 6.6
.. ... ... ...
95 53 0.036887 6.0
96 54 0.609564 6.0
97 55 0.502679 6.5
98 56 0.051479 6.0
99 56 0.278646 6.1
[100 rows x 3 columns]

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas MultiIndex assign multiple columns

I have created a dataframe with a MultiIndex like below:
import numpy as np
import pandas as pd
column_index= [np.array(['OPEN','OPEN','CLOSE','CLOSE']),np.array(['IBM','AAPL','IBM','AAPL'])]
df = pd.DataFrame(np.transpose(np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])),index=['20190101','20190102','20190103'],columns=column_index)
The result is like this:
OPEN CLOSE
IBM AAPL IBM AAPL
20190101 1 4 7 10
20190102 2 5 8 11
20190103 3 6 9 12
Now I'd like to create a new set of columns by doing something like:
df['RTN'] = df.CLOSE / df.OPEN
To get:
OPEN CLOSE RTN
IBM AAPL IBM AAPL IBM AAPL
20190101 1 4 7 10 7.0 2.5
20190102 2 5 8 11 4.0 2.2
20190103 3 6 9 12 3.0 2.0
That does not work. The nicest way I've been able to do this is like so:
rtn = df.CLOSE / df.OPEN
rtn = pd.concat([rtn],keys=['RTN'],axis=1)
df = pd.concat([df,rtn],axis=1)
Is there a way to do this as an assignment without the other steps?
One way is to rename the columns prior to the operations. Then it's a simple concat:
u = df.loc[:, ['CLOSE']].rename(columns={'CLOSE': 'RTN'}, level=0).divide(
df.loc[:, ['OPEN']].rename(columns={'OPEN': 'RTN'}, level=0))
# [] DataFrame selection keeps MultiIndex
pd.concat([df, u], axis=1)
Alternatively, you can stack + eval + unstack. It's concise, but perhaps not super performant for large datasets.
df.stack().eval('RTN = CLOSE/OPEN').unstack()
#df.stack().assign(RTN = lambda x: x.CLOSE/x.OPEN).unstack()
Without eval:
df.stack().assign(RTN = lambda x: x.CLOSE/x.OPEN).unstack()
#or
df = df.stack()
df['RTN'] = df.CLOSE/df.OPEN
df = df.unstack()
Output in all cases:
OPEN CLOSE RTN
IBM AAPL IBM AAPL IBM AAPL
20190101 1 4 7 10 7.0 2.5
20190102 2 5 8 11 4.0 2.2
20190103 3 6 9 12 3.0 2.0

Split SparkR dataframe into list of dataframes

I am new to sparkR and trying to split the sparkR dataframe in to list of Dataframes based on columns.
The data has a billion records of Sls_D(date), mdse_item_i(item id), co_loc_i(location id), traffic_ti_8_00, traffic_ti_9_00, traffic_ti_10_00, traffic_ti_11_00 (each has no of traffic in the specific hour).
Data Snapshot:
sls_d co_loc_i mdse_item_i traffic_ti_8_00 traffic_ti_9_00 traffic_ti_10_00 traffic_ti_11_00
1 2016-10-21 1592 4694620 1 113 156 209
2 2016-10-21 1273 4694620 1 64 152 249
3 2016-10-21 1273 15281024 1 64 152 249
4 2016-10-21 1498 4694620 2 54 124 184
5 2016-10-21 1498 15281024 2 54 124 184
Desired Output:
sls_d co_loc_i mdse_item_i traffic_ti_8_00 traffic_ti_9_00 traffic_ti_10_00 traffic_ti_11_00
2016-10-21 4 4694620 3 67 145 283
A list of Dataframes.
d.2 = split(data.2.2,list(data.2.2$mdse_item_i,data.2.2$co_loc_i,data.2.2$sls_d))
Error in x[ind[[k]]] : Expressions other than filtering predicates
are not supported in the first parameter of extract operator [ or
subset() method.
Is there any way around to do this in sparkR apart from converting the sparkDataframe to base R.
As converting the sparkdataframe to base R results in memory error and defeats the problem of parallel processing.
Any help is greatly appreciated.
Your question is somewhat unclear; if you mean to split the columns of a Spark dataframe, you should use select. Here is an example using the iris data in SparkR 2.2:
df <- as.DataFrame(iris) # Spark dataframe
df
# SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]
# separate the length-related & width-related columns into 2 Spark dataframes:
df_length = select(df, 'Sepal_Length', 'Petal_Length')
df_width = select(df, 'Sepal_Width', 'Petal_Width')
head(collect(df_width)) # for demonstration purposes only
# Sepal_Width Petal_Width
# 1 3.5 0.2
# 2 3.0 0.2
# 3 3.2 0.2
# 4 3.1 0.2
# 5 3.6 0.2
# 6 3.9 0.4
Now, you can put these 2 Spark dataframes into an R list, but I'm not sure how useful this will be - any list operations that may make sense are not usable [EDIT after comment]:
my_list = c(df_length, df_width)
head(collect(my_list[[1]]))
# Sepal_Length Petal_Length
# 1 5.1 1.4
# 2 4.9 1.4
# 3 4.7 1.3
# 4 4.6 1.5
# 5 5.0 1.4
# 6 5.4 1.7

Applying Transformation to Table (Pandas)

I have the following data frame
Index ID Wt Wt.1
0 4999 3.2 1.2
1 5012 1.1 3.4
2 5027 4.4 5.6
and I'm trying to apply a transformation in order to get a dataframe that looks like the following
Index ID Wt
0 4999 3.2
0 4999 1.2
1 5012 1.1
1 5012 3.4
2 5027 4.4
2 5027 5.6
Is there a simply way to do this? I have tried using melt, groupby, and pivot_table but with no luck. This seems like such a simple task so perhaps I am overthinking it.
One way would be to assign to an empty dataframe the 'ID' and 'Wt.1' columns to the empty dataframe as the target 'ID' and 'Wt' columns, this has a minor advantage in that you don't get a messy append at the end where you have NaN values and both 'Wt' and 'Wt.1' columns.
In [28]:
temp = pd.DataFrame()
temp[['ID','Wt']] = df[['ID','Wt.1']]
df1 = df[['ID','Wt']].append(temp)
df1
Out[28]:
ID Wt
Index
0 4999 3.2
1 5012 1.1
2 5027 4.4
0 4999 1.2
1 5012 3.4
2 5027 5.6
[6 rows x 2 columns]
You can call df1.reset_index(inplace=True) to correct the index afterwards.
You can probably do it in few lines, but I will show them step-by-step:
In [86]:
df2=df.set_index(['Index', 'ID'])
df3=df2.stack().reset_index()
df3=df3.ix[:,['Index', 'ID', 0]]
df3.columns=['Index', 'ID', 'Wt']
print df3
Index ID Wt
0 0 4999 3.2
1 0 4999 1.2
2 1 5012 1.1
3 1 5012 3.4
4 2 5027 4.4
5 2 5027 5.6