How can I transform two data frames into another one? - pandas

I have a df1 that looks like:
Shady Slim Eminem
Date
2011-01-10 HI Yes 1500
2011-01-13 HI No 1500
2011-01-13 BYBY Yes 4000
2011-01-26 OKDO Yes 1000
I have df2 that looks like this:
HI BYBY OKDO INT
Date
2011-01-10 340.99 143.41 614.21 1.0
2011-01-13 344.20 144.55 616.69 1.0
2011-01-13 344.20 144.55 616.69 1.0
2011-01-26 342.38 156.42 616.50 1.0
I want to save Eminem as Series. I also want each column in df2 to be a series. I want to multiply Eminem by these values in the right corresponding elements of Shady and fill up df3.
I want a df3 that looks like
I also want the INT column to be the sum of the rows for each row in df3.
I want to this in a vectorization way.
Also, based on the SLIM column, if it's YES then I want to add the Eminem * value else I want the negation of it.
Here are the values I want:
HI BYBY OKDO INT
Date
2011-01-10 511,485 0 0 sum(row 1)
2011-01-13 -516300 578200 0 sum(row 2)
2011-01-13 0 578200 0 sum(row 3)
2011-01-26 0 0 616500 sum(row 4)

Option 1
Use the pd.DataFrame.mul method for multiplying and provide an axis parameter in order to specify that you want the series you are multiplying by to be lined up along the index.
df2.mul(df1.Eminem, axis=0)
HI BYBY OKDO SOME COOL INT
Date
2011-01-10 511485.0 215115.0 921315.0 108030.0 184785.0 1500.0
2011-01-13 516300.0 216825.0 925035.0 110310.0 186810.0 1500.0
2011-01-13 1376800.0 578200.0 2466760.0 294160.0 498160.0 4000.0
2011-01-26 342380.0 156420.0 616500.0 76370.0 125800.0 1000.0
Option 2
If by chance, the series in which you are trying to multiply by is already ordered in the way you'd like to multiply, you can forgo the index and access the values attribute.
df2.mul(df1.Eminem.values, 0)
HI BYBY OKDO SOME COOL INT
Date
2011-01-10 511485.0 215115.0 921315.0 108030.0 184785.0 1500.0
2011-01-13 516300.0 216825.0 925035.0 110310.0 186810.0 1500.0
2011-01-13 1376800.0 578200.0 2466760.0 294160.0 498160.0 4000.0
2011-01-26 342380.0 156420.0 616500.0 76370.0 125800.0 1000.0
Option 3
If index proves difficult, you can append a level that makes it unique
unique_me = lambda d: d.set_index(d.groupby(level=0).cumcount(), append=True)
df2.pipe(unique_me).mul(df1.pipe(unique_me).Eminem, axis=0).reset_index(-1, drop=True)
HI BYBY OKDO SOME COOL INT
Date
2011-01-10 511485.0 215115.0 921315.0 108030.0 184785.0 1500.0
2011-01-13 516300.0 216825.0 925035.0 110310.0 186810.0 1500.0
2011-01-13 1376800.0 578200.0 2466760.0 294160.0 498160.0 4000.0
2011-01-26 342380.0 156420.0 616500.0 76370.0 125800.0 1000.0
With Slim Factor
df2.drop('INT', axis=1, errors='ignore').mul(df1.Eminem.values, 0).assign(
INT=lambda d: (lambda s: s.mask(df1.Slim.eq('No'), -s))(d.sum(1)))
HI BYBY OKDO SOME COOL INT
Date
2011-01-10 511485.0 215115.0 921315.0 108030.0 184785.0 1940730.0
2011-01-13 516300.0 216825.0 925035.0 110310.0 186810.0 -1955280.0
2011-01-13 1376800.0 578200.0 2466760.0 294160.0 498160.0 5214080.0
2011-01-26 342380.0 156420.0 616500.0 76370.0 125800.0 1317470.0

Related

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

Not getting top5 values for each month using grouper and groupby in pandas

I'm trying to get top5 values for amount for each month along with the text column. I've tried resampling and group by statement
Dataset:
text amount date
123… 11.00 11-05-17
123abc… 10.00 11-08-17
Xyzzy… 22.00. 12-07-17
Xyzzy… 221.00. 11-08-17
Xyzzy… 212.00. 10-08-17
Xyzzy… 242.00. 18-08-17
Code:
df1 = df.groupby([’text', pd.Grouper(key=‘date', freq='M')])[‘amount'].apply(lambda x: x.nlargest(5))
I get group of text but not arranged by month or largest values sorted in descending order.
df1 = df.groupby([pd.Grouper(key=‘date', freq='M')])[‘amount'].apply(lambda x: x.nlargest(5))
THis code works fine but does not give text column.
assuming that amount is a numeric column:
In [8]: df.groupby(['text', pd.Grouper(key='date', freq='M')]).apply(lambda x: x.nlargest(2, 'amount'))
Out[8]:
text amount date
text date
123abc… 2017-11-30 1 123abc… 10.0 2017-11-08
123… 2017-11-30 0 123… 11.0 2017-11-05
Xyzzy… 2017-08-31 5 Xyzzy… 242.0 2017-08-18
2017-10-31 4 Xyzzy… 212.0 2017-10-08
2017-11-30 3 Xyzzy… 221.0 2017-11-08
2017-12-31 2 Xyzzy… 22.0 2017-12-07
You can using head with sort_values
df1 = df.sort_values('amount',ascending=False).groupby(['text', pd.Grouper(key='date', freq='M')]).head(2)

How to index into a data frame using another data frame's indices?

I have a dataframe, num_buys_per_day
Date count
0 2011-01-13 1
1 2011-02-02 1
2 2011-03-03 2
3 2011-06-03 1
4 2011-08-01 1
I have another data frame commissions_buy which I'll give a small subset of:
num_orders
2011-01-10 0
2011-01-11 0
2011-01-12 0
2011-01-13 0
2011-01-14 0
2011-01-18 0
I want to apply the following command
commissions_buy.loc[num_buys_per_day.index, :] = num_buys_per_day.values * commission
where commission is a scalar.
Note that all indices in num_buys_per_day exist in commissions_buy.
I get the following error:
TypeError: unsupported operand type(s) for *: 'Timestamp' and 'float'
How should I do the correct command?
you need to first make the Date colum to the index:
num_buys_per_day.set_index('Date', inplace=True)
commission_buy.loc[num_buys_per_day.index, 'num_orders'] = num_buys_per_day['count'].values * commission

pandas pivot_table with dates as values

let's say I have the following table of customer data
df = pd.DataFrame.from_dict({"Customer":[0,0,1],
"Date":['01.01.2016', '01.02.2016', '01.01.2016'],
"Type":["First Buy", "Second Buy", "First Buy"],
"Value":[10,20,10]})
which looks like this:
Customer | Date | Type | Value
-----------------------------------------
0 |01.01.2016|First Buy | 10
-----------------------------------------
0 |01.02.2016|Second Buy| 20
-----------------------------------------
1 |01.01.2016|First Buy | 10
I want to pivot the table by the Type column.
However, the pivoting only gives the numeric Value columns as a result.
I'd desire a structure like:
Customer | First Buy Date | First Buy Value | Second Buy Date | Second Buy Value
---------------------------------------------------------------------------------
where the missing values are NAN or NAT
Is this possible using pivot_table. If not, I can imagine some workarounds, but they are quite lenghty. Any other suggestions?
Use unstack:
df1 = df.set_index(['Customer', 'Type']).unstack()
df1.columns = ['_'.join(cols) for cols in df1.columns]
print (df1)
Date_First Buy Date_Second Buy Value_First Buy Value_Second Buy
Customer
0 01.01.2016 01.02.2016 10.0 20.0
1 01.01.2016 None 10.0 NaN
If need another order of columns use swaplevel and sort_index:
df1 = df.set_index(['Customer', 'Type']).unstack()
df1.columns = ['_'.join(cols) for cols in df1.columns.swaplevel(0,1)]
df1.sort_index(axis=1, inplace=True)
print (df1)
First Buy_Date First Buy_Value Second Buy_Date Second Buy_Value
Customer
0 01.01.2016 10.0 01.02.2016 20.0
1 01.01.2016 10.0 None NaN

Pandastic way of growing a dataframe

So, I have a year-indexed dataframe that I would like to increment by some logic beyond the end year (2013), say, grow the last value by n percent for 10 years, but the logic could also be to just add a constant, or slightly growing number. I will leave that to a function and just stuff the logic there.
I can't think of a neat vectorized way to do that with arbitrary length of time and logic, leaving a longer dataframe with the extra increments added, and would prefer not to loop it.
The particular calculation matters. In general you would have to compute the values in a loop. Some NumPy ufuncs (such as np.add, np.multiply, np.minimum, np.maximum) have an accumulate method, however, which may be useful depending on the calculation.
For example, to calculate values given a constant growth rate, you could use np.multiply.accumulate (or cumprod):
import numpy as np
import pandas as pd
N = 10
index = pd.date_range(end='2013-12-31', periods=N, freq='D')
df = pd.DataFrame({'val':np.arange(N)}, index=index)
last = df['val'][-1]
# val
# 2013-12-22 0
# 2013-12-23 1
# 2013-12-24 2
# 2013-12-25 3
# 2013-12-26 4
# 2013-12-27 5
# 2013-12-28 6
# 2013-12-29 7
# 2013-12-30 8
# 2013-12-31 9
# expand df
index = pd.date_range(start='2014-1-1', periods=N, freq='D')
df = df.reindex(df.index.union(index))
# compute new values
rate = 1.1
df['val'][-N:] = last*np.multiply.accumulate(np.full(N, fill_value=rate))
yields
val
2013-12-22 0.000000
2013-12-23 1.000000
2013-12-24 2.000000
2013-12-25 3.000000
2013-12-26 4.000000
2013-12-27 5.000000
2013-12-28 6.000000
2013-12-29 7.000000
2013-12-30 8.000000
2013-12-31 9.000000
2014-01-01 9.900000
2014-01-02 10.890000
2014-01-03 11.979000
2014-01-04 13.176900
2014-01-05 14.494590
2014-01-06 15.944049
2014-01-07 17.538454
2014-01-08 19.292299
2014-01-09 21.221529
2014-01-10 23.343682
To increment by a constant value you could simply use np.arange:
step=2
df['val'][-N:] = np.arange(last+step, last+(N+1)*step, step)
or cumsum:
step=2
df['val'][-N:] = last + np.full(N, fill_value=step).cumsum()
Some linear recurrence relations can be expressed using scipy.signal.lfilter. See for example,
Trying to vectorize iterative calculation with numpy and Recursive definitions in Pandas