pandas merge two dataframe to form a multiindex - pandas

I'm playing around with Pandas to see if I can do some stock calculation better/faster than with other tools. If I have a single stock it's easy to create daily calculation L
df['mystuff'] = df['Close']+1
If I download more than a ticker it gets complicated:
df = df.stack()
df['mystuff'] = df['Close']+1
df = df.unstack()
If I want to use prevous' day "Close" it gets too complex for me. I thought I might go back to fetch a single ticker, do any operation with iloc[i-1] or something similar (I haven't figured it yet) and then merge the dataframes.
How do I merget two dataframes of single tickers to have a multiindex?
So that:
f1 = web.DataReader('AAPL', 'yahoo', start, end)
f2 = web.DataReader('GOOG', 'yahoo', start, end)
is like
f = web.DataReader(['AAPL','GOOG'], 'yahoo', start, end)
Edit:
This is the nearest thing to f I can create. It's not exactly the same so I'm not sure I can use it instead of f.
f_f = pd.concat(['AAPL':f1,'GOOG':f2},axis=1)
Maybe I should experiment with operations working on a multiindex instead of splitting work on simpler dataframes.
Full Code:
import pandas_datareader.data as web
import pandas as pd
from datetime import datetime
start = datetime(2001, 9, 1)
end = datetime(2019, 8, 31)
a = web.DataReader('AAPL', 'yahoo', start, end)
g = web.DataReader('GOOG', 'yahoo', start, end)
# here are shift/diff calculations that I don't knokw how to do with a multiindex
a_g = web.DataReader(['AAPL','GOOG'], 'yahoo', start, end)
merged = pd.concat({'AAPL':a,'GOOG':g},axis=1)
a_g.to_csv('ag.csv')
merged.to_csv('merged.csv')
import code; code.interact(local=locals())
side note: I don't know how to compare the two csv

This is not exactly the same but it returns Multiindex you can use as in the a_g case
import pandas_datareader.data as web
import pandas as pd
from datetime import datetime
start = datetime(2019, 7, 1)
end = datetime(2019, 8, 31)
out = []
for tick in ["AAPL", "GOOG"]:
d = web.DataReader(tick, 'yahoo', start, end)
cols = [(col, tick) for col in d.columns]
d.columns = pd.MultiIndex\
.from_tuples(cols,
names=['Attributes', 'Symbols'] )
out.append(d)
df = pd.concat(out, axis=1)
Update
In case you want to calculate and add a new column in case you have multiindex columns you can follow this
import pandas_datareader.data as web
import pandas as pd
from datetime import datetime
start = datetime(2019, 7, 1)
end = datetime(2019, 8, 31)
ticks = ['AAPL','GOOG']
df = web.DataReader(ticks, 'yahoo', start, end)
names = list(df.columns.names)
df1 = df["Close"].shift()
cols = [("New", col) for col in df1.columns]
df1.columns = pd.MultiIndex.from_tuples(cols,
names=names)
df = df.join(df1)

Related

How can I leave every answer from 'for'

I think my code works well.
But the problem is that my code does not leave every answer on DataFrame R.
When I print R, only the last answer appeared.
What should I do to display every answer?
I want to add answer on the next column.
import numpy as np
import pandas as pd
DATA = pd.DataFrame()
DATA = pd.read_excel('C:\gskim\P4DS/Monthly Stock adjClose2.xlsx')
DATA = DATA.set_index("Date")
DATA1 = np.log(DATA/DATA.shift(1))
DATA2 = DATA1.drop(DATA1.index[0])*100
F = pd.DataFrame(index = DATA2.index)
for i in range (0, 276):
Q = DATA2.iloc[i].dropna()
W = sorted(abs(Q), reverse = False)
W_qcut = pd.qcut(W, 5, labels = ['A', 'B', 'C', 'D', 'E'])
F = Q.groupby(W_qcut).sum()
R = pd.DataFrame(F)
print(R)
the first table is the current result, I want to fill every blank tables on the second table as a result:

how to save sql query result to csv in pandas

I wrote a query in the data frame and want to save it in CSV file
I tried this code and didn't work
q1 = "SELECT * FROM df1 join df2 on df1.Date = df2.Date"
df = pd.read_sql(q1,None)
df.to_csv('data.csv',index=False)
You can try following code:
import pandas as pd
df1 = pd.read_csv("Insert file path")
df2 = pd.read_csv("Insert file path")
df1['Date'] = pd.to_datetime(df1['Date'] ,errors = 'coerce',format = '%Y-%m-%d')
df2['Date'] = pd.to_datetime(df2['Date'] ,errors = 'coerce',format = '%Y-%m-%d')
df = df1.merge(df2,how='inner', on ='Date')
df.to_csv('data.csv',index=False)
This should solve your problem.

Pandas: Memory error when using apply to split single column array into columns

I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?
Example:
import pandas as pd
import numpy as np
nRows = 2
nCols = 3
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)
df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())
It is when creating df3 I get memory error.
The DF's in the example:
df
0
0 NaN
1 NaN
df2
0 [[0.6704675101784022, 0.41730480236712697, 0.5...
1 [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object
df3
0 1 2
0 0.670468 0.417305 0.558690
0 0.140387 0.198101 0.800745
First I think working with lists in pandas is not good idea, if possible, you can avoid it.
So I believe you can simplify your code a lot:
nRows = 2
nCols = 3
np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
0 1 2
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])

Data-Visualization Python

Plot 4 different line plots for the 4 companies in dataframe open_prices. Year would be on X-axis, stock price on Y axis, you will need (2,2) plot. Set figure size to 10, 8 and share X-axis for better visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nsepy import get_history
import datetime as dt
%matplotlib inline
start = dt.datetime(2015, 1, 1)
end = dt.datetime.today()
infy = get_history(symbol='INFY', start = start, end = end)
infy.index = pd.to_datetime(infy.index)
hdfc = get_history(symbol='HDFC', start = start, end = end)
hdfc.index = pd.to_datetime(hdfc.index)
reliance = get_history(symbol='RELIANCE', start = start, end = end)
reliance.index = pd.to_datetime(reliance.index)
wipro = get_history(symbol='WIPRO', start = start, end = end)
wipro.index = pd.to_datetime(wipro.index)
open_prices = pd.concat([infy['Open'], hdfc['Open'],reliance['Open'],
wipro['Open']], axis = 1)
open_prices.columns = ['Infy', 'Hdfc', 'Reliance', 'Wipro']
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
axes[0, 0].plot(open_prices.index.year,open_prices.INFY)
axes[0, 1].plot(open_prices.index.year,open_prices.HDB)
axes[1, 0].plot(open_prices.index.year,open_prices.TTM)
axes[1, 1].plot(open_prices.index.year,open_prices.WIT)
Blank graph is coming.Please help....?!??
Below code works fine , I have changed the following things
a) axis should be ax b) DF column names were incorrect c) for any one to try this example would also need to install lxml library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nsepy import get_history
import datetime as dt
start = dt.datetime(2015, 1, 1)
end = dt.datetime.today()
infy = get_history(symbol='INFY', start = start, end = end)
infy.index = pd.to_datetime(infy.index)
hdfc = get_history(symbol='HDFC', start = start, end = end)
hdfc.index = pd.to_datetime(hdfc.index)
reliance = get_history(symbol='RELIANCE', start = start, end = end)
reliance.index = pd.to_datetime(reliance.index)
wipro = get_history(symbol='WIPRO', start = start, end = end)
wipro.index = pd.to_datetime(wipro.index)
open_prices = pd.concat([infy['Open'], hdfc['Open'],reliance['Open'],
wipro['Open']], axis = 1)
open_prices.columns = ['Infy', 'Hdfc', 'Reliance', 'Wipro']
print(open_prices.columns)
ax=[]
f, ax = plt.subplots(2, 2, sharey=True)
ax[0,0].plot(open_prices.index.year,open_prices.Infy)
ax[1,0].plot(open_prices.index.year,open_prices.Hdfc)
ax[0,1].plot(open_prices.index.year,open_prices.Reliance)
ax[1,1].plot(open_prices.index.year,open_prices.Wipro)
plt.show()

vectorization of loop in pandas

I've been trying to vectorize the following with no such luck:
Consider two data frames. One is a list of dates:
cols = ['col1', 'col2']
index = pd.date_range('1/1/15','8/31/18')
df = pd.DataFrame(columns = cols )
What i'm doing currently is looping thru df and getting the counts of all rows that are less than or equal to the date in question with my main (large) dataframe df_main
for x in range(len(index)):
temp_arr = []
active = len(df_main[(df_main.n_date <= index[x])]
temp_arr = [index[x],active]
df= df.append(pd.Series(temp_arr,index=cols) ,ignore_index=True)
Is there a way to vectorize the above?
What about something like the following
#initializing
mycols = ['col1', 'col2']
myindex = pd.date_range('1/1/15','8/31/18')
mydf = pd.DataFrame(columns = mycols )
#create df_main (that has each of myindex's dates minus 10 days)
df_main = pd.DataFrame(data=myindex-pd.Timedelta(days=10), columns=['n_date'])
#wrap a dataframe around a list comprehension
mydf = pd.DataFrame([[x, len(df_main[df_main['n_date'] <= x])] for x in myindex])