Filling pandas Data Frame - pandas

I have a list of ticker values
ticker = ["AAPL","MSFT","GOOG"]
and I want to create a DF with "high" values of prices for all the stocks in the ticker list.
Creating an empty DF:
high_df = pd.DataFrame(columns = ticker)
Filling the DF:
import pandas_datareader as web
import datetime
start = datetime.datetime(2010,1,1)
end = datetime.datetime(2010,2,1)
for each_column in high_df.columns:
high_df[each_column] = web.DataReader(each_column, "yahoo",start,end)["High"]
This works but takes a long time if the ticker list is huge. Any other suggestions for approaches to speed up? Speed up with the way the DF is filled.

Seems like just need parallel computing.
from joblib import Parallel, delayed
def yourfunc(tic):
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2010, 2, 1)
result=web.DataReader(tic, "yahoo", start, end)["High"]
return result
results = Parallel(n_jobs=-1, verbose=verbosity_level, backend="threading")(
map(delayed(yourfunc), ticker ))
About the conversion , you can using pd.DataFrame(results,columns=ticker)

Related

Create multiple DataFrames using data from an api

I'm using the world bank API to analyze data and I want to create multiple data frames with the same indicators for different countries.
import wbgapi as wb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time as t_lib
#Variables
indicators = ['AG.PRD.LVSK.XD', 'AG.YLD.CREL.KG', 'NY.GDP.MINR.RT.ZS', 'SP.POP.TOTL.FE.ZS']
countries = ['PRT', 'BRL', 'ARG']
time = range(1995, 2021)
#Code
def create_df(country):
df = wb.data.DataFrame(indicators, country, time, labels = True).reset_index()
columns = [ item for item in df['Series'] ]
columns
df = df.T
df.columns = columns
df.drop(['Series', 'series'], axis= 0, inplace = True)
df = df.reset_index()
return df
list_of_dfs = []
for n in range(len(countries)):
var = create_df(countries[n])
list_of_dfs.append(var)
Want I really wanted is to create a data frame with a different name for a different country and to store them in a list or dict like: [df_1, df_2, df_3...]
EDIT:
I'm trying this now:
a_dictionary = {}
for n in range(len(countries)):
a_dictionary["key%s" %n] = create_df(countries[n])
It was suppose to work but still get the same error in the 2nd loop:
APIResponseError: APIError: JSON decoding error (https://api.worldbank.org/v2/en/sources/2/series/AG.PRD.LVSK.XD;AG.YLD.CREL.KG;NY.GDP.MINR.RT.ZS;SP.POP.TOTL.FE.ZS/country/BRL/time/YR1995;YR1996;YR1997;YR1998;YR1999;YR2000;YR2001;YR2002;YR2003;YR2004;YR2005;YR2006;YR2007;YR2008;YR2009;YR2010;YR2011;YR2012;YR2013;YR2014;YR2015;YR2016;YR2017;YR2018;YR2019;YR2020?per_page=1000&page=1&format=json)
UPDATE:
Thanks to notiv I noticed the problema was in "BRA" instead of "BRL".
I'm also putting here a new approach that works as well by creating a master dataframe and then slicing it by country to create the desired dataframes:
df = wb.data.DataFrame(indicators, countries, time, labels = True).reset_index()
columns = [ item for item in df['Series'] ]
columns
df = df.T
df.columns = columns
df.drop(['Series', 'series'], axis= 0, inplace = True)
df = df.reset_index()
df
a_dictionary = {}
for n in range(len(countries)):
new_df = df.loc[: , (df == countries[n]).any()]
new_df['index'] = df['index']
new_df.set_index('index', inplace = True)
new_df.drop(['economy', 'Country'], axis= 0, inplace = True)
a_dictionary["eco_df%s" %n] = new_df
for loop in range(len(countries)):
for n in range(len(a_dictionary[f'eco_df{loop}'].columns)):
sns.set_theme(style="dark")
g = sns.relplot( data= a_dictionary[f'eco_df{loop}'], x= a_dictionary[f'eco_df{loop}'].index, y= a_dictionary[f'eco_df{loop}'].iloc[:,n], kind="line", palette="crest",
height=5, aspect=1.61, legend=False).set(title=countries[loop])
g.set_axis_labels("Years")
g.set_xticklabels(rotation=45)
g.tight_layout()
At the end I used the dataframes to create a chart for each indicator for each country.
Many thanks for the help

Plotly scatterplot using pandas groupby for traces

I run into this pattern quite often. I want my traces to be the results of a groupby operation.
data = dict(
time = [1,1,1,2,2,2,3,3,3],
satellite_ID = [3,24,9,3,24,9,3,24,9],
satellite_type = ['gps','glonass','galileo']*3,
snr = [28,34,26,27,35,25,28,36,24])
df = pd.DataFrame(data)
The x-axis is time, the y-axis is SNR, and each line+marker trace is a unique satellite ID. There should be 3 traces at time 1, 2, and 3 for each satellite. A nice addition would be to have each satellite_type be a different color and visible on mouse hover.
I think I figured it out from the documentation.
import plotly.express as px
import pandas as pd
data = dict(
time = [1,1,1,2,2,2,3,3,3],
satellite_ID = [3,24,9,3,24,9,3,24,9],
satellite_type = ['gps','glonass','galileo']*3,
snr = [28,34,26,27,35,25,28,36,24])
df = pd.DataFrame(data)
fig = px.line(df, x="time", y="snr", color='satellite_ID',
hover_data=['satellite_type'] )
fig.update_traces(mode="markers+lines")
"color" selects the traces, and additional hover data can be entered using the "hover_data' argument to px.line.

Dataframe column won't convert from integer string to an actual integer

I have a date string in microsecond resolution. I need it as an integer.
import pandas as pd
data = ["20181231235959383171", "20181231235959383172"]
df = pd.DataFrame(data=data, columns=["A"])
df["A"].astype(np.int)
Error:
File "pandas\_libs\lib.pyx", line 545, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long
Same problem if I try to cast it to standard Python int
Per my answer in your previous question:
import pandas as pd
data = ["20181231235959383171", "20181231235959383172"]
df = pd.DataFrame(data=data, columns=["A"])
# slow but big enough
df["A_as_python_int"] = df["A"].apply(int)
# fast but has to be split to two integers
df["A_seconds"] = (df["A_as_python_int"] // 1000000).astype(np.int)
df["A_fractions"] = (df["A_as_python_int"] % 1000000).astype(np.int)
You could do this:
import pandas as pd
data = ["20181231235959383171", "20181231235959383172"]
df = pd.DataFrame(data=data, columns=["A"])
before = df.A[0]
df.A = [int(x) for x in df.A.tolist()]
after = df.A[0]
before, after
Output:
The data has been cast into an integer. Showing: (before, after)
('20181231235959383171', 20181231235959383171)

Dask Dataframe: Defining meta for date diff in groubpy

I'm trying to find inter-purchase times (i.e., days between orders) for customers. Although my code is working correctly without defining meta, I would like to get it working properly and no longer see the warning asking me to provide meta.
Also, I would appreciate any suggestions on how to use map or map_partitions instead of apply.
So far I've tried:
meta={'days_since_last_order': 'datetime64[ns]'}
meta={'days_since_last_order': 'f8'}
meta={'ORDER_DATE_DT':'datetime64[ns]','days_since_last_order': 'datetime64[ns]'}
meta={'ORDER_DATE_DT':'f8','days_since_last_order': 'f8'}
meta=('days_since_last_order', 'f8')
meta=('days_since_last_order', 'datetime64[ns]')
Here is my code:
import numpy as np
import pandas as pd
import datetime as dt
import dask.dataframe as dd
from dask.distributed import wait, Client
client = Client(processes=True)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
d = (end - start).days + 1
np.random.seed(0)
df = pd.DataFrame()
df['CUSTOMER_ID'] = np.random.randint(1, 4, 10)
df['ORDER_DATE_DT'] = start + pd.to_timedelta(np.random.randint(1, d, 10), unit='d')
print(df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT']))
print(df)
ddf = dd.from_pandas(df, npartitions=2)
# setting ORDER_DATE_DT as index to sort by date
ddf = ddf.set_index('ORDER_DATE_DT')
ddf = client.persist(ddf)
wait(ddf)
ddf = ddf.reset_index()
grp = ddf.groupby('CUSTOMER_ID')[['ORDER_DATE_DT']].apply(
lambda df: df.assign(days_since_last_order=df.ORDER_DATE_DT.diff(1))
# meta=????
)
# for some reason, I'm unable to print grp unless I reset_index()
grp = grp.reset_index()
print(grp.compute())
Here is the printout of df.sort_values(['CUSTOMER_ID','ORDER_DATE_DT'])
Here is the printout of grp.compute()

How can I apply a couple of functions to multiple tickers in a list? (Code improvement)

So I'm currently learning how to analyse financial data in python using numpy, pandas, etc... and I'm starting off with a small script that will hopefully rank some chosen equities by the price change between 2 chosen dates.
My first script was:
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from pandas import Series, DataFrame
import datetime
from operator import itemgetter
#Edit below for 2 dates you wish to calculate:
start = datetime.datetime(2014, 7, 15)
end = datetime.datetime(2017, 7, 25)
stocks = ('AAPL', 'GOOGL', 'YHOO', 'MSFT', 'AMZN', 'DAI')
#Getting the data:
AAPL = web.DataReader('AAPL', 'google', start, end)
GOOGL = web.DataReader('GOOGL', 'google', start, end)
YHOO = web.DataReader('YHOO', 'google', start, end)
MSFT = web.DataReader('MSFT', 'google', start, end)
AMZN = web.DataReader('AMZN', 'google', start, end)
DAI = web.DataReader('DAI', 'google', start, end)
#Calculating the change:
AAPLkey = (AAPL.ix[start]['Close'])/(AAPL.ix[end]['Close'])
GOOGLkey = (GOOGL.ix[start]['Close'])/(GOOGL.ix[end]['Close'])
YHOOkey = (YHOO.ix[start]['Close'])/(YHOO.ix[end]['Close'])
MSFTkey = (MSFT.ix[start]['Close'])/(MSFT.ix[end]['Close'])
AMZNkey = (AMZN.ix[start]['Close'])/(AMZN.ix[end]['Close'])
DAIkey = (DAI.ix[start]['Close'])/(DAI.ix[end]['Close'])
#Formatting the output in a sorted order:
dict1 = {"AAPL" : AAPLkey, "GOOGL" : GOOGLkey, "YHOO" : YHOOkey, "MSFT" : MSFTkey, "AMZN" : AMZNkey, "DAI" : DAIkey}
out = sorted(dict1.items(), key=itemgetter(1), reverse = True)
for tick , change in out:
print (tick,"\t", change)
I now obviously want to make this far shorter and this is what I've got so far:
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from pandas import Series, DataFrame
import datetime
from operator import itemgetter
#Edit below for 2 dates you wish to calculate:
start = datetime.datetime(2014, 7, 15)
end = datetime.datetime(2017, 7, 25)
stocks = ('AAPL', 'GOOGL', 'YHOO', 'MSFT', 'AMZN', 'DAI')
for eq in stocks:
eq = web.DataReader(eq, 'google', start, end)
for legend in eq:
legend = (eq.ix[start]['Close'])/(eq.ix[end]['Close'])
print (legend)
The calculation works BUT the problem is this only outputs the last value for the item in the list (DAI).
So what's next in order to get the same result as my first code?
You can just move print statement into loop.
Like:
for legend in eq:
legend = (eq.loc[start]['Close'])/(eq.loc[end]['Close'])
print(legend)
Improved answer:
Get rid of label loop and print values from previous loop:
for eq in stocks:
df = web.DataReader(eq, 'google', start, end)
print((df.loc[start]['Close'])/(df.loc[end]['Close']))
When you loop over stocks at line for eq in stocks, you are saving results into eq. So at each iteration it gets overwritten. You should store the results in a list, like I have done using data.
Then loop over the data list which contains dataframes, and then use proper selection.
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from pandas import Series, DataFrame
import datetime
from operator import itemgetter
# edit below for 2 dates you wish to calculate:
start = datetime.datetime(2014, 7, 15)
end = datetime.datetime(2017, 7, 25)
stocks = ('AAPL', 'GOOGL', 'YHOO', 'MSFT', 'AMZN', 'DAI')
# store all the dataframes in a list
data = []
for eq in stocks:
data.append(web.DataReader(eq, 'google', start, end))
# print required fields from each dataframe
for df in data:
print (df.ix[start]['Close'])/(df.ix[end]['Close'])
Output:
0.624067042032
0.612014075932
0.613225417599
0.572179539021
0.340850298595
1.28323537643
Thanks to the other answers, they both helped lot. This is my final improved script thanks to that help:
import numpy as np
import pandas as pd
from pandas_datareader import data as web
from pandas import Series, DataFrame
import datetime
from operator import itemgetter
# edit below for 2 dates you wish to calculate:
start = datetime.datetime(2014, 7, 15)
end = datetime.datetime(2017, 7, 25)
stocks = ('AAPL', 'GOOGL', 'YHOO', 'MSFT', 'AMZN', 'DAI')
dict1 = {}
for eq in stocks:
df = web.DataReader(eq, 'google', start, end)
k = ((df.loc[start]['Close'])/(df.loc[end]['Close']))
dict1 [eq] = k
out = sorted(dict1.items(), key=itemgetter(1), reverse = True)
for tick , change in out:
print (tick,"\t", change)