Select rows in dataframes inside a list then append to another dataframe inside another list - pandas

I have daily stock data inside a list of n dataframes (each stock has its own dataframe). I want to select m rows on equal time intervals from each dataframe and append them to dataframes inside another list. Basically the new list should have m dataframes - which is the number the number of days, and each dataframe length n - the number of stocks.
I tried with nested for loops but it just didn't work
cross_section = []
cross_sections_list = []
for m in range(0, len(datalist[0]), 100):
for n in range(len(datalist)):
cross_section.append(datalist[n].iloc[m])
cross_sections_list.append(cross_section)
this code didnt do anything. my machine just stacked on it. if there is another way like multiindexing for example I would love trying it too.
For example
input:
[
Adj Close Ticker
Date
2020-06-01 321.850006 AAPL
2020-06-02 323.339996 AAPL
2020-06-03 325.119995 AAPL
2020-06-04 322.320007 AAPL
2020-06-05 331.500000 AAPL
2020-06-08 333.459991 AAPL
2020-06-09 343.989990 AAPL
2020-06-10 352.839996 AAPL ,
Adj Close Ticker
Date
2020-06-01 182.830002 MSFT
2020-06-02 184.910004 MSFT
2020-06-03 185.360001 MSFT
2020-06-04 182.919998 MSFT
2020-06-05 187.199997 MSFT
2020-06-08 188.360001 MSFT
2020-06-09 189.800003 MSFT
2020-06-10 196.839996 MSFT ]
output:
[
Adj Close Ticker
Date
2020-06-01 321.850006 AAPL
2020-06-01 182.830002 MSFT ,
Adj Close Ticker
Date
2020-06-03 325.119995 AAPL
2020-06-03 185.360001 MSFT ,
Adj Close Ticker
Date
2020-06-05 331.500000 AAPL
2020-06-05 187.199997 MSFT ]
and so on.
Thank you

Not exactly clear what you want, but here is some code that hopefully helps.
list_of_df = [] #list of all the df's
alldf = pd.concat(list_of_df) #brings all df's into one df
list_of_grouped = [y for x, y in alldf.groupby('Date')] #separates df's by date and puts them in a list
number_of_days = alldf.groupby('Date').ngroups #Total number of groups (Dates)
list_of_Dates = [x for x, y in alldf.groupby('Date')] #List of all the dates that were grouped
count_of_stocks = []
for i in range(len(list_of_grouped)):
count_of_stocks.append(len(list_of_grouped[i])) #puts count of each grouped df into a list
zipped = list(zip(list_of_data,count_of_stocks)) #combines the dates and count of stocks in a list to see how many stocks are in each date.

data_global = pd.DataFrame()
for i in datalist:
data_global = data_global.append(i) #first merge all dataframes into one
data_by_date = [i for x, i in data_global.groupby('Date')] #step 2: group the data by date
each_nth_day = []
for i in range(0, len(data_by_date), 21):
each_nth_day.append(data_by_date[i]) #lastly take each n-th dataframe (21 in this case)
thanks for your help user13802115

Related

Dataframe format change, I want to transfrom left dataframe to right dataframe

I have tried melt, unstack and stack pandas functions but cannot understand how to implement them.
dataframe trasnformation and reformating:
I suppose that the required transformation just can't be in a straightforward way done using Pandas functions because the source data structure does not fit well into the concept of a Pandas DataFrame and in addition to this it also does not contain the information about the required column names and columns the target DataFrame is expected to have along with the number of necessary duplicates of rows.
See the Python code illustrating the problem by providing dictionaries for the data tables:
import pandas as pd
pd_json_dict_src = {
'Date':'2023-01-01',
'Time':'14:00',
'A':{
'Intro':[
{'FIB':'1.00'},
{'DIB':'2.00'} ] },
'B':{
'Intro':[
{'FIB':'3.00'},
{'DIB':'4.00'} ] }
}
df_src = pd.DataFrame.from_dict(pd_json_dict_src)
print(df_src)
# -----------------------------------------------
pd_json_dict_tgt = {
'Date':['2023-01-01','2023-01-01'],
'Time':['14:00','14:00'],
'Area':['A','B'],
'Tech':['Intro','Intro'],
'FIB' :['1.00','3.00'],
'DIB' :['2.00','4.00']
}
df_tgt = pd.DataFrame.from_dict(pd_json_dict_tgt)
print(df_tgt)
prints
Date ... B
Intro 2023-01-01 ... [{'FIB': '3.00'}, {'DIB': '4.00'}]
[1 rows x 4 columns]
Date Time Area Tech FIB DIB
0 2023-01-01 14:00 A Intro 1.00 2.00
1 2023-01-01 14:00 B Intro 3.00 4.00
I don't also see any easy to code automated way able to transform the by the dictionary defined source data structure into the data structure of the target dictionary.
In other words it seems that there is no straightforward and easy to code general way of flattening column-wise deep nested data structures especially when choosing Python Pandas as a tool for such flattening. At least as long as there is no other answer here proving me and this statement wrong.
Create the dataframe:
df = pd.DataFrame(data=[["2023-01-01","14:00",1.0,2.0,3.0,4.0]])
0 1 2 3 4 5
0 2023-01-01 14:00 1.00 2.00 3.00 4.00
Set the index columns:
df = df.set_index([0,1]).rename_axis(["Date", "Time"])
2 3 4 5
Date Time
2023-01-01 14:00 1.00 2.00 3.00 4.00
Create the multi-level columns:
df.columns = pd.MultiIndex.from_tuples([("A", "Intro", "FIB"), ("A", "Intro", "DIB"), ("B", "Intro", "FIB"), ("B", "Intro", "DIB")], names=["Area", "Tech", ""])
Area A B
Tech Intro Intro
FIB DIB FIB DIB
Date Time
2023-01-01 14:00 1.00 2.00 3.00 4.00
Use stack() on levels "Area" and "Tech":
df = df.stack(["Area","Tech"])
DIB FIB
Date Time Area Tech
2023-01-01 14:00 A Intro 2.00 1.00
B Intro 4.00 3.00
Reset index to create columns "Date" and "Time" and repeat their values:
df = df.reset_index()
Date Time Area Tech DIB FIB
0 2023-01-01 14:00 A Intro 2.00 1.00
1 2023-01-01 14:00 B Intro 4.00 3.00

What is the most effective way for iterate over dataframe and do sql query and then save as dataframe each row in pandas

i have a dataframe like this:
import pandas as pd
import sqlalchemy
con = sqlalchemy.create_engine('....')
df=pd.DataFrame({'user_id':[1,2,3],'start_date':pd.Series(['2022-05-01 00:00:00','2022-05-10 00:00:00','2022-05-20 00:00:00'],dtype='datetime64[ns]'),
'end_date':pd.Series(['2022-06-01 00:00:00','2022-06-10 00:00:00','2022-06-20 00:00:00'],dtype='datetime64[ns]')})
'''
user_id start_date end_date
1 2022-05-01 00:00:00 2022-06-01 00:00:00
2 2022-05-10 00:00:00 2022-06-10 00:00:00
3 2022-05-20 00:00:00 2022-06-20 00:00:00
'''
I want to get the sales data for each user from the database in the date ranges specified in the df. Below is a code that I am currently using and it is working correctly.
df_stats=pd.DataFrame()
for k,j in df.iterrows():
sql='''
select '{}' as user_id,sum(item_price) as sales,count(return) as return from sales
where created_at between '{}' and '{}' and user_id={}'''.format(j['user_id'],j['start_date'],j['end_date'],j['user_id'])
sql_to_df = pd.read_sql(sql, con)
df_stats = df_stats.append(sql_to_df)
final=df.merge(df_stats,on='user_id')
'''
final:
user_id start_date end_date sales return
1 2022-05-01 00:00:00 2022-06-01 00:00:00 1500 5
2 2022-05-10 00:00:00 2022-06-10 00:00:00 2900 9
3 2022-05-20 00:00:00 2022-06-20 00:00:00 1450 1
'''
But in the articles I read it is mentioned that using iterrows() is very slow. Is there a way to make this process more efficient ?
note: Similar to my question but i couldn't find a satisfactory answer in this previously asked question.
You can use .to_records to transform the rows to a list of tuples. Then iterate the list and unpack the tuple and pass the args to "your_sql_function"
import pandas as pd
data = {
"user_id": [1, 2, 3],
"start_date": pd.Series(["2022-05-01 00:00:00", "2022-05-10 00:00:00", "2022-05-20 00:00:00"], dtype="datetime64[ns]"),
"end_date": pd.Series(["2022-06-01 00:00:00", "2022-06-10 00:00:00", "2022-06-20 00:00:00"], dtype="datetime64[ns]")
}
df = pd.DataFrame(data)
for user, start, end in df.to_records(index=False):
your_sql_function(user, start, end)

Pandas add row to datetime indexed dataframe

I cannot find a solution for this problem. I would like to add future dates to a datetime indexed Pandas dataframe for model prediction purposes.
Here is where I am right now:
new_datetime = df2.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
And this is where I am stuck. The append examples online only seem always seem to show examples with ignore_index=True , and in my case, I want to use the proper datetime indexing.
Suppose you have this df:
date value
0 2020-01-31 00:00:00 1
1 2020-02-01 00:00:00 2
2 2020-02-02 00:00:00 3
then an alternative for adding future days is
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=6, freq='D', closed='right')}))
which returns
date value
0 2020-01-31 00:00:00 1.0
1 2020-02-01 00:00:00 2.0
2 2020-02-02 00:00:00 3.0
0 2020-02-03 00:00:00 NaN
1 2020-02-04 00:00:00 NaN
2 2020-02-05 00:00:00 NaN
3 2020-02-06 00:00:00 NaN
4 2020-02-07 00:00:00 NaN
where the frequency is D (days) day and the period is 6 days.
I think I was making this more difficult than necessary because I was using a datetime index instead of the typical integer index. By leaving the 'date' field as a regular column instead of an index adding the rows is straightforward.
One thing I did do was add a reindex command so I did not end up with wonky duplicate index values:
df = df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods=21, freq='D', closed='right')}))
df = df.reset_index() # resets index
i also needed this and i solve merging the code that you share with the code on this other response add to a dataframe as I go with datetime index and end out with the following code that work for me.
data=raw.copy()
new_datetime = data.index[-1:] # current end of datetime index
increment = '1 days' # string for increment - eventually will be in a for loop to add add'l days
new_datetime = new_datetime+pd.Timedelta(increment)
today_df = pd.DataFrame({'value': 301.124},index=new_datetime)
data = data.append(today_df)
data.tail()
here 'value' is the header of your own dataframe

Groupby MultiIndex and apply dot product to each group in pandas

I have dataframe X
>>> X
A B
x1 x2 intercept x1 x2 intercept
Date
2020-12-31 48.021395 2.406670 1 -11.538462 2.406670 1
2021-03-31 33.229490 2.410444 1 -23.636364 2.405720 1
2021-06-30 11.498812 2.419787 1 -32.727273 2.402403 1
2021-09-30 5.746014 2.583867 1 -34.000000 2.479682 1
2021-12-31 4.612371 2.739457 1 -39.130435 2.496616 1
2022-03-31 3.679404 2.766474 1 -40.476190 2.411736 1
2022-06-30 3.248155 2.771958 1 -45.945946 2.303280 1
and series b:
>>> b
x1 -0.006
x2 0.083
intercept 0.017
I need to compute dot product of each of groups A, B with b, and put the results in one dataframe. I can go through each group explicitly, like the following:
result = pd.concat(
[X["A"].dot(b).rename("A"), X["B"].dot(b).rename("B"),], axis=1,
)
A B
Date
2020-12-31 -0.071375 0.285984
2021-03-31 0.017690 0.358493
2021-06-30 0.148849 0.412763
2021-09-30 0.196985 0.426814
2021-12-31 0.216701 0.459002
2022-03-31 0.224541 0.460031
2022-06-30 0.227584 0.483848
Is there a way to achieve the same without explicitly looping through the groups? In particular, is it possible to first groupby the first level of MultiIndex, then apply the dot product to each group? For example:
result=X.groupby(level=[0], axis=1).apply(lambda x: x.dot(b))
This will give me ValueError: matrices are not aligned error, which I think is due to the fact that groups in X have two levels of index in its columns whereas b's index is a simple index. So I will need to add a level of index to b to match that in X? Like:
result=X.groupby(level=[0], axis=1).apply(
lambda x: x.dot(pd.concat([b], keys=[x.columns.get_level_values(0)[0]]))
)
With this I get ValueError: cannot reindex from a duplicate axis. I am getting stuck here.
Use DataFrame.droplevel for remove top level with rename:
f = lambda x: x.droplevel(0, axis=1).dot(b).rename(x.name)
result=df.groupby(level=0, axis=1).apply(f)
print (result)
A B
2020-12-31 -0.071375 0.285984
2021-03-31 0.017690 0.358493
2021-06-30 0.148849 0.412763
2021-09-30 0.196985 0.426814
2021-12-31 0.216701 0.459002
2022-03-31 0.224541 0.460031
2022-06-30 0.227584 0.483848

Combine Pandas DataFrames while creating MultiIndex Columns

I have two DataFrames, something like this:
import pandas as pd
dates = pd.Index(['2016-10-03', '2016-10-04', '2016-10-05'], name='Date')
close = pd.DataFrame( {'AAPL': [112.52, 113., 113.05],
'CSCO': [ 31.5, 31.35, 31.59 ],
'MSFT': [ 57.42, 57.24, 57.64 ] }, index = dates )
volume= pd.DataFrame( {'AAPL': [21701800, 29736800, 21453100] ,
'CSCO': [14070500, 18460400, 11808600] ,
'MSFT': [19189500, 20085900, 16726400] }, index = dates )
The output of DataFrame 'close' looks like this:
AAPL CSCO MSFT
Date
2016-10-03 112.52 31.50 57.42
2016-10-04 113.00 31.35 57.24
2016-10-05 113.05 31.59 57.64
And the output of DataFrame 'volume' looks like this:
AAPL CSCO MSFT
Date
2016-10-03 21701800 14070500 19189500
2016-10-04 29736800 18460400 20085900
2016-10-05 21453100 11808600 16726400
I would like to combine these two DataFrames into a single DataFrame with MultiIndex COLUMNS so that it looks like this:
AAPL CSCO MSFT
Close Volume Close Volume Close Volume
Date
2016-10-03 112.52 21701800 31.50 14070500 57.42 19189500
2016-10-04 113.00 29736800 31.35 18460400 57.24 20085900
2016-10-05 113.05 21453100 31.59 11808600 57.64 16726400
Can anyone give me an idea how to do that? I've been playing with pd.concat and pd.merge, but it's not clear to me how to get it to line up on the date index and allow me to provide names for the sub-index ('Close" and 'Volume') on the columns.
You can use the keys kwarg of concat:
In [11]: res = pd.concat([close, volume], axis=1, keys=["close", "volume"])
In [12]: res
Out[12]:
close volume
AAPL CSCO MSFT AAPL CSCO MSFT
Date
2016-10-03 112.52 31.50 57.42 21701800 14070500 19189500
2016-10-04 113.00 31.35 57.24 29736800 18460400 20085900
2016-10-05 113.05 31.59 57.64 21453100 11808600 16726400
With a little rearrangement:
In [13]: res.swaplevel(0, 1, axis=1).sort_index(axis=1)
Out[13]:
AAPL CSCO MSFT
close volume close volume close volume
Date
2016-10-03 112.52 21701800 31.50 14070500 57.42 19189500
2016-10-04 113.00 29736800 31.35 18460400 57.24 20085900
2016-10-05 113.05 21453100 31.59 11808600 57.64 16726400