pandas nested loops out of range of list - pandas

I am starting with a list called returnlist:
len(returnlist)
9
returnlist[0]
AAPL AMZN BAC GE GM GOOG GS SNP XOM
Date
2012-01-09 60.247143 178.559998 6.27 18.860001 22.840000 309.218842 94.690002 89.053848 85.500000
2012-01-10 60.462856 179.339996 6.63 18.719999 23.240000 309.556641 98.330002 89.430771 85.720001
2012-01-11 60.364285 178.899994 6.87 18.879999 24.469999 310.957520 99.760002 88.984619 85.080002
2012-01-12 60.198570 175.929993 6.79 18.930000 24.670000 312.785645 101.209999 87.838463 84.739998
2012-01-13 59.972858 178.419998 6.61 18.840000 24.290001 310.475647 98.959999 87.792313 84.879997
I want to get the daily leg returns and then use cumsum to get the accumulated returns.
weeklyreturns=[]
for i in range (1,10):
returns=pd.DataFrame()
for stock in returnlist[i]:
if stock not in returnlist[i]:
returns[stock]=np.log(returnlist[i][stock]).diff()
weeklyreturns.append(returns)
the error that I am getting is :
----> 4 for stock in returnlist[i]:
5 if stock not in returnlist[i]:
6 returns[stock]=np.log(returnlist[i][stock]).diff()
IndexError: list index out of range

Since len(returnlist) == 9, that means the last item of returnlist is returnlist[8].
When you iterate over range(1,10), you will start at returnlist[1] and eventually get to returnlist[9], which doesn't exist.
It seems that what you actually need is to iterate over range(0,9).

Related

Counting Distinct words AND average time in Pandas

I'm working on analysing some text from a Twitter API using pandas. This will eventually be visualized.
For reference
df.head() of my dataset
is:
Count User Time Tweet
0 0 x 2022 ✔️Nécessité de maintien d’une filière 🇪🇺 dynam...
1 1 x 2022 Échanges approfondis à #Dakar avec le Premier ...
2 2 x 2022 ✔️Approvisionnement en #céréales & #engrai...
3 3 x 2022 Aujourd’hui à Tambacounda, à l’Est du Sénégal,...
4 4 x 2022 Working hard since 2019 to reinforce EU #auton...
I'm looking to return the distinct word count with the average time of the tweet where the word was used in.
Right now, I've been getting the distinct word count of my dataset using df.Tweet.str.split(expand=True).stack().value_counts().
This is useful, returning:
the 1505
de 1500
to 1168
RT 931
of 906
...
africain, 1
langue 1
Félicitations! 1
Length: 18071, dtype: int64
However, I want to also analyse text usage over time.
I'm not super experienced so I'm wondering if there is a way to use a function such as df.groupby() to sort this result by time? Or, is there a way to modify my original function to add a column to my results that includes average time?
I would use str.extractall to get the words, join the Time, then perform a groupby.value_counts to get the count per Year:
out = (df['Tweet']
.str.extractall('(\S+)')
.droplevel('match')
.join(df['Time'])
.groupby('Time')[0].value_counts()
)
NB. if you want to exclude non-letters/digits from the words, use (\w+) in place of (\S+).
Output:
Time 0
2022 à 3
#Dakar 1
#auton... 1
#céréales 1
#engrai... 1
& 1
... 1
...

Use dataframe with two level index as look up table to fill in values in another dataframe by nearest previous date

I want to use a look up table of events to populate a new index column in a dataset.
I have a data frame that holds a batchcode list indexed by date and tank. Where date is the date the batch was allocated. These dates change intermittently whenever fish are moved between tanks.
batchcode
date tank
2016-01-02 LD TRE-20160102-A-1
LA TRE-20160102-B-1
2016-01-09 T8 TRE-20160109-C-1
LB TRE-20160109-C-1
2016-01-25 LA TRE-20160125-D-1
2016-01-27 LD TRE-20160102-A-2
LC TRE-20160102-A-3
2016-02-02 LD TRE-20160102-E-1
LA TRE-20160125-D-2
LB TRE-20160109-C-2
I have a second table that lists daily activities such as feeding, temperature observations etc.
Date Tank Property Value
0 2015-12-06 LC Green Water (g) 50.0
1 2015-12-07 LC Green Water (g) 50.0
2 2015-12-08 LC Green Water (g) 50.0
3 2015-12-09 LC Green Water (g) 50.0
4 2015-12-10 LC Green Water (g) 50.0
I want to add a new column to this 2nd table for batchcode where the value the batchcode from the first table. i.e I need to match on Tank and for each date find the batchcode set for that date - that is at the most recent previous entry.
What is the best way to solve this? My initial solution ends up running a function for every row of table 2 and this seems inefficient. I feel that I should be able to get table 1 to work as a simple indexed lookup
df2['batchcode'] = df1.find(df2['date', 'tank'], 'batchcode', method='pad') # pseudocode
Should I try to convert the tanks to columns?
how do I find the nearest most recent date in the index.
loading code
bcr_file = f'data/batch_code_records.csv'
bcr_df = pd.read_csv(bcr_file, dtype='str')
bcr_df['date'] = pd.to_datetime(bcr_df['change_date'])
bcr_df.drop(['unique_id', 'species', 'parent_indicator',
'change_date', 'entered_by', 'fish_class', 'origin_parents_wild_breed',
'change_reason_1', 'change_reason_2', 'batch_new', 'comments',
'fish_count', 'source_population_or_wild', 'source_batch_1',
'source_batch_2', 'batch_previous_1', 'batch_previous_2', 'tank_from_1',
'tank_from_2'], axis=1, inplace=True)
bcr_df.set_index(['date', 'tank'], inplace=True)
Source data file
unique_id,batchcode,tank,species,parent_indicator,change_date,entered_by,fish_class,origin_parents_wild_breed,change_reason_1,change_reason_2,batch_new,comments,fish_count,source_population_or_wild,source_batch_1,source_batch_2,batch_previous_1,batch_previous_2,tank_from_1,tank_from_2
TRE-20160102-A-1-LD-20160102,TRE-20160102-A-1,LD,TRE,A,20160102,andrew.watkins#plantandfood.co.nz,WILD OLD,WILD,unstocked_source_to_empty_destination,,,"tank_move_comment: 26905 stock to LA, 26905 LA to LD",26905.0,WILD,,,,,,
TRE-20160102-B-1-LA-20160102,TRE-20160102-B-1,LA,TRE,B,20160102,andrew.watkins#plantandfood.co.nz,WILD OLD,WILD,new_from_wild,,,"tank_move_comment: 26905 stock to LA, 26905 LA to LD",26905.0,WILD,,,,,,
etc.
I tried using get_indexer but couldn't get it to work with the 2 level index, and with only date as the index I get non-unique date warnings.

Pandas - Take value n month before

I am working with datetime. Is there anyway to get a value of n months before.
For example, the data look like:
dft = pd.DataFrame(
np.random.randn(100, 1),
columns=["A"],
index=pd.date_range("20130101", periods=100, freq="M"),
)
dft
Then:
For every Jul of each year, we take value of December in previous year and apply it to June next year
For other month left (from Aug this year to June next year), we take value of previous month
For example: that value from Jul-2000 to June-2001 will be the same and equal to value of Dec-1999.
What I've been trying to do is:
dft['B'] = np.where(dft.index.month == 7,
dft['A'].shift(7, freq='M') ,
dft['A'].shift(1, freq='M'))
However, the result is simply a copy of column A. I don't know why. But when I tried for single line of code :
dft['C'] = dft['A'].shift(7, freq='M')
then everything is shifted as expected. I don't know what is the issue here
The issue is index alignment. This shift that you performed acts on the index, but using numpy.where you convert to arrays and lose the index.
Use pandas' where or mask instead, everything will remain as Series and the index will be preserved:
dft['B'] = (dft['A'].shift(1, freq='M')
.mask(dft.index.month == 7, dft['A'].shift(7, freq='M'))
)
output:
A B
2013-01-31 -2.202668 NaN
2013-02-28 0.878792 -2.202668
2013-03-31 -0.982540 0.878792
2013-04-30 0.119029 -0.982540
2013-05-31 -0.119644 0.119029
2013-06-30 -1.038124 -0.119644
2013-07-31 0.177794 -1.038124
2013-08-31 0.206593 -2.202668 <- correct
2013-09-30 0.188426 0.206593
2013-10-31 0.764086 0.188426
... ... ...
2020-12-31 1.382249 -1.413214
2021-01-31 -0.303696 1.382249
2021-02-28 -1.622287 -0.303696
2021-03-31 -0.763898 -1.622287
2021-04-30 0.420844 -0.763898
[100 rows x 2 columns]

Pandas group by date and get count while removing duplicates

I have a data frame that looks like this:
maid date hour count
0 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 13 2
1 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14 15 1
2 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13 23 14
3 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-14 0 1
4 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11 14 2
5 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13 7 1
I am trying get a count of maid's for each date in such a way that if a maid is included in day 1, I don't want to include in any of the subsequent days. For example, 0589b8a3-9d33-4db4-b94a-834cc8f46106 is present in both 13th as well as 14. I want to include the maid in the count for 13th but not on 14th as it is already included in 13th.
I have written the following code and it works for small data frames:
import pandas as pd
df=pd.read_csv('/home/ubuntu/uniqueSiteId.csv')
umaids=[]
tdf=[]
df['date']=pd.to_datetime(df.date)
df=df.sort_values('date')
df=df[['maid','date']]
df=df.drop_duplicates(['maid','date'])
dts=df['date'].unique()
for dt in dts:
if not umaids:
df1=df[df['date']==dt]
k=df1['maid'].unique()
umaids.extend(k)
dff=df1
fdf=df1.values.tolist()
elif umaids:
dfs=df[df['date']==dt]
df2=dfs[~dfs['maid'].isin(umaids)]
umaids.extend(df2['maid'].unique())
sdf=df2.values.tolist()
tdf.append(sdf)
ftdf = [item for t in tdf for item in t]
ndf=fdf+ftdf
ndf=pd.DataFrame(ndf,columns=['maid','date'])
print(ndf)
Since I have 1000's of data frames and most often my data frame is more than a million rows, the above takes a long time to run. Is there a better way to do this.
The expected output is this:
maid date
0 104010f8-5f57-4f7c-8ad9-5fc3ec0f9f39 2021-08-11
1 0589b8a3-9d33-4db4-b94a-834cc8f46106 2021-08-13
2 11947b4a-ccf8-48dc-a6a3-925836b3c520 2021-08-13
3 023f1f5f-37fb-4869-a957-b66b111d808e 2021-08-14
As per discussion in the comments, the solution is quite simple: sort the dataframe by date and then drop duplicates only by maid. This will keep the first occurence of maid, which also happens to be the first occurence in time since we sorted by date. Then do the groupby as usual.

Pandas series group by calculate percentage

I have a data frame. I have grouped a column status by date using
y = news_dataframe.groupby(by=[news_dataframe['date'].dt.date,news_dataframe['status']])['status'].count()
and my output is --
date status count
2019-05-29 selected 24
rejected auto 243
waiting 109
no action 1363
2019-05-30 selected 28
rejected auto 188
waiting 132
no action 1249
repeat 3
2019-05-31 selected 13
rejected auto 8
waiting 23
no action 137
repeat 2
source 1
Name: reasonForReject, dtype: int64
Now I want to calculate the percentage of each status group by date. How can I achieve this using pandas dataframe?
Compute two different groupbys and divide one by the other:
y_numerator = news_dataframe.groupby(by=[news_dataframe['date'].dt.date,news_dataframe['status']])['status'].count()
y_denominator = news_dataframe.groupby(by=news_dataframe['date'].dt.date)['status'].count()
y=y_numerator/y_denominator
I guess that's the shortest:
news_dataframe['date'] = news_dataframe['date'].dt.date
news_dataframe.groupby(['date','status'])['status'].count()/news_dataframe.groupby(['date'])['status'].count()
try this:
# just fill the consecutive rows with this
df=df.ffill()
df.df1.columns=['date','status','count']
# getting the total value of count with date and status
df1=df.groupby(['date']).sum().reset_index()
#renaming it to total as it is the sum
df1.columns=['date','status','total']
# now join the tables to find the total and actual value together
df2=df.merge(df1,on=['date'])
#calculate the percentage
df2['percentage']=(df2.count/df2.total)*100
If you need one liner its:
df['percentage']=(df.ffill()['count]/df.ffill().groupby(['date']).sum().reset_index().rename(columns={'count': 'total'}).merge(df,on=['date'])['total'])*100