Extract similar dates from multiple pandas data-frame - pandas

My 500 data frames look like this, it is a day based data for 2 years.
Date | Column A | Column B
2017-04-04
2017-04-05
2017-04-06
2017-04-07
....
2017-04-02
...
2019-02-01
2019-02-11
2019-02-22
2019-02-27
2019-03-01
2019-04-01
2019-05-01
All the data frames have a similar number of columns, but a different number of rows. All these DataFrames have a few similar timestamps. I want to exact common timestamps from all my data frames.
The goal is to filter out common timestamps in all my 500 data frames and create a subset of new 500 data frames with just common timestamps.

If you can store all 500 in memory at once, then it's useful to store them in a dictionary. Then you can find the intersection of all dates, and then save the subsets:
import pandas as pd
from functools import reduce
d = dict((file, pd.read_csv(file)) for file in [your_list_of_files])
date_com = reduce(lambda l,r: l & r [set(df.Date) for _,df in d.items()])
for file,df in d.items():
df[df.Date.isin(date_com)].to_csv(f'adjusted_{file}')

Related

How to compare elements of a row string in one dataframe column with elements of a row string of another dataframe column, and remove non-matching?

I have two dataframes with different row counts.
df1 has the problems and count
problems | count
broken, torn | 10
torn | 15
worn-out, broken | 25
df2 has the order_id and problems
order_id | problems
123 | broken
594 | torn
811 | worn-out, broken
I need to remove all rows from df1 that do not match the individual problems in the list in df2. And I want to maintain the count of df1.
The final df1 data frame would look like this:
problems | count
broken | 10
torn | 15
worn-out, broken | 25
I have only done this for columns in the same dataframe before. Not sure how to deal with multiple data frames.
Can someone please help?
Try this to merge the two df's together:
(pd.merge(df.assign(problems = df['problems'].str.split(', ').map(frozenset)),
df2.assign(problems = df2['problems'].map(frozenset)),on = 'problems'))

Datetime reformat weekly column

I split a dataframe from minute to daily, weekly and monthly. I had no problem to reformat the daily dataframe, though I am having a good bit of trouble trying to do the same with the weekly one. Here is the weekly dataframe if someone could help me out please, it would be great. I am adding the code I used to reformat the daily dataframe, so it may helps!
I am plotting it with Bokeh and without the datetime format I won't be able to format the axis and hovertools as I would like.
Thanks beforehand.
dfDay1 = dfDay.loc['2014-01-01':'2020-09-31']
dfDay1 = dfDay1.reset_index()
dfDay1['date1'] = pd.to_datetime(dfDay1['date'], format=('%Y/%m/%d'))
dfDay1 = dfDay1.set_index('date')
That worked fine for the day format.
If need convert dates before / use Series.str.split with str[0], if dates after / use str[1]:
df['date1'] = pd.to_datetime(df['week'].str.split('/').str[0])
print (df)
week Open Low High Close Volume \
0 2014-01-07/2014-01-13 58.1500 55.38 58.96 56.0000 324133239
1 2014-01-14/2014-01-20 56.3500 55.96 58.57 56.2500 141255151
2 2014-01-21/2014-01-27 57.8786 51.85 59.31 52.8600 279370121
3 2014-01-28/2014-02-03 53.7700 52.75 63.95 62.4900 447186604
4 2014-02-04/2014-02-10 62.8900 60.45 64.90 63.9100 238316161
.. ... ... ... ... ... ...
347 2020-09-01/2020-09-07 297.4000 271.14 303.90 281.5962 98978386
348 2020-09-08/2020-09-14 275.0000 262.64 281.40 271.0100 109717114
349 2020-09-15/2020-09-21 272.6300 244.13 274.52 248.5800 123816172
350 2020-09-22/2020-09-28 254.3900 245.40 259.98 255.8800 98550687
351 2020-09-29/2020-10-05 258.2530 256.50 268.33 261.3500 81921670
date1
0 2014-01-07
1 2014-01-14
2 2014-01-21
3 2014-01-28
4 2014-02-04
.. ...
347 2020-09-01
348 2020-09-08
349 2020-09-15
350 2020-09-22
351 2020-09-29
[352 rows x 7 columns]

Time column interval filter

I have a dataframe with a "Fecha" column, I would like to reduce de Dataframe size through filter it and maintain just the rows which are on each 10 minutes multiple and discard all rows which are not in 10 minutes multiple.
Some idea?
Thanks
I have to guess some variable names. But assuming your dataframe name is df, the solution should look similar to:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = df[df['Fecha'].minute % 10 == 0]
The first line guarantees that your 'Fecha' column is in DateTime-Format. The second line filters all rows which are a multiple of 10 minutes. To do this you use the modulus operator %.
Since I'm not sure if this solves your problem, here's a minimal example that runs by itself:
import pandas as pd
idx = pd.date_range(pd.Timestamp(2020, 1, 1), periods=60, freq='1T')
series = pd.Series(1, index=idx)
series = series[series.index.minute % 10 == 0]
series
The first three lines construct a series with a 1 minute index, which is filtered in the fourth line.
Output:
2020-01-01 00:00:00 1
2020-01-01 00:10:00 1
2020-01-01 00:20:00 1
2020-01-01 00:30:00 1
2020-01-01 00:40:00 1
2020-01-01 00:50:00 1
dtype: int64

Can I join two dataframes while only retaining rows based on datetimes featured in the second dataframe?

Dataframe A ('df_a') contains location-split temperature values at re-sampled 5-minute intervals:
logtime_round | location | value
2017-05-01 06:05:00 | 0 | 17
2017-05-01 06:05:00 | 1 | 14.5
2017-05-01 06:05:00 | 2 | 14.5
etc...
Dataframe B ('df_b') contains temperature values (re-sampled from hourly to daily):
logtime_round | airtemp
2017-05-01 | 10.33333
2017-05-02 | 10.42083
etc...
I have manipulated df_b so that only airtemp (format: datetime64[ns]) <= 15.5 are included, and now would like to manipulate df_a so that a new dataframe is created featuring only the same days included in df_b (I'm only interested in locations and values when outdoor air temperature was below <= 15.5).
Is this possible?
My first plan was to join the two dataframes and then look to remove any NaN airtemp values to get my desired df, however, the df_b airtemp is only featured for the first row (e.g. for 2017-05-01) with the rest as NaNs. So perhaps the df_b daily airtemp can be duplicated across all rows in the same day?
joindf = df_a.join(df_b)
Thanks!
Use merge_asof (assuming both frames have been sorted by time):
pd.merge_asof(df_a, df_b, on='logtime_round')

Pandas add column based on grouped by rolling average

I have successfully added a new summed Volume column using Transform when grouping by Date like so:
df
Name Date Volume
--------------------------
APL 12-01-2017 1102
BSC 12-01-2017 4500
CDF 12-02-2017 5455
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
Name Date Volume vol_all_daily
------------------------------------------
APL 12-01-2017 1102 5602
BSC 12-01-2017 4500 5602
CDF 12-02-2017 5455 5455
However when I want to take the rolling average it doesn't work!
df['vol_all_ma_2']=df['vol_all_daily'].
groupby([df['Date']]).rolling(window=2).mean()
Returns a DataGroupBy that gives error *and becomes too hard to put back into a df column anyways.
df['vol_all_ma_2'] =
df['vol_all_daily'].groupby([df['Date']]).transform('mean').
rolling(window=2).mean()
This just produces near identical result of vol_all_daily column
Update:
I wasn't taking the just one column per date..The above code will still take multiple dates...Instead I add the .first() to the groupby..Not sure why groupby isnt taking one row per date.
The behavior of what you have written seems correct (Part 1 below), but perhaps you want to be calling something different (Part 2 below).
Part 1: Why what you have written is behaving correctly:
d = {'Name':['APL', 'BSC', 'CDF'],'Date':pd.DatetimeIndex(['2017-12-01', '2017-12-01', '2017-12-02']),'Volume':[1102,4500,5455]}
df = pd.DataFrame(d)
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
print(df)
rolling_vol = df['vol_all_daily'].groupby([df['Date']]).rolling(window=2).mean()
print('')
print(rolling_vol)
I get as output:
Date Name Volume vol_all_daily
0 2017-12-01 APL 1102 5602
1 2017-12-01 BSC 4500 5602
2 2017-12-02 CDF 5455 5455
Date
2017-12-01 0 NaN
1 5602.0
2017-12-02 2 NaN
Name: vol_all_daily, dtype: float64
To understand why this result rolling_vol is correct, notice that you have first called the groupby, and only after that you have called rolling. That should not produce something that fits with df.
Part 2: What I think you wanted to call (just a rolling average):
If you instead run:
# same as above but without groupby
rolling_vol2 = df['vol_all_daily'].rolling(window=2).mean()
print('')
print(rolling_vol2)
You should get:
0 NaN
1 5602.0
2 5528.5
Name: vol_all_daily, dtype: float64
which looks more like the rolling average you seem to want. To explain that, I suggest reading the details of pandas resampling vs rolling.