Pandas merge two time series dataframes based on time window (cut/bin/merge) - pandas

Having a 750k rows df with 15 columns and a pd.Timestamp as index called ts.
I process realtime data down to milliseconds in near-realtime.
Now I would like to apply some statistical data derived from a higher time resolution in df_stats as new columns to the big df. The df_stats has a time resolution of 1 minute.
$ df
+----------------+---+---------+
| ts | A | new_col |
+----------------+---+---------+
| 11:33:11.31234 | 1 | 81 |
+----------------+---+---------+
| 11:33:11.64257 | 2 | 81 |
+----------------+---+---------+
| 11:34:10.12345 | 3 | 60 |
+----------------+---+---------+
$ df_stats
+----------------+----------------+
| ts | new_col_source |
+----------------+----------------+
| 11:33:00.00000 | 81 |
+----------------+----------------+
| 11:34:00.00000 | 60 |
+----------------+----------------+
Currently I have the code below, but it is inefficient, because it nees to iterate over the complete data.
I am wondering if there couldnt be an easier solution using pd.cut, bin or pd.Grouper? Or something else to merge the time-buckets on the two indexes?
df_stats['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df = df.merge(df_stats, on='ts_timeonly', how='left', sort=True, suffixes=['', '_hist']).set_index('ts')

Let us try something new reindex
df_stats=df_stats.set_index('ts').reindex(df['ts'], method='nearest')
df_stats.index=df.index
df=pd.concat([df,df_stats],axis=1)
Or
df=pd.merge_asof(df, df_stats, on='ts',direction='nearest')

Related

Join/Add data to MultiIndex dataframe in pandas

I have some measurement data from different dust analytics.
Two Locations MC174 and MC042
Two fractions PM2.5 and PM10
several analytic results [Cl,Na, K,...]
I created a multicolumn dataframe like this:
| MC174 | MC042 |
| PM2.5 | PM10 | PM2.4 | PM10 |
| Cl | Na| K | Cl | Na| K | Cl | Na| K | Cl | Na| K |
location = ['MC174','MC042']
fraction = ['PM10','PM2.5']
value = [ 'date' ,'Cl','NO3', 'SO4','Na', 'NH4','K', 'Mg','Ca', 'masse','OC_R', 'E_CR','OC_T', 'EC_T']
midx = pd.MultiIndex.from_product([location, fraction,value],names=['location','fraction','value'])
df = pd.DataFrame(columns=midx)
df
and i prepared 4 Dataframes with matching colums for those four locations and fractions.
date | Cl | Na | K |
______________________________
01-01-2021 | 3.1 | 4.3 | 1.0|
... ...
31-12-2021 | 4.9 | 3.8 | 0.8
Now i want to fill the large dataframe with the data from the four locations/fractions:
DF1 -> MainDF[MC174][PM10]
DF2 -> MainDF[MC174][PM2.5]
and so on...
My goal is to have one dataframe with the dates of the year in its index and the multilevel columnstructure i discribed at the top and all the data inside it.
I tried:
main_df['MC174']['PM10'].append(data_MC174_PM10)
pd.concat([main_df['MC174']['PM10'], data_MC174_PM10],axis=0)
main_df.loc[:,['MC174'],['PM10']] = data_MC174_PM10
but the dataframe is never filled.
Thanks in advance!

Missing 1 digit in time column read excel in pandas

I have excel that have format like this
| No | Date | Time | Name | ID | Serial | Total |
| 1 |2021-03-01| 11.45 | AB | 124535 | 5215635 | 50 |
Im trying to convert excel to pandas dataframe using below code
pd.read_excel(r'path', header=0)
pandas read the excel successfully however, I found strange result when I see the column time.
the dataframe have below result
| No | Date | Time | Name | ID | Serial | Total |
| 1.0 |2021-03-01| 11.4 | AB | 124535 | 5215635.0 | 50.0 |
Column Time is missing 1 digit. is my method to read excel is not correct?
read_excel is interpreting your dot-separated time as a float, which is quite expected.
I suggest telling read_excel to see this column as a string and convert it to datetime afterwards:
df = pd.read_excel(r'path', header=0, converters={'Time': str})
df['Time'] = pd.to_datetime(df.Time, format="%H.%M")

Plot multiple lines from one DataFrame

I have the following DataFrame in Python Pandas:
df.head(3)
+===+===========+======+=======+
| | year-month| cat | count |
+===+===========+======+=======+
| 0 | 2016-01 | 1 | 14 |
+---+-----------+------+-------+
| 1 | 2016-02 | 1 | 22 |
+---+-----------+------+-------+
| 2 | 2016-01 | 2 | 10 |
+---+-----------+------+-------+
year-month is a combination of year and month, dating back about 8 years.
cat is an integer from 1 to 10.
count is an integer.
I now want to plot count vs. year-month with matplotlib, one line for each cat. How can this be done?
Easiest is seaborn:
import seaborn as sns
sns.lineplot(x='year-month', y='count', hue='cat', data=df)
Note: it might also help if you convert year-month to datetime type before plotting, e.g.
df['year-month'] = pd.to_datetime(df['year-month'], format='%Y-%m').dt.to_period('M')

Using PySpark window functions with conditions to add rows

I have a need to be able to add new rows to a PySpark df will values based upon the contents of other rows with a common id. There will eventually millions of ids with lots rows for each id. I have tried the below method which works but seems overly complicated.
I start with a df in the format below (but in reality have more columns):
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
+-------+----------+-------+
Currently I am pivoting this df to get it in the following format:
+-----+------+------+------+
| id | varA | varB | varC |
+-----+------+------+------+
| 1 | 30 | 1 | -9 |
+-----+------+------+------+
On this df I can then use the standard withColumn and when functionality to add new columns based on the values in other columns. For example:
df = df.withColumn("varD", when((col("varA") > 16) & (col("varC") != -9)), 2).otherwise(1)
Which leads to:
+-----+------+------+------+------+
| id | varA | varB | varC | varD |
+-----+------+------+------+------+
| 1 | 30 | 1 | -9 | 1 |
+-----+------+------+------+------+
I can then pivot this df back to the original format leading to this:
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
| 1 | varD | 1 |
+-------+----------+-------+
This works but seems like it could, with millions of rows, lead to expensive and unnecessary operations. It feels like it should be doable without the need to pivot and unpivot the data. Do I need to do this?
I have read about Window functions and it sounds as if they may be another way to achieve the same result but to be honest I am struggling to get started with them. I can see how they can be used to generate a value, say a sum, for each id, or to find a maximum value but have not found a way to even get started on applying complex conditions that lead to a new row.
Any help to get started with this problem would be gratefully received.
You can use pandas_udf for adding/deleting rows/col on grouped data, and implement your processing logic in pandas udf.
import pyspark.sql.functions as F
row_schema = StructType(
[StructField("id", IntegerType(), True),
StructField("variable", StringType(), True),
StructField("value", IntegerType(), True)]
)
#F.pandas_udf(row_schema, F.PandasUDFType.GROUPED_MAP)
def addRow(pdf):
val = 1
if (len(pdf.loc[(pdf['variable'] == 'varA') & (pdf['value'] > 16)]) > 0 ) & \
(len(pdf.loc[(pdf['variable'] == 'varC') & (pdf['value'] != -9)]) > 0):
val = 2
return pdf.append(pd.Series([1, 'varD', val], index=['id', 'variable', 'value']), ignore_index=True)
df = spark.createDataFrame([[1, 'varA', 30],
[1, 'varB', 1],
[1, 'varC', -9]
], schema=['id', 'variable', 'value'])
df.groupBy("id").apply(addRow).show()
which resuts
+---+--------+-----+
| id|variable|value|
+---+--------+-----+
| 1| varA| 30|
| 1| varB| 1|
| 1| varC| -9|
| 1| varD| 1|
+---+--------+-----+

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.