Plot multiple lines from one DataFrame - pandas

I have the following DataFrame in Python Pandas:
df.head(3)
+===+===========+======+=======+
| | year-month| cat | count |
+===+===========+======+=======+
| 0 | 2016-01 | 1 | 14 |
+---+-----------+------+-------+
| 1 | 2016-02 | 1 | 22 |
+---+-----------+------+-------+
| 2 | 2016-01 | 2 | 10 |
+---+-----------+------+-------+
year-month is a combination of year and month, dating back about 8 years.
cat is an integer from 1 to 10.
count is an integer.
I now want to plot count vs. year-month with matplotlib, one line for each cat. How can this be done?

Easiest is seaborn:
import seaborn as sns
sns.lineplot(x='year-month', y='count', hue='cat', data=df)
Note: it might also help if you convert year-month to datetime type before plotting, e.g.
df['year-month'] = pd.to_datetime(df['year-month'], format='%Y-%m').dt.to_period('M')

Related

Pandas apply to a range of columns

Given the following dataframe, I would like to add a fifth column that contains a list of column headers when a certain condition is met on a row, but only for a range of dynamically selected columns (ie subset of the dataframe)
| North | South | East | West |
|-------|-------|------|------|
| 8 | 1 | 8 | 6 |
| 4 | 4 | 8 | 4 |
| 1 | 1 | 1 | 2 |
| 7 | 3 | 7 | 8 |
For instance, given that the inner two columns ('South', 'East') are selected and that column headers are to be returned when the row contains the value of one (1), the expected output would look like this:
Headers
|---------------|
| [South] |
| |
| [South, East] |
| |
The following one liner manages to return column headers for the entire dataframe.
df['Headers'] = df.apply(lambda x: df.columns[x==1].tolist(),axis=1)
I tried adding the dynamic column range condition by using iloc but to no avail. What am I missing?
For reference, these are my two failed attempts (N1 and N2 being column range variables here)
df['Headers'] = df.iloc[N1:N2].apply(lambda x: df.columns[x==1].tolist(),axis=1)
df['Headers'] = df.apply(lambda x: df.iloc[N1:N2].columns[x==1].tolist(),axis=1)
This works:
df=pd.DataFrame({'North':[8,4,1,7],'South':[1,4,1,3],'East':[8,8,1,7],\
'West':[6,4,2,8]})
df1=df.melt(ignore_index=False)
condition1=df1['variable']=='South'
condition2=df1['variable']=='East'
condition3=df1['value']==1
df1=df1.loc[(condition1|condition2)&condition3]
df1=df1.groupby(df1.index)['variable'].apply(list)
df=df.join(df1)

Missing 1 digit in time column read excel in pandas

I have excel that have format like this
| No | Date | Time | Name | ID | Serial | Total |
| 1 |2021-03-01| 11.45 | AB | 124535 | 5215635 | 50 |
Im trying to convert excel to pandas dataframe using below code
pd.read_excel(r'path', header=0)
pandas read the excel successfully however, I found strange result when I see the column time.
the dataframe have below result
| No | Date | Time | Name | ID | Serial | Total |
| 1.0 |2021-03-01| 11.4 | AB | 124535 | 5215635.0 | 50.0 |
Column Time is missing 1 digit. is my method to read excel is not correct?
read_excel is interpreting your dot-separated time as a float, which is quite expected.
I suggest telling read_excel to see this column as a string and convert it to datetime afterwards:
df = pd.read_excel(r'path', header=0, converters={'Time': str})
df['Time'] = pd.to_datetime(df.Time, format="%H.%M")

Pandas merge two time series dataframes based on time window (cut/bin/merge)

Having a 750k rows df with 15 columns and a pd.Timestamp as index called ts.
I process realtime data down to milliseconds in near-realtime.
Now I would like to apply some statistical data derived from a higher time resolution in df_stats as new columns to the big df. The df_stats has a time resolution of 1 minute.
$ df
+----------------+---+---------+
| ts | A | new_col |
+----------------+---+---------+
| 11:33:11.31234 | 1 | 81 |
+----------------+---+---------+
| 11:33:11.64257 | 2 | 81 |
+----------------+---+---------+
| 11:34:10.12345 | 3 | 60 |
+----------------+---+---------+
$ df_stats
+----------------+----------------+
| ts | new_col_source |
+----------------+----------------+
| 11:33:00.00000 | 81 |
+----------------+----------------+
| 11:34:00.00000 | 60 |
+----------------+----------------+
Currently I have the code below, but it is inefficient, because it nees to iterate over the complete data.
I am wondering if there couldnt be an easier solution using pd.cut, bin or pd.Grouper? Or something else to merge the time-buckets on the two indexes?
df_stats['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df = df.merge(df_stats, on='ts_timeonly', how='left', sort=True, suffixes=['', '_hist']).set_index('ts')
Let us try something new reindex
df_stats=df_stats.set_index('ts').reindex(df['ts'], method='nearest')
df_stats.index=df.index
df=pd.concat([df,df_stats],axis=1)
Or
df=pd.merge_asof(df, df_stats, on='ts',direction='nearest')

pandas cumcount in pyspark

Currently attempting to convert a script I made from pandas to pyspark, I have a dataframe that contains data in the form of:
index | letter
------|-------
0 | a
1 | a
2 | b
3 | c
4 | a
5 | a
6 | b
I want to create the following dataframe in which the occurrence count for each instance of a letter is stored, for example the first time we see "a" its occurrence count is 0, second time 1, third time 2:
index | letter | occurrence
------|--------|-----------
0 | a | 0
1 | a | 1
2 | b | 0
3 | c | 0
4 | a | 2
5 | a | 3
6 | b | 1
I can achieve this in pandas using:
df['occurrence'] = df.groupby('letter').cumcount()
How would I go about doing this in pyspark? Cannot find an existing method that is similar.
The feature you're looking for is called window functions
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn("occurence", row_number().over(Window.partitionBy("letter").orderBy("index")))

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.