Parsing a transposed CSV with datetime values on multiple rows in pandas - pandas

I am attempting to import a csv file which includes a number of time-series.
The challenges I am facing are:
a) the csv file is transposed so dates cannot be parsed from columns. Transposing the file using a read_csv().T command would generally work, but it is not appropriate given the datetime information.
b) since the datetime index is on a header row, repeated data points are added a numeral (i.e. Jan becomes Jan, Jan.1, Jan.2 etc.), so stripping datetime values becomes difficult.
c) the first column headers (which do not include datetime information) are placed on the last row of datetime data (third row), which further complicates parsing headers.
Is there an easy way to go from the csv to a 'standard' dataframe structure, with a datetime index parsed from the csv and values in columns?
An example of the csv data structure is here provided:
empty | empty | Jan | Jan | Jan | ... | Dec |
empty | empty | 1 | 1 | 1 | ... | 31 |
head1 | head2 | 00:00 | 01:00 | 02:00 | ... | 23:00 |
---
value1 | value2 | 0.35 | 0.38 | 0.44 | ... | 0.20 |
...

Try:
# read csv with no header
df = pd.read_csv('untitled.txt', header=None).T
# create an index by joining all the columns
df['idx'] = [ ' '.join((a,b,c)) for a,b,c in
zip(df[0].fillna(''),
df[1].fillna(''),
df[2].fillna('')) ]
# drop the unnecessary columns
df.drop([0,1,2], axis=1, inplace=True)
# output
df.set_index('idx').T.reset_index(drop=True)
Output:
+----+-----------+-----------+---------------+---------------+---------------+----------------+
| | head1 | head2 | Jan 1 00:00 | Jan 1 01:00 | Jan 1 02:00 | Dec 31 23:00 |
|----+-----------+-----------+---------------+---------------+---------------+----------------|
| 0 | value1 | value2 | 0.35 | 0.38 | 0.44 | 0.2 |
+----+-----------+-----------+---------------+---------------+---------------+----------------+
As above, the columns are still text (str type). You need to convert them back into timestamps if need be.

Related

Missing 1 digit in time column read excel in pandas

I have excel that have format like this
| No | Date | Time | Name | ID | Serial | Total |
| 1 |2021-03-01| 11.45 | AB | 124535 | 5215635 | 50 |
Im trying to convert excel to pandas dataframe using below code
pd.read_excel(r'path', header=0)
pandas read the excel successfully however, I found strange result when I see the column time.
the dataframe have below result
| No | Date | Time | Name | ID | Serial | Total |
| 1.0 |2021-03-01| 11.4 | AB | 124535 | 5215635.0 | 50.0 |
Column Time is missing 1 digit. is my method to read excel is not correct?
read_excel is interpreting your dot-separated time as a float, which is quite expected.
I suggest telling read_excel to see this column as a string and convert it to datetime afterwards:
df = pd.read_excel(r'path', header=0, converters={'Time': str})
df['Time'] = pd.to_datetime(df.Time, format="%H.%M")

Best way to partition by timestamp in a parquet dataframe

I have a dataframe containing minute level values that looks like below:
+---------------------+-------+
| Timestamp | Value |
+---------------------+-------+
| 2018-01-01 00:00:00 | 5 |
| 2018-01-01 00:01:00 | 7 |
| 2018-01-01 00:02:00 | 9 |
| 2018-01-01 00:03:00 | 0 |
| 2018-01-01 00:04:00 | 5 |
| 2018-01-01 00:05:00 | 8 |
| ... | ... |
| ... | ... |
| 2018-12-31 23:58:00 | 8 |
| 2018-12-31 23:59:00 | 7 |
+---------------------+-------+
I'd like to save it as a partitioned parquet file so that I can optimize file read.
Later I'd need to select the data for a given duration. For eg: 2018-01-05 00:00:00 to 2018-01-06 00:00:00. I was thinking I can partition this data on year,month,day,hour values as below:
df_final = df.withColumn("TimeStamp", to_timestamp(col('TimeStamp'), 'yyyy-MM-dd HH:mm:ss')) \
.withColumn("year", date_format(col("TimeStamp"), "yyyy")) \
.withColumn("month", date_format(col("TimeStamp"), "MM")) \
.withColumn("day", date_format(col("TimeStamp"), "dd")) \
.withColumn("hour", date_format(col("TimeStamp"), "HH"))
This creates a folder structure like this in the resultant parquet
└── YYYY
└── MM
└── DD
└── HH
But does this partitioning help in read optimization? Also I see that the resulting parquet file is 10x larger in size than the unpartitioned file, on disk.
What is the best way to partition this file so that I can fetch data for a given duration faster?
Generally when partitioning multilevel, check for number of records that goes into last level, if it is very less, it would be good to stop at previous partition level and use that in our requirements.
Sometimes splitting the files at multi level can be over optimization leading to lot of files and disk reads loosing the parquet columnar benefits.

Plot multiple lines from one DataFrame

I have the following DataFrame in Python Pandas:
df.head(3)
+===+===========+======+=======+
| | year-month| cat | count |
+===+===========+======+=======+
| 0 | 2016-01 | 1 | 14 |
+---+-----------+------+-------+
| 1 | 2016-02 | 1 | 22 |
+---+-----------+------+-------+
| 2 | 2016-01 | 2 | 10 |
+---+-----------+------+-------+
year-month is a combination of year and month, dating back about 8 years.
cat is an integer from 1 to 10.
count is an integer.
I now want to plot count vs. year-month with matplotlib, one line for each cat. How can this be done?
Easiest is seaborn:
import seaborn as sns
sns.lineplot(x='year-month', y='count', hue='cat', data=df)
Note: it might also help if you convert year-month to datetime type before plotting, e.g.
df['year-month'] = pd.to_datetime(df['year-month'], format='%Y-%m').dt.to_period('M')

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.

Return multiple rows before and after the match row based on time span in Excel/VBA

I have the following kind of data:
+---------------+-------------------------+----
| time | item | line index number
+---------------+-------------------------+----
| 05:00:00 | | 1
| 05:00:01 | MatchingValue | 2
| 05:15:00 | | 3
| 06:00:00 | B | 4
| 06:01:00 | | 5
| 06:45:00 | | 6
| 07:00:00 | MatchingValue | 7
| 07:15:00 | | 8
| 08:00:00 | | 9
| 09:00:00 | | 10
+---------------+-------------------------+
What I am trying to do is to extract multiple rows before and after the matching row with item == "MatchingValue", together with the matching row itself . Those returned multiple rows are within 15 minutes of the time where item == "MatchingValue"
For example, if I was searching "MatchingValue" in the 2nd column, I would like to get the results of rows 1, 2, 3 and 6, 7, 8.
I know that one can get the return of rows 2, 7 at the same time by using array formula (e.g. Index and Match). but I really don't know how to use array formula for my own question.
I appreciate any assistance.
Easiest way is to add a helper column and filter your data in place or just use a pivot table to get only the data you need.
Formula in your helper column: =or(b2="MatchingValue",countifs(b:b,MatchingValue,A:A,">=" & A2-1/(24*4),A:A,"<=" & A2+1/(24*4))>0)
Of course you can also write array formula to collect your data in a new range but considering your already complex criteria and variable number of results that would be really a complex formula.