Sorting csv file with Python 3

Sorting csv file with Python 3 - pandas

I'm having trouble sorting a csv file which has in its second column the UTC time as: 2010-01-01 00:00:00
I have a file that is like this:
name utc_time longitude latitude
A 2010-01-01 00:00:34 23 41
B 2011-01-01 10:00:00 26 44
C 2009-01-01 03:00:00 34 46
D 2012-01-01 00:00:00 31 47
E 2010-01-01 04:00:00 44 48
F 2013-01-01 14:00:00 24 41
Which I want it to be outputted in a csv file keeping the same structure but sorted by date:
Output:
name utc_time longitude latitude
C 2009-01-01 03:00:00 34 46
A 2010-01-01 00:00:34 23 41
E 2010-01-01 04:00:00 44 48
B 2011-01-01 10:00:00 26 44
D 2012-01-01 00:00:00 31 47
F 2013-01-01 14:00:00 24 41
I'm actually trying this:
fileEru = pd.read_csv("input.csv")
fileEru = sorted(fileEru, key = lambda row: datetime.strptime(row[1],'%Y-%m-%d %H:%M:%S'), reverse=True)
fileEru.to_csv("output.csv")
But it doesn't work.

try this:
(pd.read_csv("input.csv", parse_dates=['utc_time'])
.sort_values('utc_time')
.to_csv("output.csv", index=False))

Related

Pandas: Find weekly max from timeseries(calendar week not 7 days)

I want my dataframe to be grouped by calendar weekly, like Monday to Sunday.
timestamp value
# before time
...
# this is a Friday
2021-10-01 13:00:00 2204.0
2021-10-01 13:30:00 3262.0
...
# this is next Monday
2021-10-04 16:00:00 254.0
2021-10-04 16:30:00 990.0
2021-10-04 17:00:00 1044.0
2021-10-04 17:30:00 26.0
...
# time continues
The result I'm expecting, hope this is clear enough.
timestamp value weekly_max
# this is a Friday
2021-10-01 13:00:00 2204.0 3262.0 # assume 3262.0 is the maximum value during 2021-09-27 to 2021-10-03
2021-10-01 13:30:00 3262.0 3262.0
...
# this is next Monday
2021-10-04 16:00:00 254.0 1044.0
2021-10-04 16:30:00 990.0 1044.0
2021-10-04 17:00:00 1044.0 1044.0
2021-10-04 17:30:00 26.0 1044.0
...

get week number:
df['week'] = df.datetime.dt.isocalendar().week
get max for each week
df_weeklymax = df.groupby('week').agg(max=('value', 'max')).reset_index()
merge 2 tables
df.merge(df_weeklymax, on='week', how='left')
example output:
datetime
value
week
max
0
2021-01-01 00:00:00
20
53
69
1
2021-01-01 13:36:00
69
53
69
2
2021-01-02 03:12:00
69
53
69
3
2021-01-02 16:48:00
57
53
69
4
2021-01-03 06:24:00
39
53
69
5
2021-01-03 20:00:00
56
53
69
6
2021-01-04 09:36:00
73
1
92
7
2021-01-04 23:12:00
76
1
92
8
2021-01-05 12:48:00
92
1
92
9
2021-01-06 02:24:00
4
1
92

How to resample dates into 1 minute bars?

Say I have a pandas.DataFrame like:
val
2020-01-01 12
2020-04-15 38
2020-05-03 19
How can I create a pandas.DataFrame like:
val
2020-01-01 00:00:00 12
2020-01-01 00:01:00 12
...
2020-01-01 23:58:00 12
2020-01-01 23:59:00 12
2020-04-15 00:00:00 38
2020-04-15 00:01:00 38
...
2020-04-15 23:58:00 38
2020-04-15 23:59:00 38
2020-05-03 00:00:00 19
2020-05-03 00:01:00 19
...
2020-05-03 23:58:00 19
2020-05-03 23:59:00 19
I have tried df.resample('1 min').asfreq() but that gives me all the minutes from the first row to the last row, including all the days that aren't in the original index.

Recreating your sample df:
dates = [ pd.to_datetime('2020-01-01'), pd.to_datetime('2020-04-15'), pd.to_datetime('2020-05-03') ]
val = [12, 38, 19]
df = pd.DataFrame({ 'date' : dates, 'val' : val})
df = df.set_index('date')
I don't generally recommend loops, but this feels like it might be a case where it is more natural to use one. It really depends on how much data you're dealing with. It works, anyway. :)
out = pd.DataFrame()
for row in df.itertuples():
bars = pd.date_range(row.Index, row.Index+pd.Timedelta(days=1), freq="T", closed='left')
out = pd.concat([out, pd.DataFrame(data={'val' : row.val}, index=bars)])
print(out)
val
2020-01-01 00:00:00 12
2020-01-01 00:01:00 12
2020-01-01 00:02:00 12
2020-01-01 00:03:00 12
2020-01-01 00:04:00 12
... ...
2020-05-03 23:55:00 19
2020-05-03 23:56:00 19
2020-05-03 23:57:00 19
2020-05-03 23:58:00 19
2020-05-03 23:59:00 19
[4320 rows x 1 columns]

Daily calculations in intraday data

Let's say I have a DataFrame with date_time index:
date_time a b
2020-11-23 04:00:00 10 5
2020-11-23 05:00:00 11 5
2020-11-23 06:00:00 12 5
2020-11-24 04:30:00 13 6
2020-11-24 05:30:00 14 6
2020-11-24 06:30:00 15 6
2020-11-25 06:00:00 16 7
2020-11-25 07:00:00 17 7
2020-11-25 08:00:00 18 7
"a" column is intraday data (every row - different value). "b" column - DAILY data - same data during the current day.
I need to make some calculations with "b" (daily) column and create "c" column with the result. For example, sum for two last days.
Result:
date_time a b c
2020-11-23 04:00:00 10 5 NaN
2020-11-23 05:00:00 11 5 NaN
2020-11-23 06:00:00 12 5 NaN
2020-11-24 04:30:00 13 6 11
2020-11-24 05:30:00 14 6 11
2020-11-24 06:30:00 15 6 11
2020-11-25 06:00:00 16 7 13
2020-11-25 07:00:00 17 7 13
2020-11-25 08:00:00 18 7 13
I guesss I should use something like
df['c'] = df.resample('D').b.rolling(3).sum ...
but I got "NaN" values in "c".
Could you help me? Thanks!

One thing you can do is to drop duplicates on the date and work on that:
# get the dates
df['date'] = df['date_time'].dt.normalize()
df['c'] = (df.drop_duplicates('date')['b'] # drop duplicates on dates
.rolling(2).sum() # rolling sum
)
df['c'] = df['c'].ffill() # fill the missing data
Output:
date_time a b date c
0 2020-11-23 04:00:00 10 5 2020-11-23 NaN
1 2020-11-23 05:00:00 11 5 2020-11-23 NaN
2 2020-11-23 06:00:00 12 5 2020-11-23 NaN
3 2020-11-24 04:30:00 13 6 2020-11-24 11.0
4 2020-11-24 05:30:00 14 6 2020-11-24 11.0
5 2020-11-24 06:30:00 15 6 2020-11-24 11.0
6 2020-11-25 06:00:00 16 7 2020-11-25 13.0
7 2020-11-25 07:00:00 17 7 2020-11-25 13.0
8 2020-11-25 08:00:00 18 7 2020-11-25 13.0

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!

Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

Creating nested dataframes with multiple dataframes

I have multiple dataframes, the following are only 2 of them:
print(df1)
Date A B C
2019-10-01 00:00:00 2 3 1
2019-10-01 01:00:00 5 1 6
2019-10-01 02:00:00 8 2 4
2019-10-01 03:00:00 3 6 5
print(df2)
Date A B C
2019-10-01 00:00:00 9 4 2
2019-10-01 01:00:00 3 2 4
2019-10-01 02:00:00 6 5 2
2019-10-01 03:00:00 3 6 5
All of them have same index and columns. I want to create dataframe like this:
Date df1 df2
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
I have to apply this process to 30 dataframes(their index and columns are same), so I want to write a for loop in order to achieve this dataframe. How can I do that?

Reshape each DataFrame of list of DataFrames by DataFrame.set_index with DataFrame.unstack and then concat, last change columns names with lambda function:
dfs = [df1,df2]
df = (pd.concat([x.set_index('Date').unstack() for x in dfs], axis=1)
.rename(columns=lambda x: f'df{x+1}'))
print (df)
df1 df2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5
If want some custom columns names in final DataFrame create list with same size like length of dfs and add parameter keys:
dfs = [df1,df2]
names = ['col1','col2']
df = pd.concat([x.set_index('Date').unstack() for x in dfs], keys=names, axis=1)
print (df)
col1 col2
Date
A 2019-10-01 00:00:00 2 9
2019-10-01 01:00:00 5 3
2019-10-01 02:00:00 8 6
2019-10-01 03:00:00 3 3
B 2019-10-01 00:00:00 3 4
2019-10-01 01:00:00 1 2
2019-10-01 02:00:00 2 5
2019-10-01 03:00:00 6 6
C 2019-10-01 00:00:00 1 2
2019-10-01 01:00:00 6 4
2019-10-01 02:00:00 4 2
2019-10-01 03:00:00 5 5

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Sorting csv file with Python 3 - pandas

try this: (pd.read_csv("input.csv", parse_dates=['utc_time']) .sort_values('utc_time') .to_csv("output.csv", index=False))

Related

Pandas: Find weekly max from timeseries(calendar week not 7 days)

How to resample dates into 1 minute bars?

Daily calculations in intraday data

7 days hourly mean with pandas

Creating nested dataframes with multiple dataframes

Categories

Resources