Regarding time format - pandas

I am having dataframe containing time column to be like this:
time
2001-11-28 13:42:46 -0500
2001-10-10 22:14:00 -0400
I know how to convert them into time period but I fail to understand what does -0500 and -4000 even means.
This data I am using is an open source data for bugs related thing. If any one can help me with this it will be very helpful to me.

-0500 means five hours (05) and zero minutes (00) behind (-) UTC. For example New York City sometimes observes this time offset (specifically in winter, when DST is not in effect).
Read more about how UTC offsets work here: https://en.wikipedia.org/wiki/UTC_offset
And more on how Pandas handles times and time zones here (search for "timezone"): https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

There is timezone offset, you can processing different ways:
#convert to datetimes with different timezones
df['time1'] = pd.to_datetime(df['time'])
#convert to datetimes with utc tiemzone
df['time2'] = pd.to_datetime(df['time'], utc=True)
#remove timezone information
df['time3'] = pd.to_datetime(df['time'].str.rsplit(' -', n=1).str[0])
print (df)
time time1 \
0 2001-11-28 13:42:46 -0500 2001-11-28 13:42:46-05:00
1 2001-10-10 22:14:00 -0400 2001-10-10 22:14:00-04:00
time2 time3
0 2001-11-28 18:42:46+00:00 2001-11-28 13:42:46
1 2001-10-11 02:14:00+00:00 2001-10-10 22:14:00

Related

Calculate time difference between two columns of string type in hive without changing the data type string

I am trying to calculate the time difference between two columns of a row which are of string data type. If the time difference between them is less than 2 hours then select the first column of that row else if the time difference is greater than 2 hours then select the second column of that row. It can be done by converting the columns to datetime format, but I want the result to be in string only. How can I do that? The data looks like this:
col1(string type)
2018-07-16 02:23:00
2018-07-26 12:26:00
2018-07-26 15:32:00
col2(string type)
2018-07-16 02:36:00
2018-07-26 14:29:00
2018-07-27 15:38:00
I think you don't need to convert the columns to datetime format, since the data in your case is already ordered (yyyy-MM-dd hh:mm:ss). You just need to take all the digits and take it into one string (yyyyMMddhhmmss) then you can apply your selection which is bigger or smaller than 2 hours (here 20000 since the hour is followed by mmss). By looking at your example (assuming col2 > col1), this query would work:
SELECT case when regexp_replace(col2,'[^0-9]', '')-regexp_replace(col1,'[^0-9]', '') < 20000 then col1 else col2 end as col3 from your_table;
Use unix_timestamp() to convert string timestamp to seconds.
The difference in hours will be:
hive> select (unix_timestamp('2018-07-16 02:23:00')- unix_timestamp('2018-07-16 02:36:00'))/60/60;
OK
-0.21666666666666667
Important update: this method will work correctly only if time zone is configured as UTC. Because for DST timezones for some marginal cases Hive converts time during timestamp operations. Consider this example for PDT time zone:
hive> select hour('2018-03-11 02:00:00');
OK
3
Note the hour is 3, not 2. This is because 2018-03-11 02:00:00 cannot exist in PDT time zone because exactly at 2018-03-11 02:00:00 time is adjusted and becomes 2018-03-11 03:00:00.
The same happens when converting to unix_timestamp. For PDT time zone unix_timestamp('2018-03-11 03:00:00') and unix_timestamp('2018-03-11 02:00:00') will return the same timestamp:
hive> select unix_timestamp('2018-03-11 03:00:00');
OK
1520762400
hive> select unix_timestamp('2018-03-11 02:00:00');
OK
1520762400
And few links for your reference:
https://community.hortonworks.com/questions/82511/change-default-timezone-for-hive.html
http://boristyukin.com/watch-out-for-timezones-with-sqoop-hive-impala-and-spark-2/
Also have a look at this jira please: Hive should carry out timestamp computations in UTC

Matplotlib Default date format?

I'm using Pandas to read a .csv file that a 'Timestamp' date column in the format:
31/12/2016 00:00
I use the following line to convert it to a datetime64 dtype:
time = pd.to_datetime(df['Timestamp'])
The column has an entry corresponding to every 15mins for almost a year, and I've run into a problem when I want to plot more than 1 months worth.
Pandas seems to change the format from ISO to US upon reading (so YYYY:MM:DD to YYYY:DD:MM), so my plots have 30 day gaps whenever the datetime represents a new day. A plot of the first 5 days looks like:
This is the raw data in the file either side of the jump:
01/01/2017 23:45
02/01/2017 00:00
If I print the values being plotted (after reading) around the 1st jump, I get:
2017-01-01 23:45:00
2017-02-01 00:00:00
So is there a way to get pandas to read the dates properly?
Thanks!
You can specify a format parameter in pd.to_datetime to tell pandas how to parse the date exactly, which I suppose is what you need:
time = pd.to_datetime(df['Timestamp'], format='%d/%m/%Y %H:%M')
pd.to_datetime('02/01/2017 00:00')
#Timestamp('2017-02-01 00:00:00')
pd.to_datetime('02/01/2017 00:00', format='%d/%m/%Y %H:%M')
#Timestamp('2017-01-02 00:00:00')

pytz - convert a datetime in the future to UTC

I have a file that contains forecasted events for the next two weeks. There is a datetime column which has the date and each 30 minute interval, and a time zone column.
I am using pytz to convert the different time zones (around 30+ unique ones) to UTC before loading them into a database. However, for the forecast file I am receiving an error:
NonExistentTimeError: 2016-10-16 00:00:00
Is there a way to go about this?
date interval time_zone
10/26/2016 22:30 US/Central
10/26/2016 22:30 US/Eastern
10/26/2016 23:00 America/Bogota
10/26/2016 23:00 Asia/Calcutta
Current code:
for tz in df['time_zone'].unique():
df.loc[df['time_zone'] == tz, 'datetime_utc'] = df.loc[df['time_zone'] == tz, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
df['datetime_utc'] = df['datetime_utc'].dt.tz_localize(None)
Due to changes in daylight saving happening on the 16th October, 2016-10-16 00:00:00 really is a local time that does not exist for Brazil (It should instead read 2016-10-16 01:00:00)

Converting UTC 0 to UTC local in SQL. But for two different time zone

I want historic date convert UTC 0 to UTC local in SQL. Like;
2012-11-23
2013-01-08
2014-02-23
But we have 2 different time zone. We use UTC +2 after last sunday in March and use UTC +3 after last sunday October. I need solution immediately guys. Please help me...
Try this:
SELECT CONVERT_TZ('your date ','+your time zone','+time zone you want');
SELECT CONVERT_TZ('2004-01-01 12:00:00','+02:00' ,'+03:00'); // in your case, from +2 to +3
See this link
If you need a dynamic thing, you'll need the timezone of each history (maybe you can store it in a separeted column), so I think you can do something like this:
SELECT CONVERT_TZ('your_date_column','local_timezone' ,'time_zone_column');

Rails daylight savings time not accounted for when using DateTime.strptime

I've been working on parsing strings and I have a test case that has been causing problems for me. When parsing a date/time string with strptime, Daylight Savings Time is NOT accounted for. This is a bug as far as I can tell. I can't find any docs on this bug. Here is a test case in the Rails console. This is ruby 1.9.3-p215 and Rails 3.2.2.
1.9.3-p125 :049 > dt = DateTime.strptime("2012-04-15 10:00 Central Time (US & Canada)", "%Y-%m-%d %H:%M %Z")
=> Sun, 15 Apr 2012 10:00:00 -0600
1.9.3-p125 :050 > dt = DateTime.strptime("2012-04-15 10:00 Central Time (US & Canada)", "%Y-%m-%d %H:%M %Z").utc
=> Sun, 15 Apr 2012 16:00:00 +0000
1.9.3-p125 :051 > dt = DateTime.strptime("2012-04-15 10:00 Central Time (US & Canada)", "%Y-%m-%d %H:%M %Z").utc.in_time_zone("Central Time (US & Canada)")
=> Sun, 15 Apr 2012 11:00:00 CDT -05:00
As you can see, I have to convert to utc and then back to the timezone to get DST to be properly interpreted, but then the time is shifted one hour as well, so it's not what I parsed out of the string. Does someone have a workaround to this bug or a more robust way of parsing a date + time + timezone reliably into a DateTime object where daylight savings time is properly represented? Thank you.
Edit:
Ok, I found a workaround, although I'm not sure how robust it is.
Here is an example:
ActiveSupport::TimeZone["Central Time (US & Canada)"].parse "2012-04-15 10:00"
This parses the date/time string into the correct timezone. I'm not sure how robust the parse method is for handling this so I'd like to see if there is a better workaround, but this is my method so far.
This is a frustrating problem. The Rails method you're looking for is Time.zone.parse. First use DateTime.strptime to parse the string, then run it through Time.zone.parse to set the zone. Check out the following console output:
> Time.zone
=> (GMT-06:00) Central Time (US & Canada)
> input_string = "10/12/12 00:00"
> input_format = "%m/%d/%y %H:%M"
> date_with_wrong_zone = DateTime.strptime(input_string, input_format)
=> Fri, 12 Oct 2012 00:00:00 +0000
> correct_date = Time.zone.parse(date_with_wrong_zone.strftime('%Y-%m-%d %H:%M:%S'))
=> Fri, 12 Oct 2012 00:00:00 CDT -05:00
Notice that even though Time.zone's offset is -6 (CST), the end result's offset is -5 (CDT).
Ok, here is the best way I've found to handle this so far. I created a utility method in a lib file.
# Returns a DateTime object in the specified timezone
def self.parse_to_date(date_string, num_hours, timezone)
if timezone.is_a? String
timezone = ActiveSupport::TimeZone[timezone]
end
result = nil
#Chronic.time_class = timezone # Trying out chronic time zone support - so far it doesn't work
the_date = Chronic.parse date_string
if the_date
# Format the date into something that TimeZone can definitely parse
date_string = the_date.strftime("%Y-%m-%d")
result = timezone.parse(date_string) + num_hours.to_f.hours
end
result
end
Note that I add hours onto the time manually because Chronic.parse wasn't as robust as I liked in parsing times - it failed when no trailing zeros were added to a time, for example, as in 8:0 instead of 8:00.
I hope this is useful to someone. Parsing date/time/timzone strings into a valid date seems to be a very common thing, but I was unable to find any parsing code that incorporated all three together.

Categories