I have a bunch of data sitting in a Postgres database for a website I am building. Problem is, I don't really know what I should do with the date information. For example, I have a table called events that has a date column that stores date information as a string until I can figure out what to do with it.
The reason they are in string format is unfortunately for the topic of my website, their is not a good API, so I had to scrape data. Here's what some of the data looks like inside the column:
| Date |
|-------------------------------------|
| Friday 07.30.2021 at 08:30 AM ET |
| Wednesday 04.07.2021 at 10:00 PM ET |
| Saturday 03.27.2010 |
| Monday 01.11.2010 |
| Saturday 02.09.2019 at 05:00 PM ET |
| Wednesday 03.31.2010 |
It would have been nice to have every row have the time with it, but a lot of entries don't. I have no problem doing some string manipulation on the data to get them into a certain format where they can be turned into a date, I am just somewhat stumped on what I should do next.
What would you do in this situation if you were restricted to the data seen in the events table? Would you store it as UTC? How would you handle the dates without a time? Would you give up and just display everything as EST dates regardless of where the user lives (lol)?
It would be nice to use these dates to display correctly for anyone anywhere in the world, but it looks like I might be pigeonholed because of the dates that don't have a time associated with them.
Converting your mishmash free form looking date to a standard timestamp is not all that daunting as it seems. Your samples indicate you have a string with 5 separate pieces of information: day name, date (mm.dd.yyyy), literal (at),time of day, day part (AM,PM) and some code for timezone each separated by spaces. But for them to be useful the first step is splitting this into the individual parts. For that us a regular expression to ensure single spaces then use string_to_array to create an array of up to 6 elements. This gives:
+--------------------------+--------+----------------------------------+
| Field | Array | Action / |
| | index | Usage |
+--------------------------+--------+----------------------------------+
| day name | 1 | ignore |
| date | 2 | cast to date |
| literal 'at' | 3 | ignore |
| time of day | 4 | interval as hours since midnight |
| AM/PM | 5 | adjustment for time of day |
| some code for timezone | 6 | ??? |
+--------------------------+--------+----------------------------------+
Putting it all together we arrive at:
with test_data ( stg ) as
( values ('Friday 07.30.2021 at 08:30 AM ET' )
, ('Wednesday 04.07.2021 at 10:00 PM ET')
, ('Saturday 03.27.2010' )
, ('Monday 01.11.2010' )
, ('Saturday 02.09.2019 at 05:00 PM ET' )
, ('Wednesday 03.31.2010')
)
-- <<< Your query begins here >>>
, stg_array( strings) as
( select string_to_array(regexp_replace(stg, '( ){1,}',' ','g'), ' ' )
from test_data --<<< your actual table >>>
) -- select * from stg_array
, as_columns( dt, tod_interval, adj_interval, tz) as
( select strings[2]::date
, case when array_length(strings,1) >= 4
then strings[4]::interval
else '00:00':: interval
end
, case when array_length(strings,1) >= 5 then
case when strings[5]='PM'
then interval '12 hours'
else interval '0 hours'
end
else interval '0 hours'
end
, case when array_length(strings,1) >= 6
then strings[6]
else current_setting('TIMEZONE')
end
from stg_array
)
select dt + tod_interval + adj_interval dtts, tz
from as_columns;
This gives the corresponding timestamp for date, time, and AM/PM indicator (in the current timezone) for items without a timezone specifies. For those containing a timezone code, you will have to convert to a proper timezone name. Note ET is not a valid timezone name nor a valid abbreviation. Perhaps a lookup table. See example here; it also contains a regexp based solution. Also the example in run on db<>fiddle. Their server is in the UK, thus the timezone.
Related
What is the best Oracle sql query to extract date?
input entry - 2020-10-14T07:26:32.661Z ,
expected output - 2020-10-14
If you want a DATE data type where the time component is midnight then:
SELECT TRUNC(
TO_TIMESTAMP_TZ(
'2020-10-14T07:26:32.661Z',
'YYYY-MM-DD"T"HH24:MI:SS.FF3TZR'
)
) AS truncated_date
FROM DUAL;
Which (depending on your NLS_DATE_FORMAT) outputs:
| TRUNCATED_DATE |
| :------------------ |
| 2020-10-14 00:00:00 |
(Note: a DATE data type has year, month, day, hour, minute and second components. Whatever client program you are using to access the database may choose not to show the time component but it will still be there.)
If you want a YYYY-MM-DD formatted string then:
SELECT TO_CHAR(
TO_TIMESTAMP_TZ(
'2020-10-14T07:26:32.661Z',
'YYYY-MM-DD"T"HH24:MI:SS.FF3TZR'
),
'YYYY-MM-DD'
) AS formatted_date
FROM DUAL;
| FORMATTED_DATE |
| :------------- |
| 2020-10-14 |
db<>fiddle here
The canonical way is probably trunc():
select trunc(input_entry)
This assumes that input_entry is a date or timestamp value.
EDIT:
If your input is just a string, use string operations:
select substr(input_entry, 1, 10)
You can also readily cast this to a date.
Postgres version 9.4.18, PostGIS Version 2.2.
Here are the tables I'm working with (and can unlikely make significant changes to the table structure):
Table ltg_data (spans 1988 to 2018):
Column | Type | Modifiers
----------+--------------------------+-----------
intensity | integer | not null
time | timestamp with time zone | not null
lon | numeric(9,6) | not null
lat | numeric(8,6) | not null
ltg_geom | geometry(Point,4269) |
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")
Size of ltg_data (~800M rows):
ltg=# select pg_relation_size('ltg_data');
pg_relation_size
------------------
149729288192
Table counties:
Column | Type | Modifiers
-----------+-----------------------------+--------------------------------- -----------------------
gid | integer | not null default nextval('counties_gid_seq'::regclass)
objectid_1 | integer |
objectid | integer |
state | character varying(2) |
cwa | character varying(9) |
countyname | character varying(24) |
fips | character varying(5) |
time_zone | character varying(2) |
fe_area | character varying(2) |
lon | double precision |
lat | double precision |
the_geom | geometry(MultiPolygon,4269) |
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)
Desired result:
I want a time series with one row for every day of the year in format 'MM-DD' ignoring the year: 01-01, 01-02, 01-03, ..., 12-31. And the count of rows in table ltg_data for each day of the year. I also eventually want the same thing for every hour of every day of the year ('MM-DD-HH').
A group by statement should accomplish this, but I'm having a hard time joining the "big" table with the days generated with generate_series().
MM-DD | total_count
-------+------------
12-22 | 9
12-23 | 0
12-24 | 0
12-25 | 0
12-26 | 23
12-27 | 0
12-28 | 5
12-29 | 0
12-30 | 0
12-31 | 0
Some of my many attempted queries:
SELECT date_trunc('day', d),
count(a.lat) AS strikes
FROM generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d
LEFT JOIN
(SELECT date_trunc('day', TIME) AS day_of_year,
ltg_data.lat
FROM ltg_data
JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR' ) AS a ON d = day_of_year
GROUP BY d
ORDER BY d ASC;
But this doesn't ignore year. I shouldn't be surprised because the "day" in date_trunc is still considering the year I guess.
2017-12-27 00:00:00-08 | 0
2017-12-28 00:00:00-08 | 0
2017-12-29 00:00:00-08 | 0
2017-12-30 00:00:00-08 | 0
2017-12-31 00:00:00-08 | 0
2018-01-01 00:00:00-08 | 0
2018-01-02 00:00:00-08 | 12
2018-01-03 00:00:00-08 | 0
And this query, in which I'm trying to convert the data from generate_series() to text in 'DD-MM' format to join to the ltg_data table in text format. Says the data types don't match. I've tried extract as well, since that could provide "doy" and "hour", which would work, but I can't seem to match data types in that query either. It's hard to make that "generate_series" a double precision.
SELECT to_char(d, 'MM-DD') AS DAY,
count(a.lat) AS strikes
FROM
(SELECT generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d)
AS f
LEFT JOIN
(SELECT to_char(TIME, 'MM-DD') AS day_of_year,
ltg_data.lat
FROM ltg_data
JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR' ) AS a ON f = day_of_year
GROUP BY d
ORDER BY d ASC;
Result:
ERROR: operator does not exist: record = text
LINE 4: ON f = day_of_year group by d order by d asc;
^
HINT: No operator matches the given name and argument type(s). You might
need to add explicit type casts.
Conclusion:
I'm aiming at getting daily and hourly total counts that span many years but group by 'MM-DD' and 'MM-DD-HH' (ignoring year), with the query results showing all days/hours even if they are zero.
Later I'll also try to find averages and percentiles for days and hours, so if you have any advice on that, I'm all ears. But my current problem is focused on just getting a complete result for totals.
Basically, to cut off the year, to_char(time, 'MMDD') like you already tried does the job. You just forgot to also apply it to the timestamps generated with generate_series()before joining. And some other minor details.
To simplify and for performance and convenience I suggest this simple function to calculate an integer from the pattern 'MMDD' of a given timestamp.
CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';
I used to_char(time, 'MMDD') at first, but switched to the above expression that turned out to be fastest in various tests.
db<>fiddle here
It can be used in expression indexes since it's defined IMMUTABLE. And it still allows function inlining because it only uses EXTRACT (xyz FROM date) - which is implemented with the IMMUTABLE function date_part(text, date) internally. (Note that datepart(text, timestamptz) is only STABLE).
Then this kind of query does the job:
SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd -- ignoring the year
FROM generate_series(timestamp '2018-01-01' -- any dummy year
, timestamp '2018-12-31'
, interval '1 day') d
) d
LEFT JOIN (
SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
FROM counties c
JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
) ct USING (mmdd)
ORDER BY 1;
Since time (I would use a different column name) is data type timestamptz the cast time::date depends on the time zone setting of your current session. ("Days" are defined by the time zone you are in.) To get immutable (but slower) results use the AT TIME ZONE construct with a time zone name like:
SELECT f_mmdd((time AT TIME ZONE 'Europe/Vienna')::date) ...
Details:
Ignoring time zones altogether in Rails and PostgreSQL
Format mmdd any way you like for display.
The cast to integer is optional for the purpose of this particular query. But since you plan to do all kinds of queries, you'll end up wanting an index on the expression:
CREATE INDEX ltg_data_mmdd_idx ON event(f_mmdd(time));
(Not needed for this query.)
integer is a bit faster for this purpose.
And you need the (otherwise optional) function wrapper for this since to_char() is only defined STABLE, but we need IMMUTABLE for the index. The updated expression (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int is IMMUTABLE, but the function wrapper is still convenient.
Related:
How do you do date math that ignores the year?
Generating time series between two dates in PostgreSQL
In my code using SQL Server, I am comparing data between two months where I have the exact dates identified. I am trying to find if the value in a certain column changes in a bunch of different scenarios. That part works, but what I'd like to do is make it so that I don't have to always go back to change the date each time I wanted to get the results I'm looking for. Is this possible?
My thought was that adding a WITH clause, but it is giving me an aggregation error. Is there anyway I can go about making this date problem simpler? Thanks in advance
EDIT
Ok I'd like to clarify. In my WITH statement, I have:
select distinct
d.Date
from Database d
Which returns:
+------+-------------+
| | Date |
+------+-------------|
| 1 | 01-06-2017 |
| 2 | 01-13-2017 |
| 3 | 01-20-2017 |
| 4 | 01-27-2017 |
| 5 | 02-03-2017 |
| 6 | 02-10-2017 |
| 7 | 02-17-2017 |
| 8 | 02-24-2017 |
| 9 | ........ |
+------+-------------+
If I select this statement and execute, it will return just the dates from my table as shown above. What I'd like to do is be able to have sql that will pull from these date values and compare the last date value from one month to the last date value of the next month. In essence, it should compare the values from date 8 to values from date 4, but it should be dynamic enough that it can do the same for any two dates without much tinkering.
If I didn't misunderstand your request, it seems you need a numbers table, also known as a tally table, or in this case a calendar table.
Recommended post: https://dba.stackexchange.com/questions/11506/why-are-numbers-tables-invaluable
Basically, you create a table and populate it with numbers of year's week o start and end dates. Then join your main query to this table.
+------+-----------+----------+
| week | startDate | endDate |
+------+-----------+----------+
| 1 | 20170101 | 20170107 |
| 2 | 20170108 | 20170114 |
+------+-----------+----------+
Select b.week, max(a.data) from yourTable a
inner join calendarTable b
on a.Date between b.startDate and b.endDate
group by b.week
dynamic dates to filter by BETWEEN
select dateadd(m,-1,dateadd(day,-(datepart(day,cast(getdate() as date))-1),cast(getdate() as date))) -- 1st date of last month
select dateadd(day,-datepart(day,cast(getdate() as date)),cast(getdate() as date)) -- last date of last month
select dateadd(day,-(datepart(day,cast(getdate() as date))-1),cast(getdate() as date)) -- 1st date of current month
select dateadd(day,-datepart(day,dateadd(m,1,cast(getdate() as date))),dateadd(m,1,cast(getdate() as date))) -- last date of the month
To extract the week of a given year we can use:
SELECT EXTRACT(WEEK FROM timestamp '2014-02-16 20:38:40');
However, I am trying to group weeks together in a bit of an odd format. My start of a week would begin on Mondays at 4am and would conclude the following Monday at 3:59:59am.
Ideally, I would like to create a query that provides a start and end date, then groups the total sales for that period by the weeks laid out above.
Example:
SELECT
(some custom week date),
SUM(sales)
FROM salesTable
WHERE
startDate BETWEEN 'DATE 1' AND 'DATE 2'
I am not looking to change the EXTRACT() function, rather create a query that would pull from the following sample table and output the sample results.
If 'DATE 1' in query was '2014-07-01' AND 'DATE 2' was '2014-08-18':
Sample Table:
itemID | timeSold | price
------------------------------------
1 | 2014-08-13 09:13:00 | 12.45
2 | 2014-08-15 12:33:00 | 20.00
3 | 2014-08-05 18:33:00 | 10.00
4 | 2014-07-31 04:00:00 | 30.00
Desired result:
weekBegin | priceTotal
----------------------------------
2014-07-28 04:00:00 | 30.00
2014-08-04 04:00:00 | 10.00
2014-08-11 04:00:00 | 32.45
Produces your desired output:
SELECT date_trunc('week', time_sold - interval '4h')
+ interval '4h' AS week_begin
, sum(price) AS price_total
FROM tbl
WHERE time_sold >= '2014-07-01 0:0'::timestamp
AND time_sold < '2014-08-19 0:0'::timestamp -- start of next day
GROUP BY 1
ORDER BY 1;
db<>fiddle here (extended with a row that actually shows the difference)
Old sqlfiddle
Explanation
date_trunc() is the superior tool here. You are not interested in week numbers, but in actual timestamps.
The "trick" is to subtract 4 hours from selected timestamps before extracting the week - thereby shifting the time frame towards the earlier bound of the ISO week. To produce the desired display, add the same 4 hours back to the truncated timestamps.
But apply the WHERE condition on unmodified timestamps. Also, never use BETWEEN with timestamps, which have fractional digits. Use the WHERE conditions like presented above. See:
Unexpected results from SQL query with BETWEEN timestamps
Operating with data type timestamp, i.e. with (shifted) "weeks" according to the current time zone. You might want to work with timestamptz instead. See:
Ignoring time zones altogether in Rails and PostgreSQL
horrible title, but let me explain: i've got this django model containing a timestamp (date) and the attribute to log - f.e. the number of users consuming some ressource - (value).
class Viewers(models.Model):
date = models.DateTimeField()
value = models.IntegerField()
for each 10seconds the table contains the number of users. something like this:
| date | value |
|------|-------|
| t1 | 15 |
| t2 | 18 |
| t3 | 27 |
| t4 | 25 |
| .. | .. |
| t30 | 38 |
| t31 | 36 |
| .. | .. |
now i want to generate different statistics from this data, each with another resolution. f.e. for a chart of the last day i don't need the 10second resolution, so i want 5 minute steps (that are build by averaging the values (and maybe also the date) of the rows from t1 to t29, t30 to t59, ...), so that i'll get:
| date | value |
|------|-------|
| t15 | 21 |
| t45 | 32 |
| .. | .. |
the attributes to keep variable are start & end timestamp and the resolution (like 5 minutes). is there a way using the django orm/queryset api and if not, how to reach this with custom sql?
I've been trying to solve this problem in the most 'django' way possible. I've settled for the following. It averages the values for 15minute time slots between start_date and end_date where the column name is'date':
readings = Reading.objects.filter(date__range=(start_date, end_date)) \
.extra(select={'date_slice': "FLOOR (EXTRACT (EPOCH FROM date) / '900' )"}) \
.values('date_slice') \
.annotate(value_avg=Avg('value'))
It returns a dictionary:
{'value_avg': 1116.4925373134329, 'date_slice': 1546512.0}
{'value_avg': 1001.2028985507246, 'date_slice': 1546513.0}
{'value_avg': 1180.6285714285714, 'date_slice': 1546514.0}
The core of the idea comes from this answer to the same question for PHP/SQL. The code passed to extra is for a Postgres DB.
from django.db.models import Avg
Viewers.objects.filter(date__range=(start_time, end_time)).aggregate(average=Avg('value'))
That will get you the average of all the values between start_time and end_time, returned as a dictionary in the form of { 'average': <the average> }.
start_time and end_time need to be Python datetime objects. So if you have a timestamp, or something, you'll need to convert it first. You can also use datetime.timedelta to calculate the end_time based on the start_time. For a five minute resolution, something like this:
from datetime import timedelta
end_time = start_time + timedelta(minutes=5)
have you looked at the range filter?
https://docs.djangoproject.com/en/dev/ref/models/querysets/#range
The example given in the doc's seems similar to your situation.
Slightly improving upon the answer by #Richard Corden, in Postgresql you can do
def for_interval(self, start=None, end=None, interval=60):
# (Check start and end values...)
return self
.filter(timestamp__range=(start, end)) \
.annotate(
unix_timestamp=Floor(Extract('timestamp', 'epoch') / interval) * interval,
time=Func(F('unix_timestamp'), function="TO_TIMESTAMP", output_field=models.DateTimeField()),
) \
.values('time') \
.annotate(value=Avg('value')) \
.order_by('time')
I would also recommend storing the floor of the interval rather than its midpoint.
After long trying i made it as SQL-statement:
SELECT FROM_UNIXTIME(AVG(UNIX_TIMESTAMP(date))), SUM(value)
FROM `my_table`
WHERE date BETWEEN SUBTIME(NOW( ), '0:30:00') AND NOW()
GROUP BY UNIX_TIMESTAMP(date) DIV 300
ORDER BY date DESC
with
start_time = SUBTIME(NOW( ), '0:30:00')
end_time = NOW()
period = 300 # in seconds
in the end - not really hard - and indeed independent from the time resolution of the samplings in the origin table.