I have a table in Postgres as follows:
| id | start_time | end_time | duration |
|----|--------------------------|--------------------------|----------|
| 1 | 2018-05-11T00:00:20.631Z | 2018-05-11T01:03:14.496Z | 1:02:54 |
| 2 | 2018-05-11T00:00:04.877Z | 2018-05-11T00:00:14.641Z | 0:00:10 |
| 3 | 2018-05-11T01:03:28.063Z | 2018-05-11T01:04:36.410Z | 0:01:08 |
| 4 | 2018-05-11T00:00:20.631Z | 2018-05-11T02:03:14.496Z | 2:02:54 |
start_time and end_time are stored as varchar. Format is 'yyyy-mm-dd hh24:mi:ss.ms' (ISO format).
duration has been calculated as end_time - start_time. Format is hh:mi:ss.
I need result table output as follows:
| id | start_time | end_time | duration | start | end | duration_minutes |
|----|--------------------------|--------------------------|----------|-----------|-----------|------------------|
| 1 | 2018-05-11T00:00:20.631Z | 2018-05-11T01:03:14.496Z | 1:02:54 | 5/11/2018 | 5/11/2018 | 62 | -- (60+2)
| 2 | 2018-05-11T00:00:04.877Z | 2018-05-11T00:00:14.641Z | 0:00:10 | 5/11/2018 | 5/11/2018 | 0 |
| 3 | 2018-05-11T01:03:28.063Z | 2018-05-11T01:04:36.410Z | 0:01:08 | 5/11/2018 | 5/11/2018 | 1 |
| 4 | 2018-05-11T00:00:20.631Z | 2018-05-11T02:03:14.496Z | 2:02:54 | 5/11/2018 | 5/11/2018 | 122 | -- (2X60 +2)
start and end need to contain only the mm/dd/yyyy portion of start_time and end_time respectively.
duration_minutes should calculate total duration in minutes (eg, if duration is 1:02:54, duration in minutes should be 62 which is 60+2)
How can I do this using SQL?
Based in varchar input, this query produces your desired result, exactly:
SELECT *
, to_char(start_time::timestamp, 'FMMM/DD/YYYY') AS start
, to_char(end_time::timestamp, 'FMMM/DD/YYYY') AS end
, extract(epoch FROM duration::interval)::int / 60 AS duration_minutes
FROM tbl;
Major points:
Use timestamp and interval instead of varchar to begin with.
Or do not store the functionally dependent column duration at all. It can cheaply be computed on the fly.
For display / a particular text representation use to_char().
Be explicit and do not rely on locale settings that may change from session to session.
The FM pattern modifier is for (quoting the manual):
fill mode (suppress leading zeroes and padding blanks)
extract (epoch FROM interval_tpe) produces the number of contained seconds. You want to truncate fractional minutes? Integer division does just that, so cast to int like demonstrated. Related:
Get difference in minutes between times with timezone
The following appears to do what you want:
select v.starttime::timestamp::date, v.endtime::date,
extract(epoch from v.endtime::timestamp - v.starttime::timestamp)/60
from (values ('2018-05-11T00:00:20.631Z', '2018-05-11T01:03:14.496Z')) v(starttime, endtime)
If you want the dates in a particular format, then use to_char().
Related
I have a table with the following structure and data in it:
| ID | Date | Result |
|---- |------------ |-------- |
| 1 | 30/04/2020 | + |
| 1 | 01/05/2020 | - |
| 1 | 05/05/2020 | - |
| 2 | 03/05/2020 | - |
| 2 | 04/05/2020 | + |
| 2 | 05/05/2020 | - |
| 2 | 06/05/2020 | - |
| 3 | 01/05/2020 | - |
| 3 | 02/05/2020 | - |
| 3 | 03/05/2020 | - |
| 3 | 04/05/2020 | - |
I'm trying to write an SQL query (I'm using SQL Server) which returns the date of the first two consecutive negative results for a given ID.
For example, for ID no. 1, the first two consecutive negative results are on 01/05 and 05/05.
The first two consecutive results for ID No. 2 are on 05/05 and 06/05.
The first two consecutive negative results for ID No. 3 are on on 01/05 and 02/05 .
So the query should produce the following result:
| ID | FirstNegativeDate |
|---- |------------------- |
| 1 | 01/05 |
| 2 | 05/05 |
| 3 | 01/05 |
Please note that the dates aren't necessarily one day apart. Sometimes, two consecutive negative tests may be several days apart. But they should still be considered as "consecutive negative tests". In other words, two negative tests are not 'consecutive' only if there is a positive test result in between them.
How can this be done in SQL? I've done some reading and it looks like maybe the PARTITION BY statement is required but I'm not sure how it works.
This is a gaps-and-island problem, where you want the start of the first island of '-'s that contains at least two rows.
I would recommend lead() and aggregation:
select id, min(date) first_negative_date
from (
select t.*, lead(result) over(partition by id order by date) lead_result
from mytable t
) t
where result = '-' and lead_result = '-'
group by id
Use LEAD or LAG functions over ID partition ordered by your Date column.
Then simple check where LEAD/LAG column is equal to Result.
You'll need also to filter the top ones.
The image attached just shows what LEAD/LAG would return
Given a data frame with 4 columns group, start_date, available_stock, used_stock. I basically have to figure out how long a stock will last given a group and date. lets say we have a dataframe with the following data
+----------+------------+-----------------+------------+
| group | start_date | available stock | used_stock |
+----------+------------+-----------------+------------+
| group 1 | 01/12/2019 | 100 | 80 |
| group 1 | 08/12/2019 | 60 | 10 |
| group 1 | 15/12/2019 | 60 | 10 |
| group 1 | 22/12/2019 | 150 | 200 |
| group 2 | 15/12/2019 | 80 | 90 |
| group 2 | 22/12/2019 | 150 | 30 |
| group 3 | 22/12/2019 | 50 | 50 |
+----------+------------+-----------------+------------+
Steps:
sort each group by start_date so we get something like the above data set
per group starting from the smallest date we check if the used_stock is greater or equal to the available stock. if it is true the end date is same as start_date
if the above condition is false then add the next dates used_stock to the current used_stock value. continue till the used_stock is greater or equal to available_stock, at which point the end date is same as the start_date of last added used_stock row.
in case no such value is found end date is null
after applying the above steps for every row we should get something like
+----------+------------+-----------------+------------+------------+
| group | start_date | available stock | used_stock | end_date |
+----------+------------+-----------------+------------+------------+
| group 1 | 01/12/2019 | 100 | 80 | 15/12/2019 |
| group 1 | 08/12/2019 | 60 | 10 | 22/12/2019 |
| group 1 | 15/12/2019 | 60 | 10 | 22/12/2019 |
| group 1 | 22/12/2019 | 150 | 200 | 22/12/2019 |
| group 2 | 15/12/2019 | 80 | 90 | 15/12/2019 |
| group 2 | 22/12/2019 | 150 | 30 | null |
| group 3 | 22/12/2019 | 50 | 50 | 22/12/2019 |
+----------+------------+-----------------+------------+------------+
the above logic was prebuilt in pandas and was tweaked and applied in the spark application as a grouped map Pandas UDF. I want to move away from #pandas_udf approach and have a pure spark data frame based approach to check if there will be any performance improvements.Would appreciate any help with this or any improvements on the given logic which would reduce the overall execution time.
With spark 2.4+, you can use SparkSQL builtin function aggregate:
aggregate(array_argument, zero_expression, merge, finish)
and implement the logic in the merge and finish expressions, see below for an example:
from pyspark.sql.functions import collect_list, struct, to_date, expr
from pyspark.sql import Window
w1 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, Window.unboundedFollowing)
# SQL expression to calculate end_date using aggregate function:
end_date_expr = """
aggregate(
/* argument */
data,
/* zero expression, initialize and specify the aggregator's datatype which is 'struct<end_date:date,total:double>' */
(date(NULL) as end_date, double(0) as total),
/* merge: use acc.total to save accumulated sum of used_stock
* this works similar to Python's reduce function
*/
(acc, y) ->
IF(acc.total >= `available stock`
, (acc.end_date as end_date, acc.total as total)
, (y.start_date as end_date, acc.total + y.used_stock as total)
),
/* finish: post-processing and retrieving only end_date */
z -> IF(z.total >= `available stock`, z.end_date, NULL)
)
"""
df.withColumn('start_date', to_date('start_date', 'dd/MM/yyyy')) \
.withColumn('data', collect_list(struct('start_date','used_stock')).over(w1)) \
.withColumn('end_date', expr(end_date_expr)) \
.select("group", "start_date", "`available stock`", "used_stock", "end_date") \
.show(truncate=False)
+-------+----------+---------------+----------+----------+
|group |start_date|available stock|used_stock|end_date |
+-------+----------+---------------+----------+----------+
|group 1|2019-12-01|100 |80 |2019-12-15|
|group 1|2019-12-08|60 |10 |2019-12-22|
|group 1|2019-12-15|60 |10 |2019-12-22|
|group 1|2019-12-22|150 |200 |2019-12-22|
|group 2|2019-12-15|80 |90 |2019-12-15|
|group 2|2019-12-22|150 |30 |null |
|group 3|2019-12-22|50 |50 |2019-12-22|
+-------+----------+---------------+----------+----------+
Note: this could be less efficient if many of the groups contain a large list of rows(i.e. 1000+ rows), when most of them require to just scan limited rows (i.e. less than 20) to find the first row satisfying the condition. In such case, you might set up two Window specs and do the calculation in two rounds:
from pyspark.sql.functions import collect_list, struct, to_date, col, when, expr
# 1st scan up to the N following rows which can cover majority of end_date satisfying the condition
N = 20
w2 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, N)
# 2nd scan will cover the full length but only to rows having end_date is NULL
w1 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, Window.unboundedFollowing)
df.withColumn('start_date', to_date('start_date', 'dd/MM/yyyy')) \
.withColumn('data', collect_list(struct('start_date','used_stock')).over(w2)) \
.withColumn('end_date', expr(end_date_expr)) \
.withColumn('data',
when(col('end_date').isNull(), collect_list(struct('start_date','used_stock')).over(w1))) \
.selectExpr(
"group",
"start_date",
"`available stock`",
"used_stock",
"IF(end_date is NULL, {0}, end_date) AS end_date".format(end_date_expr)
).show(truncate=False)
I have a table T1 in Postgres which is as follows:
| event | Date_Time |
|-------|--------------------------|
| start | 2018-04-30T06:09:30.986Z |
| run | 2018-04-30T10:37:38.044Z |
| end | 2018-04-30T11:39:38.044Z |
The Date_Time is in ISO format (stored as varchar) and I need to calculate the difference in Date_Time so that my output is as follows:
| event | Date_Time | Time_Difference |
|-------|--------------------------|-----------------|
| start | 2018-04-30T06:09:30.986Z | 4:28:08 |
| run | 2018-04-30T10:37:38.044Z | 1:02:00 |
| end | 2018-04-30T11:39:38.044Z | |
(10: 37: 38 - 06: 09: 30 = 4:28:08)
How can I do this using SQL?
Unrelated to the question, but: you should never store timestamp (or date or number) values in a varchar.
You first have to convert the varchar value to a timestamp. If the values are indeed formatted correctly, you can simply cast them: Date_Time::timestamp - or maybe to a timestamptz.
As far as I can tell, you want the different to the next row in your result. This can be achieved with the window function lead()
select event,
Date_Time,
date_time::timestamp - lead(date_time::timestamp) over (order by date_time::timestamp) as time_difference
from the_table
order by date_time;
The result of subtracting one timestamp from another is an interval you can format if you want.
I currently have two functions that should return the time a device started logging again, the time when the previous row before it was more than 60 seconds away. These functions may work fine but I have to see it work as it takes forever. Is there any shortcuts to make this faster?
CREATE OR REPLACE FUNCTION findNextTime(startt integer)
RETURNS integer AS
$nextTime$
DECLARE
nextTime integer;
BEGIN
select time into nextTime from m01 where time < startt ORDER BY time DESC LIMIT 1;
return nextTime;
END;
$nextTime$ LANGUAGE plpgsql;
CREATE OR REPlACE FUNCTION findStart()
RETURNS integer AS
$lastTime$
DECLARE
currentTime integer;
lastTime integer;
BEGIN
select time into currentTime from m01 ORDER BY time DESC LIMIT 1;
LOOP
RAISE NOTICE 'Current Time: %', currentTime;
select findNextTime(currentTime) into lastTime;
EXIT WHEN ((currentTime - lastTime) > 60);
currentTime := lastTime;
END LOOP;
return lastTime;
END;
$lastTime$ LANGUAGE plpgsql;
To clarify, I want to essentially find the last time there was a break of more than 60 seconds between any two rows.
CREATE TABLE IF NOT EXISTS m01 (
time integer,
value decimal,
id smallint,
driveId smallint
)
Sample Data:
In this case it would return 1520376063 because the next entry (1520375766) is more than 60 seconds apart it.
| time | value | id | driveid |
|------------|--------------------|------|---------|
| 1520376178 | 516.2 | 5116 | 2 |
| 1520376173 | 507.8 | 5116 | 2 |
| 1520376168 | 499.5 | 5116 | 2 |
| 1520376163 | 491.1 | 5116 | 2 |
| 1520376158 | 482.90000000000003 | 5116 | 2 |
| 1520376153 | 474.5 | 5116 | 2 |
| 1520376148 | 466.20000000000005 | 5116 | 2 |
| 1520376143 | 457.8 | 5116 | 2 |
| 1520376138 | 449.5 | 5116 | 2 |
| 1520376133 | 441.20000000000005 | 5116 | 2 |
| 1520376128 | 432.90000000000003 | 5116 | 2 |
| 1520376123 | 424.6 | 5116 | 2 |
| 1520376118 | 416.20000000000005 | 5116 | 2 |
| 1520376113 | 407.8 | 5116 | 2 |
| 1520376108 | 399.5 | 5116 | 2 |
| 1520376103 | 391.20000000000005 | 5116 | 2 |
| 1520376098 | 382.90000000000003 | 5116 | 2 |
| 1520376093 | 374.5 | 5116 | 2 |
| 1520376088 | 366.20000000000005 | 5116 | 2 |
| 1520376083 | 357.8 | 5116 | 2 |
| 1520376078 | 349.5 | 5116 | 2 |
| 1520376073 | 341.20000000000005 | 5116 | 2 |
| 1520376068 | 332.90000000000003 | 5116 | 2 |
| 1520376063 | 324.5 | 5116 | 2 |
| 1520375766 | 102.5 | 5116 | 2 |
This simple query should replace your two functions. Note the window function lead() in the subquery:
SELECT *
FROM (
SELECT time, lead(time) OVER (ORDER BY time DESC) AS last_time
FROM m01
WHERE time < _startt
) sub
WHERE time > last_time + 60
ORDER BY time DESC
LIMIT 1;
Either way, the crucial part for performance is the right index. Ideally on (time DESC).
Assuming time is defined NOT NULL - which it probably should be, but the table definition in the question does not say so. Else you probably want ORDER BY time DESC NULLS LAST - and a matching index. See:
PostgreSQL sort by datetime asc, null first?
I expect this plpgsql function to perform faster, though, if gaps typically show up early:
CREATE OR REPLACE FUNCTION find_gap_before_time(_startt int)
RETURNS int AS
$func$
DECLARE
_current_time int;
_last_time int;
BEGIN
FOR _last_time IN -- single loop is enough!
SELECT time
FROM m01
WHERE time < _startt
ORDER BY time DESC -- NULLS LAST?
LOOP
IF _current_time > _last_time + 60 THEN -- never true for 1st row
RETURN _current_time;
END IF;
_current_time := _last_time;
END LOOP;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT find_gap_before_time(1520376200);
Result as requested.
Aside: You'd typically save a couple of bytes per row in storage by placing the column value last or first, thereby minimizing alignment padding. Like:
CREATE TABLE m01 (
time integer,
id smallint,
driveId smallint,
value decimal
);
Detailed explanation:
Calculating and saving space in PostgreSQL
dateposted is a MySQL TIMESTAMP column:
SELECT *
FROM posts
WHERE dateposted > NOW() - 604800
...SHOULD, if I am not mistaken, return rows where dateposted was in the last week. But it returns only posts less than roughly one day old. I was under the impression that TIMESTAMP used seconds?
IE: 7 * 3600 * 24 = 604800
Use:
WHERE dateposted BETWEEN DATE_ADD(NOW(), INTERVAL -7 DAY) AND NOW()
That is because now() is implicitly converted into a number from timestamp and mysql conversion rules create a number like YYYYMMDDHHMMSS.uuuuuu
from mysql docs:
mysql> SELECT NOW();
-> '2007-12-15 23:50:26'
mysql> SELECT NOW() + 0;
-> 20071215235026.000000
Internally perhaps. The way to do this is the date math functions. So it would be:
SELECT * FROM posts WHERE dateposted > DATE_ADD(NOW(), INTERVAL -7 DAY)
I think there is a DATE_SUB, I'm just used to using ADD everywhere.
No, you can't implicitly use integer arithmetic with TIMESTAMP, DATETIME, and other date-related data types. You're thinking of the UNIX timestamp format, which is an integer number of seconds since 1/1/1970.
You can convert SQL data types to a UNIX timestamp in MySQL and then use arithmetic:
SELECT * FROM posts WHERE UNIX_TIMESTAMP(dateposted)+604800 > NOW()+0;
NB: adding zero to NOW() makes it return a numeric value instead of a string value.
update: Okay, I'm totally wrong with the above query. Converting NOW() to a numeric output doesn't produce a number that can be compared to UNIX timestamps. It produces a number, but the number doesn't count seconds or anything else. The digits are just YYYYMMDDHHMMSS strung together.
Example:
CREATE TABLE foo (
id SERIAL PRIMARY KEY,
dateposted TIMESTAMP
);
INSERT INTO foo (dateposted) VALUES ('2009-12-4'), ('2009-12-11'), ('2009-12-18');
SELECT * FROM foo;
+----+---------------------+
| id | dateposted |
+----+---------------------+
| 1 | 2009-12-04 00:00:00 |
| 2 | 2009-12-11 00:00:00 |
| 3 | 2009-12-18 00:00:00 |
+----+---------------------+
SELECT *, UNIX_TIMESTAMP(dateposted) AS ut, NOW()-604800 AS wk FROM foo
+----+---------------------+------------+-----------------------+
| id | dateposted | ut | wk |
+----+---------------------+------------+-----------------------+
| 1 | 2009-12-04 00:00:00 | 1259913600 | 20091223539359.000000 |
| 2 | 2009-12-11 00:00:00 | 1260518400 | 20091223539359.000000 |
| 3 | 2009-12-18 00:00:00 | 1261123200 | 20091223539359.000000 |
+----+---------------------+------------+-----------------------+
It's clear that the numeric values are not comparable. However, UNIX_TIMSTAMP() can also convert numeric values in that format as it can convert a string representation of a timestamp:
SELECT *, UNIX_TIMESTAMP(dateposted) AS ut, UNIX_TIMESTAMP(NOW())-604800 AS wk FROM foo
+----+---------------------+------------+------------+
| id | dateposted | ut | wk |
+----+---------------------+------------+------------+
| 1 | 2009-12-04 00:00:00 | 1259913600 | 1261089774 |
| 2 | 2009-12-11 00:00:00 | 1260518400 | 1261089774 |
| 3 | 2009-12-18 00:00:00 | 1261123200 | 1261089774 |
+----+---------------------+------------+------------+
Now one can run a query with an expression comparing them:
SELECT * FROM foo WHERE UNIX_TIMESTAMP(dateposted) > UNIX_TIMESTAMP(NOW())-604800
+----+---------------------+
| id | dateposted |
+----+---------------------+
| 3 | 2009-12-18 00:00:00 |
+----+---------------------+
But the answer given by #OMGPonies is still better, because this expression in my query probably can't make use of an index. I'm just offering this as an explanation of how the TIMESTAMP and NOW() features work.
Try this query:
SELECT * FROM posts WHERE DATE_SUB(CURDATE(),INTERVAL 7 DAY) < dateposted;
I am assuming that you are using mySQL.