SQL query : multiple challenges - sql

Not being an SQL expert, I am struggling with the following:
I inherited a larg-ish table (about 100 million rows) containing time-stamped events that represent stage transitions of mostly shortlived phenomena. The events are unfortunately recorded in a somewhat strange way, with the table looking as follows:
phen_ID record_time producer_id consumer_id state ...
000123 10198789 start
10298776 000123 000112 hjhkk
000124 10477886 start
10577876 000124 000123 iuiii
000124 10876555 end
Each phenomenon (phen-ID) has a start event and theoretically an end event, although it might not have been occured yet and thus not recorded. Each phenomenon can then go through several states. Unfortunately, for some states, the ID is recorded in either a product or a consumer field. Also, the number of states is not fixed, and neither is the time between the states.
To beginn with, I need to create an SQL statement that for each phen-ID shows the start time and the time of the last recorded event (could be an end state or one of the intermediate states).
Just considering a single phen-ID, I managed to pull together the following SQL:
WITH myconstants (var1) as (
values ('000123')
)
select min(l.record_time), max(l.record_time) from
(select distinct * from public.phen_table JOIN myconstants ON var1 IN (phen_id, producer_id, consumer_id)
) as l
As the start-state always has the lowest recorded-time for the specific phenomenon, the above statement correctly returns the recorded time range as one row irrespective of what the end state is.
Obviously here I have to supply the phen-ID manually.
How can I make this work that so I get a row of the start times and maxium recorded time for each unique phen-ID? Played around with trying to fit in something like select distinct phen-id ... but was not able to "feed" them automatically into the above. Or am I completely off the mark here?
Addition:
Just to clarify, the ideal output using the table above would like something like this:
ID min-time max-time
000123 10198789 10577876 (min-time is start, max-time is state iuii)
000124 10477886 10876555 (min-time is start, max-time is end state)

union all might be an option:
select phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from (
select phen_id, record_time from phen_table
union all select producer_id, record_time from phen_table
union all select consumer_id, record_time from phen_table
) t
where phen_id is not null
group by phen_id
On the other hand, if you want prioritization, then you can use coalesce():
select coalesce(phen_id, producer_id, consumer_id) as phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from phen_table
group by coalesce(phen_id, producer_id, consumer_id)
The logic of the two queries is not exactly the same. If there are rows where more than one of the three columns is not null, and values differ, then the first query takes in account all non-null values, while the second considers only the "first" non-null value.
Edit
In Postgres, which you finally tagged, the union all solution can be phrased more efficiently with a lateral join:
select x.phen_id,
min(p.record_time) as min_record_time,
max(p.record_time) as max_record_time
from phen_table p
cross join lateral (values (phen_id), (producer_id), (consumer_id)) as x(phen_id)
where x.phen_id is not null
group by x.phen_id

I think you're on the right track. Try this and see if it is what you are looking for:
select
min(l.record_time)
,max(l.record_time)
,coalesce(phen_id, producer_id, consumer_id) as [Phen ID]
from public.phen_table
group by coalesce(phen_id, producer_id, consumer_id)

Related

SQL Query GROUP BY with same values over multiple columns and return SUM of the relative time value

I have pgAdmin 4.16.
The database contains a table called flights. In this table every row represents a flight. When a flight is delayed, the delay codes are used to describe the reason of the delay. Per delay code there is a time delay describing for how long the delay was. A delay can contain up to 3 delay codes with its relative delay time. I can group the delay codes per only 1 set of columns (delay code and delay time), but not over all 3 columns. Here is the script:
SELECT delay_code_1, COUNT(delay_code_1), AVG(delay_time_1), SUM(delay_time_1)
FROM flights
GROUP BY delay_code_1
ORDER BY SUM(delay_time_1) DESC
Here is the flights table:
Here is the desired result:
my sincere thanks
This problem occurs because the table is not in normal form - repeating groups should be broken out into another table. If this is your schema, you might redisign.
But, assuming you cannot change the schema, one solution would be to union three passes at the table, e.g.
SELECT delay_code, SUM(delay_time) as Total_Time
FROM
(
SELECT delay_code_1 as delay_code, delay_time_1 as delay_time
FROM flight
WHERE delay_code_1 is not null
UNION ALL
SELECT delay_code_2 as delay_code, delay_time_2 as delay_time
FROM flight
WHERE delay_code_2 is not null
UNION ALL
SELECT delay_code_3 as delay_code, delay_time_3 as delay_time
FROM flight
WHERE delay_code_3 is not null
)
GROUP BY delay_code
HTH
EDITED - as mentioned by #a_horse_with_no_name, UNION ALL should be used here. Plain UNION de-dupes the results, so Total_Time would be wrong if there were multiple delays of the same code and time.

Not contained in eaggregate function/ the GROUP BY AND max date does not show the latest date

I am stucked here and your help will be appreciated. Based on the code that you see below:
I have a sub query with some conditions. especifically, the following:
AND OwnedByTeamJ='C - O Review'
AND OwnedByTeamJ is null
I want to get the results from the subquery,
Do another select on them because all I want is the latest date listed the table. As you can see in the picture, I want to be able to extract row#3 which has the highest date and its own by the team which is null (I guess! since there is no value there).
Now the problems:
1-at first the code worked and I saw all the records! though, it didsn't pick the latest date
2- sudennly it stopped working and saw this error:
Column 'tt.IncidentID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
select
distinct max(LastModifiedDateTimeJ),
incidentID,
EffecRequestStatus,
OwnedByTeamJ
From (
select
EffecRequestStatus,
IncidentID,
LastModifiedDateTimeJ,
OwnedByTeamJ,
DetailsJ,
Status,
OwnedByTeam
from IncidentView
where
CAST(CreatedDateTime as DATE) >='05-01-2019'
AND JournalTypeName like '%Journal - Note%'
AND OwnedByTeamJ='C - O Review'
AND OwnedByTeamJ is null
group by
EffecRequestStatus,
IncidentID,
LastModifiedDateTimeJ,
OwnedByTeamJ,
DetailsJ,
Status,
OwnedByTeam
) as tt
where
tt. RequestStatus not in ('Submission','P-C submission','C Review')
A sample of data look like the following but my table has more columns than that is listed here:
1) These conditions should not coexist:
AND OwnedByTeamJ='C - O Review'
AND OwnedByTeamJ is null
...because these two things are never true at the same time, and you will get an empty result-set. I guess what you want is
AND (OwnedByTeamJ='C - O Review' or OwnedByTeamJ is null)
2) The error you mention does not seem to agree with the query you've shown us. Did you try to make another group by externally?
3) It doesn't make sense to use distinct together with any aggregation function like max,sum, count etc.
4) As long as you have included EffecRequestStatus within the group by, you will get 3 rows, not 1, because all 3 rows have a different value for EffecRequestStatus. You will have to remove it from the group by, and therefore from the select as well, if you only want to see one row.
Try this, if you want to have a not empty result:
AND (OwnedByTeamJ='C - O Review'
OR OwnedByTeamJ is null)

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

Subqueries and AVG() on a subtraction

Working on a query to return the average time from when an employee begins his/her shift and then arrives at the first home (this DB assumes they are salesmen).
What I have:
SELECT l.OFFICE_NAME, crew.EMPLOYEE_NAME, //avg(first arrival time)
FROM LOCAL_OFFICE l, CREW_WORK_SCHEDULE crew,
WHERE l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
You can see the AVG() command is commented out, because I know the time that they arrive at work, and the time they get to the first house, and can find the value using this:
(SELECT MIN(c.ARRIVE)
FROM ORDER_STATUS c
WHERE c.USER_ID = crew.CREW_ID)
-(SELECT START_TIME
FROM CREW_SHIFT_CODES
WHERE WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE)
Would the best way be to simply put the above into the the AVG() parentheses? Just trying to learn the best methods to create queries. If you want more info on any of the tables, etc. just ask, but hopefully they're all named so you know what they're returning.
As per my comment, the example you gave would only return one record to the AVG function, and so not do very much.
If the sub-query was returning multiple records, however, your suggestion of placing the sub-query inside the AVG() would work...
SELECT
AVG((SELECT MIN(sub.val) FROM sub WHERE sub.id = main.id GROUP BY sub.group))
FROM
main
GROUP BY
main.group
(Averaging a set of minima, and so requiring two levels of GROUP BY.)
In many cases this gives good performance, and is maintainable. But sometimes the sub-query grows large, and it can be better to reformat it using an inline view...
SELECT
main.group,
AVG(sub_query.val)
FROM
main
INNER JOIN
(
SELECT
sub.id,
sub.group,
MIN(sub.val) AS val
FROM
sub
GROUP BY
sub.id
sub.group
)
AS sub_query
ON sub_query.id = main.id
GROUP BY
main.group
Note: Although this looks as though the inline view will calculate a lod of values that are not needed (and so be inefficient), most RDBMS optimise this so only the required records get processes. (The optimiser knows how the inner query is being used by the outer query, and builds the execution plan accordingly.)
Don't think of subqueries: they're often quite slow. In effect, they are row by row (RBAR) operations rather than set based
join all the table together
I've used a derived table to calculate the 1st arrival time
Aggregate
Soemthing like
SELECT
l.OFFICE_NAME, crew.EMPLOYEE_NAME,
AVG(os.minARRIVE - cs.START_TIME)
FROM
LOCAL_OFFICE l
JOIN
CREW_WORK_SCHEDULE crew On l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
JOIN
CREW_SHIFT_CODES cs ON cs.WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE
JOIN
(SELECT MIN(ARRIVE) AS minARRIVE, USER_ID
FROM ORDER_STATUS
GROUP BY USER_ID
) os ON oc.USER_ID = crew.CREW_ID
GROUP B
l.OFFICE_NAME, crew.EMPLOYEE_NAME
This probably won't give correct data because of the minARRIVE grouping: there isn't enough info from ORDER_STATUS to show "which day" or "which shift". It's simply "first arrival for that user for all time"
Edit:
This will give you average minutes
You can add this back to minARRIVE using DATEADD, or change to hh:mm with some %60 (modul0) and /60 (integer divide
AVG(
DATEDIFF(minute, os.minARRIVE, os.minARRIVE)
)

ms-access: runtime error 3354

i'm having a problem running an sql in ms-access. im using this code:
SELECT readings_miu_id, ReadDate, ReadTime, RSSI, Firmware, Active, OriginCol, ColID, Ownage, SiteID, PremID, prem_group1, prem_group2
INTO analyzedCopy2
FROM analyzedCopy AS A
WHERE ReadTime = (SELECT TOP 1 analyzedCopy.ReadTime FROM analyzedCopy WHERE analyzedCopy.readings_miu_id = A.readings_miu_id AND analyzedCopy.ReadDate = A.ReadDate ORDER BY analyzedCopy.readings_miu_id, analyzedCopy.ReadDate, analyzedCopy.ReadTime)
ORDER BY A.readings_miu_id, A.ReadDate ;
and before this i'm filling in the analyzedCopy table from other tables given certain criteria. for one set of criteria this code works just fine but for others it keeps giving me runtime error '3354'. the only diference i can see is that with the criteria that works, the table is around 4145 records long where as with the criteria that doesn't work the table that im using this code on is over 9000 records long. any suggestions?
is there any way to tell it to only pull half of the information and then run the same select string on the other half of the table im pulling from and add those results to the previous results from the first half?
The full text for run-time error '3354' is that it is "At most one record can be returned by this subquery."
I just tried to run this query on the first 4000 records and it failed again with the same error code so it can't be the ammount of records i would think.
See this:
http://allenbrowne.com/subquery-02.html#AtMostOneRecord
What is happening is your subquery is returning two identical records (based on the ORDER BY) and the TOP 1 actually returns two records (yes that's how access does the TOP statement). You need to add fields to the ORDER BY to make it unique - preferable an unique ID (you do have an unique PK don't you?)
As Andomar below stated DISTINCT TOP 1 will work as well.
What does MS-ACCESS return when you run the subquery?
SELECT TOP 1 analyzedCopy.ReadTime
FROM analyzedCopy
WHERE analyzedCopy.readings_miu_id = A.readings_miu_id
AND analyzedCopy.ReadDate = A.ReadDate
ORDER BY analyzedCopy.readings_miu_id, analyzedCopy.ReadDate,
analyzedCopy.ReadTime
If it returns multiple rows, maybe it can be fixed with DISTINCT:
SELECT DISTINCT TOP 1 analyzedCopy.ReadTime
FROM ... rest of query ...
I don't know if this would work or not (and I no longer have a copy of Access to test on), so I apologize up front if I'm way off.
First, just do a select on the primary key of analyzedCopy to get the mid-point ID. Something like:
SELECT TOP 4500 readings_miu_id FROM analyzedCopy ORDER BY readings_miu_id, ReadDate;
Then, when you have the mid-point ID, you can add that to the WHERE statement of your original statement:
SELECT ...
INTO ...
FROM ...
WHERE ... AND (readings_miu_id <= {ID from above}
ORDER BY ...
Then SELECT the other half:
SELECT ...
INTO ...
FROM ...
WHERE ... AND (readings_miu_id > {ID from above}
ORDER BY ...
Again, sorry if I'm way off.