Postgres - aggregate information once and add add multiple properties as columns - sql

I have a table called project and a view called downtime_report_overview. The downtime_report_overview consists of the table downtimeReport (id, startTime, stopTime, downTimeCauseId, employeeId, ...) and the joined downtimeCause.name.
Thanks to Gorden's reply (postgres - select one specfic row of another table and store it as column), I am able to include an active downtime (stopTime = null) via an array aggregate and filter as column to the project query. Since I might need to more properties to the downtime_report_overview (e.g. meta data like username) in the near future I was wondering if is a way where I can extract the correct downtimeReport only once.
In the example below I using the array aggregation 3 times, once id, startTime and causeName. It seems verbose on the one hand and on the other I'm not even certain that it will select the correct downTime row for all 3 columns.
SELECT
COUNT(downtime_report_overview."downtimeReportId") AS "downtimeReportsTotalCount",
FLOOR(date_part('epoch'::text, sum(downtime_report_overview."stopTime" - downtime_report_overview."startTime")))::integer AS "downtimeReportsTotalDurationInSeconds",
(array_agg(downtime_report_overview."downtimeReportId" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportId",
(array_agg(downtime_report_overview."startTime" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportStartTime",
(array_agg(downtime_report_overview."downtimeCauseName" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportCauseName"
...

There are several ways to approach this. Obviously, you can write a separate expression for each column. Or, you can play around with manipulating an entire row as a record.
In this case, perhaps the simplest approach is to separate the aggregation and getting the row of interest. Based on the original question, the code would look like:
SELECT p.*, tt.*
FROM (SELECT p."projectID"
count(t."timeTrackId") as "timeTracksTotalCount",
floor(date_part('epoch'::text, sum(t."stopTime" - t."startTime")))::integer AS "timeTracksTotalDurationInSeconds"
FROM project p LEFT JOIN
time_track t
ON t."fkProjectId" = p."projectId"
GROUP BY p."projectID"
) p LEFT JOIN
(SELECT DISTINCT ON (t."fkProjectId") tt.*
FROM time_track tt
WHERE t."stopTime" is null
ORDER BY t."fkProjectId", t."startTime" desc
) tt
ON tt."fkProjectId" = p."projectId";

Related

SQL Query for multiple columns with one column distinct

I've spent an inordinate amount of time this morning trying to Google what I thought would be a simple thing. I need to set up an SQL query that selects multiple columns, but only returns one instance if one of the columns (let's call it case_number) returns duplicate rows.
select case_number, name, date_entered from ticket order by date_entered
There are rows in the ticket table that have duplicate case_number, so I want to eliminate those duplicate rows from the results and only show one instance of them. If I use "select distinct case_number, name, date_entered" it applies the distinct operator to all three fields, instead of just the case_number field. I need that logic to apply to only the case_number field and not all three. If I use "group by case_number having count (*)>1" then it returns only the duplicates, which I don't want.
Any ideas on what to do here are appreciated, thank you so much!
You can use ROW_NUMBER(). For example
select *
from (
select *,
row_number() over(partition by case_number) as rn
) x
where rn = 1
The query above will pseudo-randomly pick one row for each case_number. If you want a better selection criteria you can add ORDER BY or window frames to the OVER clause.

Get most recent record from Right table with sub query

When I join to the right table I am getting way too many duplicates. I am trying to grab the most recent record from the right table however, it does not matter what I try it does not work.
So Far I have tried:
PROC SQL;
CREATE TABLE fs1.sample AS
SELECT A.*,
B.xx1,
max(B.time_s)
FROM lx1.results a left join (Select Distinct C.id, c.per FROM lx2.results c
Where c.id = a.id
and COMPGED(a.txt1, c.txt1,'i') < 100
and c.dt > a.dt
and c.ksv = 37
and datepart(c.lsg) >= '12DEC2020'd ) b
ON a.id = b.id
group by a.id, a.txt1
QUIT;
Unfortunately, I get an error. I also tried using case when exists, but that takes way too long. Essentially I am trying to grab the most recent record from the right table based on time_s. I also want to make sure the record I grab from the right table somewhat matches a.txt1.
Cheers
When you perform a join, you attach all records from the table that match your join conditions.
If the table is indexed appropriately, a subquery could achieve the goal of obtaining the most recent value, however, if the query uses the wrong index, TOP or equivalent functions may return the wrong result.
There are a number of ways to accomplish the task of retrieving the most recent record but they are contingent on a couple of things.
Firstly, you need to be able to identify what the most recent row is, usually by a column called CreatedDate or something similar against the IDs. (You should know what that business logic is, it may be that the table is chronologically entered [as most tables are] and therefore, SubID might be a thing. We're going to assume it is CreatedDate.)
Secondly, you need to rank the rows in terms of the CreatedDate in a descending order so that the newest matching ID is ranked 1.
Finally, you filter your results by 1 to return the newest result, but you could also filter by <= x if you are interested in the top x newest return results per ID.
To use more mathematical language: We are deriving a value from the CreatedDate and ID values and then using that derivative value to sort and filter the data. In this case we are deriving the RowNumber from the CreatedDate in descending order for each ID.
In order to accomplish this, you can use the Windowed Function ROW_NUMBER(),
ROW_NUMBER() OVER (PARTITION BY id ORDER BY CreatedDate DESC) as RankID
This windowed function will return a row value for each ID relative to the CreatedDate in descending order, where the newest created date is equal to 1.
You can then put brackets around the whole query to make it into a table so you will be able to filter the results of that Windowed Function.
SELECT id, txt
(SELECT id, txt
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY CreatedDate DESC) as RankID
FROM SourceTable) A
WHERE RankID = 1
This should achieve your goal of returning the "newest result".
What ever your column is that determines the age of the data relative to the ID, it can be multiple, should be placed within the ORDER BY.
In order to make this query perform faster, you should index your data appropriately, whereby ID is the the first column, and CreatedDate Desc is your next column. This means your system will not have to perform a costly sort every time this runs, but that depends on whether you plan on using this query often and how much overhead it is grabbing.

How to select the row with the lowest value- oracle

I have a table where I save authors and songs, with other columns. The same song can appear multiple times, and it obviously always comes from the same author. I would like to select the author that has the least songs, including the repeated ones, aka the one that is listened to the least.
The final table should show only one author name.
Clearly, one step is to find the count for every author. This can be done with an elementary aggregate query. Then, if you order by count and you can just select the first row, this would solve your problem. One approach is to use ROWNUM in an outer query. This is a very elementary approach, quite efficient, and it works in all versions of Oracle (it doesn't use any advanced features).
select author
from (
select author
from your_table
group by author
order by count(*)
)
where rownum = 1
;
Note that in the subquery we don't need to select the count (since we don't need it in the output). We can still use it in order by in the subquery, which is all we need it for.
The only tricky part here is to remember that you need to order the rows in the subquery, and then apply the ROWNUM filter in the outer query. This is because ORDER BY is the very last thing that is processed in any query - it comes after ROWNUM is assigned to rows in the output. So, moving the WHERE clause into the subquery (and doing everything in a single query, instead of a subquery and an outer query) does not work.
You can use analytical functions as follows:
Select * from
(Select t.*,
Row_number() over (partition by song order by cnt_author) as rn
From
(Select t.*,
Count(*) over (partition by author) as cnt_author
From your_table t) t ) t
Where rn = 1

SQL Query Deduplication / Join Issue

I've been having the worst time trying to write what I feel should be a pretty simple query to deal with duplicate entries.
For context: I've created a data warehouse using Big Query and am using Stitch to pull data from Hubspot. Everything works as expected as in: I have confirmed that I have the right number of records in BigQuery.
The issue comes into how Stitch refreshes data. Instead of updating records based on object id, it appends a new row. According to their documentation, the query below should work, but it doesn't for the simple reason that there exist multiple versions of a given record with the same _sdc_sequence (which I don't think should exist). There are other _sdc (stitch system fields) that I can use to help, but it's also not completely reliable for the same reasons as above.
SELECT DISTINCT o.*
FROM [sample-table:hubspot.companies] o
INNER JOIN (
SELECT
MAX(_sdc_sequence) AS seq,
id
FROM [sample-table:hubspot.companies]
GROUP BY companyid ) oo
ON o.companyid = oo.companyid
AND o._sdc_sequence = oo.seq
The query above returns fewer results than it should. If I run the following query, I get the right number of results, but I need the other fields besides companyid like name, description, revenue, etc.
SELECT o.companyid
FROM [samples_table:hubspot.companies] o
GROUP BY o.companyid
I was trying something like this, but it doesn't work (I'm getting the following error (Expression 'oo.properties.name.value' is not present in the GROUP BY list).
SELECT o.companyid,
oo.properties.name.value,
oo.properties.hubspot_owner_id.value,
oo.properties.description.value
FROM [sample_table:hubspot.companies] o
LEFT JOIN [sample_table:hubspot.companies] oo
ON o.companyid = oo.companyid
GROUP BY o.companyid
I'm my mind, the way that I'm thinking about this is:
Get list of unique records id (companyid)
Do a SQL "vlookup equivalent" of the raw, ungrouped company table that is sorted by insert time to get the first record that matches the id (which will be the most recent since the table is sorted)
I just don't know how to write this...
Try using window functions:
#standardSQL
SELECT c.*
FROM (SELECT c.*,
ROW_NUMBER() OVER (PARTITION BY companyid ORDER BY _sdc_sequence DESC) as seqnum
FROM `sample-table.hubspot.companies` c
) c
WHERE seqnum = 1;
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)]
FROM `sample-table.hubspot.companies` t
GROUP BY companyid

Need (Teradata) SQL Help On Selecting Specific Rows

I have a query that pulls the following table, but what I'm really interested in grabbing are the highlighted rows that my results are generating. I was trying to write a case statement within the query, but I realized that I'm omitting some of the grp_mkt records I'm trying to keep. Logic is essentially is i want records of segments not in grp_mkt AND segments if you are not in grp_mkt. I can probably do a join of the same query to find this but the tables are massive (impression level data) that I'd rather not try to pull the tables again.
It seems that you want to pull segments once, with the prioritization to any market but grp_market. Here is one way:
with t as (
select t.*,
row_number() over (partition by segment order by (case when market = grp_mkt then 1 else 2 end) desc) as seqnum
from <your query/table here> t
)
select t.*
from t
where seqnum = 1;
If you could have multiple segments (non-grpmkt) and you want all of them, then using rank() instead of row_number() would work.