SQL Query Deduplication / Join Issue

SQL Query Deduplication / Join Issue - sql

I've been having the worst time trying to write what I feel should be a pretty simple query to deal with duplicate entries.
For context: I've created a data warehouse using Big Query and am using Stitch to pull data from Hubspot. Everything works as expected as in: I have confirmed that I have the right number of records in BigQuery.
The issue comes into how Stitch refreshes data. Instead of updating records based on object id, it appends a new row. According to their documentation, the query below should work, but it doesn't for the simple reason that there exist multiple versions of a given record with the same _sdc_sequence (which I don't think should exist). There are other _sdc (stitch system fields) that I can use to help, but it's also not completely reliable for the same reasons as above.
SELECT DISTINCT o.*
FROM [sample-table:hubspot.companies] o
INNER JOIN (
SELECT
MAX(_sdc_sequence) AS seq,
id
FROM [sample-table:hubspot.companies]
GROUP BY companyid ) oo
ON o.companyid = oo.companyid
AND o._sdc_sequence = oo.seq
The query above returns fewer results than it should. If I run the following query, I get the right number of results, but I need the other fields besides companyid like name, description, revenue, etc.
SELECT o.companyid
FROM [samples_table:hubspot.companies] o
GROUP BY o.companyid
I was trying something like this, but it doesn't work (I'm getting the following error (Expression 'oo.properties.name.value' is not present in the GROUP BY list).
SELECT o.companyid,
oo.properties.name.value,
oo.properties.hubspot_owner_id.value,
oo.properties.description.value
FROM [sample_table:hubspot.companies] o
LEFT JOIN [sample_table:hubspot.companies] oo
ON o.companyid = oo.companyid
GROUP BY o.companyid
I'm my mind, the way that I'm thinking about this is:
Get list of unique records id (companyid)
Do a SQL "vlookup equivalent" of the raw, ungrouped company table that is sorted by insert time to get the first record that matches the id (which will be the most recent since the table is sorted)
I just don't know how to write this...

Try using window functions:
#standardSQL
SELECT c.*
FROM (SELECT c.*,
ROW_NUMBER() OVER (PARTITION BY companyid ORDER BY _sdc_sequence DESC) as seqnum
FROM `sample-table.hubspot.companies` c
) c
WHERE seqnum = 1;

Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)]
FROM `sample-table.hubspot.companies` t
GROUP BY companyid

Related

How to select the row with the lowest value- oracle

I have a table where I save authors and songs, with other columns. The same song can appear multiple times, and it obviously always comes from the same author. I would like to select the author that has the least songs, including the repeated ones, aka the one that is listened to the least.
The final table should show only one author name.

Clearly, one step is to find the count for every author. This can be done with an elementary aggregate query. Then, if you order by count and you can just select the first row, this would solve your problem. One approach is to use ROWNUM in an outer query. This is a very elementary approach, quite efficient, and it works in all versions of Oracle (it doesn't use any advanced features).
select author
from (
select author
from your_table
group by author
order by count(*)
)
where rownum = 1
;
Note that in the subquery we don't need to select the count (since we don't need it in the output). We can still use it in order by in the subquery, which is all we need it for.
The only tricky part here is to remember that you need to order the rows in the subquery, and then apply the ROWNUM filter in the outer query. This is because ORDER BY is the very last thing that is processed in any query - it comes after ROWNUM is assigned to rows in the output. So, moving the WHERE clause into the subquery (and doing everything in a single query, instead of a subquery and an outer query) does not work.

You can use analytical functions as follows:
Select * from
(Select t.*,
Row_number() over (partition by song order by cnt_author) as rn
From
(Select t.*,
Count(*) over (partition by author) as cnt_author
From your_table t) t ) t
Where rn = 1

Postgres - aggregate information once and add add multiple properties as columns

I have a table called project and a view called downtime_report_overview. The downtime_report_overview consists of the table downtimeReport (id, startTime, stopTime, downTimeCauseId, employeeId, ...) and the joined downtimeCause.name.
Thanks to Gorden's reply (postgres - select one specfic row of another table and store it as column), I am able to include an active downtime (stopTime = null) via an array aggregate and filter as column to the project query. Since I might need to more properties to the downtime_report_overview (e.g. meta data like username) in the near future I was wondering if is a way where I can extract the correct downtimeReport only once.
In the example below I using the array aggregation 3 times, once id, startTime and causeName. It seems verbose on the one hand and on the other I'm not even certain that it will select the correct downTime row for all 3 columns.
SELECT
COUNT(downtime_report_overview."downtimeReportId") AS "downtimeReportsTotalCount",
FLOOR(date_part('epoch'::text, sum(downtime_report_overview."stopTime" - downtime_report_overview."startTime")))::integer AS "downtimeReportsTotalDurationInSeconds",
(array_agg(downtime_report_overview."downtimeReportId" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportId",
(array_agg(downtime_report_overview."startTime" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportStartTime",
(array_agg(downtime_report_overview."downtimeCauseName" ORDER BY downtime_report_overview."startTime" DESC) FILTER (WHERE downtime_report_overview."stopTime" IS null))[1] AS "activeDownTimeReportCauseName"
...

There are several ways to approach this. Obviously, you can write a separate expression for each column. Or, you can play around with manipulating an entire row as a record.
In this case, perhaps the simplest approach is to separate the aggregation and getting the row of interest. Based on the original question, the code would look like:
SELECT p.*, tt.*
FROM (SELECT p."projectID"
count(t."timeTrackId") as "timeTracksTotalCount",
floor(date_part('epoch'::text, sum(t."stopTime" - t."startTime")))::integer AS "timeTracksTotalDurationInSeconds"
FROM project p LEFT JOIN
time_track t
ON t."fkProjectId" = p."projectId"
GROUP BY p."projectID"
) p LEFT JOIN
(SELECT DISTINCT ON (t."fkProjectId") tt.*
FROM time_track tt
WHERE t."stopTime" is null
ORDER BY t."fkProjectId", t."startTime" desc
) tt
ON tt."fkProjectId" = p."projectId";

Oracle subquery in select

I have a table that keeps costs of products. I'd like to get the average cost AND last buying invoice for each product.
My solution was creating a sub-select to get last buying invoice but unfortunately I'm getting
ORA-00904: "B"."CODPROD": invalid identifier
My query is
SELECT (b.cod_aux) product,
-- here goes code to get average cost,
(SELECT round(valorultent, 2)
FROM (SELECT valorultent
FROM pchistest
WHERE codprod = b.codprod
ORDER BY dtultent DESC)
WHERE ROWNUM = 1)
FROM pchistest a, pcembalagem b
WHERE a.codprod = b.codprod
GROUP BY a.codprod, b.cod_aux
ORDER BY b.cod_aux
In short what I'm doing on sub-select is ordering descendantly and getting the first row given the product b.codprod

Your problem is that you can't use your aliased columns deeper than one sub-query. According to the comments, this was changed in 12C, but I haven't had a chance to try it as the data warehouse that I use is still on 11g.
I would use something like this:
SELECT b.cod_aux AS product
,ROUND (r.valorultent, 2) AS valorultent
FROM pchistest a
JOIN pcembalagem b ON (a.codprod = b.codprod)
JOIN (SELECT valorultent
,codprod
,ROW_NUMBER() OVER (PARTITION BY codprod
ORDER BY dtultent DESC)
AS row_no
FROM pchistest) r ON (r.row_no = 1 AND r.codprod = b.codprod)
GROUP BY a.codprod, b.cod_aux
ORDER BY b.cod_aux
I avoid sub-queries in SELECT statements. Most of the time, the optimizer wants to run a SELECT for each item in the cursor, OR it does some crazy nested loops. If you do it as a sub-query in the JOIN, Oracle will normally process the rows that you are joining; normally, it is more efficient. Finally, complete your per item functions (in this case, the ROUND) in the final product. This will prevent Oracle from doing it on ALL rows, not just the ones you use. It should do it correctly, but it can get confused on complex queries.
The ROW_NUMBER() OVER (PARTITION BY ..) is where the magic happens. This adds a row number to each group of CODPRODs. This allows you to pluck the top row from each CODPROD, so this allows you to get the newest/oldest/greatest/least/etc from your sub-query. It is also great for filtering duplicates.

What is the most accurate/efficient method to retrieve recently entered text without changing ROWNUM?

Please bear with me as I am new to all this. I was given an answer to a previous query which was really helpful.
Now I am attempting to complete another step of the project.
My query is, I have a table and I am attempting to retrieve the most recent notes entered against a particular line of the order.
Example of how the tables looks like are below.
The data in the below table is obtained from joining two separate tables.
I used the following Syntax, and am able to obtain the most recent notes entered against an example order (867318)
However now, in a situation where the order has multiple lines (ORDER_LINE) how can I retrieve the most recent NOTE_PAD.TEXT entered against the ORDER_LINE ?
Any help would be greatly appreciated.
SQL Brand: Oracle DB pre 12c Date format: Is in real date/timeIssue: Most accurate/efficient method to retrieve the most recently entered NOTE_PAD.TEXT without changing ROWNUM each time depending on the ORDER_NUMBER SQL Syntax: SELECT *
FROM ( SELECT ORDER_TYPE,
ORDER_NUMBER,
ORDER_LINE,
NOTE_PAD.CREATED_DATE,
NOTE_PAD.TEXT
FROM ORDER_AWB INNER JOIN NOTE_PAD ON ( "ORDER_AWB"."NOTES" = "NOTE_PAD"."NOTES" )
WHERE ( "ORDER_AWB"."ORDER_NUMBER" = '867318' )
ORDER BY NOTE_PAD.CREATED_DATE DESC )
WHERE ROWNUM <= 1
Example of table result when above syntax is used:
2: http://i.stack.imgur.com/e02YS.png So basically this is the how the entire code looks like in the DB: I need to retrieve the notes against the particular type of order and it’s status. Hence ORDER_TYPE = ‘RO’ ORDER_STATUS = ‘OPEN’

If you want to handle multiple orders at the same time, then use row_number():
SELECT *
FROM (SELECT ORDER_TYPE, np, ORDER_LINE,
np.CREATED_DATE, NOTE_PAD.TEXT,
ROW_NUMBER() OVER (PARTITION BY o.ORDER_NUMBER
ORDER BY NP.CREATED_DATE DESC
) as seqnum
FROM ORDER_AWB o INNER JOIN
NOTE_PAD np
ON o."NOTES" = np."NOTES"
) o
WHERE SEQNUM = 1;

Set-based alternative to loop in SQL Server

I know that there are several posts about how BAD it is to try to loop in SQL Server in a stored procedure. But I haven't quite found what I am trying to do. We are using data connectivity that can be linked internally directly into excel.
I have seen some posts where a few people have said they could convert most loops to a standard query. But for the life of me I am having trouble with this one.
I need all custIDs who have orders right before an event of type 38,40. But only get them if there is no other order between the event and the order in the first query.
So there are 3 parts. I first query for all orders (orders table) based on a time frame into a temporary table.
Select into temp1 odate, custId from orders where odate>'5/1/12'
Then I could use the temp table to inner join on the secondary table to get a customer event (LogEvent table) that may have occurred some time in the past prior to the current order.
Select into temp2 eventdate, temp1.custID from LogEvent inner join temp1 on
temp1.custID=LogEvent.custID where EventType in (38,40) and temp1.odate>eventdate
order by eventdate desc
The problem here is that the queries I am trying to run will return all rows for each of the customers from the first query where I only want the latest for each customer. So this is where on the client side I would loop to only get one Event instead of all the old ones. But as all the query has to run inside of Excel I can't really loop client side.
The third step then could use the results from the second query to make check if the event occurred between most current order and any previous order. I only want the data where the event precedes the order and no other orders are in between.
Select ordernum, shopcart.custID from shopcart right outer join temp2 on
shopcart.custID=temp2.custID where shopcart.odate >= temp2.eventdate and
ordernum is null
Is there a way to simplify this and make it set-based to run in SQL Server instead of some kind of loop that I is perform at the client?

THis is a great example of switching to set-based notation.
First, I combined all three of your queries into a single query. In general, having a single query let's the query optimizer do what it does best -- determine execution paths. It also prevents accidental serialization of queries on a multithreaded/multiprocessor machine.
The key is row_number() for ordering the events so the most recent has a value of 1. You'll see this in the final WHERE clause.
select ordernum, shopcart.custID
from (Select eventdate, temp1.custID,
row_number() over (partition by temp1.CustID order by EventDate desc) as seqnum
from LogEvent inner join
(Select odate, custId
from order
where odate>'5/1/12'
) temp1
on temp1.custID=LogEvent.custID
where EventType in (38,40) and temp1.odate>eventdate order by eventdate desc
) temp2 left outer join
ShopCart
on shopcart.custID=temp2.custID
where seqnum = 1 and shopcart.odate >= temp2.eventdate and ordernum is null
I kept your naming conventions, even though I think "from order" should generate a syntax error. Even if it doesn't it is bad practice to name tables and columns with reserved SQL words.

If you are using a newer version of sql server, then you can use the ROW_NUMBER function. I will write an example shortly.
;WITH myCTE AS
(
SELECT
eventdate, temp1.custID,
ROW_NUMBER() OVER (PARTITION BY temp1.custID ORDER BY eventdate desc) AS CustomerRanking
FROM LogEvent
JOIN temp1
ON temp1.custID=LogEvent.custID
WHERE EventType IN (38,40) AND temp1.odate>eventdate
)
SELECT * into temp2 from myCTE WHERE CustomerRanking = 1;
This gets you the most recent event for each customer without a loop.
Also, you could use RANK, however that will create duplicates for ties, whereas ROW_NUMBER will guarantee no duplicate numbers for your partition.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas