SQL Question regarding fields associated with the MAX([field]) only - sql

I'm trying to gather the entire row information associated with the MAX() of a particular field.
I essentially have several [Flight_Leg] for a unique [Shipment_ID], and each one has unique [Destination_Aiport], [Departure_Time], and [Arrival_Time]. Obviously, each [Shipment_ID] can have multiple [Flight_Leg], and each [Flight_Leg] has a unique row of information.
SELECT
[Shipment_ID],
MAX([Flight_Leg]) AS "Final Leg",
[Arrival_Time],
[Destination_Airport]
FROM
[Flight_Info]
Group By
[Shipment_ID],
[Arrival_Time]
The output is multiple lines, rather than having one unique line for [Shipment_ID]. I'm just trying to isolate the FINAL flight info for a shipment.

Depending on your database, most support window functions. Here's one option using row_number():
select *
from (
select *, row_number() over (partition by shipment_id order by flight_leg desc) rn
from flight_info
) t
where rn = 1
Alternatively here's a more generic approach joining back to itself:
select fi.*
from flight_info fi
join (select shipment_id, max(flight_leg) max_flight_leg
from flight_info
group by shipment_id) t on fi.shipment_id = t.shipment_id
and fi.flight_leg = t.max_flight_leg

Related

Combining the filtering of duplicate rows in partitioned table and the query. BigQuery

I would like to filter the customer_id's last purchased item from the item purchase table. However, the table is the concatenation of distributed tables and may contain duplicate rows. Thus, I am filtering with the ROW_NUMBER() = 1 [1], [2] which is partitioned by log_key field.
I was wondering if there is a better way (instead of using a nested query) of filtering duplicate rows with the same log_key and getting the last item purchased by users.
I was wondering if it is possible to combine the two partition by operations.
currently
WITH
purchase_logs AS (
SELECT
basis_dt, reg_datetime, logkey,
customer_id, customer_info_1, customer_info_2, -- customer info
item_id, item_info_1, item_info_2, -- item info
FROM `project.dataset.item_purchase_table`
WHERE basis_dt BETWEEN '2021-11-01' AND '2021-11-10'
QUALIFY ROW_NUMBER() OVER (PARTITION BY log_key ORDER BY reg_datetime ASC) = 1
)
SELECT *
FROM purchase_logs
QUALIFY ROW_NUMBER() OVER (PARTITION BY log_key, customer_id ORDER BY reg_datetime ASC) = 1
ORDER BY reg_datetime, customer_id
;
The below isn't what I optimally wanted (since the coding format isn't consistent; not filtering logkey first). However, I end up combining the two window operations PARTITION BY with the prior logic. (I was hoping for some kind of filter with the HAVING clause and keeping the coding conventions of filtering with logkey.)
SELECT
basis_dt, reg_datetime, logkey,
customer_id, customer_info_1, customer_info_2, -- customer info
item_id, item_info_1, item_info_2, -- item info
FROM `project.dataset.item_purchase_table`
WHERE basis_dt BETWEEN '2021-11-01' AND '2021-11-10'
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY reg_datetime ASC) = 1
;

select rows in sql with latest date from 3 tables in each group

I'm creating PREDICATE system for my application.
Please see image that I already
I have a question how can I select rows in SQL with latest date "Taken On" column tables for each "QuizESId" columns, before that I am understand how to select it but it only using one table, I learn from this
select rows in sql with latest date for each ID repeated multiple times
Here is what I have already tried
SELECT tt.*
FROM myTable tt
INNER JOIN
(SELECT ID, MAX(Date) AS MaxDateTime
FROM myTable
GROUP BY ID) groupedtt ON tt.ID = groupedtt.ID
AND tt.Date = groupedtt.MaxDateTime
What I am confused about here is how can I select from 3 tables, I hope you can guide me, of course I need a solution with good query and efficient performance.
Thanks
This is for SQL Server (you didn't specify exactly what RDBMS you're using):
if you want to get the "latest row for each QuizId" - this sounds like you need a CTE (Common Table Expression) with a ROW_NUMBER() value - something like this (updated: you obviously want to "partition" not just by QuizId, but also by UserName):
WITH BaseData AS
(
SELECT
mAttempt.Id AS Id,
mAttempt.QuizModelId AS QuizId,
mAttempt.StartedAt AS StartsOn,
mUser.UserName,
mDetail.Score AS Score,
RowNum = ROW_NUMBER() OVER (PARTITION BY mAttempt.QuizModelId, mUser.UserName
ORDER BY mAttempt.TakenOn DESC)
FROM
UserQuizAttemptModels mAttempt
INNER JOIN
AspNetUsers mUser ON mAttempt.UserId = muser.Id
INNER JOIN
QuizAttemptDetailModels mDetail ON mDetail.UserQuizAttemptModelId = mAttempt.Id
)
SELECT *
FROM BaseData
WHERE QuizId = 10053
AND RowNum = 1
The BaseData CTE basically selects the data (as you did) - but it also adds a ROW_NUMBER() column. This will "partition" your data into groups of data - based on the QuizModelId - and it will number all the rows inside each data group, starting at 1, and ordered by the second condition - the ORDER BY clause. You said you want to order by "Taken On" date - but there's no such date visible in your query - so I just guessed it might be on the UserQuizAttemptModels table - change and adapt as needed.
Now you can select from that CTE with your original WHERE condition - and you specify, that you want only the first row for each data group (for each "QuizId") - the one with the most recent "Taken On" date value.

BigQuery SQL Query Optimization

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
The BigQuery dataset that I'm working with comes from Hubspot. It's being kept in sync by Stitch. (For those unfamiliar with BigQuery, most integrations are append-only so I have to filter out old copies via the ROW_NUMBER() OVER line you'll see below, so that's why it exists. Seems like the standard way to deal with this quirk.)
The wrinkle with the companies table is every single field, except for two ID ones, is of type RECORD. (See the screenshot at the bottom for an example). It serves to keep a history of field value changes. Unfortunately they don't seem to be in any order so wrapping up the fields - properties.first_conversion_event_name for example - in a MIN() or MAX() and grouping by companyid formula doesn't work.
This is what I ended up with (the final query is much longer; I didn't include all of the fields in the sample below):
WITH companies AS (
SELECT
o.companyid as companyid,
ARRAY_AGG(STRUCT(o.properties.name.value, o.properties.name.timestamp) ORDER BY o.properties.name.timestamp DESC)[SAFE_OFFSET(0)] as name,
ARRAY_AGG(STRUCT(o.properties.industry.value, o.properties.industry.timestamp) ORDER BY o.properties.industry.timestamp DESC)[SAFE_OFFSET(0)] as industry,
ARRAY_AGG(STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) ORDER BY o.properties.lifecyclestage.timestamp DESC)[SAFE_OFFSET(0)] as lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) as seqnum
FROM `project.hubspot.companies` o) o
WHERE seqnum = 1
GROUP BY companyid)
SELECT
companyid,
name.value as name,
industry.value as industry,
lifecyclestage.value as lifecyclestage
FROM companies
The WITH clause at the top is to get rid of the extra fields that the ARRAY_AGG(STRUCT()) includes. For each field I would have two columns - [field].value and [field].timestamp - and I only want the [field].value one.
Thanks in advance!
Schema Screenshot
I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).
Based on schema you presented and assuming your query really returns what you expect - below "optimized" version should return same result
#standardSQL
WITH companies AS (
SELECT
o.companyid AS companyid,
STRUCT(o.properties.name.value, o.properties.name.timestamp) AS name,
STRUCT(o.properties.industry.value, o.properties.industry.timestamp) AS industry,
STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) AS lifecyclestage
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) AS seqnum
FROM `project.hubspot.companies` o
) o
WHERE seqnum = 1
)
SELECT
companyid,
name.value AS name,
industry.value AS industry,
lifecyclestage.value AS lifecyclestage
FROM companies
As you can see I just simply removed GROUP BY companyid because you already have just one entry/row per companyid after you apply WHERE seqnum = 1, so no reason at all to group just one row per companyid. For the very same reason I removed ARRAY_AGG( ORDER BY)[SAFE_OFFSET(0)] - it just aggregated one struct and then extracted that one element out of array - so no need in doing this

filtering out duplicate rows using max

I have a table that, for the most part, is individual users. Occasionally there is a joint user. For a joint user, all the fields in the table will be exactly the same as the primary user except for a b-score field. I want to only display one row of data per account, and use the highest b-score to decide which row to use when it is a joint account (so the highest score is displayed only)
I thought it would be a simple
SELECT DISTINCT accountNo, MAX(bscore) FROM table, GROUP BY accountNo
but I'm still getting multiple rows for joints
You seem to want the ANSI-standard row_number() function:
select t.*
from (select t.*, row_number() over (partition by accountNo order by bscore desc) as seqnum
from t
) t
where seqnum = 1;
This worked for me, maybe not the most efficient. Correlated sub-query. The key part is accountNo = a.accountNo.
SELECT DISTINCT a.accountNo, (SELECT MAX(bscore) FROM table WHERE accountNo =
a.accountNo) bscore
FROM table a
GROUP BY a.accountNo

Return row data based on Distinct Column value

I'm using SQL Azure with asp script, and for the life of me, have had no luck trying to get this to work. The table I'm running a query on has many columns, but I want to query for distinct values on 2 columns (name and email), from there I want it to return the entire row's values.
What my query looks like now:
SELECT DISTINCT quote_contact, quote_con_email
FROM quote_headers
WHERE quote_contact LIKE '"&query&"%'
But I need it to return the whole row so I can retrieve other data points. Had I been smart a year ago, I would have created a separate table just for the contacts, but that's a year ago.
And before I implemented LiveSearch features.
One approach would be to use a CTE (Common Table Expression). With this CTE, you can partition your data by some criteria - i.e. your quote_contact and quote_con_email - and have SQL Server number all your rows starting at 1 for each of those partitions, ordered by some other criteria - i.e. probably SomeDateTimeStamp.
So try something like this:
;WITH DistinctContacts AS
(
SELECT
quote_contact, quote_con_email, (other columns here),
RN = ROW_NUMBER() OVER(PARTITION BY quote_contact, quote_con_email ORDER BY SomeDateTimeStamp DESC)
FROM
dbo.quote_headers
WHERE
quote_contact LIKE '"&query&"%'
)
SELECT
quote_contact, quote_con_email, (other columns here)
FROM
DistinctContacts
WHERE
RowNum = 1
Here, I am selecting only the last entry for each "partition" (i.e. for each pair of name/email) - ordered in a descending fashion by the time stamp.
Does that approach what you're looking for??
You need to provide more details.
This is what I could come up with Without them:
WITH dist as (
SELECT DISTINCT quote_contact, quote_con_email, RANK() OVER(ORDER BY quote_contact, quote_con_email) rankID
FROM quote_headers
WHERE quote_contact LIKE '"&query&"%'
),
data as (
SELECT *, RANK() OVER(ORDER BY quote_contact, quote_con_email) rankID FROM quote_headers
)
SELECT * FROM dist d INNER JOIN data src ON d.rankID = src.rankID