How to select DISTINCT results in an SQL JOIN query - sql

this is my query so please check it and tell me. in this query is execute successfully but distinct is not working:
SELECT
DISTINCT(ticket_message.ticket_id),
support_ticket.user_id,
support_ticket.priority,
support_ticket.subject,
support_ticket.status,
ticket_message.message
FROM
support_ticket
LEFT OUTER JOIN ticket_message ON support_ticket.ticket_id = ticket_message.ticket_id
LEFT OUTER JOIN assign_ticket ON ticket_message.ticket_id = assign_ticket.ticket_id

The word distinct is a modifier to the keyword SELECT. So you need to think of it as SELECT DISTINCT and it ALWAYS operates across the entire row. It simply ignores the parentheses seen in the following:
select distinct(ticket_message.ticket_id)
because distinct is NOT a function.
So. What we appear to have is a support ticket with associated messages. There are usually multiple messages per support ticket, so I suspect what you want is more complex. For example you might want just the most recent message for each support ticket.
To achieve most recent we need a timestamp (or "datetime") column and we also need to know if your database supports "window functions". Let's assume you have a timestamp column called message_at and you database does support window functions, then this would reduce the number of rows:
SELECT
support_ticket.ticket_id
, support_ticket.user_id
, support_ticket.support_section
, support_ticket.priority
, support_ticket.subject
, support_ticket.status
, tm.file
, tm.message
, assign_ticket.section_id
, assign_ticket.section_admin_id
FROM support_ticket
LEFT OUTER JOIN (
SELECT
ticket_id
, file
, message
, ROW_NUMBER() OVER (PARTITION BY ticket_id ORDER BY message_at DESC) AS row_num
FROM ticket_message
) tm ON support_ticket.ticket_id = tm.ticket_id
AND tm.row_num = 1
LEFT OUTER JOIN assign_ticket ON tm.ticket_id = assign_ticket.ticket_id
ROW_NUMBER() OVER (PARTITION BY ticket_id ORDER BY message_at DESC) assigns the number 1 to the most recent message, and later we ignore all rows that are > 1 thus removing unwanted repetition in the results.
So.
We really need to know much more about your actual data, the database (and version) you are using and your real needs. It is almost certain that select distinct is NOT the right technique for what you are trying to achieve.
I suggest you read these: Provide a Minimal Complete Verifiable Example (MCVE)
and Why should I provide a MCVE

Use this statement:
SELECT DISTINCT
ticket_message.ticket_id
FROM
support_ticket
LEFT OUTER JOIN ticket_message ON
support_ticket.ticket_id = ticket_message.ticket_id
LEFT OUTER JOIN assign_ticket ON
ticket_message.ticket_id = assign_ticket.ticket_id
As soon as you add more columns to your query, DISTINCT also takes them into account as well.

Related

Sum fields of an Inner join

How I can add two fields that belong to an inner join?
I have this code:
select
SUM(ACT.NumberOfPlants ) AS NumberOfPlants,
SUM(ACT.NumOfJornales) AS NumberOfJornals
FROM dbo.AGRMastPlanPerformance MPR (NOLOCK)
INNER JOIN GENRegion GR ON (GR.intGENRegionKey = MPR.intGENRegionLink )
INNER JOIN AGRDetPlanPerformance DPR (NOLOCK) ON
(DPR.intAGRMastPlanPerformanceLink =
MPR.intAGRMastPlanPerformanceKey)
INNER JOIN vwGENPredios P โ€‹โ€‹(NOLOCK) ON ( DPR.intGENPredioLink =
P.intGENPredioKey )
INNER JOIN AGRSubActivity SA (NOLOCK) ON (SA.intAGRSubActivityKey =
DPR.intAGRSubActivityLink)
LEFT JOIN (SELECT RA.intGENPredioLink, AR.intAGRActividadLink,
AR.intAGRSubActividadLink, SUM(AR.decNoPlantas) AS
intPlantasTrabajads, SUM(AR.decNoPersonas) AS NumOfJornales,
SUM(AR.decNoPlants) AS NumberOfPlants
FROM AGRRecordActivity RA WITH (NOLOCK)
INNER JOIN AGRActividadRealizada AR WITH (NOLOCK) ON
(AR.intAGRRegistroActividadLink = RA.intAGRRegistroActividadKey AND
AR.bitActivo = 1)
INNER JOIN AGRSubActividad SA (NOLOCK) ON (SA.intAGRSubActividadKey
= AR.intAGRSubActividadLink AND SA.bitEnabled = 1)
WHERE RA.bitActive = 1 AND
AR.bitActive = 1 AND
RA.intAGRTractorsCrewsLink IN(2)
GROUP BY RA.intGENPredioLink,
AR.decNoPersons,
AR.decNoPlants,
AR.intAGRAActivityLink,
AR.intAGRSubActividadLink) ACT ON (ACT.intGENPredioLink IN(
DPR.intGENPredioLink) AND
ACT.intAGRAActivityLink IN( DPR.intAGRAActivityLink) AND
ACT.intAGRSubActivityLink IN( DPR.intAGRSubActivityLink))
WHERE
MPR.intAGRMastPlanPerformanceKey IN(4) AND
DPR.intAGRSubActivityLink IN( 1153)
GROUP BY
P.vchRegion,
ACT.NumberOfFloors,
ACT.NumOfJournals
ORDER BY ACT.NumberOfFloors DESC
However, it does not perform the complete sum. It only retrieves all the values โ€‹โ€‹of the columns and adds them 1 by 1, instead of doing the complete sum of the whole column.
For example, the query returns these results:
What I expect is the final sums. In NumberOfPlants the result of the sum would be 163,237 and of NumberJornales would be 61.
How can I do this?
First of all the (nolock) hints are probably not accomplishing the benefit you hope for. It's not an automatic "go faster" option, and if such an option existed you can be sure it would be already enabled. It can help in some situations, but the way it works allows the possibility of reading stale data, and the situations where it's likely to make any improvement are the same situations where risk for stale data is the highest.
That out of the way, with that much code in the question we're better served with a general explanation and solution for you to adapt.
The issue here is GROUP BY. When you use a GROUP BY in SQL, you're telling the database you want to see separate results per group for any aggregate functions like SUM() (and COUNT(), AVG(), MAX(), etc).
So if you have this:
SELECT Sum(ColumnB) As SumB
FROM [Table]
GROUP BY ColumnA
You will get a separate row per ColumnA group, even though it's not in the SELECT list.
If you don't really care about that, you can do one of two things:
Remove the GROUP BY If there are no grouped columns in the SELECT list, the GROUP BY clause is probably not accomplishing anything important.
Nest the query
If option 1 is somehow not possible (say, the original is actually a view) you could do this:
SELECT SUM(SumB)
FROM (
SELECT Sum(ColumnB) As SumB
FROM [Table]
GROUP BY ColumnA
) t
Note in both cases any JOIN is irrelevant to the issue.

SQL Join from Two Tables showing only maximum date in one table

I have two tables, visits and encounters. Each Visit by a student may have several encounters, at different times. I would like a query with visitid, encounterid, and encounterdate showing ONLY the latest encounter for each visit, My results MUST include visits with no encounters.
My tables ;
Visits
visit_id
studenti_id
Encounters
encounter_id
visit_id
encounter_datetime
What I have tried
select
Visits.visit_id,
Encounters.encounter_id,
Encounters.encounter_datetime
FRom Visits
LEFT OUTER JOIN Encounters
ON Visits.visit_id = Encounters.visit_id
INNER JOIN (
select Encounters.visit_id, MAX(Encounters.encounter_datetime)as Latest
from Encounters
group by Encounters.visit_id
) as NewEncounters
ON Encounters.visit_id = NewEncounters.visit_id
AND Encounters.encounter_datetime = NewEncounters.Latest
This returns the results I want, HOWEVER, Visits without encounters are not in the results.
I actually don't see a clean way to salvage your direct join approach, but if your database support ROW_NUMBER, it is an easy option:
WITH cte AS (
SELECT v.visit_id, e.encounter_id, e.encounter_datetime,
ROW_NUMBER() OVER (PARTITION BY v.visit_id ORDER BY e.encounter_datetime DESC) rn
FROM Visits v
LEFT JOIN Encounters e ON v.visit_id = e.visit_id
)
SELECT visit_id, encounter_id, encounter_datetime,
FROM cte
WHERE rn = 1;
For the problem of getting the max of several dates I give an (untested! Sorry)
code example, which, however, points out the line of approach.
Select
Visits.visit_id,
a.encounter_id,
max(a.encounter_datetime) as Max_Datetime
FRom Visits
LEFT OUTER JOIN Encounters a
ON Visits.visit_id = a.Encounters.visit_id
inner join
Encounters b
on a.visit_id=b.visit_id
and
a.encounter_datetime<=b.encounter_datetime
group by
Visits.visit_id,
a.encounter_id,
a.encounter_datetime;
For visits without encounters you can union a query with a where clause using Is Null.
Maybe your database needs some syntactic fumbling with ; etc.

How do I get all values for Store_ID pulled into my Snowflake Query?

I have a query below and am trying to get all the week_id's, upc_id's and upc_dsc's pulled in even if there is no net_amt or item_qty for them. I'm successfully pulling in all stores, but I also want to show a upc and week id for these stores so that they can see if they have 0 sales for a upc. I tried doing a right join with my date table under the right join of the s table as well as a right join for the upc table, but it messes up my data and doesn't pull in the columns I need. Does anyone know how to fix this?
Thank you
select
a.week_id
, s.district_cd
, s.store_id
, a.upc_id
, a.upc_dsc
, sum(a.net_amt) as current_week_sales
, sum(t.net_amt) as last_week_sales
, sum(a.item_qty) as current_week_units
, sum(t.item_qty) as last_week_units
from (
select
week_id
, district_cd
, str.store_id
, txn.upc_id
, upc_dsc
, dense_rank() over (order by week_id) as rank
, sum(txn.net_amt) as net_amt
, sum(txn.item_qty) as item_qty
from dw_dss.txn_facts txn
left join dw_dss.lu_store_finance_om STR
on str.store_id = txn.store_id
join dw_dss.lu_upc upc
on upc.upc_id = txn.upc_id
join lu_day_merge day
on day.d_date = txn.txn_dte
where district_cd in (72,73)
and txn.upc_id in (27610100000
,27610200000
,27610300000
,27610400000
)
and division_id = 19
and txn_dte between '2021-07-25' and current_date - 1
group by 1,2,3,4,5
) a
left join temp_tables.ab_week_ago t
on t.rank = a.rank and a.store_id = t.store_id and a.upc_id = t.upc_id
right join dw_dss.lu_store_finance_om s
on s.store_id = a.store_id
where s.division_id = 19
and s.district_cd in (72,73)
group by 1,2,3,4,5
As stated in a previous comment, the example is too long to debug, especially since the source tables are not provided.
However, as a general rule, when adding zeroes for missing dimensions, I follow these steps:
Construct the main query, this is the query with all the complexity
that pulls the data you need - just the available data, without
missing dimensions; test this query to make sure it gives correct
results, aggregated correctly by each dimension
Then use this query as a CTE in a WITH statement and to this query, you can right join all dimensions for which you want to add zero values for missing data
Be sure to double check filtering on the dimensions to ensure that you don't filter out too much in your WHERE conditions, for example, instead of filtering with WHERE on the final query, like in your example:
right join dw_dss.lu_store_finance_om s
on s.store_id = a.store_id
where s.division_id = 19
and s.district_cd in (72,73)
I might rather filter the dimension itself in a subquery:
right join (select store_id from dw_dss.lu_store_finance_om
where s.division_id = 19 and s.district_cd in (72,73)) s
on s.store_id = a.store_id
I have a query below and am trying to get all the week_id's, upc_id's and upc_dsc's pulled in even if there is no net_amt or item_qty for them.
You want to generate the rows that you want using cross join and then use left join to bring in the the data you want. In your case, you also want aggregation.
You have not explained the tables, and I find your query quite hard to follow. But the idea is:
select c.weekid, c.store_id, c.upc_id,
count(f.dayid) as num_sales,
sum(f.net_amt) as total_amt
from calendar c cross join
stores s cross join
upcs u left join
facts f
using (dayid, store_id, upc_id) -- or whatever the right conditions are
group by c.weekid, c.store_id, c.upc_id;
Obviously, you have additional filters. You would apply these filters in the where clause to the dimension tables (or use a subquery if the logic is more complicated).

Sorting by newest date in joined query

I have a query in MSSQL that needs modification but I am unable to get it working properly. The query now is the following:
SELECT Computer.Id AS ComputerId,
Concat(HardDisk.Id, ' ') disks
FROM Computer
LEFT JOIN HardDisk ON Computer.Id = HardDisk.ComputerId
LEFT JOIN DiskOperationLog ON DiskOperationLog.HardDiskId = HardDisk.Id
I need it to also check in the table DiskOperationLog for an EndTime column and if two DiskOperationLog columns with the same HardDisk.Id exists it only needs to select the DiskOperationLog with the newest date. Is this something you can do? I suspect it can be done using the max(DiskOperationLog.EndTime) but I am unable to get it properly included in my selection.
Any help is highly appreciated!
I need it to also check in the table DiskOperationLog for an EndTime column and if two DiskOperationLog columns with the same HardDisk.Id exists it only needs to select the DiskOperationLog with the newest date.
Your query doesn't seem to use DiskOperationLog -- not for filtering (the query uses LEFT JOIN) and not selecting any columns. Let me assume this is an oversight in the question.
In SQL Server, the simplest method to do what you want uses OUTER APPLY:
SELECT c.Id AS ComputerId, Concat(hd.Id, ' ') disks
FROM Computer c LEFT JOIN
HardDisk hd
ON c.Id = hd.ComputerId OUTER APPLY
(SELECT TOP (1) dol.*
FROM DiskOperationLog dol
WHERE dol.HardDiskId = hd.Id
ORDER BY dol.EndTime DESC
) dol;
APPLY implements a lateral join whihc is a lot like a correlated subquery, with the following differences:
The logic is in the FROM clause.
More than one column can be returned.
More than one row can be returned.
You can use a ROW_NUMBER() clause. You would want to partition by HardDisk.Id and order by DiskOperationLog.EndTime descending.
With Qry1 As (
SELECT Computer.Id AS ComputerId,
Concat(HardDisk.Id, ' ') disks,
DiskOperationLog.EndTime,
ROW_NUMBER() OVER(PARTITION BY HardDisk.Id ORDER BY DiskOperationLog.EndTime DESC) As Seq
FROM Computer
LEFT JOIN HardDisk
ON Computer.Id = HardDisk.ComputerId
LEFT JOIN DiskOperationLog
ON DiskOperationLog.HardDiskId = HardDisk.Id
)
SELECT Computer.Id AS ComputerId,
Concat(HardDisk.Id, ' ') disks,
DiskOperationLog.EndTime,
FROM Qry1
WHERE Seq = 1
BTW, if you're trying to get a list of comma-separated disk numbers in column #2, that is definitely not the way to do it.

Bigquery SQL code to pull earliest contact

I have a copy of our salesforce data in bigquery, I'm trying to join the contact table together with the account table.
I want to return every account in the dataset but I only want the contact that was created first for each account.
I've gone around and around in circles today googling and trying to cobble a query together but all roads either lead to no accounts, a single account or loads of contacts per account (ignoring the earliest requirement).
Here's the latest query. that produces no results. I think I'm nearly there but still struggling. any help would be most appreciated.
SELECT distinct
c.accountid as Acct_id
,a.id as a_Acct_ID
,c.id as Cont_ID
,a.id AS a_CONT_ID
,c.email
,c.createddate
FROM `sfdcaccounttable` a
INNER JOIN `sfdccontacttable` c
ON c.accountid = a.id
INNER JOIN
(SELECT a2.id, c2.accountid, c2.createddate AS MINCREATEDDATE
FROM `sfdccontacttable` c2
INNER JOIN `sfdcaccounttable` a2 ON a2.id = c2.accountid
GROUP BY 1,2,3
ORDER BY c2.createddate asc LIMIT 1) c3
ON c.id = c3.id
ORDER BY a.id asc
LIMIT 10
The solution shared above is very BigQuery specific: it does have some quirks you need to work around like the memory error you got.
I once answered a similar question here that is more portable and easier to maintain.
Essentially you need to create a smaller table(even better to make it a view) with the ID and it's first transaction. It's similar to what you shared by slightly different as you need to group ONLY in the topmost query.
It looks something like this
select
# contact ids that are first time contacts
b.id as cont_id,
b.accountid
from `sfdccontacttable` as b inner join
( select accountid,
min(createddate) as first_tx_time
FROM `sfdccontacttable`
group by 1) as a on (a.accountid = b.accountid and b.createddate = a.first_tx_time)
group by 1, 2
You need to do it this way because otherwise you can end up with multiple IDs per account (if there are any other dimensions associated with it). This way also it is kinda future proof as you can have multiple dimensions added to the underlying tables without affecting the result and also you can use a where clause in the inner query to define a "valid" contact and so on. You can then save that as a view and simply reference it in any subquery or join operation
Setup a view/subquery for client_first or client_last
as:
SELECT * except(_rank) from (
select rank() over (partition by accountid order by createddate ASC) as _rank,
*
FROM `prj.dataset.sfdccontacttable`
) where _rank=1
basically it uses a Window function to number the rows, and return the first row, using ASC that's first client, using DESC that's last client entry.
You can do that same for accounts as well, then you can join two simple, as exactly 1 record will be for each entity.
UPDATE
You could also try using ARRAY_AGG which has less memory footprint.
#standardSQL
SELECT e.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.createddate ASC LIMIT 1
)[OFFSET(0)] e
FROM `dataset.sfdccontacttable` t
GROUP BY t.accountid
)