DISTINCT pulling duplicate column values - sql

The following query is pulling duplicate site_ids, with me using DISTINCT I can't figure out why...
SELECT
DISTINCT site_id,
deal_woot.*,
site.woot_off,
site.name AS site_name
FROM deal_woot
INNER JOIN site ON site.id = site_id
WHERE site_id IN (2, 3, 4, 5, 6)
ORDER BY deal_woot.id DESC LIMIT 5

DISTINCT looks at the entire record, not just the column directly after it. To accomplish what you want, you'll need to use GROUP BY:
Non-working code:
SELECT
site_id,
deal_woot.*,
site.woot_off,
site.name AS site_name
FROM deal_woot
INNER JOIN site ON site.id = site_id
WHERE site_id IN (2, 3, 4, 5, 6)
GROUP BY site_id
Why doesn't it work? If you GROUP BY a column, you should use an aggregate function (such as MIN or MAX) on the rest of the columns -- otherwise, if there are multiple site_woot_offs for a given site_id, it's not clear to SQL which of those values you want to SELECT.
You will probably have to expand deal_woot.* to list each of its fields.
Side-note: If you're using MySQL, I believe it's not technically necessary to specify an aggregate function for the remaining columns. If you don't specify an aggregate function for a column, it chooses a single column value for you (usually the first value in the result set).

Your query is returning DISTINCT rows, it is not just looking at site_id. In other words, if any of the columns are different, a new row is returned from this query.
This makes sense, because if you actually do have differences, what should the server return as values for deal_woot.* ? If you want to do this, you need to specify this - perhaps done by getting distinct site_id's, then getting LIMIT 1 of the other values in a subquery with an appropiate ORDER BY clause.

You are selecting distinct value from one table only. When you join with the other table it will pull all rows that match each of your distinct value from the other table, causing duplicate id's

If you want to select site info and a single row from deal_woot table with the same site_id, you need to use a different query. For example,
SELECT site.id, deal_woot.*, site.woot_off, site.name
FROM site
INNER JOIN
(SELECT site_id, MAX(id) as id FROM deal_woot
WHERE site_id IN (2,3,4,5,6) GROUP BY site_id) X
ON (X.site_id = site.id)
INNER JOIN deal_woot ON (deal_woot.id = X.id)
WHERE site.id IN (2,3,4,5,6);
This query should work regardless of sql dialect/db vendor. For mysql, you can just add group by site_id to your original query, since it lets you use GROUP BY without aggregate functions.
** I assume that deal_woot.id and site.id are primary keys for deal_woot and site tables respectively.

Related

Selecting distinct values from database

I have a table as follows:
ParentActivityID | ActivityID | Timestamp
1 A1 T1
2 A2 T2
1 A1 T1
1 A1 T5
I want to select unique ParentActivityID's along with Timestamp. The time stamp can be the most recent one or the first one as is occurring in the table.
I tried to use DISTINCT but i came to realise that it dosen't work on individual columns. I am new to SQL. Any help in this regard will be highly appreciated.
DISTINCT is a shorthand that works for a single column. When you have multiple columns, use GROUP BY:
SELECT ParentActivityID, Timestamp
FROM MyTable
GROUP BY ParentActivityID, Timestamp
Actually i want only one one ParentActivityID. Your solution will give each pair of ParentActivityID and Timestamp. For e.g , if i have [1, T1], [2,T2], [1,T3], then i wanted the value as [1,T3] and [2,T2].
You need to decide what of the many timestamps to pick. If you want the earliest one, use MIN:
SELECT ParentActivityID, MIN(Timestamp)
FROM MyTable
GROUP BY ParentActivityID
Try this:
SELECT [ParentActivityId],
MIN([Timestamp]) AS [FirstTimestamp],
MAX([Timestamp]) AS [RecentTimestamp]
FROM [Table]
GROUP BY [ParentActivityId]
This will provide you the first timestamp and the most recent timestamp for each ParentActivityId that is present in your table. You can choose the ones you need as per your need.
"Group by" is what you need here. Just do "group by ParentActivityID" and tell that most recent timestamp along all rows with same ParentActivityID is needed for you:
SELECT ParentActivityID, MAX(Timestamp) FROM Table GROUP BY ParentActivityID
"Group by" operator is like taking rows from a table and putting them in a map with a key defined in group by clause (ParentActivityID in this example). You have to define how grouping by will handle rows with duplicate keys. For this you have various aggregate functions which you specify on columns you want to select but which are not part of the key (not listed in group by clause, think of them as a values in a map).
Some databases (like mysql) also allow you to select columns which are not part of the group by clause (not in a key) without applying aggregate function on them. In such case you will get some random value for this column (this is like blindly overwriting value in a map with new value every time). Still, SQL standard together with most databases out there will not allow you to do it. In such case you can use min(), max(), first() or last() aggregate function to work around it.
Use CTE for getting the latest row from your table based on parent id and you can choose the columns from the entire row of the output .
;With cte_parent
As
(SELECT ParentActivityId,ActivityId,TimeStamp
, ROW_NUMBER() OVER(PARTITION BY ParentActivityId ORDER BY TimeStamp desc) RNO
FROM YourTable )
SELECT *
FROM cte_parent
WHERE RNO =1

Oracle SQL Developer(4.0.0.12)

First time posting here, hopes it goes well.
I try to make a query with Oracle SQL Developer, where it returns a customer_ID from a table and the time of the payment from another. I'm pretty sure that the problems lies within my logicflow (It was a long time I used SQL, and it was back in school so I'm a bit rusty in it). I wanted to list the IDs as DISTINCT and ORDER BY the dates ASCENDING, so only the first date would show up.
However the returned table contains the same ID's twice or even more in some cases. I even found the same ID and same DATE a few times while I was scrolling through it.
If you would like to know more please ask!
SELECT DISTINCT
FIRM.customer.CUSTOMER_ID,
FIRM.account_recharge.X__INSDATE FELTOLTES
FROM
FIRM.customer
INNER JOIN FIRM.account
ON FIRM.customer.CUSTOMER_ID = FIRM.account.CUSTOMER
INNER JOIN FIRM.account_recharge
ON FIRM.account.ACCOUNT_ID = FIRM.account_recharge.ACCOUNT
WHERE
FIRM.account_recharge.X__INSDATE BETWEEN TO_DATE('14-01-01', 'YY-MM-DD') AND TO_DATE('14-12-31', 'YY-MM-DD')
ORDER
BY FELTOLTES
Your select works like this because a CUSTOMER_ID indeed has more than one X__INSDATE, therefore the records in the result will be distinct. If you need only the first date then don't use DISTINCT and ORDER BY but try to select for MIN(X__INSDATE) and use GROUP BY CUSTOMER_ID.
SELECT DISTINCT FIRM.customer.CUSTOMER_ID,
FIRM.account_recharge.X__INSDATE FELTOLTES
Distinct is applied to both the columns together, which means you will get a distinct ROW for the set of values from the two columns. So, basically the distinct refers to all the columns in the select list.
It is equivalent to a select without distinct but a group by clause.
It means,
select distinct a, b....
is equivalent to,
select a, b...group by a, b
If you want the desired output, then CONCATENATE the columns. The distict will then work on the single concatenated resultset.

getting unique results without using group by

For some reason, I am not able to use GROUP BY but I can use DISTINCT like this:
SELECT DISTINCT(name), id, email from myTable
However above query lists people with same name also because I am selecting more than one columns whereas I want to select only unique names. Is there someway to get unique names without using GROUP BY ?
Although using GROUP BY is the most direct way, you can do other things if it is prohibited. For example, you can use NOT EXISTS with a subquery, like this:
SELECT name, id, email
FROM myTable t
WHERE NOT EXISTS (SELECT * FROM myTable tt WHERE tt.name=t.name AND tt.id < t.id)
This query uses NOT EXISTS to eliminate rows with the same name and IDs higher than the one selected. Note that since this query must pick a single user per name, it may eliminate some users based on their ID, which is a rather arbitrary criterion.

Sybase: HAVING operates on rows?

I've came across the following SYBASE SQL:
-- Setup first
create table #t (id int, ts int)
go
insert into #t values (1, 2)
insert into #t values (1, 10)
insert into #t values (1, 20)
insert into #t values (1, 30)
insert into #t values (2, 5)
insert into #t values (2, 13)
insert into #t values (2, 25)
go
declare #time int select #time=11
-- This is the SQL I am asking about
select * from (select * from #t where ts <= #time) t group by id having ts = max(ts)
go
The results of this SQL are
id ts
----------- -----------
1 10
2 5
This looks like HAVING condition applied to rows rather than groups. Can someone please point me at a place is Sybase 15.5 documentation where this case is described? All I see is "HAVING operates on groups". The closest I see in the docs is:
The having clause can include columns or expressions that are not in
the select list and not in the group by clause.
(Quote from here).
However, they don't exactly explain what happens when you do that.
My understanding: Yes, fundamentally, HAVING operates on rows. By omitting a GROUP BY, it operates on all result rows within a single "supergroup" rather than on rows-within-groups. Read the section "How group by and having queries with aggregates work" in your originally-linked Sybase docco:-
How group by and having queries with aggregates work
The where clause excludes rows that do not meet its search conditions; its function remains the same for grouped or nongrouped queries.
The group by clause collects the remaining rows into one group for each unique value in the group by expression. Omitting group by creates a single group for the whole table.
Aggregate functions specified in the select list calculate summary values for each group. For scalar aggregates, there is only one value for the table. Vector aggregates calculate values for the distinct groups.
The having clause excludes groups from the results that do not meet its search conditions. Even though the having clause tests only rows, the presence or absence of a group by clause may make it appear to be operating on groups:
When the query includes group by, having excludes result group rows. This is why having seems to operate on groups.
When the query has no group by, having excludes result rows from the (single-group) table. This is why having seems to operate on rows (the results are similar to where clause results).
Secondly, a brief summary appears in the section "How the having, group by, and where clauses interact":-
How the having, group by, and where clauses interact
When you include the having, group by, and where clauses in a query, the sequence in which each clause affects the rows determines the final results:
The where clause excludes rows that do not meet its search conditions.
The group by clause collects the remaining rows into one group for each unique value in the group by expression.
Aggregate functions specified in the select list calculate summary values for each group.
The having clause excludes rows from the final results that do not meet its search conditions.
#SQLGuru's explanation is an illustration of this.
Edit...
On a related point, I was surprised by the behaviour of non-ANSI-conforming queries that utilise TSQL "extended columns". Sybase handles the extended columns (i) after the WHERE clause (ii) by creating extra joins to the original tables and (iii) the WHERE clause is not used in the join. Such queries might return more rows than expected and the HAVING clause then requires additional conditions to filter these out.
See examples b, c and d under "Transact-SQL extensions to group by and having" on the page of your originally-linked docco. I found it useful to install the pubs2 sample database from Sybase to play along with the examples.
I haven't done Sybase since it shared code with MS SQL Server....90's, but my interpretation of what you are doing is this:
First, the list is filtered to <= 11
id ts
1 2
1 10
2 5
Everything else is filtered out.
Next, you are filtering the list to the rows where TS = the Max(TS) for that group.
id ts
1 10
2 5
10 is the Max(TS) for group 1 and 5 is the Max(TS) for group 2. Those two rows are the ones that remain. What result would you expect otherwise?
If you read the documentation here, it seems that Sybase use of columns in the having clause that don't appear in the group by clause is different from MySQL.
The example they give has this explanation:
The Transact-SQL extended column, price (in the select list, but not
an aggregate and not in the group by clause), causes all qualified
rows to display in each qualified group, even though a standard group
by clause produces a single row per group. The group by still affects
the vector aggregate, which computes the average price per group
displayed on each row of each group (they are the same values that
were computed for example a):
So, ts = max(ts) essentially does this:
select *
from (select t.*,
max(ts) over (partition by id) as maxts
from #t
where ts <= #time
) t
where ts = maxts
The subquery is important, because the where clause gets used for the max() calculation and all rows would be returned.
I find this behavior rather confusing and non-standard. I would replace it with more typical constructs. These are about the same level of complexity and seem clearer to a larger audience.

GROUP BY / aggregate function confusion in SQL

I need a bit of help straightening out something, I know it's a very easy easy question but it's something that is slightly confusing me in SQL.
This SQL query throws a 'not a GROUP BY expression' error in Oracle. I understand why, as I know that once I group by an attribute of a tuple, I can no longer access any other attribute.
SELECT *
FROM order_details
GROUP BY order_no
However this one does work
SELECT SUM(order_price)
FROM order_details
GROUP BY order_no
Just to concrete my understanding on this.... Assuming that there are multiple tuples in order_details for each order that is made, once I group the tuples according to order_no, I can still access the order_price attribute for each individual tuple in the group, but only using an aggregate function?
In other words, aggregate functions when used in the SELECT clause are able to drill down into the group to see the 'hidden' attributes, where simply using 'SELECT order_no' will throw an error?
In standard SQL (but not MySQL), when you use GROUP BY, you must list all the result columns that are not aggregates in the GROUP BY clause. So, if order_details has 6 columns, then you must list all 6 columns (by name - you can't use * in the GROUP BY or ORDER BY clauses) in the GROUP BY clause.
You can also do:
SELECT order_no, SUM(order_price)
FROM order_details
GROUP BY order_no;
That will work because all the non-aggregate columns are listed in the GROUP BY clause.
You could do something like:
SELECT order_no, order_price, MAX(order_item)
FROM order_details
GROUP BY order_no, order_price;
This query isn't really meaningful (or most probably isn't meaningful), but it will 'work'. It will list each separate order number and order price combination, and will give the maximum order item (number) associated with that price. If all the items in an order have distinct prices, you'll end up with groups of one row each. OTOH, if there are several items in the order at the same price (say £0.99 each), then it will group those together and return the maximum order item number at that price. (I'm assuming the table has a primary key on (order_no, order_item) where the first item in the order has order_item = 1, the second item is 2, etc.)
The order in which SQL is written is not the same order it is executed.
Normally, you would write SQL like this:
SELECT
FROM
JOIN
WHERE
GROUP BY
HAVING
ORDER BY
Under the hood, SQL is executed like this:
FROM
JOIN
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
Reason why you need to put all the non-aggregate columns in SELECT to the GROUP BY is the top-down behaviour in programming. You cannot call something you have not declared yet.
Read more: https://sqlbolt.com/lesson/select_queries_order_of_execution
SELECT *
FROM order_details
GROUP BY order_no
In the above query you are selecting all the columns because of that its throwing an error not group by something like..
to avoid that you have to mention all the columns whichever in select statement all columns must be in group by clause..
SELECT *
FROM order_details
GROUP BY order_no,order_details,etc
etc it means all the columns from order_details table.
To use group by clause you have to mention all the columns from select statement in to group by clause but not the column from aggregate function.
TO do this instead of group by you can use partition by clause you can use only one port to group as a partition by.
you can also make it as partition by 1
use Common table expression(CTE) to avoid this issue.
multiple CTes also come handy, pasting a case where I have used...maybe helpful
with ranked_cte1 as
( select r.mov_id,DENSE_RANK() over ( order by r.rev_stars desc )as rankked from ratings r ),
ranked_cte2 as ( select * from movie where mov_id=(select mov_id from ranked_cte1 where rankked=7 ) ) select * from ranked_cte2
select * from movie where mov_id=902