Is more efficient GROUP BY or JOIN?

Is more efficient GROUP BY or JOIN? - sql

Hi i have this table and i want to know what query is mor efficient:
[ID_SOGGETTO]
,[COGNOME]
,[NOME]
,[DENOMINAZIONE]
,[FISICA_GIURID]
,[CODICEFISCALE]
,[PARTITAIVA]
,[ID_COMUNE]
,[DATA_NASCITA]
,[RAE]
,[SAE]
,[TIPO_SOCIETA]
,[NDG]
,[CODICECCIAA]
,[CODICECR]
,[ATTIVITALAVORATIVA]
,[PROFESSIONISTA]
,[CODICEFORNITORESAP]
,[TRASFERITOSAP]
,[ALBOPRECS]
,[ID_USER]
,[ID_USERINC]
,[ID_VERSIONE]
,[DATA_AGGIORNAMENTO]
,[DATA_STORICIZZAZIONE]
And i tried this query to select all rows where that have same "Partita iva" and different "ID_SOGGETTO":
SELECT * FROM table WHERE PARTITAIVA IN ( SELECT PARTITAIVA FROM table GROUP BY PARTITAIVA HAVING COUNT(distinct ID_SOGGETTO) > 1)
It's more efficient with a JOIN?

Often the most efficient way to do what you want uses window functions:
SELECT t.*
FROM (SELECT t.*,
(DENSE_RANK()OVER (PARTITION BY PARTITAIVA ORDER BY ID_SOGGETTO ASC) +
DENSE_RANK()OVER (PARTITION BY PARTITAIVA ORDER BY ID_SOGGETTO DESC)
) as cnt
FROM table t
) t
WHERE cnt > 1;
The sum of DENSE_RANK() is simply a way to calculate the COUNT(DISTINCT).
In other databases, EXISTS would be recommended:
select t.*
from t
where exists (select 1
from t t2
where t2.partitaiva = t.partitaiva and
t2.id_soggetto <> t.id_soggetto
);
However, I am not sure if this would be faster in SparkSQL.

Related

Selecting the latest order

I need to select the data of all my customers with the records displayed in the image. But I need to get the most recent record only, for example I need to get the order # E987 for John and E888 for Adam. As you can see from the example, when I do the select statement, I get all the order records.

You don't mention the specific database, so I'll answer with a generic solution.
You can do:
select *
from (
select t.*,
row_number() over(partition by name order by order_date desc) as rn
from t
) x
where rn = 1

You can use analytical function row_number.
Select * from
(Select t.*,
Row_number() over (partition by customer_id order by order_date desc) as rn
From your_table t) t
Where rn = 1
Or you can use not exists as follows:
Select *
From yoir_table t
Where not exists
(Select 1 from your_table tt
Where t.customer_id = tt.custome_id
And tt.order_date > t.order_date)

You can do it with a subquery that finds the last order date.
SELECT t.*
FROM yoir_table t
JOIN (SELECT tt.custome_id,
MAX(tt.order_date) MaxOrderDate
FROM yoir_table tt
GROUP BY tt.custome_id) AS tt
ON t.custome_id = tt.custome_id
AND t.order_date = tt.MaxOrderDate

How to find Min and Max rows from a table including all columns from postgresql

I have a view select c1,c2,count from table and it will give result below.
I want to fetch the entire row of maximum and minimum count's value and that should return only two rows with max and min count like below.
How to do it?

The quickest way is probably a union:
(
select c1, c2, count
from the_table
order by count
limit 1
)
union all
(
select c1, c2, count
from the_table
order by count desc
limit 1
)
Usually the individual statements in a UNION, don't need parentheses, but as we want an order by on each of them, they are needed.
Another option would be join against a derived table:
select t1.*
from the_table t1
join (
select min(count) as min_count,
max(count) as max_count
from the_table
) mm on t1.count in (mm.min_count, mm.max_count)
But I doubt that this will be faster.

I would recommend window functions:
select *
from (
select t.*,
row_number() over(order by count) rn_asc,
row_number() over(order by count desc) rn_desc
from mytable t
) t
where 1 in (rn_asc, rn_desc)
order by count
This requires scanning the table only once (as opposed to union all or join).

Oracle SQL query result into a temporary table for use in a sub query

I want to create a temporary table which is inturn derived from a query to be used in another sub-query so as to simplify the rownum() and partition by condition. The query I have entered is as below but it returns an error t.trlr_num invalid identifier.
with t as
(select distinct
ym.trlr_num,
ym.arrdte,
ri.invnum,
ri.supnum
from rcvinv ri, yms_ymr ym
where ym.trlr_cod='RCV'
and ri.trknum = ym.trlr_num
and ym.wh_id <=50
and ym.trlr_stat in ('C','CI','R','OR')
and ym.arrdte is not null
order by ym.arrdte desc
)
select trlr_number, invnum, supnum
from
(
select
t.trlr_num, t.invnum, t.supnum,
row_number() over (partition by t.trlr_number,t.invnum order by t.arrdte) as rn
from t
)
where rn = 1;
From above, I put a condition to create a table t as a temporary table to be used in the below select statement. But is seems to error out with invalid identifier.

seems typo, replace trlr_number with trlr_num and it work
with t as
(select distinct
ym.trlr_num,
ym.arrdte,
ri.invnum,
ri.supnum
from rcvinv ri, yms_ymr ym
where ym.trlr_cod='RCV'
and ri.trknum = ym.trlr_num
and ym.wh_id <=50
and ym.trlr_stat in ('C','CI','R','OR')
and ym.arrdte is not null
order by ym.arrdte desc
)
select trlr_num, invnum, supnum
from
(
select
t.trlr_num, t.invnum, t.supnum,
row_number() over (partition by t.trlr_num,t.invnum order by t.arrdte) as rn
from t
)
where rn = 1;

You could use multiple subqueries in the WITH clause as separate temporary tables. It would be nice and easy to understand:
WITH t AS
(SELECT DISTINCT ym.trlr_num trlr_num,
ym.arrdte arrdte,
ri.invnum invnum,
ri.supnum supnum
FROM rcvinv ri,
yms_ymr ym
WHERE ym.trlr_cod ='RCV'
AND ri.trknum = ym.trlr_num
AND ym.wh_id <=50
AND ym.trlr_stat IN ('C','CI','R','OR')
AND ym.arrdte IS NOT NULL
),
t1 AS (
SELECT t.trlr_num,
t.arrdte,
t.invnum,
t.supnum,
row_number() OVER (PARTITION BY t.trlr_num, t.invnum ORDER BY t.trlr_num, t.invnum DESC) rn
FROM t
)
SELECT trlr_num, arrdte, invnum, supnum
FROM t1
WHERE rn = 1;

(SQL Server) using row count to sort the list but dont need list out the row count number

All the column that inside the select sql are needed to list out ,except the row_number(),any solution to eliminate to row_count ?
SELECT *
FROM
(SELECT Station,
ROW_NUMBER() over (
ORDER BY totalseq ASC) AS rownumber1
FROM [SFCKM].[dbo].[T_DB_Subline]
WHERE Track_Point_No = '3d1')a
LEFT JOIN
(SELECT group_no,
trim_line,
MSC,
lot_no,
color,
AON,
format(Commit_time,'MM/dd/yy h:mm:ss tt')AS time,
datediff(DAY,Commit_Time,SYSDATETIME()) AS aging,
ROW_NUMBER() over (
ORDER BY commit_time DESC) AS rownumber
FROM [SFCKM].[dbo].[T_Work_Actual]
WHERE Track_Point_No = '3d1') c ON a.rownumber1 = c.rownumber
ORDER BY a.rownumber1

You could just select the values you are looking for e.g. Station and aging.
select a.Station, c.aging from
(select Station, ROW_NUMBER() over (order by totalseq asc) AS rownumber1
from [SFCKM].[dbo].[T_DB_Subline] where Track_Point_No = '3d1') a
left join
(*,aging,ROW_NUMBER() over (order by commit_time desc) AS rownumber
FROM [SFCKM].[dbo].[T_Work_Actual] where Track_Point_No = '3d1') c
on a.rownumber1 = c.rownumber
order by a.rownumber1

Don't use SELECT *: specify only the columns you need. SELECT * is not best practice, just lazy
There is no way to exclude a column as per my answer SQL exclude a column using SELECT * [except columnA] FROM tableA?

SQL: How to find duplicates based on two fields?

I have rows in an Oracle database table which should be unique for a combination of two fields but the unique constrain is not set up on the table so I need to find all rows which violate the constraint myself using SQL. Unfortunately my meager SQL skills aren't up to the task.
My table has three columns which are relevant: entity_id, station_id, and obs_year. For each row the combination of station_id and obs_year should be unique, and I want to find out if there are rows which violate this by flushing them out with an SQL query.
I have tried the following SQL (suggested by this previous question) but it doesn't work for me (I get ORA-00918 column ambiguously defined):
SELECT
entity_id, station_id, obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year
Can someone suggest what I'm doing wrong, and/or how to solve this?

SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
FROM mytable t
)
WHERE rn > 1

SELECT entity_id, station_id, obs_year
FROM mytable t1
WHERE EXISTS (SELECT 1 from mytable t2 Where
t1.station_id = t2.station_id
AND t1.obs_year = t2.obs_year
AND t1.RowId <> t2.RowId)

Change the 3 fields in the initial select to be
SELECT
t1.entity_id, t1.station_id, t1.obs_year

Re-write of your query
SELECT
t1.entity_id, t1.station_id, t1.obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year
I think the ambiguous column error (ORA-00918) was because you were selecting columns whose names appeared in both the table and the subquery, but you did not specifiy if you wanted it from dupes or from mytable (aliased as t1).

Could you not create a new table that includes the unique constraint, and then copy across the data row by row, ignoring failures?

You need to specify the table for the columns in the main select. Also, assuming entity_id is the unique key for mytable and is irrelevant to finding duplicates, you should not be grouping on it in the dupes subquery.
Try:
SELECT t1.entity_id, t1.station_id, t1.obs_year
FROM mytable t1
INNER JOIN (
SELECT station_id, obs_year FROM mytable
GROUP BY station_id, obs_year HAVING COUNT(*) > 1) dupes
ON
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
FROM mytable t
)
WHERE rn > 1
by Quassnoi is the most efficient for large tables.
I had this analysis of cost :
SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
WHERE EXISTS (SELECT 1 from trn_refil_book b Where
a.dist_code = b.dist_code and a.book_date = b.book_date and a.book_no = b.book_no
AND a.RowId <> b.RowId)
;
gave a cost of 1322341
SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
INNER JOIN (
SELECT b.dist_code, b.book_date, b.book_no FROM trn_refil_book b
GROUP BY b.dist_code, b.book_date, b.book_no HAVING COUNT(*) > 1) c
ON
a.dist_code = c.dist_code and a.book_date = c.book_date and a.book_no = c.book_no
;
gave a cost of 1271699
while
SELECT dist_code, book_date, book_no
FROM (
SELECT t.dist_code, t.book_date, t.book_no, ROW_NUMBER() OVER (PARTITION BY t.book_date, t.book_no
ORDER BY t.dist_code) AS rn
FROM trn_refil_book t
) p
WHERE p.rn > 1
;
gave a cost of 1021984
The table was not indexed....

SELECT entity_id, station_id, obs_year
FROM mytable
GROUP BY entity_id, station_id, obs_year
HAVING COUNT(*) > 1
Specify the fields to find duplicates on both the SELECT and the GROUP BY.
It works by using GROUP BY to find any rows that match any other rows based on the specified Columns.
The HAVING COUNT(*) > 1 says that we are only interested in seeing any rows that occur more than 1 time (and are therefore duplicates)

I thought a lot of the solutions here were cumbersome and tough to understand since I had a 3 column primary key constraint and needed to find the duplicates. So here's an option
SELECT id, name, value, COUNT(*) FROM db_name.table_name
GROUP BY id, name, value
HAVING COUNT(*) > 1

I'm surprised there aren't any answers here that use a CTE (Common Table Expression)
WITH cte as (
SELECT
ROW_NUMBER()
OVER(
PARTITION BY Last_Name, First_Name order by BIRTHDATE)
AS RN,
Employee_number, First_Name, Last_Name, BirthDate,
SUM(1)
OVER(
PARTITION BY Last_Name, First_Name
ORDER BY BIRTHDATE ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS CNT
FROM
employment)
select * from cte where cnt > 1
Not only will this find duplicates (on first and last name only), it will tell you how many there are.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Is more efficient GROUP BY or JOIN? - sql

Related

Selecting the latest order

How to find Min and Max rows from a table including all columns from postgresql

Oracle SQL query result into a temporary table for use in a sub query

(SQL Server) using row count to sort the list but dont need list out the row count number

SQL: How to find duplicates based on two fields?

Categories

Resources