Query Optimization, Issue - sql

Using SQL Server 2012;
I am using a query to find deltas in a table.
I have an archive table that has all the records with Licenceno PK,FileID
I want to find out how many Licenceno are in a fileId but are not in previous FileID.
Code Used:
Select count(*) from table where fileid = 123 and Licenceno not in (select Licenceno from table where fileid <123)
The code works fine but the problem is some of the fileIds have the same number of records as the previous ones but take 4 hours and are still running..
Is it a table issue?
Index cant be an issue as the whole table has
a non clustered index.
It is happening generally when i am calculating deltas for the latest Licenceno.
or Query planning is the issue?
I am not able to solve this for the past 5 days.

I would rewrite your query to use an exists clause, and also add an appropriate index:
SELECT COUNT(*)(
FROM yourTable t1
WHERE
fileid = 123 AND
NOT EXISTS (SELECT 1 FROM yourTable t2
WHERE t2.Licenseno = t1.Licenseno AND t2.fileid < 123);
An index on (Licenseno, fileid) might help here:
CREATE INDEX idx ON yourTable (Licenseno, fileid);
You may also try the came composite index in the reverse order:
CREATE INDEX idx ON yourTable (fileid, Licenseno);

Why not use count(distinct)?
select count(distinct licenseno)
from table
where fileid = 123;
For this query, you want an index on (fileid, licenseno).
You are complicating the logic by thinking sequentially ("have I seen this licenseno already?"). Instead, you just want to count the distinct values.
EDIT:
For this problem, you can try two levels of aggregation:
select count(*)
from (select licenseno, min(fileid) as min_fileid
from t
where licenseno <= 123
group by licenseno
) t
where min_fileid = 123;
How good the performance is relative to other approaches dependson how selective <= 123 is.

You could also use LAG for this
SELECT COUNT(*)
FROM (SELECT fileid,
LAG(fileid) OVER (PARTITION BY Licenceno ORDER BY fileid) AS prevFileID
FROM TABLE
WHERE fileid <= 123 ) D
WHERE fileid = 123
AND prevFileID IS NULL
... or an aggregation query ...
WITH T
AS (SELECT 1 AS Flag,
FROM TABLE
WHERE fileid <= 123
GROUP BY Licenceno
HAVING MIN(fileid) = 123 )
SELECT COUNT(*)
FROM T

Related

Big query De-duplication query is not working properly

anyone please tell me the below query is not working properly, It suppose to delete the duplicate records only and keep the one of them (latest record) but it is deleting all the record instead of keeping one of the duplicate records, why is it so?
delete
from
dev_rahul.page_content_insights
where
(sha_id,
etl_start_utc_dttm) in (
select
(a.sha_id,
a.etl_start_utc_dttm)
from
(
select
sha_id,
etl_start_utc_dttm,
ROW_NUMBER() over (Partition by sha_id
order by
etl_start_utc_dttm desc) as rn
from
dev_rahul.page_content_insights
where
(snapshot_dt) >= '2021-03-25' ) a
where
a.rn <> 1)
Query looks ok, though I don't use that syntax for cleaning up duplicates.
Can I confirm the following:
sha_id, etl_start_utc_dttm is your primary key?
You wish to keep sha_id and the latest row based on etl_start_utc_dttm field descending?
If so, try this two query pattern:
create or replace table dev_rahul.rows_not_to_delete as
SELECT col.* FROM (SELECT ARRAY_AGG(pci ORDER BY etl_start_utc_dttm desc LIMIT 1
) OFFSET(0)] col
FROM dev_rahul.page_content_insights pci
where snapshot_dt >= '2021-03-25' )
GROUP BY sha_id
);
delete dev_rahul.page_content_insights p
where not exists (select 1 from DW_pmo.rows_not_to_delete d
where p.sha_id = d.sha_id and p.etl_start_utc_dttm = d.etl_start_utc_dttm
) and snapshot_dt >= '2021-03-25';
You could do this in a singe query by putting the first statement into a CTE.

How can I improve the native query for a table with 7 millions rows?

I have the below view(table) in my database(SQL SERVER).
I want to retrieve 2 things from this table.
The object which has the latest booking date for each Product number.
It will return the objects = {0001, 2, 2019-06-06 10:39:58} and {0003, 2, 2019-06-07 12:39:58}.
If all the step number has no booking date for a Product number, it wil return the object with Step number = 1. It will return the object = {0002, 1, NULL}.
The view has 7.000.000 rows. I must do it by using native query.
The first query that retrieves the product with the latest booking date:
SELECT DISTINCT *
FROM TABLE t
WHERE t.BOOKING_DATE = (SELECT max(tbl.BOOKING_DATE) FROM TABLE tbl WHERE t.PRODUCT_NUMBER = tbl.PRODUCT_NUMBER)
The second query that retrieves the product with booking date NULL and Step number = 1;
SELECT DISTINCT *
FROM TABLE t
WHERE (SELECT max(tbl.BOOKING_DATE) FROM TABLE tbl WHERE t.PRODUCT_NUMBER = tbl.PRODUCT_NUMBER) IS NULL AND t.STEP_NUMBER = 1
I tried using a single query, but it takes too long.
For now I use 2 query for getting this information but for the future I need to improve this. Do you have an alternative? I also can not use stored procedure, function inside SQL SERVER. I must do it with native query from Java.
Try this,
Declare #p table(pumber int,step int,bookdate datetime)
insert into #p values
(1,1,'2019-01-01'),(1,2,'2019-01-02'),(1,3,'2019-01-03')
,(2,1,null),(2,2,null),(2,3,null)
,(3,1,null),(3,2,null),(3,3,'2019-01-03')
;With CTE as
(
select pumber,max(bookdate)bookdate
from #p p1
where bookdate is not null
group by pumber
)
select p.* from #p p
where exists(select 1 from CTE c
where p.pumber=c.pumber and p.bookdate=c.bookdate)
union all
select p1.* from #p p1
where p1.bookdate is null and step=1
and not exists(select 1 from CTE c
where p1.pumber=c.pumber)
If performance is main concern then 1 or 2 query do not matter,finally performance matter.
Create NonClustered index ix_Product on Product (ProductNumber,BookingDate,Stepnumber)
Go
If more than 90% of data are where BookingDate is not null or where BookingDate is null then you can create Filtered Index on it.
Create NonClustered index ix_Product on Product (ProductNumber,BookingDate,Stepnumber)
where BookingDate is not null
Go
Try row_number() with a proper ordering. Null values are treated as the lowest possible values by sql-server ORDER BY.
SELECT TOP(1) WITH TIES *
FROM myTable t
ORDER BY row_number() over(partition by PRODUCT_NUMBER order by BOOKING_DATE DESC, STEP_NUMBER);
Pay attention to sql-server adviced indexes to get good performance.
Possibly the most efficient method is a correlated subquery:
select t.*
from t
where t.step_number = (select top (1) t2.step_number
from t t2
where t2.product_number = t.product_number and
order by t2.booking_date desc, t2.step_number
);
In particular, this can take advantage of an index on (product_number, booking_date desc, step_number).

Access: query crashes

I have the following query (let's call it Query1) (kindly created here by Erik von Asmuth):
SELECT PARTNERID
,NAME
,FIRST_NAME
,UID
,DATA_R
FROM MY_TABLE
WHERE MY_TABLE.[DATA_R] = (
SELECT MAX(t.[DATA_R])
FROM MY_TABLE AS t
WHERE t.PARTNERID = MY_TABLE.PARTNERID
)
ORDER BY PARTNERID;
MY_TABLE has 20000 records and is a Query (even if the name might suggest the opposite) with the following form:
SELECT [MYTABLE_O].PARTNERID, [MYTABLE_O].NAME, [MYTABLE_O].FIRST_NAME, [MYTABLE_O].[Codice fiscale] AS CF, [MYTABLE_O].Date AS DATA_R
FROM [MYTABLE_O] LEFT JOIN [TO_EXCLUDE] ON [MYTABLE_O].[PARTNERID] = [TO_EXCLUDE].[PARTNERID]
WHERE ((([TO_EXCLUDE].PARTNERID) Is Null));
(I want to exclude some already considered elements that are in Table TO_EXCLUDE).
When I run the query (Query1) MS Access freezes. How can I avoid it/make it more efficient and stable?
I have tried to index in MYTABLE_O both PARTNERID AND DATA_R
You may have to write the result of the subquery:
SELECT PARTNERID, MAX([DATA_R]) AS MAXDATAR
FROM YourQuery
GROUP BY PARTNERID
to a temp table, and then replace in your query
FROM MY_TABLE AS t
with
FROM TempTable AS t

SQL Server : Update Flag on Max Date on Foreign key

I'm trying to do this update but for some reason I cannot quite master SQL sub queries.
My table structure is as follows:
id fk date activeFlg
--- -- ------- ---------
1 1 04/10/11 0
2 1 02/05/99 0
3 2 09/10/11 0
4 3 11/28/11 0
5 3 12/25/98 0
Ideally I would like to set the activeFlg to 1 for all of the distinct foreign keys with the most recent date. For instance after running my query id 1,3 and 4 will have an active flag set to one.
The closest thing I came up with was a query returning all of the max dates for each distinct fk:
SELECT MAX(date)
FROM table
GROUP BY fk
But since I cant even come up with the subquery there is no way I can proceed :/
Can somebody please give me some insight on this. I'm trying to really learn more about sub queries so an explanation would be greatly appreciated.
Thank you!
You need to select the fk to and then restrict by that, so
SELECT fk,MAX(date)
FROM table
GROUP BY fk
To
With Ones2update AS
(
SELECT fk,MAX(date)
FROM table
GROUP BY fk
)
Update table
set Active=1
from table t
join Ones2update u ON t.fk = u.fk and t.date = u.date
also I would test first so do this query first
With Ones2update AS
(
SELECT fk,MAX(date)
FROM table
GROUP BY fk
)
selct fk, date, active
from table t
join Ones2update u ON t.fk = u.fk and t.date = u.date
to make sure you are getting what you expect and I did not make any typos.
Additional note: I use a join instead of a sub-query -- they are logically the same but I always find joins to be clearer (once I got used to using joins). Depending on the optimizer they can be faster.
This is the general idea. You can flesh out the details.
update t
set activeFlg = 1
from yourTable t
join (
select id, max([date] maxdate
from TheForeignKeyTable
group by [date]
) sq on t.fk = sq.id and t.[date] = maxdate

How can I select adjacent rows to an arbitrary row (in sql or postgresql)?

I want to select some rows based on certain criteria, and then take one entry from that set and the 5 rows before it and after it.
Now, I can do this numerically if there is a primary key on the table, (e.g. primary keys that are numerically 5 less than the target row's key and 5 more than the target row's key).
So select the row with the primary key of 7 and the nearby rows:
select primary_key from table where primary_key > (7-5) order by primary_key limit 11;
2
3
4
5
6
-=7=-
8
9
10
11
12
But if I select only certain rows to begin with, I lose that numeric method of using primary keys (and that was assuming the keys didn't have any gaps in their order anyway), and need another way to get the closest rows before and after a certain targeted row.
The primary key output of such a select might look more random and thus less succeptable to mathematical locating (since some results would be filtered, out, e.g. with a where active=1):
select primary_key from table where primary_key > (34-5)
order by primary_key where active=1 limit 11;
30
-=34=-
80
83
100
113
125
126
127
128
129
Note how due to the gaps in the primary keys caused by the example where condition (for example becaseu there are many inactive items), I'm no longer getting the closest 5 above and 5 below, instead I'm getting the closest 1 below and the closest 9 above, instead.
There's a lot of ways to do it if you run two queries with a programming language, but here's one way to do it in one SQL query:
(SELECT * FROM table WHERE id >= 34 AND active = 1 ORDER BY id ASC LIMIT 6)
UNION
(SELECT * FROM table WHERE id < 34 AND active = 1 ORDER BY id DESC LIMIT 5)
ORDER BY id ASC
This would return the 5 rows above, the target row, and 5 rows below.
Here's another way to do it with analytic functions lead and lag. It would be nice if we could use analytic functions in the WHERE clause. So instead you need to use subqueries or CTE's. Here's an example that will work with the pagila sample database.
WITH base AS (
SELECT lag(customer_id, 5) OVER (ORDER BY customer_id) lag,
lead(customer_id, 5) OVER (ORDER BY customer_id) lead,
c.*
FROM customer c
WHERE c.active = 1
AND c.last_name LIKE 'B%'
)
SELECT base.* FROM base
JOIN (
-- Select the center row, coalesce so it still works if there aren't
-- 5 rows in front or behind
SELECT COALESCE(lag, 0) AS lag, COALESCE(lead, 99999) AS lead
FROM base WHERE customer_id = 280
) sub ON base.customer_id BETWEEN sub.lag AND sub.lead
The problem with sgriffinusa's solution is that you don't know which row_number your center row will end up being. He assumed it will be row 30.
For similar query I use analytic functions without CTE. Something like:
select ...,
LEAD(gm.id) OVER (ORDER BY Cit DESC) as leadId,
LEAD(gm.id, 2) OVER (ORDER BY Cit DESC) as leadId2,
LAG(gm.id) OVER (ORDER BY Cit DESC) as lagId,
LAG(gm.id, 2) OVER (ORDER BY Cit DESC) as lagId2
...
where id = 25912
or leadId = 25912 or leadId2 = 25912
or lagId = 25912 or lagId2 = 25912
such query works more faster for me than CTE with join (answer from Scott Bailey). But of course less elegant
You could do this utilizing row_number() (available as of 8.4). This may not be the correct syntax (not familiar with postgresql), but hopefully the idea will be illustrated:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (ORDER BY primary_key) AS r, *
FROM table
WHERE active=1) t
WHERE 25 < r and r < 35
This will generate a first column having sequential numbers. You can use this to identify the single row and the rows above and below it.
If you wanted to do it in a 'relationally pure' way, you could write a query that sorted and numbered the rows. Like:
select (
select count(*) from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
Then use that as a common table expression. Write a select which filters it down to the rows you're interested in, then join it back onto itself using a criterion that the index of the right-hand copy of the table is no more than k larger or smaller than the index of the row on the left. Project over just the rows on the right. Like:
with numbered_emps as (
select (
select count(*)
from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
)
select b.*
from numbered_emps a, numbered_emps b
where a.name like '% Smith' -- this is your main selection criterion
and ((b.idx - a.idx) between -5 and 5) -- this is your adjacency fuzzy-join criterion
What could be simpler!
I'd imagine the row-number based solutions will be faster, though.