How can I improve the native query for a table with 7 millions rows?

How can I improve the native query for a table with 7 millions rows? - sql

I have the below view(table) in my database(SQL SERVER).
I want to retrieve 2 things from this table.
The object which has the latest booking date for each Product number.
It will return the objects = {0001, 2, 2019-06-06 10:39:58} and {0003, 2, 2019-06-07 12:39:58}.
If all the step number has no booking date for a Product number, it wil return the object with Step number = 1. It will return the object = {0002, 1, NULL}.
The view has 7.000.000 rows. I must do it by using native query.
The first query that retrieves the product with the latest booking date:
SELECT DISTINCT *
FROM TABLE t
WHERE t.BOOKING_DATE = (SELECT max(tbl.BOOKING_DATE) FROM TABLE tbl WHERE t.PRODUCT_NUMBER = tbl.PRODUCT_NUMBER)
The second query that retrieves the product with booking date NULL and Step number = 1;
SELECT DISTINCT *
FROM TABLE t
WHERE (SELECT max(tbl.BOOKING_DATE) FROM TABLE tbl WHERE t.PRODUCT_NUMBER = tbl.PRODUCT_NUMBER) IS NULL AND t.STEP_NUMBER = 1
I tried using a single query, but it takes too long.
For now I use 2 query for getting this information but for the future I need to improve this. Do you have an alternative? I also can not use stored procedure, function inside SQL SERVER. I must do it with native query from Java.

Try this,
Declare #p table(pumber int,step int,bookdate datetime)
insert into #p values
(1,1,'2019-01-01'),(1,2,'2019-01-02'),(1,3,'2019-01-03')
,(2,1,null),(2,2,null),(2,3,null)
,(3,1,null),(3,2,null),(3,3,'2019-01-03')
;With CTE as
(
select pumber,max(bookdate)bookdate
from #p p1
where bookdate is not null
group by pumber
)
select p.* from #p p
where exists(select 1 from CTE c
where p.pumber=c.pumber and p.bookdate=c.bookdate)
union all
select p1.* from #p p1
where p1.bookdate is null and step=1
and not exists(select 1 from CTE c
where p1.pumber=c.pumber)
If performance is main concern then 1 or 2 query do not matter,finally performance matter.
Create NonClustered index ix_Product on Product (ProductNumber,BookingDate,Stepnumber)
Go
If more than 90% of data are where BookingDate is not null or where BookingDate is null then you can create Filtered Index on it.
Create NonClustered index ix_Product on Product (ProductNumber,BookingDate,Stepnumber)
where BookingDate is not null
Go

Try row_number() with a proper ordering. Null values are treated as the lowest possible values by sql-server ORDER BY.
SELECT TOP(1) WITH TIES *
FROM myTable t
ORDER BY row_number() over(partition by PRODUCT_NUMBER order by BOOKING_DATE DESC, STEP_NUMBER);
Pay attention to sql-server adviced indexes to get good performance.

Possibly the most efficient method is a correlated subquery:
select t.*
from t
where t.step_number = (select top (1) t2.step_number
from t t2
where t2.product_number = t.product_number and
order by t2.booking_date desc, t2.step_number
);
In particular, this can take advantage of an index on (product_number, booking_date desc, step_number).

Related

Query Optimization, Issue

Using SQL Server 2012;
I am using a query to find deltas in a table.
I have an archive table that has all the records with Licenceno PK,FileID
I want to find out how many Licenceno are in a fileId but are not in previous FileID.
Code Used:
Select count(*) from table where fileid = 123 and Licenceno not in (select Licenceno from table where fileid <123)
The code works fine but the problem is some of the fileIds have the same number of records as the previous ones but take 4 hours and are still running..
Is it a table issue?
Index cant be an issue as the whole table has
a non clustered index.
It is happening generally when i am calculating deltas for the latest Licenceno.
or Query planning is the issue?
I am not able to solve this for the past 5 days.

I would rewrite your query to use an exists clause, and also add an appropriate index:
SELECT COUNT(*)(
FROM yourTable t1
WHERE
fileid = 123 AND
NOT EXISTS (SELECT 1 FROM yourTable t2
WHERE t2.Licenseno = t1.Licenseno AND t2.fileid < 123);
An index on (Licenseno, fileid) might help here:
CREATE INDEX idx ON yourTable (Licenseno, fileid);
You may also try the came composite index in the reverse order:
CREATE INDEX idx ON yourTable (fileid, Licenseno);

Why not use count(distinct)?
select count(distinct licenseno)
from table
where fileid = 123;
For this query, you want an index on (fileid, licenseno).
You are complicating the logic by thinking sequentially ("have I seen this licenseno already?"). Instead, you just want to count the distinct values.
EDIT:
For this problem, you can try two levels of aggregation:
select count(*)
from (select licenseno, min(fileid) as min_fileid
from t
where licenseno <= 123
group by licenseno
) t
where min_fileid = 123;
How good the performance is relative to other approaches dependson how selective <= 123 is.

You could also use LAG for this
SELECT COUNT(*)
FROM (SELECT fileid,
LAG(fileid) OVER (PARTITION BY Licenceno ORDER BY fileid) AS prevFileID
FROM TABLE
WHERE fileid <= 123 ) D
WHERE fileid = 123
AND prevFileID IS NULL
... or an aggregation query ...
WITH T
AS (SELECT 1 AS Flag,
FROM TABLE
WHERE fileid <= 123
GROUP BY Licenceno
HAVING MIN(fileid) = 123 )
SELECT COUNT(*)
FROM T

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here

You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).

Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).

A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Order by data as per supplied Id in sql

Query:
SELECT *
FROM [MemberBackup].[dbo].[OriginalBackup]
where ration_card_id in
(
1247881,174772,
808454,2326154
)
Right now the data is ordered by the auto id or whatever clause I'm passing in order by.
But I want the data to come in sequential format as per id's I have passed
Expected Output:
All Data for 1247881
All Data for 174772
All Data for 808454
All Data for 2326154
Note:
Number of Id's to be passed will 300 000

One option would be to create a CTE containing the ration_card_id values and the orders which you are imposing, and the join to this table:
WITH cte AS (
SELECT 1247881 AS ration_card_id, 1 AS position
UNION ALL
SELECT 174772, 2
UNION ALL
SELECT 808454, 3
UNION ALL
SELECT 2326154, 4
)
SELECT t1.*
FROM [MemberBackup].[dbo].[OriginalBackup] t1
INNER JOIN cte t2
ON t1.ration_card_id = t2.ration_card_id
ORDER BY t2.position DESC
Edit:
If you have many IDs, then neither the answer above nor the answer given using a CASE expression will suffice. In this case, your best bet would be to load the list of IDs into a table, containing an auto increment ID column. Then, each number would be labelled with a position as its record is being loaded into your database. After this, you can join as I have done above.

If the desired order does not reflect a sequential ordering of some preexisting data, you will have to specify the ordering yourself. One way to do this is with a case statement:
SELECT *
FROM [MemberBackup].[dbo].[OriginalBackup]
where ration_card_id in
(
1247881,174772,
808454,2326154
)
ORDER BY CASE ration_card_id
WHEN 1247881 THEN 0
WHEN 174772 THEN 1
WHEN 808454 THEN 2
WHEN 2326154 THEN 3
END
Stating the obvious but note that this ordering most likely is not represented by any indexes, and will therefore not be indexed.

Insert your ration_card_id's in #temp table with one identity column.
Re-write your sql query as:
SELECT a.*
FROM [MemberBackup].[dbo].[OriginalBackup] a
JOIN #temps b
on a.ration_card_id = b.ration_card_id
order by b.id

Remove duplicates (1 to many) or write a subquery that solves my problem

Referring to the diagram below the records table has unique Records. Each record is updated, via comments through an Update Table. When I join the two I get lots of duplicates.
How to remove duplicates? Group By does not work for me as I have more than 10 fields in select query and some of them are functions.
Write a sub query which pulls the last updates in the Update table for each record that is updated in a particular month. Joining with this sub query will solve my problem.
Thanks!
Edit
Table structure that is of interest is
create table Records(
recordID int,
90more_fields various
)
create table Updates(
update_id int,
record_id int,
comment text,
byUser varchar(25),
datecreate datetime
)

Here's one way.
SELECT * /*But list columns explicitly*/
FROM Orange o
CROSS APPLY (SELECT TOP 1 *
FROM Blue b
WHERE b.datecreate >= '20110901'
AND b.datecreate < '20111001'
AND o.RecordID = b.Record_ID2
ORDER BY b.datecreate DESC) b

Based on the limited information available...
WITH cteLastUpdate AS (
SELECT Record_ID2, UpdateDateTime,
ROW_NUMBER() OVER(PARTITION BY Record_ID2 ORDER BY UpdateDateTime DESC) AS RowNUM
FROM BlueTable
/* Add WHERE clause if needed to restrict date range */
)
SELECT *
FROM cteLastUpdate lu
INNER JOIN OrangeTable o
ON lu.Record_ID2 = o.RecordID
WHERE lu.RowNum = 1

Last updates per record and month:
SELECT *
FROM UPDATES outerUpd
WHERE exists
(
-- Magic part
SELECT 1
FROM UPDATES innerUpd
WHERE innerUpd.RecordId = outerUpd.RecordId
GROUP BY RecordId
, date_part('year', innerUpd.datecolumn)
, date_part('month', innerUpd.datecolumn)
HAVING max(innerUpd.datecolumn) = outerUpd.datecolumn
)
(Works on PostgreSQL, date_part is different in other RDBMS)

Find the longest sequence of a value in a table

This is an SQL Question, I think it is difficult one - I'm not sure it is possible to achieve in a simple SQL sentence or a stored procedure:
I want to find the number of the longest sequence of the same (known) number in a column in a table:
example:
TABLE:
DATE SALEDITEMS
1/1/09 4
1/2/09 3
1/3/09 3
1/4/09 4
1/5/09 3
calling the sp/sentence for 4 will give 1 calling the sp/sentecne for 3 will give 2
as there was 2 times in a row number 3.
I'm running SQL server 2008.

UPDATE: I generated a million rows of random data, and abandoned the recursive CTE solution, as its query plan didn't make good use of indexes in the optimizer.
But the non-recursive solution I originaly posted turned out to work great, as long as there was an additional non-clustered index on (SALEDITEMS, [DATE]). This makes sense, since the query needs to filter in both directions (both by date and by SALEDITEMS). With this additional index, queries on a million rows return in under 2 seconds on my (not very beefy) desktop mathine. Without this index, the query was dog-slow.
BTW, this is a great example of how SQL Server's cost-based query optimization totally breaks down in some cases. The recursive CTE solution has a cost (on my PC) of 42 and takes at least several minutes to finish. The non-recursive solution has a cost of 15,446 (!!!) and completes in 1.5 seconds. Moral of the story: when comparing SQL Server query plans, don't assume that cost necessarily correlates to query performance!
Anyway, here's the solution I'd recommend (the same non-recursive CTE I posted earlier) :
DECLARE #SALEDITEMS INT = 3;
WITH SalesNoMatch ([DATE], SALEDITEMS, NoMatchDate)
AS
(
SELECT [DATE], SALEDITEMS,
(SELECT MIN([DATE]) FROM Sales s2 WHERE s2.SALEDITEMS <> #SALEDITEMS
AND s2.[DATE] > s1.[DATE]) as NoMatchDate
FROM Sales s1
)
, SalesMatchCount ([DATE], ConsecutiveCount) AS
(
SELECT [DATE], 1+(SELECT COUNT(1) FROM Sales s2 WHERE s2.[DATE] > s1.[DATE] AND s2.[DATE] < NoMatchDate)
FROM SalesNoMatch s1
WHERE s1.SALEDITEMS = #SALEDITEMS
)
SELECT MAX(ConsecutiveCount)
FROM SalesMatchCount;
Here's the DDL I used to test this, including indexes you'll need:
CREATE TABLE [Sales](
[DATE] date NOT NULL,
[SALEDITEMS] int NOT NULL
);
CREATE UNIQUE CLUSTERED INDEX IX_Sales ON Sales ([DATE]);
CREATE UNIQUE NONCLUSTERED INDEX IX_Sales2 ON Sales (SALEDITEMS, [DATE]);
And here's how I created my test data-- 1,000,001 rows with ascending dates with SALEDITEMS randomly set between 1 and 10.
INSERT INTO Sales ([DATE], SALEDITEMS)
VALUES ('1/1/09', 5)
DECLARE #i int = 0;
WHILE (#i < 1000000)
BEGIN
INSERT INTO Sales ([DATE], SALEDITEMS)
SELECT DATEADD (d, 1, (SELECT MAX ([DATE]) FROM Sales)), ABS(CHECKSUM(NEWID())) % 10 + 1
SET #i = #i + 1;
END
Here's the recursive-CTE solution that I abandoned:
DECLARE #SALEDITEMS INT = 3;
-- recursive CTE solution (remember to set MAXRECURSION!)
WITH SalesRowNum ([DATE], SALEDITEMS, RowNum)
AS
(
SELECT [DATE], SALEDITEMS, ROW_NUMBER() OVER (ORDER BY s1.[DATE]) as RowNum
FROM Sales s1
)
, SalesCTE (RowNum, [DATE], ConsecutiveCount)
AS
(
SELECT s1.RowNum, s1.[DATE], 1 AS ConsecutiveCount
FROM SalesRowNum s1
WHERE SALEDITEMS = #SALEDITEMS
UNION ALL
SELECT s1.RowNum, s1.[DATE], ConsecutiveCount + 1 AS ConsecutiveCount
FROM SalesRowNum s1
INNER JOIN SalesCTE s2 ON s1.RowNum = s2.RowNum + 1
WHERE SALEDITEMS = #SALEDITEMS
)
SELECT MAX(ConsecutiveCount)
FROM SalesCTE;

Untested, because you did not provide DDL and sample data:
DECLARE #SALEDITEMS INT;
SET #SALEDITEMS=3;
SELECT MAX(cnt) FROM(
SELECT COUNT(*) FROM YourTable JOIN (
SELECT y1.[Date] AS d1, y2.[Date] AS d2
FROM YourTable AS y1 JOIN YourTable AS y2
ON y1.SALEDITEMS=#SALEDITEMS AND y2.SALEDITEMS=#SALEDITEMS
AND NOT EXISTS(SELECT 1 FROM YourTable AS y
WHERE y.SALEDITEMS<>#SALEDITEMS
AND y1.[Date] < y.[Date] AND y.[Date] < y2.[Date])
) AS t
WHERE [Date] BETWEEN t.d1 AND t.d2
) AS t;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I improve the native query for a table with 7 millions rows? - sql

Related

Query Optimization, Issue

Modify my SQL Server query -- returns too many rows sometimes

Order by data as per supplied Id in sql

Remove duplicates (1 to many) or write a subquery that solves my problem

Find the longest sequence of a value in a table

Categories

Resources