Sql query slowness due to rank and columns with null values

Sql query slowness due to rank and columns with null values - sql

EDITED:
I have the following table in database with around 10 millions records:
Declaration:
create table PropertyOwners (
[Key] int not null primary key,
PropertyKey int not null,
BoughtDate DateTime,
OwnerKey int null,
GroupKey int null
)
go
[Key] is primary key and combination of PropertyKey, BoughtDate, OwnerKey and GroupKey is unique.
With the following index:
CREATE NONCLUSTERED INDEX [IX_PropertyOwners] ON [dbo].[PropertyOwners]
(
[PropertyKey] ASC,
[BoughtDate] DESC,
[IsGroup] ASC
)
INCLUDE ( [OwnerKey], [GroupKey])
go
Description of the case:
For single BoughtDate one property can belong to multiple owners or single group, for single record there can either be OwnerKey or GroupKey but not both so one of them will be null for each record. I am trying to retrieve the data from the table using following query for the OwnerKey. If there are same property rows for owners and group at the same time than the rows having OwnerKey with be preferred, that is why I am using "IsGroup" in Rank function.
declare #ownerKey int = 40000
select PropertyKey, BoughtDate, OwnerKey, GroupKey
from (
select PropertyKey, BoughtDate, OwnerKey, GroupKey,
RANK() over (partition by PropertyKey order by BoughtDate desc, IsGroup) as [Rank]
from PropertyOwners
) as result
where result.[Rank]=1 and result.[OwnerKey]=#ownerKey
It is taking 2-3 seconds to get the records when ever I use the [Rank]=1 with any of PropertyKey/OwnerKey/GroupKey. But when I tried to get the records for the PropertyKey/OwnerKey/GroupKey without using [Rank]=1 in the same query, it is executing in milliseconds. See following query:
declare #ownerKey int = 40000
select PropertyKey, BoughtDate, OwnerKey, GroupKey
from (
select PropertyKey, BoughtDate, OwnerKey, GroupKey,
RANK() over (partition by PropertyKey order by BoughtDate desc, IsGroup) as [Rank]
from PropertyOwners
) as result
where result.[OwnerKey]=#ownerKey
I have also tried to use the Indexed view to pre ranked them but I can't use it in my query as Rank function is not supported in indexed view.
Please note this table is updated once a day and using Sql Server 2008 R2. Any help will be
greatly appreciated.

If I understood your query correctly, it basically does as follows: "for a given owner, return all properties for which this owner is the latest one".
This can also be achieved in other ways, without ranking the entire 10M table, such as:
select po.*
from dbo.PropertyOwners po
where po.OwnerKey = #OwnerKey
and not exists (
select 0 from dbo.PropertyOwners lo
where lo.PropertyKey = po.PropertyKey
and lo.BoughtDate > po.BoughtDate
-- Other group-related conditions here, if need be
);
Essentially the same, just a little bit different wording:
select po.*
from dbo.PropertyOwners po
left join dbo.PropertyOwners lo on lo.PropertyKey = po.PropertyKey
and lo.BoughtDate > po.BoughtDate
-- Other group-related conditions here, if need be
where po.OwnerKey = #OwnerKey
and lo.PropertyKey is null;
You will definitely need different indices for these, and I can't be sure they will help. But at least give it a try.

Related

Query to determine cumulative changes to records

Given the following table containing the example rows, I’m looking for a query to give me the aggregate results of changes made to the same record. All changes are made against a base record in another table (results table), so the contents of the results table are not cumulative.
Base Records (from which all changes are made)
Edited Columns highlighted
I’m looking for a query that would give me the cumulative changes (in order by date). This would be the resulting rows:
Any help appreciated!
UPDATE---------------
Let me offer some clarification. The records being edited exist in one table, let's call that [dbo].[Base]. When a person updates a record from [dbo].[Base], his updates go into [dbo].[Updates]. Therefore, a person is always editing from the base table.
At some point, let's say once a day, we need to calculate the sum of changes with the following rule:
For any given record, determine the latest change for each column and take the latest change. If no change was made to a column, take the value from [dbo].[Base]. So, one way of looking at the [dbo].[Updates] table would be to see only the changed columns.
Please let's not discuss the merits of this approach, I realize it's strange. I just need to figure out how to determine the final state of each record.
Thanks!

This is dirty, but you can give this a shot (test here: https://rextester.com/MKSBU15593)
I use a CTE to do an initial CROSS JOIN of the Base and Update tables and then a second to filter it to only the rows where the IDs match. From there I use FIRST_VALUE() for each column, partitioned by the ID value and ordered by a CASE expression (if the Base column value matches the Update column value then 1 else 0) and the Datemodified column to get the most recent version of the each column.
It spits out
CREATE TABLE Base
(
ID INT
,FNAME VARCHAR(100)
,LNAME VARCHAR(100)
,ADDRESS VARCHAR(100)
,RATING INT
,[TYPE] VARCHAR(5)
,SUBTYPE VARCHAR(5)
);
INSERT INTO dbo.Base
VALUES
( 100,'John','Doe','123 First',3,'Emp','W2'),
( 200,'Jane','Smith','Wacker Dr.',2,'Emp','W2');
CREATE TABLE Updates
(
ID INT
,DATEMODIFIED DATE
,FNAME VARCHAR(100)
,LNAME VARCHAR(100)
,ADDRESS VARCHAR(100)
,RATING INT
,[TYPE] VARCHAR(5)
,SUBTYPE VARCHAR(5)
);
INSERT INTO dbo.Updates
VALUES
( 100,'1/15/2019','John','Doe','123 First St.',3,'Emp','W2'),
( 200,'1/15/2019','Jane','Smyth','Wacker Dr.',2,'Emp','W2'),
( 100,'1/17/2019','Johnny','Doe','123 First',3,'Emp','W2'),
( 200,'1/19/2019','Jane','Smith','2 Wacker Dr.',2,'Emp','W2'),
( 100,'1/20/2019','Jon','Doe','123 First',3,'Cont','W2');
WITH merged AS
(
SELECT b.ID AS IDOrigin
,'1/1/1900' AS DATEMODIFIEDOrigin
,b.FNAME AS FNAMEOrigin
,b.LNAME AS LNAMEOrigin
,b.ADDRESS AS ADDRESSOrigin
,b.RATING AS RATINGOrigin
,b.[TYPE] AS TYPEOrigin
,b.SUBTYPE AS SUBTYPEOrigin
,u.*
FROM base b
CROSS JOIN
dbo.Updates u
), filtered AS
(
SELECT *
FROM merged
WHERE IDOrigin = ID
)
SELECT distinct
ID
,FNAME = FIRST_VALUE(FNAME) OVER (PARTITION BY ID ORDER BY CASE WHEN FNAME = FNAMEOrigin THEN 1 ELSE 0 end, datemodified desc)
,LNAME = FIRST_VALUE(LNAME) OVER (PARTITION BY ID ORDER BY CASE WHEN LNAME = LNAMEOrigin THEN 1 ELSE 0 end, datemodified desc)
,ADDRESS = FIRST_VALUE(ADDRESS) OVER (PARTITION BY ID ORDER BY CASE WHEN ADDRESS = ADDRESSOrigin THEN 1 ELSE 0 end, datemodified desc)
,RATING = FIRST_VALUE(RATING) OVER (PARTITION BY ID ORDER BY CASE WHEN RATING = RATINGOrigin THEN 1 ELSE 0 end, datemodified desc)
,[TYPE] = FIRST_VALUE([TYPE]) OVER (PARTITION BY ID ORDER BY CASE WHEN [TYPE] = TYPEOrigin THEN 1 ELSE 0 end, datemodified desc)
,SUBTYPE = FIRST_VALUE(SUBTYPE) OVER (PARTITION BY ID ORDER BY CASE WHEN SUBTYPE = SUBTYPEOrigin THEN 1 ELSE 0 end, datemodified desc)
FROM filtered

Don't you just want the last record?
select e.*
from edited e
where e.datemodified = (select max(e2.datemodified)
from edited e2
where e2.id = e.id
);

How to insert rows of data in order using MERGE Statement?

I would like to have the inserted rows in the same order as in the
source select statement - i.e. ORDER BY TMP.DEF_DATA_SK. But they are
inserted somewhat randomly.
With Simple Insert into Select Statement it can be Done But i Want it to be done using MERGE.
SQL is as follows
MERGE
INTO
HCI_STD_STAGING.STAGE.DEF_DATA
TRG
USING
( SELECT TMP.DEF_DATA_SK,
TMP.VAL,
TMP.CD,
TMP.DESCR,
TMP.DEF_TP_SK TYPE_SK ,--
TMP.PRN_SK PARENT, --
PRN.DEF_DATA_SK PRN_SK, PRN.VAL PRN_VAL,
TYP.DEF_TP_SK
,PRN_PRN.VAL DB
,ROW_NUMBER() OVER (ORDER BY TMP.DEF_DATA_SK) AS RowNum
FROM
HCI_STD_STAGING.STAGE._DEF_DATA_TMP TMP
LEFT JOIN HCI_STD_STAGING.STAGE.DEF_TP TYP
ON TMP.DEF_TP_SK = TYP.CD --TYPE
LEFT JOIN HCI_STD_STAGING.STAGE.DEF_DATA PRN
ON TMP.PRN_SK = PRN.VAL -- SCH
INNER JOIN HCI_STD_STAGING.STAGE.DEF_DATA PRN_PRN
ON PRN.PRN_SK = PRN_PRN.DEF_DATA_SK AND TMP.DB = PRN_PRN.VAL --AND
TMP.SCH = PRN.VAL
WHERE TMP.DEF_TP_SK = 'Table Object'
GROUP BY
TMP.DEF_DATA_SK,
TMP.VAL,
TMP.CD,
TMP.DESCR,
TMP.DEF_TP_SK ,
TMP.PRN_SK ,
PRN.DEF_DATA_SK , PRN.VAL ,
TYP.DEF_TP_SK
,PRN_PRN.VAL
--order by TMP.DEF_DATA_SK
) SRC
ON SRC.VAL = TRG.VAL
AND SRC.PRN_SK = TRG.PRN_SK
AND SRC.DEF_TP_SK = TRG.DEF_TP_SK
WHEN NOT MATCHED
THEN
INSERT
(
VAL,CD, DESCR, DEF_TP_SK, PRN_SK
)
VALUES ( SRC.VAL, SRC.CD,SRC.DESCR,SRC.DEF_TP_SK,SRC.PRN_SK );

There is no guarantee for the order of insertion of the rows into the table.
However, if your table has primary key, the records will be ordered by the primary key because upon the creation of primary key, it will also create a clustered index on the table based on that primary key. If the primary key defined as identity, the only guarantee is that the identity values will be generated based on the ORDER BY clause.
But if you don't want to put primary key on the table, you can just create clustered index on the table based on a column. There are some other concerns about clustered index to be considered based on your needs. Please take a look at clustered index as follows
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-2017
so you could get the rows stored as you expected.

Ambiguous column name SQL

I get the following error when I want to execute a SQL query:
"Msg 209, Level 16, State 1, Line 9
Ambiguous column name 'i_id'."
This is the SQL query I want to execute:
SELECT DISTINCT x.*
FROM items x LEFT JOIN items y
ON y.i_id = x.i_id
AND x.last_seen < y.last_seen
WHERE x.last_seen > '4-4-2017 10:54:11'
AND x.spot = 'spot773'
AND (x.technology = 'Bluetooth LE' OR x.technology = 'EPC Gen2')
AND y.id IS NULL
GROUP BY i_id
This is how my table looks like:
CREATE TABLE [dbo].[items] (
[id] INT IDENTITY (1, 1) NOT NULL,
[i_id] VARCHAR (100) NOT NULL,
[last_seen] DATETIME2 (0) NOT NULL,
[location] VARCHAR (200) NOT NULL,
[code_hex] VARCHAR (100) NOT NULL,
[technology] VARCHAR (100) NOT NULL,
[url] VARCHAR (100) NOT NULL,
[spot] VARCHAR (200) NOT NULL,
PRIMARY KEY CLUSTERED ([id] ASC));
I've tried a couple of things but I'm not an SQL expert:)
Any help would be appreciated
EDIT:
I do get duplicate rows when I remove the GROUP BY line as you can see:

I'm adding another answer in order to show how you'd typically select the lastest record per group without getting duplicates. You's use ROW_NUMBER for this, marking every last record per i_id with row number 1.
SELECT *
FROM
(
SELECT
i.*,
ROW_NUMBER() over (PARTITION BY i_id ORDER BY last_seen DESC) as rn
FROM items i
WHERE last_seen > '2017-04-04 10:54:11'
AND spot = 'spot773'
AND technology IN ('Bluetooth LE', 'EPC Gen2')
) ranked
WHERE rn = 1;
(You'd use RANK or DENSE_RANK instead of ROW_NUMBER if you wanted duplicates.)

You forgot the table alias in GROUP BY i_id.
Anyway, why are you writing an anti join query where you are trying to get rid of duplicates with both DISTINCT and GROUP BY? Did you have issues with a straight-forward NOT EXISTS query? You are making things way more complicated than they actually are.
SELECT *
FROM items i
WHERE last_seen > '2017-04-04 10:54:11'
AND spot = 'spot773'
AND technology IN ('Bluetooth LE', 'EPC Gen2')
AND NOT EXISTS
(
SELECT *
FROM items other
WHERE i.i_id = other.i_id
AND i.last_seen < other.last_seen
);
(There are other techniques of course to get the last seen record per i_id. This is one; another is to compare with MAX(last_seen); another is to use ROW_NUMBER.)

SQL query to get records even if count is 0 in one table

select id, name, 'First Category' as category, count(id) as totalCalls
from missed_call
where name = 'whatever1'
group by name, category
UNION
select id, name, 'Second Category' as category, count(id) as totalCalls
from missed_call
where name = 'whatever2'
group by name, category
order by name ASC, totalCalls DESC
The previous query will not retrieve the records where totalCalls is 0.
So, how can I do to get those records and present totalCalls as 0?
UPDATE: I have tried changing count(id) as totalCalls for IFNULL(count(id), 0) as totalCalls but it doesn't solve the problem. Perhaps, because count(id) is actually not null, it just does not exist.

If you are unwilling to expand your database schema you can always pretend there is a table:
select surrogateTable.name,
surrogateTable.Category,
count(id) as totalCalls
from
(
select 'whatever1' Name,
'First Category' Category
union all
select 'whatever2',
'Second Category'
) surrogateTable
left join missed_call
on surrogateTable.Name = missed_call.Name
group by surrogateTable.name, surrogateTable.category
I dropped id in select because you should not select something you are not grouping on - this is probably MySql.
Check this on Sql Fiddle.

Your problem is that you only look at missed calls and not at categories, so you cannot notice categories that have no corresponding missed calls.
Here is the skeleton that will do that, supposing you will adapt it to the real structure of the category table.
SELECT ...
FROM Category cat
LEFT JOIN missed_call call ON call.category = category.id
WHERE (call.name = 'whatever1' OR call.category IS NULL)
GROUP BY call.name, call.category
...
Note especially call.category IS NULL. The column is supposedly not nullable; so this really checks for a Category row without any corresponding calls, an artifact of the outer join.

You should define a table named category to contain a complete list of all category names that are possible, even those that have no calls assigned to them (i.e. zero).
create table category
(
id numeric(10,0) NOT NULL,
name varchar(10) NULL
)
Then you can query the full list of categories from this table and LEFT JOIN the results against what you have from above.
You can then amend your missed_call to use foreign keys against your new category table for better efficiency and better schema design
create table missed_call
(
id numeric(10,0) NOT NULL,
first_category_id numeric(10,0) NULL,
second_category_id numeric(10,0) NULL,
name varchar(12)
)

SQL Server: row present in one query, missing in another

Ok so I think I must be misunderstanding something about SQL queries. This is a pretty wordy question, so thanks for taking the time to read it (my problem is right at the end, everything else is just context).
I am writing an accounting system that works on the double-entry principal -- money always moves between accounts, a transaction is 2 or more TransactionParts rows decrementing one account and incrementing another.
Some TransactionParts rows may be flagged as tax related so that the system can produce a report of total VAT sales/purchases etc, so it is possible that a single Transaction may have two TransactionParts referencing the same Account -- one VAT related, and the other not. To simplify presentation to the user, I have a view to combine multiple rows for the same account and transaction:
create view Accounting.CondensedEntryView as
select p.[Transaction], p.Account, sum(p.Amount) as Amount
from Accounting.TransactionParts p
group by p.[Transaction], p.Account
I then have a view to calculate the running balance column, as follows:
create view Accounting.TransactionBalanceView as
with cte as
(
select ROW_NUMBER() over (order by t.[Date]) AS RowNumber,
t.ID as [Transaction], p.Amount, p.Account
from Accounting.Transactions t
inner join Accounting.CondensedEntryView p on p.[Transaction]=t.ID
)
select b.RowNumber, b.[Transaction], a.Account,
coalesce(sum(a.Amount), 0) as Balance
from cte a, cte b
where a.RowNumber <= b.RowNumber AND a.Account=b.Account
group by b.RowNumber, b.[Transaction], a.Account
For reasons I haven't yet worked out, a certain transaction (ID=30) doesn't appear on an account statement for the user. I confirmed this by running
select * from Accounting.TransactionBalanceView where [Transaction]=30
This gave me the following result:
RowNumber Transaction Account Balance
-------------------- ----------- ------- ---------------------
72 30 23 143.80
As I said before, there should be at least two TransactionParts for each Transaction, so one of them isn't being presented in my view. I assumed there must be an issue with the way I've written my view, and run a query to see if there's anything else missing:
select [Transaction], count(*)
from Accounting.TransactionBalanceView
group by [Transaction]
having count(*) < 2
This query returns no results -- not even for Transaction 30! Thinking I must be an idiot I run the following query:
select [Transaction]
from Accounting.TransactionBalanceView
where [Transaction]=30
It returns two rows! So select * returns only one row and select [Transaction] returns both. After much head-scratching and re-running the last two queries, I concluded I don't have the faintest idea what's happening. Any ideas?
Thanks a lot if you've stuck with me this far!
Edit:
Here are the execution plans:
select *
select [Transaction]
1000 lines each, hence finding somewhere else to host.
Edit 2:
For completeness, here are the tables I used:
create table Accounting.Accounts
(
ID smallint identity primary key,
[Name] varchar(50) not null
constraint UQ_AccountName unique,
[Type] tinyint not null
constraint FK_AccountType foreign key references Accounting.AccountTypes
);
create table Accounting.Transactions
(
ID int identity primary key,
[Date] date not null default getdate(),
[Description] varchar(50) not null,
Reference varchar(20) not null default '',
Memo varchar(1000) not null
);
create table Accounting.TransactionParts
(
ID int identity primary key,
[Transaction] int not null
constraint FK_TransactionPart foreign key references Accounting.Transactions,
Account smallint not null
constraint FK_TransactionAccount foreign key references Accounting.Accounts,
Amount money not null,
VatRelated bit not null default 0
);

Demonstration of possible explanation.
Create table Script
SELECT *
INTO #T
FROM master.dbo.spt_values
CREATE NONCLUSTERED INDEX [IX_T] ON #T ([name] DESC,[number] DESC);
Query one (Returns 35 results)
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY NAME) AS rn
FROM #T
)
SELECT c1.number,c1.[type]
FROM cte c1
JOIN cte c2 ON c1.rn=c2.rn AND c1.number <> c2.number
Query Two (Same as before but adding c2.[type] to the select list makes it return 0 results)
;
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY NAME) AS rn
FROM #T
)
SELECT c1.number,c1.[type] ,c2.[type]
FROM cte c1
JOIN cte c2 ON c1.rn=c2.rn AND c1.number <> c2.number
Why?
row_number() for duplicate NAMEs isn't specified so it just chooses whichever one fits in with the best execution plan for the required output columns. In the second query this is the same for both cte invocations, in the first one it chooses a different access path with resultant different row_numbering.
Suggested Solution
You are self joining the CTE on ROW_NUMBER() over (order by t.[Date])
Contrary to what may have been expected the CTE will likely not be materialised which would have ensured consistency for the self join and thus you assume a correlation between ROW_NUMBER() on both sides that may well not exist for records where a duplicate [Date] exists in the data.
What if you try ROW_NUMBER() over (order by t.[Date], t.[id]) to ensure that in the event of tied dates the row_numbering is in a guaranteed consistent order. (Or some other column/combination of columns that can differentiate records if id won't do it)

If the purpose of this part of the view is just to make sure that the same row isn't joined to itself
where a.RowNumber <= b.RowNumber
then how does changing this part to
where a.RowNumber <> b.RowNumber
affect the results?

It seems you read dirty entries. (Someone else deletes/insertes new data)
try SET TRANSACTION ISOLATION LEVEL READ COMMITTED.
i've tried this code (seems equal to yours)
IF object_id('tempdb..#t') IS NOT NULL DROP TABLE #t
CREATE TABLE #t(i INT, val INT, acc int)
INSERT #t
SELECT 1, 2, 70
UNION ALL SELECT 2, 3, 70
;with cte as
(
select ROW_NUMBER() over (order by t.i) AS RowNumber,
t.val as [Transaction], t.acc Account
from #t t
)
select b.RowNumber, b.[Transaction], a.Account
from cte a, cte b
where a.RowNumber <= b.RowNumber AND a.Account=b.Account
group by b.RowNumber, b.[Transaction], a.Account
and got two rows
RowNumber Transaction Account
1 2 70
2 3 70

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Sql query slowness due to rank and columns with null values - sql

Related

Query to determine cumulative changes to records

How to insert rows of data in order using MERGE Statement?

Ambiguous column name SQL

SQL query to get records even if count is 0 in one table

SQL Server: row present in one query, missing in another

Categories

Resources