How to improve performance of this query? - sql

With reference to SQL Query how to summarize students record by date? I was able to get the report I wanted.
I was told in real world the students table will have 30 Millions of records. I do have index on (StudentID, Date). Any suggestions to improve the performance or is there a better way to build the report ?
Right now I have the following query
;with cte as
(
select id,
studentid,
date,
'#'+subject+';'+grade+';'+convert(varchar(10), date, 101) report
from student
)
-- insert into studentreport
select distinct
studentid,
STUFF(
(SELECT cast(t2.report as varchar(50))
FROM cte t2
where c.StudentId = t2.StudentId
order by t2.date desc
FOR XML PATH (''))
, 1, 0, '') AS report
from cte c;

Without seeing the execution plan, it's not really possible to write an optimized SQL statement so I'll make suggestions instead.
Don't use a cte as they often don't handle queries with large memory requires well (at least, in my experience). Instead, stage the cte data in a real table, either with a materialized/indexed view or with a working table (maybe a large temp table). Then execute the second select (after the cte) to combine your data in an ordered list.
The number of comments to your question indicates that you have a large problem (or problems). You're converting tall and skinny data (think integers, datetime2 types) into ordered lists within a strings. Try to think instead in terms of storing in the smallest data formats available and manipulating into strings until afterward (or never). Alternatively, give serious thought into creating an XML data field to replace the 'report' field.
If you can make it work, this is what I would do (including a test case without indexes). Your mileage may vary, but give it a try:
create table #student (id int not null, studentid int not null, date datetime not null, subject varchar(40), grade varchar(40))
insert into #student (id,studentid,date,subject,grade)
select 1, 1, getdate(), 'history', 'A-' union all
select 2, 1, dateadd(d,1,getdate()), 'computer science', 'b' union all
select 3, 1, dateadd(d,2,getdate()), 'art', 'q' union all
--
select 1, 2, getdate() , 'something', 'F' union all
select 2, 2, dateadd(d,1,getdate()), 'genetics', 'e' union all
select 3, 2, dateadd(d,2,getdate()), 'art', 'D+' union all
--
select 1, 3, getdate() , 'memory loss', 'A-' union all
select 2, 3, dateadd(d,1,getdate()), 'creative writing', 'A-' union all
select 3, 3, dateadd(d,2,getdate()), 'history of asia 101', 'A-'
go
select studentid as studentid
,(select s2.date as '#date', s2.subject as '#subject', s2.grade as '#grade'
from #student s2 where s1.studentid = s2.studentid for xml path('report'), type) as 'reports'
from (select distinct studentid from #student) s1;
I don't know how to make the output legible on here, but the resultset is 2 fields. Field 1 is an integer, field 2 is XML with one node per report. This still isn't as ideal as just sending the resultset, but it is at least one result per studentid.

Related

How to do an as-of-join in SQL (Snowflake)?

I am looking to join two time-ordered tables, such that the events in table1 are matched to the "next" event in table2 (within the same user). I am using SQL / Snowflake for this.
For argument's sake table1 is "notification_clicked" events and table2 is "purchases"
This is one way to do it:
WITH partial_result AS (
SELECT
userId, notificationId, notificationTimeStamp, transactionId, transactionTimeStamp
FROM table1 CROSS JOIN table2
WHERE table1.userId = table2.userId
AND notificationTimeStamp <= transactionTimeStamp)
SELECT *
FROM partial_result
QUALIFY ROW_NUMBER() OVER(
PARTITION BY userId, notificationId ORDER BY transactionTimeStamp ASC
) = 1
It is not super readable, but is this "the" way to do this?
If you're doing an AsOf join against small tables, you can use a regular Venn diagram type of join. If you're running it against large tables, a regular join will lead to an intermediate cardinality explosion before the filter.
For large tables, this is the highest performance approach I have to date. Rather than treating an AsOf join like a regular Venn diagram join, we can treat it like a special type of union between two tables with a filter that uses the information from that union. The sample SQL does the following:
Unions the A and B tables so that the Entity and Time come from both tables and all other columns come from only one table. Rows from the other table specify NULL for these values (measures 1 and 2 in this case). It also projects a source column for the table. We'll use this later.
In the unioned table, it uses a LAG function on windows partitioned by the Entity and ordered by the Time. For each row with a source indicator from the A table, it lags back to the first Time with source in the B table, ignoring all values in the A table.
with A as
(
select
COLUMN1::int as "E", -- Entity
COLUMN2::int as "T", -- Time
COLUMN4::string as "M1" -- Measure (could be many)
from (values
(1, 7, 1, 'M1-1'),
(1, 8, 1, 'M1-2'),
(1, 41, 1, 'M1-3'),
(1, 89, 1, 'M1-4')
)
), B as
(
select
COLUMN1::int as "E", -- Entity
COLUMN2::int as "T", -- Time
COLUMN4::string as "M2" -- Different measure (could be many)
from (values
(1, 6, 1, 'M2-1'),
(1, 12, 1, 'M2-2'),
(1, 20, 1, 'M2-3'),
(1, 35, 1, 'M2-4'),
(1, 57, 1, 'M2-5'),
(1, 85, 1, 'M2-6'),
(1, 92, 1, 'M2-7')
)
), UNIONED as -- Unify schemas and union all
(
select 'A' as SOURCE_TABLE -- Project the source table
,E as AB_E -- AB_ means it's unified
,T as AB_T
,M1 as A_M1 -- A_ means it's from A
,NULL::string as B_M2 -- Make columns from B null for A
from A
union all
select 'B' as SOURCE_TABLE
,E as AB_E
,T as AB_T
,NULL::string as A_M1 -- Make columns from A null for B
,M2 as B_M2
from B
)
select AB_E as ENTITY
,AB_T as A_TIME
,lag(iff(SOURCE_TABLE = 'A', null, AB_T)) -- Lag back to
ignore nulls over -- previous B row
(partition by AB_E order by AB_T) as B_TIME
,A_M1 as M1_FROM_A
,lag(B_M2) -- Lag back to the previous non-null row.
ignore nulls -- The A sourced rows will already be NULL.
over (partition by AB_E order by AB_T) as M2_FROM_B
from UNIONED
qualify SOURCE_TABLE = 'A'
;
This will perform orders of magnitude faster for large tables because the highest intermediate cardinality is guaranteed to be the cardinality of A + B.
To simplify this refactor, I wrote a stored procedure that generates the SQL given the paths to table A and B, the entity column in A and B (right now limited to one, but if you have more it will get the SQL started), the order by (time) column in A and B, and finally the list of columns to "drag through" the AsOf join. It's rather lengthy so I posted it on Github and will work later to document and enhance it:
https://github.com/GregPavlik/AsOfJoin/blob/main/StoredProcedure.sql

TSQL: How do I search for a grade within a list of grades?

How do I search for a grade within a list of grades? Some grades are of string data type, for example 'PK', and 'KK'.
The production list has over a thousand students each in different grade levels, so I'm not sure how to ensure the query would address that.
The logic I'm trying for is something like WHERE CurrentGrade like ('%SchoolGrades%').
Example query results:
Desired result:
(I didn't design the tables I have to work with as I know they are not optimal, but it's what I have to work with, thanks for the help.)
Sample code:
CREATE TABLE #StudentGrades(
StudentID int
, CurrentGrade varchar(255)
, SchoolEarliestGrade varchar(255)
, SchoolID int
, School varchar(255)
, SchoolGrades varchar(255)
)
INSERT INTO #StudentGrades (StudentID, CurrentGrade, SchoolEarliestGrade, SchoolID, School, SchoolGrades)
VALUES
(7777777, 11, 'PK' , 111 ,'Smith Elementary' ,'PK, KK, 01, 02, 03, 04, 05'),
(7777777, 11, '06' , 222 ,'Jones Middle' ,'06, 07, 08'),
(7777777, 11, '09' , 333 ,'Perez High School' ,'09, 10, 11, 12')
SELECT * FROM #StudentGrades
This will give you the rows where the CurrentGrade is in the SchoolGrades.
SELECT *
FROM StudentGrades
WHERE ', ' + SchoolGrades + ', ' LIKE '%, ' + CurrentGrade + ', %'
Edit: This is the best solution with help from the comments. Thanks, all.
Based on the fatc that the grades are in string field you can use like
select * from StudentGrades
where schoolGrades like '%11%';
or
select * from StudentGrades
where schoolGrades like '%KK%';
select * from StudentGrades
where schoolGrades like '%KK%' OR schoolGrades like '%PK%';
One could use a recursive CTE to unpivot the data in, separated values of SchoolGrades. Then derrive the needed values, and fianally use a simple where clause in the select against the CTE. Not sure of performance as the recursive loop is a record by record approach; however it may be faster than a full table scan with 2 ors.
Working example:
http://rextester.com/NAB12900
strCTE gets the values to normalize the data for us
subcte provides the the needed grades on individual rows with data normalized
and the last query simply limits to only current grade matching stringvalue.
.
WITH StrCTE AS
(
SELECT 1 start, CHARINDEX(',' , schoolGrades) stop, StudentID, CurrentGrade, SchoolEarliestGrade, SchoolID, School, SchoolGrades
FROM#studentGrades A
UNION ALL
SELECT stop + 1, CHARINDEX(',' ,schoolgrades , stop + 1), StudentID, CurrentGrade, SchoolEarliestGrade, SchoolID, School, SchoolGrades
FROM StrCTE A
WHERE stop > 0
),
SUBCTE AS (SELECT StudentID, CurrentGrade, SchoolEarliestGrade, SchoolID, School, SchoolGrades, ltrim(SUBSTRING(schoolgrades , start, CASE WHEN stop > 0 THEN stop-start ELSE 4000 END)) AS stringValue
FROM StrCTE)
SELECT *
FROM SUBCTE
WHERE currentgrade = stringValue
IMO this strCTE query gives you the ability to normalize the data allowing other standard SQL queries to function. Maybe create STRCTE as a materialized view on which your analysis is done. as the materialized view can have indexes which reduce the performance loss from the recursive loop.

Multiple joins to get the same lookup column for different values

We have a rather large SQL query, which is rather poorly performing. One of the problems (from analysing query plan) is the number of joins we have.
Essentially we have values in our data that we need to do a look up on another table.to get the value to display to the user. The problem is that we have do a join on the same table 4 times because there are 4 different columns that all need the same look up.
Hopefully this diagram might make it clearer
Raw_Event_data
event_id, datetime_id, lookup_1, lookup_2, lookup_3, lookup_4
1, 2013-01-01_12:00, 1, 5, 3, 9
2, 2013-01-01_12:00, 121, 5, 8, 19
3, 2013-01-01_12:00, 11, 2, 3, 32
4, 2013-01-01_12:00, 15, 2, 1, 0
Lookup_table
lookup_id, lookup_desc
1, desc1
2, desc2
3, desc3
...
Our query then looks something like this
Select
raw.event_id,
raw.datetime_id,
lookup1.lookup_desc,
lookup2.lookup_desc,
lookup3.lookup_desc,
lookup4.lookup_desc,
FROM
Raw_Event_data raw, Lookup_table lookup1,Lookup_table lookup2,Lookup_table lookup3,Lookup_table lookup4
WHERE raw.event_id = 1 AND
raw.lookup_1 *= lookup1 AND
raw.lookup_2 *= lookup2.lookup_id AND
raw.lookup_3 *= lookup3.lookup_id AND
raw.lookup_4 *= lookup4.lookup_id
So I get as an output
1, 2013-01-01_12:00, desc1, desc5, desc3, desc9
As I said the query works, but the joins are killing the performance.
That is a simplistic example I give there, in reality there will be 12 joins like above and we won't be selecting a specific event, but rather a range of events.
The question is, is there a better way of doing those joins.
correlated subqueries might be the way to go:
SELECT r.event_id
, r.datetime_id
, (select lookup1.lookup_desc from lookup_table lookup1 where lookup1.lookup_id = r.lookup_1) as desc_1
, (select lookup2.lookup_desc from lookup_table lookup2 where lookup2.lookup_id = r.lookup_2) as desc_2
, (select lookup3.lookup_desc from lookup_table lookup3 where lookup3.lookup_id = r.lookup_3) as desc_3
, (select lookup4.lookup_desc from lookup_table lookup4 where lookup4.lookup_id = r.lookup_4) as desc_4
FROM Raw_Event_data r
WHERE r.event_id = 1
;
My first attempt would be to handle the indexing myself, if I was refused by the DBA's.
declare #start_range bigint, #end_range bigint
select
#start_range = 5
,#end_range = 500
create local temporary table raw_event_subset
( --going to assume some schema based on your comments...obviously you will change these to whatever the base schema is.
event_id bigint
,datetime_id timestamp
,lookup_1 smallint
,lookup_2 smallint
--etc
) on commit preserve rows
create HG index HG_temp_raw_event_subset_event_id on raw_event_subset (event_id)
create LF index LF_temp_raw_event_subset_lookup_1 on raw_event_subset (lookup_1)
create LF index LF_temp_raw_event_subset_lookup_2 on raw_event_subset (lookup_2)
--etc
insert into raw_event_subset
select
event_id
,datetime_id
,lookup_1
,lookup_2
--,etc
from raw_event_data
where event_id >= #start_range --event_id *must* have an HG index on it for this to be worthwhile.
and event_id <= #end_range
--then run your normal query, except replace raw_event_data with raw_event_subset
select
event_id
,datetime_id
,l1.lookup_desc
,l2.lookup_desc
--etc
from raw_event_subset r
left join lookup_table l1
on l1.lookup_id = r.lookup_1
left join lookup_table l2
on l2.lookup_id = r.lookup_2
--etc
drop table raw_event_subset
hope this helps...

create view for items not in list

I have two tables. Table 1 is a master list of equipment with equipment_id and equipment_description. So, let's say for this table I have ten equipment_id's. 1,2,3....10 each with some description attached.
Table 2 logs when the equipment has been inspected:
equipment_id|inspection_date
1 | '1-22-2012'
2 | '1-22-2012'
4 | '1-22-2012'
2 | '1-23-2012'
3 | '1-23-2012'
I've created a view, v_dates which pulls out of table 2 all of the distinct inspection dates - not sure if I needed it but did it anyway.
I would like to create another view which shows all equipment that was NOT inspected for each date in the v_dates. So it would show:
3 | '1-22-2012'
5 | '1-22-2012'
and so on.
Rookie here and just not sure how to join these tables correctly. Can't get it to work and would appreciate any help.
Untested, but I think this should give the desired result:
SELECT i.id,d.date FROM
( SELECT DISTINCT inspection_date AS date FROM inspections ORDER BY 1 ) d
LEFT JOIN
inspections i
ON d.date=i.date
WHERE i.date IS NULL
GROUP BY 1,2
ORDER BY 1,2
Like mentioned in the comments would a table with inspection dates really help.
The following appears to work based on my test data using SQL SERVER 2005. I am using a CROSS JOIN of distinct dates along with a LEFT JOIN to throw out EQUIPMENT_ID records that exist for those dates.
Sorry, I am having problems getting my code formatting correct with tabs and spaces...
IF OBJECT_ID('tempdb..#EQUIPMENT') IS NOT NULL
DROP TABLE #EQUIPMENT
CREATE TABLE #EQUIPMENT
( EQUIPMENT_ID smallint,
EQUIPMENT_DESC varchar(32)
)
INSERT INTO #EQUIPMENT
( EQUIPMENT_ID, EQUIPMENT_DESC )
SELECT 1, 'AAA'
UNION SELECT 2, 'BBB'
UNION SELECT 3, 'CCC'
UNION SELECT 4, 'DDD'
UNION SELECT 5, 'EEE'
UNION SELECT 6, 'FFF'
UNION SELECT 7, 'GGG'
UNION SELECT 8, 'HHH'
UNION SELECT 9, 'III'
UNION SELECT 10, 'JJJ'
IF OBJECT_ID('tempdb..#INSPECTION') IS NOT NULL
DROP TABLE #INSPECTION
CREATE TABLE #INSPECTION
( EQUIPMENT_ID smallint,
INSPECTION_DATE smalldatetime
)
INSERT INTO #INSPECTION
( EQUIPMENT_ID, INSPECTION_DATE )
SELECT 1, '1-22-2012'
UNION SELECT 1, '1-27-2012'
UNION SELECT 3, '1-27-2012'
UNION SELECT 5, '1-29-2012'
UNION SELECT 7, '1-22-2012'
UNION SELECT 7, '1-27-2012'
UNION SELECT 7, '1-29-2012'
SELECT E.EQUIPMENT_ID, D.INSPECTION_DATE
FROM #EQUIPMENT E
CROSS JOIN ( SELECT DISTINCT INSPECTION_DATE
FROM #INSPECTION
) D
LEFT JOIN #INSPECTION I2
ON E.EQUIPMENT_ID = I2.EQUIPMENT_ID
AND D.INSPECTION_DATE = I2.INSPECTION_DATE
WHERE I2.EQUIPMENT_ID IS NULL
ORDER BY E.EQUIPMENT_ID, D.INSPECTION_DATE
As per my comment to the question, you really need a table of valid inspection dates. It makes the sql much more sensible, and besides it's the only way to do it if you want to see all items listed for dates when inspections were supposed to be done, but no inspections were done.
So, assuming the two tables:
create table inspections (equipment_id int, inspection_date date);
create table inspection_dates (id int, inspection_date date);
Then a join to get all the equipment that does not have an inspection on a date when an inspection should have taken place would be:
select i.equipment_id, id.inspection_date
from inspection_dates id,
(select distinct equipment_id from inspections) i
where not exists (select * from inspections i2
where i2.inspection_date = id.inspection_date
and i2.equipment_id = i.equipment_id);
You want the combos that do not exist. Thus the not exists predicate.
Note again, that presumably you would have a table for all the unique equipment_ids, but not knowing that I had to construct it myself in place.

SQL Query to eliminate similar entries

I am working on a problem in SQL Server 2008
I have a table with six columns:
PK INT
dOne SmallINT
dTwo SmallINT
dThree SmallINT
dFour SmallINT
dFiveSmallINT
dSix SmallINT
The table contains around a million recrods. It's probably worth noting that value in column n+1 > value in column n i.e. 97, 98, 99, 120, 135. I am trying to eliminate all rows which have 5 DIGITS in common (ignoring the PK) i.e.:
76, 89, 99, 102, 155, 122
11, 89, 99, 102, 155, 122
89, 99, 102, 155, 122, 130
In this case the algorithm should start at the first row and delete the second and third rows because they contain 5 matching digits. The first row persists.
I have tried to brute force the solution but finding all the duplicates for only the first record takes upwards of 25 seconds meaning processing the whole table would take... way too long (this should be a repeatable process).
I am fairly new to SQL but this is what I have come up with (I have come up with a few solutions but none were adequate... this is the latest attempt):
(I won't include all the code but I will explain the method, I can paste more if it helps)
Save the digits of record n into variables. SELECT all records which have one digit in common with record n FROM largeTable.
Insert all selected digits into #oneMatch and include [matchingOne] with the digit that matched.
Select all records which have one digit in common with record n FROM the temp table WHERE 'digit in common' != [matching]. INSERT all selected digits into #twoMatch and include [matchingOne] AND [matchingTwo]...
Repeat until inserting into #fiveMatch. Delete #fiveMatch from largeTable and move to record n+1
I am having a problem implementing this solution. How can I assign the matching variable depending on the WHERE clause?
-- SELECT all records with ONE matching field:
INSERT INTO #oneMatch (ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix, mOne)
SELECT ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix
FROM dbo.BaseCombinationsExtended
WHERE ( [dOne] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dOne?
OR [dTwo] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dTwo?
OR [dTwo] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dThree?
...
OR [dSix] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dSix?
)
I am able to 'fake' the above using six queries but that is too inefficient...
Sorry for the long description. Any help would be greatly appreciated (new solution or implementation of my attempt above) as this problem has been nagging at me for a while...
Unless I miss something this should produce the correct result.
declare #T table
(
PK INT identity primary key,
dOne SmallINT,
dTwo SmallINT,
dThree SmallINT,
dFour SmallINT,
dFive SmallINT,
dSix SmallINT
)
insert into #T values
(76, 89, 99, 102, 155, 122),
(11, 89, 99, 102, 155, 122),
(89, 99, 102, 155, 122, 130)
;with q1(PK, d1, d2, d3, d4, d5) as
(
select PK, dTwo, dThree, dFour, dFive, dSix
from #T
union all
select PK, dOne, dThree, dFour, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dFour, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFour, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFour, dFive
from #T
),
q2 as
(
select PK,
row_number() over(partition by d1, d2, d3, d4, d5 order by PK) as rn
from q1
),
q3 as
(
select PK
from q2
where rn = 1
group by PK
having count(*) = 6
)
select T.*
from #T as T
inner join q3 as Q
on T.PK = Q.PK
I can't make any promises on performance, but you can try this. The first thing that I do is put the data into a more normalized structure.
CREATE TABLE dbo.Test_Sets_Normalized (my_id INT NOT NULL, c SMALLINT NOT NULL)
GO
INSERT INTO dbo.Test_Sets_Normalized (my_id, c)
SELECT my_id, c1 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c2 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c3 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c4 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c5 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c6 FROM dbo.Test_Sets
GO
SELECT DISTINCT
T2.my_id
FROM
(SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T1
INNER JOIN (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T2 ON T2.my_id > T1.my_id
WHERE
(
SELECT
COUNT(*)
FROM
dbo.Test_Sets_Normalized T3
INNER JOIN dbo.Test_Sets_Normalized T4 ON
T4.my_id = T2.my_id AND
T4.c = T3.c
WHERE
T3.my_id = T1.my_id) >= 5
That should get you the IDs that you need. Once you've confirmed that it does what you want, you can JOIN back to the original table and delete by IDs.
There's probably an improvement possible somewhere that doesn't require the DISTINCT. I'll give it a little more thought.
Edit - the following approach might be better than N squared performance, depending on the optimizer. If all 5 columns are indexed it should only need 6 index seeks per row, which is still N * logN. It does seem a little dopey though.
You could code generate the where condition based on all the permutations of 5 matches: so the records to delete would be given by:
SELECT * FROM SillyTable ToDelete WHERE EXISTS
(
SELECT PK From SillyTable Duplicate
WHERE ( (
(Duplicate.dOne=ToDelete.dOne)
AND (Duplicate.dTwo=ToDelete.dTwo)
AND (Duplicate.dThree=ToDelete.dThree)
AND (Duplicate.dFour=ToDelete.dFour)
AND (Duplicate.dFive=ToDelete.dFive)
) OR (
(Duplicate.dOne=ToDelete.dTwo)
AND (Duplicate.dTwo=ToDelete.dThree)
AND (Duplicate.dThree=ToDelete.dFour)
AND (Duplicate.dFour=ToDelete.dFive)
AND (Duplicate.dFive=ToDelete.dSix)
) OR (
(Duplicate.dTwo=ToDelete.dOne)
AND (Duplicate.dThree=ToDelete.dTwo)
AND (Duplicate.dFour=ToDelete.dThree)
AND (Duplicate.dFive=ToDelete.dFour)
AND (Duplicate.dSix=ToDelete.dFive)
) OR (
(Duplicate.dTwo=ToDelete.dTwo)
AND (Duplicate.dThree=ToDelete.dThree)
AND (Duplicate.dFour=ToDelete.dFour)
AND (Duplicate.dFive=ToDelete.dFive)
AND (Duplicate.dSix=ToDelete.dSix)
) ...
This goes on to cover all 36 combinations (there is one non-match on each side of the join, out of 6 possible columns, so 6*6 gives you all the possibilites). I would code generate this because it's a lot of typing, and what if you want 4 out of 6 matches tomorrow, but you could hand code it I guess.