SQL Query taking WAY too long - sql

I have a query that's taking way too long.
There's not an index on any column and I'm pretty sure the way the OR are acting in this are making this too hard on the server.
This is a view I have and I'm making a SELECT * on this view that is taking 4 minutes to complete.
After revision, the query that I'm doing on this view is taking the most time.
SELECT * FROM Penny_Assoc_PCB WHERE PRODUCT_ID=68 ORDER BY RECORD_DT, ASSOCIATION_TYPE
/***** Here is the execution plan *******/
https://www.brentozar.com/pastetheplan/?id=Bki03eIHK
SELECT dbo.synfact_record.RECORD_ID
,dbo.synfact_record.PART_ID
,dbo.synfact_record.RECORD_DT
,dbo.synfact_association.ASSOCIATION_PART_A
,dbo.synfact_association.ASSOCIATION_PART_B
,dbo.synfact_association.ASSOCIATION_TYPE
,dbo.synfact_association.ASSOCIATION_ID
,dbo.synfact_record.PRODUCT_ID
FROM dbo.synfact_association
INNER JOIN dbo.synfact_record ON dbo.synfact_association.RECORD_ID = dbo.synfact_record.RECORD_ID
WHERE (
dbo.synfact_record.PART_ID IN (
SELECT PART_ID
FROM dbo.synfact_record AS synfact_record_1
WHERE (RECORD_STATUS = 1)
AND (RECORD_TYPE = 0)
)
)
AND dbo.synfact_record.PRODUCT_ID IN(
8,
9,
10,
15,
27,
31,
34,
56,
60,
61,
62,
66,
67,
68)
AND (dbo.synfact_record.RECORD_ID > 499)
AND (dbo.synfact_record.RECORD_STATUS = 1)
GROUP BY dbo.synfact_record.RECORD_ID
,dbo.synfact_record.PART_ID
,dbo.synfact_record.RECORD_DT
,dbo.synfact_association.ASSOCIATION_PART_A
,dbo.synfact_association.ASSOCIATION_PART_B
,dbo.synfact_association.ASSOCIATION_TYPE
,dbo.synfact_association.ASSOCIATION_ID
,dbo.synfact_record.PRODUCT_ID
,dbo.synfact_record.RECORD_STATUS

You can substantially simplify your query.
I have removed the GROUP BY, which was acting as a giant DISTINCT with no aggregation. If you get duplicates, I suggest you put more thought into your join. Perhaps you need a better join condition, or a top-1-per-group.
SELECT r.RECORD_ID,
r.PART_ID,
r.RECORD_DT,
a.ASSOCIATION_PART_A,
a.ASSOCIATION_PART_B,
a.ASSOCIATION_TYPE,
r.ASSOCIATION_ID,
r.PRODUCT_ID
FROM
dbo.synfact_association AS a
INNER JOIN
dbo.synfact_record AS r ON a.RECORD_ID = r.RECORD_ID
WHERE
(r.PART_ID IN (
SELECT PART_ID
FROM dbo.synfact_record AS r1
WHERE (r1.RECORD_STATUS = 1)
AND (r1.RECORD_TYPE = 0)
)
)
AND r.PRODUCT_ID IN
(8,9,10,15,27,31,34,56,60,61,62,67,68)
AND (r.RECORD_ID > 499)
AND (r.RECORD_STATUS = 1);
Based on this query alone, I would recommend the following indexes:
CREATE CLUSTERED INDEX IX_synfact_association_RECORD_ID
ON synfact_association (RECORD_ID)
-- for non clustered add: INCLUDE (ASSOCIATION_PART_A, ASSOCIATION_PART_B, ASSOCIATION_TYPE)
CREATE CLUSTERED INDEX IX_synfact_record_RECORD_ID
ON synfact_record (RECORD_STATUS, RECORD_ID)
-- for non clustered add: INCLUDE (PART_ID, RECORD_DT, ASSOCIATION_ID, PRODUCT_ID)
In this second index it maybe worth swapping RECORD_ID and PART_ID
CREATE NONCLUSTERED INDEX IX_synfact_record_RECORD_TYPE
ON synfact_record (RECORD_STATUS, RECORD_TYPE, PART_ID)
This last index is necessary for the IN clause

Related

How to do an as-of-join in SQL (Snowflake)?

I am looking to join two time-ordered tables, such that the events in table1 are matched to the "next" event in table2 (within the same user). I am using SQL / Snowflake for this.
For argument's sake table1 is "notification_clicked" events and table2 is "purchases"
This is one way to do it:
WITH partial_result AS (
SELECT
userId, notificationId, notificationTimeStamp, transactionId, transactionTimeStamp
FROM table1 CROSS JOIN table2
WHERE table1.userId = table2.userId
AND notificationTimeStamp <= transactionTimeStamp)
SELECT *
FROM partial_result
QUALIFY ROW_NUMBER() OVER(
PARTITION BY userId, notificationId ORDER BY transactionTimeStamp ASC
) = 1
It is not super readable, but is this "the" way to do this?
If you're doing an AsOf join against small tables, you can use a regular Venn diagram type of join. If you're running it against large tables, a regular join will lead to an intermediate cardinality explosion before the filter.
For large tables, this is the highest performance approach I have to date. Rather than treating an AsOf join like a regular Venn diagram join, we can treat it like a special type of union between two tables with a filter that uses the information from that union. The sample SQL does the following:
Unions the A and B tables so that the Entity and Time come from both tables and all other columns come from only one table. Rows from the other table specify NULL for these values (measures 1 and 2 in this case). It also projects a source column for the table. We'll use this later.
In the unioned table, it uses a LAG function on windows partitioned by the Entity and ordered by the Time. For each row with a source indicator from the A table, it lags back to the first Time with source in the B table, ignoring all values in the A table.
with A as
(
select
COLUMN1::int as "E", -- Entity
COLUMN2::int as "T", -- Time
COLUMN4::string as "M1" -- Measure (could be many)
from (values
(1, 7, 1, 'M1-1'),
(1, 8, 1, 'M1-2'),
(1, 41, 1, 'M1-3'),
(1, 89, 1, 'M1-4')
)
), B as
(
select
COLUMN1::int as "E", -- Entity
COLUMN2::int as "T", -- Time
COLUMN4::string as "M2" -- Different measure (could be many)
from (values
(1, 6, 1, 'M2-1'),
(1, 12, 1, 'M2-2'),
(1, 20, 1, 'M2-3'),
(1, 35, 1, 'M2-4'),
(1, 57, 1, 'M2-5'),
(1, 85, 1, 'M2-6'),
(1, 92, 1, 'M2-7')
)
), UNIONED as -- Unify schemas and union all
(
select 'A' as SOURCE_TABLE -- Project the source table
,E as AB_E -- AB_ means it's unified
,T as AB_T
,M1 as A_M1 -- A_ means it's from A
,NULL::string as B_M2 -- Make columns from B null for A
from A
union all
select 'B' as SOURCE_TABLE
,E as AB_E
,T as AB_T
,NULL::string as A_M1 -- Make columns from A null for B
,M2 as B_M2
from B
)
select AB_E as ENTITY
,AB_T as A_TIME
,lag(iff(SOURCE_TABLE = 'A', null, AB_T)) -- Lag back to
ignore nulls over -- previous B row
(partition by AB_E order by AB_T) as B_TIME
,A_M1 as M1_FROM_A
,lag(B_M2) -- Lag back to the previous non-null row.
ignore nulls -- The A sourced rows will already be NULL.
over (partition by AB_E order by AB_T) as M2_FROM_B
from UNIONED
qualify SOURCE_TABLE = 'A'
;
This will perform orders of magnitude faster for large tables because the highest intermediate cardinality is guaranteed to be the cardinality of A + B.
To simplify this refactor, I wrote a stored procedure that generates the SQL given the paths to table A and B, the entity column in A and B (right now limited to one, but if you have more it will get the SQL started), the order by (time) column in A and B, and finally the list of columns to "drag through" the AsOf join. It's rather lengthy so I posted it on Github and will work later to document and enhance it:
https://github.com/GregPavlik/AsOfJoin/blob/main/StoredProcedure.sql

SQL Offset total row count slow with IN Clause

I am using the below SQL code based on another answer. However when including the massive in clause, getting the total count takes too long. If I remove the total count, then the query takes less than 1 second. Is there a more efficient way to get the total row count? The answers I saw were based off of 2013 SQL queries.
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
), TempCount AS (
SELECT COUNT(*) AS MaxRows FROM TempResult
)
SELECT *
FROM TempResult,
TempCount <----- this is what is slow. Removing this and the query is super fast
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY
Step one for performance related questions is going to be to analyze your table/index structure, and to review the query plans. You haven't provided that information, so I'm going to make up my own, and go from there.
I'm going to assume that you have a heap, with ~10M rows (12,872,738 for me):
DECLARE #MaxRowCount bigint = 10000000,
#Offset bigint = 0;
DROP TABLE IF EXISTS #ExampleTable;
CREATE TABLE #ExampleTable
(
ID bigint NOT NULL,
Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);
WHILE #Offset < #MaxRowCount
BEGIN
INSERT INTO #ExampleTable
( ID, Name )
SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )),
ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ))
FROM master.dbo.spt_values SV
CROSS APPLY master.dbo.spt_values SV2;
SET #Offset = #Offset + ROWCOUNT_BIG();
END;
If I run the query provided over #ExampleTable, it takes about 4 seconds and gives me this query plan:
This isn't a great query plan by any means, but it is hardly awful. Running with live query stats shows that the cardinality estimates were at most off by one, which is fine.
Lets give a massive number of items in our IN list (5000 items from 1-5000). Compiling the plan took 4 seconds:
I can get my number up to 15000 items before the query processor stops being able to handle it, with no change in query plan (it does take a total of 6 seconds to compile). Running both queries takes about 5 seconds a pop on my machine.
This is probably fine for analytical workloads or for data warehousing, but for OLTP like queries we've definitely exceeded our ideal time limit.
Lets look at some alternatives. We can probably do some of these in combination.
We could cache off the IN list in a temp table or table variable.
We could use a window function to calculate the count
We could cache off our CTE in a temp table or table variable
If on a sufficiently high SQL Server version, use batch mode
Change the indices on your table to make this faster.
Workflow considerations
If this is for an OLTP workflow, then we need something that is fast regardless of how many users we have. As such, we want to minimize recompiles and we want index seeks wherever possible. If this is analytic or warehousing, then recompiles and scans are probably fine.
If we want OLTP, then the caching options are probably off the table. Temp tables will always force recompiles, and table variables in queries that rely on a good estimate require you to force a recompile. The alternative would be to have some other part of your application maintain a persistent table that has paginated counts or filters (or both), and then have this query join against that.
If the same user would look at many pages, then caching off part of it is probably still worth it even in OLTP, but make sure you measure the impact of many concurrent users.
Regardless of workflow, updating indices is probably okay (unless your workflows are going to really mess with your index maintenance).
Regardless of workflow, batch mode will be your friend.
Regardless of workflow, window functions (especially with either indices and/or batch mode) will probably be better.
Batch mode and the default cardinality estimator
We pretty consistently get poor cardinality estimates (and resulting plans) with the legacy cardinality estimator and row-mode executions. Forcing the default cardinality estimator helps with the first, and batch-mode helps with the second.
If you can't update your database to use the new cardinality estimator wholesale, then you'll want to enable it for your specific query. To accomplish that, you can use the following query hint: OPTION( USE HINT( 'FORCE_DEFAULT_CARDINALITY_ESTIMATION' ) ) to get the first. For the second, add a join to a CCI (doesn't need to return data): LEFT OUTER JOIN dbo.EmptyCciForRowstoreBatchmode ON 1 = 0 - this enables SQL Server to pick batch mode optimizations. These recommendations assume a sufficiently new SQL Server version.
What the CCI is doesn't matter; we like to keep an empty one around for consistency, that looks like this:
CREATE TABLE dbo.EmptyCciForRowstoreBatchmode
(
__zzDoNotUse int NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);
The best plan I could get without modifying the table was to use both of them. With the same data as before, this runs in <1s.
WITH TempResult AS
(
SELECT ID,
Name,
COUNT( * ) OVER ( ) MaxRows
FROM #ExampleTable
WHERE ID IN ( <<really long LIST>> )
)
SELECT TempResult.ID,
TempResult.Name,
TempResult.MaxRows
FROM TempResult
LEFT OUTER JOIN dbo.EmptyCciForRowstoreBatchmode ON 1 = 0
ORDER BY TempResult.Name OFFSET ( #PageNum - 1 ) * #PageSize ROWS FETCH NEXT #PageSize ROWS ONLY
OPTION( USE HINT( 'FORCE_DEFAULT_CARDINALITY_ESTIMATION' ) );
As far as I know there are 3 ways to achieve this, besides using the #temp table approach already mentioned. In my test cases below, I've used a SQL Server 2016 Developer instance with 6CPU/16GB RAM, and a simple table containing ~25M rows.
Method 1: CROSS JOIN
DECLARE
#PageSize INT = 10
, #PageNum INT = 1;
WITH TempResult AS (SELECT
id
, shortDesc
FROM dbo.TestName
WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
SELECT
*, MaxRows
FROM TempResult
CROSS JOIN (SELECT COUNT(1) AS MaxRows FROM TempResult) AS TheCount
ORDER BY TempResult.shortDesc OFFSET (#PageNum - 1) * #PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY;
Test result 1:
Method 2: COUNT(*) OVER()
DECLARE
#PageSize INT = 10
, #PageNum INT = 1;
WITH TempResult AS (SELECT
id
, shortDesc
FROM dbo.TestName
WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
)
SELECT
*, MaxRows = COUNT(*) OVER()
FROM TempResult
ORDER BY TempResult.shortDesc OFFSET (#PageNum - 1) * #PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY;
Test result 2:
Method 3: 2nd CTE
Test result 3 (T-SQL used was the same as in the question):
Conclusion
The fastest method depends on your data structure (and total number of rows) in combination with your server sizing/load. In my case using COUNT(*) OVER() proved to be the fastest method. To find what is best for you, you have to test what is best for your scenario. And don't rule out that #table approach either just yet ;-)
You can try to count the rows while filtering the table using ROW_NUMBER():
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
;WITH
TempResult AS (
SELECT ID, Name, ROW_NUMBER() OVER (ORDER BY ID) N
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
),
TempCount AS (
SELECT TOP 1 N AS MaxRows
FROM TempResult
ORDER BY ID DESC
)
SELECT *
FROM
TempResult,
TempCount
ORDER BY
TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY
You could try phrasing this as:
WITH TempResult AS(
SELECT ID, Name, COUNT(*) OVER () as maxrows
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
However, I doubt that you will see much performance improvement. The entire table needs to be scanned to get the total count. That is probably where the performance issue is.
This might be a shot in the dark but you can try using a temp table instead of a cte.
Though the performance results and preference of one over the other depends on use-case to use-case, a temp table sometimes can actually prove better since it enables you to leverage indices and dedicated statistics.
INSERT INTO #TempResult
SELECT ID, Name
FROM Table
WHERE ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
The IN statement is a notorious hurdle for the SQL Server query engine. When it gets "massive" (your words) it slows down even simple queries. In my experience, IN statements with more than 5000 items nearly always unacceptably slow down any query.
It nearly always works better to convert the items of a large IN statement into a temp table or table variable first and then join with this table, as below. I tested this and found it's significantly faster, even with the preparation of the temp table. I think that the IN statement, even though the inner query performs well enough with it, has a detrimental effect on the combined query.
DECLARE #ids TABLE (ID int primary key );
-- This must be done in chunks of 1000
INSERT #ids (ID) VALUES
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),...
...
;WITH TempResult AS
(
SELECT tbl.ID, tbl.Name
FROM Table tbl
JOIN #ids ids ON ids.ID = tbl.ID
),
TempCount AS
(
SELECT COUNT(*) AS MaxRows FROM TempResult
)
SELECT *
FROM TempResult,
TempCount
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY
CTEs are very nice, but having many consecutive CTEs (two is not many I think but in general) caused me many times performace horror. The simplest method I think would be calculate number of rows once and assign it to variable:
DECLARE
#PageSize INT = 10,
#PageNum INT = 1,
#MaxRows bigint = (SELECT COUNT(1) FROM Table Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10));
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
SELECT *
FROM TempResult,
#MaxRows TempCount <----- this is what is slow. Removing this and the query is super fast
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY
I can't test this at the moment but on glancing through it struck me that specifying a multiply (cross join) as in:
FROM TempResult,
TempCount <----- this is what is slow. Removing this and the query is super
may be the issue
How does it perform when written simply as:
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
SELECT *, (SELECT COUNT(*) FROM TempResult) AS MaxRows
FROM TempResult
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY

Firebird select from table distinct one field

The question I asked yesterday was simplified but I realize that I have to report the whole story.
I have to extract the data of 4 from 4 different tables into a Firebird 2.5 database and the following query works:
SELECT
PRODUZIONE_T t.CODPRODUZIONE,
PRODUZIONE_T.NUMEROCOMMESSA as numeroco,
ANGCLIENTIFORNITORI.RAGIONESOCIALE1,
PRODUZIONE_T.DATACONSEGNA,
PRODUZIONE_T.REVISIONE,
ANGUTENTI.NOMINATIVO,
ORDINI.T_DATA,
FROM PRODUZIONE_T
LEFT OUTER JOIN ORDINI_T ON PRODUZIONE_T.CODORDINE=ORDINI_T.CODORDINE
INNER JOIN ANGCLIENTIFORNITORI ON ANGCLIENTIFORNITORI.CODCLIFOR=ORDINI_T.CODCLIFOR
LEFT OUTER JOIN ANGUTENTI ON ANGUTENTI.IDUTENTE = PRODUZIONE_T.RESPONSABILEUC
ORDER BY right(numeroco,2) DESC, left(numeroco,3) desc
rows 1 to 500;
However the query returns me double (or more) due to the REVISIONE column.
How do I select only the rows of a single NUMEROCOMMESSA with the maximum REVISIONE value?
This should work:
select COD, ORDER, S.DATE, REVISION
FROM TAB1
JOIN
(
select ORDER, MAX(REVISION) as REVISION
FROM TAB1
Group By ORDER
) m on m.ORDER = TAB1.ORDER and m.REVISION = TAB1.REVISION
Here you go - http://sqlfiddle.com/#!6/ce7cf/4
Sample Data (as u set it in your original question):
create table TAB1 (
cod integer primary key,
n_order varchar(10) not null,
s_date date not null,
revision integer not null );
alter table tab1 add constraint UQ1 unique (n_order,revision);
insert into TAB1 values ( 1, '001/18', '2018-02-01', 0 );
insert into TAB1 values ( 2, '002/18', '2018-01-31', 0 );
insert into TAB1 values ( 3, '002/18', '2018-01-30', 1 );
The query:
select *
from tab1 d
join ( select n_ORDER, MAX(REVISION) as REVISION
FROM TAB1
Group By n_ORDER ) m
on m.n_ORDER = d.n_ORDER and m.REVISION = d.REVISION
Suggestions:
Google and read the classic book: "Understanding SQL" by Martin Gruber
Read Firebird SQL reference: https://www.firebirdsql.org/file/documentation/reference_manuals/fblangref25-en/html/fblangref25.html
Here is yet one more solution using Windowed Functions introduced in Firebird 3 - http://sqlfiddle.com/#!6/ce7cf/13
I do not have Firebird 3 at hand, so can not actually check if there would not be some sudden incompatibility, do it at home :-D
SELECT * FROM
(
SELECT
TAB1.*,
ROW_NUMBER() OVER (
PARTITION BY n_order
ORDER BY revision DESC
) AS rank
FROM TAB1
) d
WHERE rank = 1
Read documentation
https://community.modeanalytics.com/sql/tutorial/sql-window-functions/
https://www.firebirdsql.org/file/documentation/release_notes/html/en/3_0/rnfb30-dml-windowfuncs.html
Which of the three (including Gordon's one) solution would be faster depends upon specific database - the real data, the existing indexes, the selectivity of indexes.
While window functions can make the join-less query, I am not sure it would be faster on real data, as it maybe can just ignore indexes on order+revision cortege and do the full-scan instead, before rank=1 condition applied. While the first solution would most probably use indexes to get maximums without actually reading every row in the table.
The Firebird-support mailing list suggested a way to break out of the loop, to only use a single query: The trick is using both windows functions and CTE (common table expression): http://sqlfiddle.com/#!18/ce7cf/2
WITH TMP AS (
SELECT
*,
MAX(revision) OVER (
PARTITION BY n_order
) as max_REV
FROM TAB1
)
SELECT * FROM TMP
WHERE revision = max_REV
If you want the max revision number in Firebird:
select t.*
from tab1 t
where t.revision = (select max(t2.revision) from tab1 t2 where t2.order = t.order);
For performance, you want an index on tab1(order, revision). With such an index, performance should be competitive with any other approach.

Multiple joins to get the same lookup column for different values

We have a rather large SQL query, which is rather poorly performing. One of the problems (from analysing query plan) is the number of joins we have.
Essentially we have values in our data that we need to do a look up on another table.to get the value to display to the user. The problem is that we have do a join on the same table 4 times because there are 4 different columns that all need the same look up.
Hopefully this diagram might make it clearer
Raw_Event_data
event_id, datetime_id, lookup_1, lookup_2, lookup_3, lookup_4
1, 2013-01-01_12:00, 1, 5, 3, 9
2, 2013-01-01_12:00, 121, 5, 8, 19
3, 2013-01-01_12:00, 11, 2, 3, 32
4, 2013-01-01_12:00, 15, 2, 1, 0
Lookup_table
lookup_id, lookup_desc
1, desc1
2, desc2
3, desc3
...
Our query then looks something like this
Select
raw.event_id,
raw.datetime_id,
lookup1.lookup_desc,
lookup2.lookup_desc,
lookup3.lookup_desc,
lookup4.lookup_desc,
FROM
Raw_Event_data raw, Lookup_table lookup1,Lookup_table lookup2,Lookup_table lookup3,Lookup_table lookup4
WHERE raw.event_id = 1 AND
raw.lookup_1 *= lookup1 AND
raw.lookup_2 *= lookup2.lookup_id AND
raw.lookup_3 *= lookup3.lookup_id AND
raw.lookup_4 *= lookup4.lookup_id
So I get as an output
1, 2013-01-01_12:00, desc1, desc5, desc3, desc9
As I said the query works, but the joins are killing the performance.
That is a simplistic example I give there, in reality there will be 12 joins like above and we won't be selecting a specific event, but rather a range of events.
The question is, is there a better way of doing those joins.
correlated subqueries might be the way to go:
SELECT r.event_id
, r.datetime_id
, (select lookup1.lookup_desc from lookup_table lookup1 where lookup1.lookup_id = r.lookup_1) as desc_1
, (select lookup2.lookup_desc from lookup_table lookup2 where lookup2.lookup_id = r.lookup_2) as desc_2
, (select lookup3.lookup_desc from lookup_table lookup3 where lookup3.lookup_id = r.lookup_3) as desc_3
, (select lookup4.lookup_desc from lookup_table lookup4 where lookup4.lookup_id = r.lookup_4) as desc_4
FROM Raw_Event_data r
WHERE r.event_id = 1
;
My first attempt would be to handle the indexing myself, if I was refused by the DBA's.
declare #start_range bigint, #end_range bigint
select
#start_range = 5
,#end_range = 500
create local temporary table raw_event_subset
( --going to assume some schema based on your comments...obviously you will change these to whatever the base schema is.
event_id bigint
,datetime_id timestamp
,lookup_1 smallint
,lookup_2 smallint
--etc
) on commit preserve rows
create HG index HG_temp_raw_event_subset_event_id on raw_event_subset (event_id)
create LF index LF_temp_raw_event_subset_lookup_1 on raw_event_subset (lookup_1)
create LF index LF_temp_raw_event_subset_lookup_2 on raw_event_subset (lookup_2)
--etc
insert into raw_event_subset
select
event_id
,datetime_id
,lookup_1
,lookup_2
--,etc
from raw_event_data
where event_id >= #start_range --event_id *must* have an HG index on it for this to be worthwhile.
and event_id <= #end_range
--then run your normal query, except replace raw_event_data with raw_event_subset
select
event_id
,datetime_id
,l1.lookup_desc
,l2.lookup_desc
--etc
from raw_event_subset r
left join lookup_table l1
on l1.lookup_id = r.lookup_1
left join lookup_table l2
on l2.lookup_id = r.lookup_2
--etc
drop table raw_event_subset
hope this helps...

Merge duplicate records into 1 records with the same table and table fields

I have a database table that contains a list of demographic records, some of those participant might have multiple/duplicate records, e.g.
NOTE:
Gender:
119 = Male
118 = Female
Race:
255 = white
253 = Asian
UrbanRural:
331 = Urban
332 = Rural
participantid, gender, race, urbanrural, moduletypeid, hibernateid, and more fields
1, 119, 0, 331, 1, 1, .....
1, 119, 255, 0, 2, 2, .....
1, 0, 255, 331, 3, 3, .....
1, 119, 253, 331, 0, 4, .....
The output should be keep the first hibernateid and the duplicate records will be merge to the first hibernatedid record. If you can do this using function that will check the records if duplicate that would be great, after merged the records it delete the unused duplicate records. Your answer gives me a great idea to resolved this problem. Thanks
Output should be:
participantid, gender, race, urbanrural, moduletypeid, hibernateid, and more fields
1, 119, 255, 331, 1, 1, .....
Help me guys, Thanks
You can do something like this in Postgres 9.1+:
WITH duplicates AS (
SELECT desired_unique_key, count(*) AS count_of_same_key, min(st.id) AS keep_id, max(st.id) as delete_id
FROM source_table st
GROUP BY desired_unique_key
HAVING count(*) > 1
),
deleted_dupes AS (
DELETE FROM source_table st
WHERE st.id IN (SELECT(delete_id) FROM duplicates)
)
UPDATE source_table st
SET field = WHATEVER
FROM duplicates d
WHERE st.id = d.keep_id
Try something like:
select participantid, min(gender), min(race), min(urbanrural),
min(case moduletypeid when 0 then null else moduletypeid end), min(hibernateid), ...
from yourtable
group by participantid
It's not clear to me why moduletypeid whould be returned as 1 in your example - I have assumed that 0 in this field is a special case, to be treated as null (hence the case clause).
I'm doing something like this (postgres), I haven't tested it yet though.
SELECT dup.id AS dup_id, orig.id AS orig_id
INTO TEMP specialty_duplicates
FROM medical_specialty dup,
(SELECT DISTINCT ON (name) * FROM medical_specialty ORDER BY name, id) orig
WHERE orig.name = dup.name AND dup.id <> orig.id;
UPDATE doctor_medical_specialty
SET medical_specialty=orig_id
FROM specialty_duplicates
WHERE medical_specialty = dup_id;
DELETE
FROM medical_specialty
WHERE id IN (SELECT dup_id FROM specialty_duplicates);
ALTER TABLE medical_specialty
ADD UNIQUE (name);
The schema is that medical_specialty has id and name, and doctor_medical_specialty references it by id.
The benefit over a CTE (IIUC) is you can merge references in multiple referring tables.
I'm using a temporary table rather than a view so that both deleting and updating are consistent with the same snapshot in time
So you want a query to find/remove duplicates, is that right?
If so, try this:
SELECT T1.* FROM table_name T1, table_name T2
WHERE T1.dupe_field = T2.dupe_field
AND T1.other_dupe_field = T2.other_dupe_field
AND T1.primary_key > T2.primary_key;
Change the table and field names to suit your own table structure.
Confirm with this SELECT query that it is selecting the dupes you want to remove, and then change it to a DELETE in order to remove the dupes.