Stored procedure performance issue when grabbing data from many secondary tables - sql

Let me explain the scenario of my stored procedure: I have to populate a table for which columns are coming from several tables. Below is one example table out of the several tables that I am using in my query
Data
PrimaryID1 PrimaryID2 KeyID Data
--------------------------------------------------------------
001 0011 1 abc1
001 0011 2 abc2
002 0021 1 xyz1
Since the granualar data is in 'PrimaryID1' and 'PrimaryID2' I am using correlated query based on the 'KeyID' to differentiate different fields in my destination table that I am populating. I am also dealing with huge amount of data in each of the 7 to 8 tables of data. Initially I had all the correlated queries in one single SQL statement, but that didn't work out (obviously!!). Then I separated each set of table into separate CTE part and then tried to insert into the final query, later I came to know CTEs absolutely do nothing about query performance, so I jumped to temp tables, populated each set of data into separate temp tables created non-clustered indexes in each of the fields that I am using in the table and finally tried to join it in the final query, but still this too didn't work out.
Let me explain the query that I have written here. I am taking data from two tables one happens to be PrimaryTable and another happens to be Secondary table. The granular data for PrimaryTable is in PrimaryID1 and PrimaryID2 but in SecondaryTable the granularity further goes down to KeyID and its respective data column Data. So there are 8 secondary tables that I am dealing with each different to one another, hence I end with queries for each secondary tables with a host of columns for it. This is how the query that I have developed looks like. In total I have around 280 odd columns in my query coming from 8 different secondary tables.
Query:
SELECT
PP.PrimaryID1,
PP.PrimaryID2,
(SELECT PPA.Data1
FROM SecondaryTable1 PPA
WHERE PPA.PrimaryID1 = PP.PrimaryID1 AND
PPA.PrimaryID2 = PP.PrimaryID2 AND PPA.KeyID = 1) AS DataField1,
(SELECT PPA.Data2
FROM SecondaryTable1 PPA
WHERE PPA.PrimaryID1 = PP.PrimaryID1 AND
PPA.PrimaryID2 = PP.PrimaryID2 AND PPA.KeyID = 2) AS DataField2,
(SELECT PPA.Data3
FROM SecondaryTable1 PPA
WHERE PPA.PrimaryID1 = PP.PrimaryID1 AND
PPA.PrimaryID2 = PP.PrimaryID2 AND PPA.KeyID = 3) AS DataField3
FROM
PrimaryTable PP
I am dealing with 8 such secondary tables with the count of records ranging from 28 million to 100 thousand. I hope this was helpful.

For your sample, you should be able to use a simple JOIN. Joins are generally more efficient than nested subqueries (although, knowing nothing about your indexing, it's hard to tell for sure if it solves your performance problem)
SELECT
PP.PrimaryID1,
PP.PrimaryID2,
PPA1.Data1 AS DataField1,
PPA2.Data2 AS DataField2,
PPA3.Data3 AS DataField3
FROM PrimaryTable PP
JOIN SecondaryTable1 PPA1
ON PPA1.PrimaryID1 = PP.PrimaryID1 AND PPA1.PrimaryID2 = PP.PrimaryID2
AND PPA1.KeyID = 1
JOIN SecondaryTable1 PPA2
ON PPA2.PrimaryID1 = PP.PrimaryID1 AND PPA2.PrimaryID2 = PP.PrimaryID2
AND PPA2.KeyID = 2
JOIN SecondaryTable1 PPA3
ON PPA3.PrimaryID1 = PP.PrimaryID1 AND PPA3.PrimaryID2 = PP.PrimaryID2
AND PPA3.KeyID = 3

Related

Get the "most" optimal row in a JOIN

Problem
I have a situation in which I have two tables in which I would like the entries from table 2 (lets call it table_2) to be matched up with the entries in table 1 (table_1) such that there are no duplicates rows of table_2 used in the match up.
Discussion
Specifically, in this case there are datetime stamps in each table (field is utcdatetime). For each row in table_1, I want to find the row in table_2 in which has the closed utcdatetime to the table 1 utcdatetime such that the table2.utcdatetime is older than the table_1 utcdatetime and within 30 minutes of the table 1 utcdatetime. Here is the catch, I do not want any repeats. If a row in table 2 gets gobbled up in a match on an earlier row in table 1, then I do not want it considered for a match later.
This has currently been implemented in a Python routine, but it is slow to iterate over all of the rows in table 1 as it is large. I thought I was there with a single SQL statement, but I found that my current SQL results in duplicate table 2 rows in the output data.
I would recommend using a nested select to get whatever results you're looking for.
For instance:
select *
from person p
where p.name_first = 'SCCJS'
and not exists (select 'x' from person p2 where p2.person_id != p.person_id
and p.name_first = 'SCCJS' and p.name_last = 'SC')

Many-to-one row updates in SQL?

So I'm trying to update data from a temporary table into a main table.
Let's call these tables temp, and services.
The pseudocode would be something like this...
Sort temp by inserted_on
When service_id and location match in both tables:
If temp.column1 is not null, replace services.column1
If temp.column2 is not null, replace services.column2
etc...
I've got this bit working, although when I have multiple source rows in the temp table that match the condition, not all fields are being updated.
For example, I might have two rows with identical service_id and location, in one row column1 is null and column2 has a value, and in the next row the opposite is true. I need to update these one by one in the order they came in, and overwrite old data if necessary.
I also need to join the temp table inside the UPDATE to retrieve the keys I'm matching on.
I've tried the below code, but it only seems to be updating certain rows, and I can't quite figure out what the logic is behind it.
I'm not worried about the order, I'm just trying to figure out why it's leaving some blanks when there is data ready to fill the gaps.
UPDATE sloc
SET
sloc.ata = COALESCE(tmp.ata, sloc.ata),
sloc.atd = COALESCE(tmp.atd, sloc.atd),
sloc.atp = COALESCE(tmp.atp, sloc.atp),
sloc.eta = COALESCE(tmp.eta, sloc.eta),
sloc.etd = COALESCE(tmp.etd, sloc.etd),
sloc.etp = COALESCE(tmp.etp, sloc.etp),
sloc.plat = COALESCE(tmp.plat, sloc.plat),
sloc.plats_up = COALESCE(tmp.plats_up, sloc.plats_up),
sloc.cis_plats_up = COALESCE(tmp.cis_plats_up, sloc.cis_plats_up)
FROM
services_locations sloc
INNER JOIN services svc ON svc.id = sloc.sid
INNER JOIN ref_tiploc tloc ON tloc.id = sloc.tpl_id
INNER JOIN trainstatus_tmp tmp ON svc.rid = tmp.rid AND tloc.tpl = tmp.tpl

Merging/Joining Multiple Tables SQL

I am working with a database that has 5 tables each with a different number of observations. I include a description of the columns in each table below. As you can see from below, Table 1,2 and 5 have SecurID in common, Table 3 and 4 have Factor in common, and lastly 3 and 5 have BID in common. I need to perform an anlysis of Table 1 vs 2's exposure and return by date. To do this I need to do mulitple merge/joins. I need to join tables 3 and 4 then join them with table 5 and lastly with 1 and 2. What I tried was multiple joins like:
SELECT *
FROM Table3
INNER JOIN Table4 ON Table3.Factor = Table4.Factor
LEFT JOIN Table5 ON Table3.BID = Table5.BID
LEFT JOIN Table1 ON Table5.SecurID = Table1.SecurID
LEFT JOIN Table2 ON Table5.SecurID = Table2.SecurID
My problem is that when I run this query I get a crazy amount of extra observations. Are multiple join functions the most efficient way to combine all these tables? I'm very new to SQL, but each table has an index, which I believe is a faster way to speed up the data retrieval process compared with SELECT.
Table 1 (32,800 Observ.): SecurID, HoldingDate, Weight
Table 2 (2200 Observ.): SecurID, HoldingDate, Weight
Table 3 (808400 Observ.): BID, Factor, Exposure, Date
Table 4 (8000 Observ.): Factor, Return, FactorGrpName
Table 5 (1600 Observ.): SecurID, SecurName, BID

How to design the Tables / Query for (m:n relation?)

I am sorry if the term m:n is not correct, If you know a better term i will correct. I have the following situation, this is my original data:
gameID
participID
result
the data itself looks like that
1 5 10
1 4 -10
2 5 150
2 2 -100
2 1 -50
when i would extract this table it will easily have some 100mio rows and around 1mio participIDs ore more.
i will need:
show me all results of all games from participant x, where participant y was present
luckily only for a very limited amount of participants, but those are subject to change so I need a complete table and can reduce in a second step.
my idea is the following, it just looks very unoptimized
1) get the list of games where the "point of view participant" is included"
insert into consolidatedtable (gameid, participid, result)
select gameID,participID,sum(result) from mastertable where participID=x and result<>0
2) get all games where other participant is included
insert into consolidatedtable (gameid, participid, result)
where gameID in (select gameID from consolidatedtable)
AND participID=y and result<>0
3) delete all games from consolidate table where count<2
delete from consolidatedDB where gameID in (select gameid from consolidatedtable where count(distinct(participID)<2 group by gameid)
the whole thing looks like a childrens solution to me
I need a consolidated table for each player
I insert way to many games into this table and delete them later on
the whole thing needs to be run participant by participant over
the whole master table, it would not work if i do this for several
participants at the same time
any better ideas, must be, this ones just so bad. the master table will be postgreSQL on the DW server, the consolidated view will be mySQL (but the number crunching will be done in postgreSQL)
my problems
1) how do i build the consolidated table(s - do i need more than one), without having to run a single query for each player over the whole master table (i need to data for players x,y,z and no matter who else is playing) - this is the consolidation task for the DW server, it should create the table for webserver (which is condensed)
2) how can i then extract the at the webserver fast (so the table design of (1) should take this into consideration. we are not talking about a lot of players here i need this info, maybe 100? (so i could then either partition by player ID, or just create single table)
Datawarehouse: postgreSQL 9.2 (48GB, SSD)
Webserver: mySQL 5.5 (4GB Ram, SSD)
master table: gameid BIGINT, participID, Result INT, foreign key on particiP ID (to participants table)
the DW server will hold the master table, the DW server should also prepare the consolidated/extracted Tables (processing power, ssd space is not
an issue)
the webserver should hold the consoldiated tables (only for the 100
players where i need the info) and query this data in a very
efficient manner
so efficient query at webserver >> workload of DW server)
i think this is important, sorry that i didnt include it at the beginning.
the data at the DW server updates daily, but i do not need to query the whole "master table" completely every day. the setup allows me to consolidate only never values. eg: yesterday consolidation was up to ID 500, current ID=550, so today i only consolidate 501-550.
Here is another idea that might work, depending on your database (and my understanding of the question):
SELECT *
FROM table a
WHERE participID = 'x'
AND EXISTS (
SELECT 1 FROM table b
WHERE b.participID = 'y'
AND b.gameID=a.gameID
);
Assuming you have indexes on the two columns (participID and gameID), the performance should be good.
I'd compare it to this and see which runs faster:
SELECT *
FROM table a
JOIN (
SELECT gameID
FROM table
WHERE participID = 'y'
GROUP BY gameID
) b
ON a.gameID=b.gameID
WHERE a.participID = 'x';
Sounds like you just want a self join:
For all participants:
SELECT x.gameID, x.participID, x.results, y.participID, y.results
FROM table as x
JOIN table as y
ON T1.gameID = T2.gameID
WHERE x.participID <> y.participID
The downside of that is you'd get each participant on each side of each game.
For 2 specific particpants:
SELECT x.gameID, x.results, y.results
FROM (SELECT gameID, participID, results
FROM table
WHERE t1.participID = 'x'
and results <> 0)
as x
JOIN (SELECT gameID, participID, results
FROM table
WHERE t1.participID = 'y'
and results <> 0)
as y
ON T1.gameID = T2.gameID
You might not need to select participID in your query, depending on what you're doing with the results.

Update a single table based on data from multiple tables SQL Server 2005,2008

I need to update table one using data from table two. Table one and two are not related by any common column(s). Table three is related to table two.
Ex : table one(reg_det table)
reg_det_id | reg_id | results
101 | 11 | 344
table two :(temp table)
venue | results
Anheim convention center | 355
Table three (regmaster-tbl)
reg_id| venue
11 | Anaheim convention center
I need to update results column in table one using data from table two. But table one and two are not related. Table two and three and table one and three are related as you can see above. Can anyone please suggest any ideas! I need the results value to be 355 in table one and this data is coming from table 2, but these two are unrelated, and they can be related using table three. Sorry if it is confusing!
Fairly straight forward:
UPDATE T1
SET result = t2.results
FROM [table one] T1
INNER JOIN [table three] t3
on t1.reg_id = t3.reg_id
INNER JOIN [table two] T2
on t2.venue = t3.venue
Almost a question instead of an answer. :)
Couldn't you use an implied inner join?
UPDATE rd
SET rd.results = tt.results
FROM reg_det rd, regmaster rm, temptable tt
WHERE rm.reg_id = rd.reg_id
AND rm.venue = tt.venue;
I find it easier to read, and this syntax works in a SELECT statement, with the same meaning as an explicit inner join.
Try this:
UPDATE rd
SET rd.results = t.results
FROM reg_det rd
JOIN regmaster rm ON rm.reg_id = rd.reg_id
JOIN temptable t ON t.venue = rm.venue
WHERE t.results = 355
I added a WHERE clause because otherwise it will update all reg_det records that have matches in regmaster and temptable.