UNION ALL Slower than N queries - sql

This question is following this question where I wanted to select the MAX value of multiples fields while retrieving each row.
The accepted answer with UNION ALL worked like a charm but I now have some scaling issues.
To give some context, I have more than 3 million rows in my matches table and the filters used in the WHERE condition can reduce this dataset to about 5000-6000 rows. I'm using PostgreSQL.
The query takes something like 14-16 seconds to process. The strange thing is that if I run one query at a time, it will take 150ms.
So if my maths are corrects, the total duration of this query should be 150ms * 20 (number of fields to select max value) = 3 seconds, not 16 ??
Why the entire query takes so much time ?
Here are some questions I have about that:
Is it just better to do 20 queries and aggregate the final result ?
Can I speed up my query by using some index ?
Is it possible to make the WHERE filters + JOIN only once instead of doing it in all my queries ?
PS: here is the Node.js code I use if you want to read the query in a more readable way than the 500 lines of the pastebin:
const fields = [
'match_players.kills',
'match_players.deaths',
'match_players.assists',
'match_players.gold',
'matches.game_duration',
'match_players.minions',
'match_players.kda',
'match_players.damage_taken',
'match_players.damage_dealt_champions',
'match_players.damage_dealt_objectives',
'match_players.kp',
'match_players.vision_score',
'match_players.critical_strike',
'match_players.time_spent_living',
'match_players.heal',
'match_players.turret_kills',
'match_players.killing_spree',
'match_players.double_kills',
'match_players.triple_kills',
'match_players.quadra_kills',
'match_players.penta_kills',
]
const query = fields
.map((field) => {
return `
(SELECT
'${field}' AS what,
${field} AS amount,
match_players.win as result,
matches.id,
matches.date,
matches.gamemode,
match_players.champion_id
FROM
match_players
INNER JOIN
matches
ON
matches.id = match_players.match_id
WHERE
match_players.summoner_puuid = :puuid
AND match_players.remake = 0
AND matches.gamemode NOT IN (800, 810, 820, 830, 840, 850, 2000, 2010, 2020)
ORDER BY
${field} DESC, matches.id
LIMIT
1)
`
})
.join('UNION ALL ')
const { rows } = await Database.rawQuery(query, { puuid })
Thanks a lot for your time.

If your database engine and API support common table expressions (WITH keyword), then you could first perform the query that makes the join and the filtering, and then use the result set for performing the UNION ALL:
const query = `
WITH base as (
SELECT
${fields.join()},
match_players.win as result,
matches.id,
matches.date,
matches.gamemode,
match_players.champion_id
FROM
match_players
INNER JOIN
matches
ON
matches.id = match_players.match_id
WHERE
match_players.summoner_puuid = :puuid
AND match_players.remake = 0
AND matches.gamemode NOT IN (800, 810, 820, 830, 840, 850, 2000, 2010, 2020)
)
` + fields.map((field) => `
(SELECT
'${field}' AS what,
${field.split(".").pop()} AS amount,
result,
id,
date,
gamemode,
champion_id
FROM
base
ORDER BY
2 DESC, id
LIMIT
1)
`).join(' UNION ALL ');

Related

Postgres query optimization on joins and where in clauses

So I am trying to make a backend to send users notification from time to time. Now in order to do that I need to procure some data from different postgres tables. I wrote this query but it is taking 12-14 seconds to get the data.
When run without where in clause I get the data in almost 700ms.
SELECT DISTINCT ON (t."playerId") t."gzpId", t."pubCode", t."playerId" as token, t."provider",
COALESCE(p."preferenceValue",'en') as lang,
s."segmentId"
FROM "userPlayerIdMap" t LEFT JOIN
"userPreferences" p
ON t."gzpId" = p."gzpId" LEFT JOIN
"segment" s
ON t."gzpId" = s."gzpId"
WHERE t."pubCode" IN ('hyrmas','ayqioa','rj49as99') and
t."provider" IN ('FCM','ONE_SIGNAL') and
s."segmentId" IN (0,1,2,3,4,5,6) and
p."preferenceValue" IN ('en','hi')
ORDER BY t."playerId" desc;
Rows in "userPlayerIdMap" = 650000
Rows in "userPreferences" = 1456466
Rows in "segment" = 5674186
I have already added indexes on the required columns.
Would really appreciate some help.
Use subqueries:
SELECT t."gzpId", t."pubCode", t."playerId" as token, t."provider",
COALESCE((SELECT p."preferenceValue"
FROM "userPreferences" p
WHERE t."gzpId" = p."gzpId" AND
p."preferenceValue" IN ('en', 'hi')
LIMIT 1
), 'en'
) as lang,
(SELECT s."segmentId"
FROM "segment" s
WHERE t."gzpId" = s."gzpId" AND
s."segmentId" IN (0, 1, 2, 3, 4, 5, 6)
LIMIT 1
) as segmentId
FROM "userPlayerIdMap"
WHERE t."pubCode" IN ('hyrmas', 'ayqioa', 'rj49as99') and
t."provider" IN ('FCM', 'ONE_SIGNAL')
-- ORDER BY t."playerId" desc;
I'm not sure the ORDER BY is necessary. If it was only being used for the DISTINCT ON, then it is not necessary in this version of the logic.
At the very least (with the ORDER BY) this will reduce the number of rows that need to be sorted. If you don't need the ORDER BY, then there is no sort -- a significant performance gain.
Then, you want indexes on:
userPreferences(gzpId, preferenceValue)
segment(gzpId, segmentId)
The index on userPlayerIdMap is trickier. I don't think that Postgres can use the index for both ins without a scan. You want the more selective column first, but one of:
userPlayerIdMap(provider, pubCode, gzpId)
userPlayerIdMap(pubCode, provider, gzpId)
I threw gzpId so Postgres can use the index to look up the values in the subquery.

SQL - query timing out when pulling records for most recent date with a subquery

I'm trying to pull values for the most recent date (COMPUTE_DAY) in a very large dataset - this seems to be a frequently asked question, with the most common solution to be using a subquery on the same table. Unfortunately, my query is timing out every time I try that. The table is partitioned on two columns, REGION and COMPUTE_DAY, with primary keys REGION, COMPUTE_DAY and PLAN_UUID. Are there any ways I can make this query more efficient?
SELECT /*+ use_hash(ipp,ipp2) */
ipp.COMPUTE_DAY,
ipp.ITEM,
ipp.MANUFACTURER,
ipp.ORDER_DATE,
ipp.CARTON,
sum(ipp.TARGET_INVENTORY) as 1,
sum(ipp.CURRENT_INVENTORY) as 2,
sum(ipp.DEMAND) as 3,
sum(ipp.ORDERS) as 4,
sum(ipp.SHIPMENTS) as 5,
sum(ipp.QUANTITY) as 6,
FROM
table ipp
WHERE
ipp.REGION = 1
AND ipp.COMPUTE_DAY = (select max(ipp2.COMPUTE_DAY) from O_IP_PLANS ipp2 where ipp2.REGION_ID = 1 AND ipp2.COMPUTE_DAY BETWEEN TO_DATE('{RUN_DATE_YYYY/MM/DD}','YYYY/MM/DD')-7 AND TO_DATE('{RUN_DATE_YYYY/MM/DD}','YYYY/MM/DD') AND ipp2.PLAN_UUID = ipp.PLAN_UUID)
AND ipp.GROUP_ID = 121
AND ipp.IOG = 1
AND ipp.INTENT = 'YES'
GROUP BY ipp.COMPUTE_DAY,
ipp.ITEM,
ipp.MANUFACTURER,
ipp.ORDER_DATE,
ipp.CARTON;

SQL GROUP BY function returning incorrect SUM amount

I've been working on this problem, researching what I could be doing wrong but I can't seem to find an answer or fault in the code that I've written. I'm currently extracting data from a MS SQL Server database, with a WHERE clause successfully filtering the results to what I want. I get roughly 4 rows per employee, and want to add together a value column. The moment I add the GROUP BY clause against the employee ID, and put a SUM against the value, I'm getting a number that is completely wrong. I suspect the SQL code is ignoring my WHERE clause.
Below is a small selection of data:
hr_empl_code hr_doll_paid
1 20.5
1 51.25
1 102.49
1 560
I expect that a GROUP BY and SUM clause would give me the value of 734.24. The value I'm given is 211461.12. Through troubleshooting, I added a COUNT(*) column to my query to work out how many lines it's running against, and it's giving a result of 1152, furthering reinforces my belief that it's ignoring my WHERE clause.
My SQL code is as below. Most of it has been generated by the front-end application that I'm running it from, so there is some additional code in there that I believe does assist the query.
SELECT DISTINCT
T000.hr_empl_code,
SUM(T175.hr_doll_paid)
FROM
hrtempnm T000,
qmvempms T001,
hrtmspay T166,
hrtpaytp T175,
hrtptype T177
WHERE 1 = 1
AND T000.hr_empl_code = T001.hr_empl_code
AND T001.hr_empl_code = T166.hr_empl_code
AND T001.hr_empl_code = T175.hr_empl_code
AND T001.hr_ploy_ment = T166.hr_ploy_ment
AND T001.hr_ploy_ment = T175.hr_ploy_ment
AND T175.hr_paym_code = T177.hr_paym_code
AND T166.hr_pyrl_code = 'f' AND T166.hr_paid_dati = 20180404
AND (T175.hr_paym_type = 'd' OR T175.hr_paym_type = 't')
GROUP BY T000.hr_empl_code
ORDER BY hr_empl_code
I'm really lost where it could be going wrong. I have stripped out the additional WHERE AND and brought it down to just T166.hr_empl_code = T175.hr_empl_code, but it doesn't make a different.
By no means am I any expert in SQL Server and queries, but I have decent grasp on the technology. Any help would be very appreciated!
Group by is not wrong, how you are using it is wrong.
SELECT
T000.hr_empl_code,
T.totpaid
FROM
hrtempnm T000
inner join (SELECT
hr_empl_code,
SUM(hr_doll_paid) as totPaid
FROM
hrtpaytp T175
where hr_paym_type = 'd' OR hr_paym_type = 't'
GROUP BY hr_empl_code
) T on t.hr_empl_code = T000.hr_empl_code
where exists
(select * from qmvempms T001,
hrtmspay T166,
hrtpaytp T175,
hrtptype T177
WHERE T000.hr_empl_code = T001.hr_empl_code
AND T001.hr_empl_code = T166.hr_empl_code
AND T001.hr_empl_code = T175.hr_empl_code
AND T001.hr_ploy_ment = T166.hr_ploy_ment
AND T001.hr_ploy_ment = T175.hr_ploy_ment
AND T175.hr_paym_code = T177.hr_paym_code
AND T166.hr_pyrl_code = 'f' AND T166.hr_paid_dati = 20180404
)
ORDER BY hr_empl_code
Note: It would be more clear if you have used joins instead of old style joining with where.

LINQ to SQL Every Nth Row From Table

Anybody know how to write a LINQ to SQL statement to return every nth row from a table? I'm needing to get the title of the item at the top of each page in a paged data grid back for fast user scanning. So if i wanted the first record, then every 3rd one after that, from the following names:
Amy, Eric, Jason, Joe, John, Josh, Maribel, Paul, Steve, Tom
I'd get Amy, Joe, Maribel, and Tom.
I suspect this can be done... LINQ to SQL statements already invoke the ROW_NUMBER() SQL function in conjunction with sorting and paging. I just don't know how to get back every nth item. The SQL Statement would be something like WHERE ROW_NUMBER MOD 3 = 0, but I don't know the LINQ statement to use to get the right SQL.
Sometimes, TSQL is the way to go. I would use ExecuteQuery<T> here:
var data = db.ExecuteQuery<SomeObjectType>(#"
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS [__row]
FROM [YourTable]) x WHERE (x.__row % 25) = 1");
You could also swap out the n:
var data = db.ExecuteQuery<SomeObjectType>(#"
DECLARE #n int = 2
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS [__row]
FROM [YourTable]) x WHERE (x.__row % #n) = 1", n);
Once upon a time, there was no such thing as Row_Number, and yet such queries were possible. Behold!
var query =
from c in db.Customers
let i = (
from c2 in db.Customers
where c2.ID < c.ID
select c2).Count()
where i%3 == 0
select c;
This generates the following Sql
SELECT [t2].[ID], [t2]. --(more fields)
FROM (
SELECT [t0].[ID], [t0]. --(more fields)
(
SELECT COUNT(*)
FROM [dbo].[Customer] AS [t1]
WHERE [t1].[ID] < [t0].[ID]
) AS [value]
FROM [dbo].[Customer] AS [t0]
) AS [t2]
WHERE ([t2].[value] % #p0) = #p1
Here's an option that works, but it might be worth checking that it doesn't have any performance issues in practice:
var nth = 3;
var ids = Table
.Select(x => x.Id)
.ToArray()
.Where((x, n) => n % nth == 0)
.ToArray();
var nthRecords = Table
.Where(x => ids.Contains(x.Id));
Just googling around a bit I haven't found (or experienced) an option for Linq to SQL to directly support this.
The only option I can offer is that you write a stored procedure with the appropriate SQL query written out and then calling the sproc via Linq to SQL. Not the best solution, especially if you have any kind of complex filtering going on.
There really doesn't seem to be an easy way to do this:
How do I add ROW_NUMBER to a LINQ query or Entity?
How to find the ROW_NUMBER() of a row with Linq to SQL
But there's always:
peopleToFilter.AsEnumerable().Where((x,i) => i % AmountToSkipBy == 0)
NOTE: This still doesn't execute on the database side of things!
This will do the trick, but it isn't the most efficient query in the world:
var count = query.Count();
var pageSize = 10;
var pageTops = query.Take(1);
for(int i = pageSize; i < count; i += pageSize)
{
pageTops = pageTops.Concat(query.Skip(i - (i % pageSize)).Take(1));
}
return pageTops;
It dynamically constructs a query to pull the (nth, 2*nth, 3*nth, etc) value from the given query. If you use this technique, you'll probably want to create a limit of maybe ten or twenty names, similar to how Google results page (1-10, and Next), in order to avoid getting an expression so large the database refuses to attempt to parse it.
If you need better performance, you'll probably have to use a stored procedure or a view to represent your query, and include the row number as part of the stored proc results or the view's fields.

MAX Subquery in SQL Anywhere Returning Error

In sqlanywhere 12 I wrote the following query which returns two rows of data:
SELECT "eDatabase"."Vendor"."VEN_CompanyName", "eDatabase"."OrderingInfo"."ORD_Timestamp"
FROM "eDatabase"."OrderingInfo"
JOIN "eDatabase"."Vendor"
ON "eDatabase"."OrderingInfo"."ORD_VEN_FK" = "eDatabase"."Vendor"."VEN_PK"
WHERE ORD_INV_FK='7853' AND ORD_DefaultSupplier = 1
Which returns:
'**United Natural Foods IN','2018-02-07 15:05:15.513'
'Flora ','2018-02-07 14:40:07.491'
I would like to only return the row with the maximum timestamp in the column "ORD_Timestamp". After simply trying to select by MAX("eDatabase"."OrderingInfo"."ORD_Timestamp") I found a number of posts describing how that method doesn't work and to use a subquery to obtain the results.
I'm having difficulty creating the subquery in a way that works and with the following query I'm getting a syntax error on my last "ON":
SELECT "eDatabase"."Vendor"."VEN_CompanyName", "eDatabase"."OrderingInfo"."ORD_Timestamp"
FROM ( "eDatabase"."OrderingInfo"
JOIN
"eDatabase"."OrderingInfo"
ON "eDatabase"."Vendor"."VEN_PK" = "eDatabase"."OrderingInfo"."ORD_VEN_FK" )
INNER JOIN
(SELECT "eDatabase"."Vendor"."VEN_CompanyName", MAX("eDatabase"."OrderingInfo"."ORD_Timestamp")
FROM "eDatabase"."OrderingInfo")
ON "eDatabase"."Vendor"."VEN_PK" = "eDatabase"."OrderingInfo"."ORD_VEN_FK"
WHERE ORD_INV_FK='7853' AND ORD_DefaultSupplier = 1
Does anyone know how I can adjust this to make the query correctly select only the max ORD_Timestamp row?
try this:
SELECT TOP 1 "eDatabase"."Vendor"."VEN_CompanyName", "eDatabase"."OrderingInfo"."ORD_Timestamp"
FROM "eDatabase"."OrderingInfo"
JOIN "eDatabase"."Vendor"
ON "eDatabase"."OrderingInfo"."ORD_VEN_FK" = "eDatabase"."Vendor"."VEN_PK"
WHERE ORD_INV_FK='7853' AND ORD_DefaultSupplier = 1
order by "ORD_Timestamp" desc
this orders them biggest on to and say only hsow the top row