I am new to SQL and trying to query a large database so speed is an issue. I have been using a query (with line 1) of the form shown below which has been working fine, but when I modify it (to switch line 1 for line 2) to use a constant to make a cut rather than a value derived within the query itself then the query is significantly slower (running time of 1 is ~1sec and 2 is a few minutes). I would have actually expected it to be much quicker.
Can someone explain why this is happening or suggest how I might rewrite this query better?
Thanks
Query
with local_sample as
( SELECT b.mass, ...various other columns selected...
FROM table1 TAB, table2 b
WHERE ...a few clauses... )
SELECT min(prog.num), LTAB.mass, ...various other columns...
from local_sample LTAB, table2 prog
WHERE ...a few clauses...
[**1**] and prog.mass > LTAB.mass/2.0
[**2**] and prog.mass > 31.62
group by ...columns...
Information in the question is kind of scarce, so at a guess, it's an implicit conversion issue. My hunch is LTAB.mass is the same data type as prog.mass, so no conversion is necessary, but whatever that data type is doesn't play nicely with decimal.
Numbers in sql come in many flavors, and most of the time we don't have to think about it because the conversions are very fast and happen in the background. Occasionally though, you'll come across number types formats that don't play well with others (float for instance) and it can become a performance pain point.
So here is a way to test if that's the issue run the below query (assuming Microsoft SQL Server is your RDBMS):
select SQL_VARIANT_PROPERTY(Mass,'BaseType') AS 'Base Type'
From table2
If my hunch is correct it will return float as the base type. If that's the case and the implicit conversion is the issue then the below should work in a similar manner to query 1:
with local_sample as
( SELECT b.mass, ...various other columns selected...
FROM table1 TAB, table2 b
WHERE ...a few clauses... )
Declare #Mass float = 31.62
SELECT min(prog.num), LTAB.mass, ...various other columns...
from local_sample LTAB, table2 prog
WHERE ...a few clauses...
and prog.mass > #Mass
group by ...columns...
Anyway let me know if that doesn't work for you.
Related
I'm developing a simple app to return a random selection of exercises, one for each bodypart.
bodypart is an indexed enum column on an Exercise model. DB is PostgreSQL.
The below achieves the result I want, but feels horribly inefficient (hitting the db once for every bodypart):
BODYPARTS = %w(legs core chest back shoulders).freeze
#exercises = BODYPARTS.map do |bp|
Exercise.public_send(bp).sample
end.shuffle
So, this gives a random exercise for each bodypart, and mixes up the order at the end.
I could also store all exercises in memory and select from them; however, I imagine this would scale horribly (there are only a dozen or so seed records at present).
#exercises = Exercise.all
BODYPARTS.map do |bp|
#exercises.select { |e| e[:bodypart] == bp }.sample
end.shuffle
Benchmarking these shows the select approach as the more effective on a small scale:
Queries: 0.072902 0.020728 0.093630 ( 0.088008)
Select: 0.000962 0.000225 0.001187 ( 0.001113)
MrYoshiji's answer: 0.000072 0.000008 0.000080 ( 0.000072)
My question is whether there's an efficient way to achieve this output, and, if so, what that approach might look like. Ideally, I'd like to keep this to a single db query.
Happy to compose this using ActiveRecord or directly in SQL. Any thoughts greatly appreciated.
From my comment, you should be able to do (thanks PostgreSQL's DISTINCT ON):
Exercise.select('distinct on (bodypart) *')
.order('bodypart, random()')
Postgres' DISTINCT ON is very handy and performance is typically great, too - for many distinct bodyparts with few rows each. But for only few distinct values of bodypart with many rows each (big table - and your use case) there are far superior query techniques.
This will be massively faster in such a case:
SELECT e.*
FROM unnest(enum_range(null::bodypart)) b(bodypart)
CROSS JOIN LATERAL (
SELECT *
FROM exercises
WHERE bodypart = b.bodypart
-- ORDER BY ??? -- for a deterministic pick
LIMIT 1 -- arbitrary pick!
) e;
Assuming that bodypart is the name of the enum as well as the table column.
enum_range is an enum support function that (quoting the manual):
Returns all values of the input enum type in an ordered array
I unnest it and run a LATERAL subquery for each value, which is very fast when supported with the right index. Detailed explanation for the query technique and the needed index (focus on chapter "2a. LATERAL join"):
Optimize GROUP BY query to retrieve latest record per user
For just an arbitrary row for each bodypart, a simple index on exercises(bodypart) does the job. But you can have a deterministic pick like "the latest entry" with the right multicolumn index and a matching ORDER BY clause and almost the same performance.
Related:
Is it a bad practice to query pg_type for enums on a regular basis?
Select first row in each GROUP BY group?
I have a table DB_Budget which has 3 columns Business_Unit_Code, Ledger_Period, Budget_GBP. For sake of simplicity, I have left out the other columns.
Data types are -
This table is present in my development and production environment.
While doing some quality checks I ran the below query:
select
Business_Unit_Code,
Ledger_Period,
Budget_GBP
from [SomeLinkedServer].[Database].dbo.DB_BUDGET
where business_unit_code = 'AV' and ledger_period = '200808'
and budget_gbp >= 32269
except
select
Business_Unit_Code,
Ledger_Period,
Budget_GBP
from [Database].dbo.DB_BUDGET
where business_unit_code = 'AV' and ledger_period = '200808'
and budget_gbp >= 32269
I got this -
If I remove the except, this is what I get -
Clearly, data is same in both tables! Why would EXCEPT give me one row?
Things get interesting. I wrap Budget_GBP around LTRIM(RTRIM( ... construct.
And things matched!
I did a bit of googling and found that LTRIM(RTRIM( basically rounds off the float to 32269.2. That might be the reason why they match.
So, to summarize, my first question is why the EXCEPT gives a row in result when the records are matching?
My second question might be simple. As you can see, I am restricted to use the clause budget_gbp >= 32269 in WHERE clause. Reason is when I provide the exact value(which I am copying from SSMS), I get no results. Please let me know what I am doing wrong here.
EDIT - Is there any way the data validation might work? There are 100s of table in the database and it is next to impossible for me to scavenge for float columsn and wrap them around cast. Using EXCEPT is one ways of validating the data in development environment.
Cast the values to fixed point types or strings and re-run your code. The first query would look like:
select Business_Unit_Code, Ledger_Period, cast(Budget_GBP as decimal(18, 2)) as Budget_GBP
With numbers, what you see is not always what you get. So you see 32669.19, but it might really be 32.669.189999.
you could define an accuracy and check if the value is within the specified accuracy of each other.
declare #accuracy float = 0.001
select *
from DB_BUDGET1 t1
full join DB_BUDGET2 t2
on t1.Business_Unit_Code=t2.Business_Unit_Code
and t1.Ledger_Period =t2.Ledger_Period
and t1.Budget_GBP between t2.Budget_GBP-#accuracy and t2.Budget_GBP+#accuracy
where t1.id is null
or t2.id is null
http://www.sqlfiddle.com/#!3/41da0/2/0
I was always bothered by how should I approach those, which solution is better. I guess the sample code should explain it better.
Lets imagine we have a table that has 3 columns:
(int)Id
(nvarchar)Name
(int)Value
I want to get the basic columns plus a number of calculations on the Value column, but with each of the calculation being based on a previous one, In other words something like this:
SELECT
*,
Value + 10 AS NewValue1,
Value / NewValue1 AS SomeOtherValue,
(Value + NewValue1 + SomeOtherValue) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
Obviously this will not work. NewValue1, SomeOtherValue and YetAnotherValue are on the same level in the query so they can't refer to each other in the calculations.
I know of two ways to write queries that will give me the desired result. The first one involves repeating the calculations.
SELECT
*,
Value + 10 AS NewValue1,
Value / (Value + 10) AS SomeOtherValue,
(Value + (Value + 10) + (Value / (Value + 10))) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
The other one involves constructing a multilevel query like this:
SELECT
t2.*,
(t2.Value + t2.NewValue1 + t2.SomeOtherValue) / 10 AS YetAnotherValue
FROM
(
SELECT
t1.*,
t1.Value / t1.NewValue1 AS SomeOtherValue
FROM
(
SELECT
*,
Value + 10 AS NewValue1
FROM
MyTable
WHERE
Name LIKE "A%"
) t1
) t2
But which one is the right way to approach the problem or simply "better"?
P.S. Yes, I know that "better" or even "good" solution isn't always the same thing in SQL and will depend on many factors.
I have tired a number of different combination of calculations in both variants. They always produced the same execution plan, so it could be assumed that there is no difference in the performance aspect. From the code usability perspective the first approach i obviously better as the code is more readable and compact.
There is no "right" way to write such queries. SQL Server, as with most databases (MySQL being a notable exception), does not create intermediate tables for each subquery. Instead, it optimizes the query as a whole and often moves all the calculations for the expressions into a single processing node.
The reason that column aliases cannot be re-used at the same level goes to the ANSI standard definition. In particular, nothing in the standard specifies the order of evaluation for the individual expressions. Without knowing the order, SQL cannot guarantee that the variable is defined before evaluated.
I often write multi-level queries -- either using subqueries or CTEs -- to make queries more readable and more maintainable. But then again, I will also copy logic from one variable to the other because it is expedient. In my opinion, this is something that the writer of the query needs to decide on, taking into account whether the query is part of the code for a system that needs to be maintained, local coding standards, whether the query is likely to be modified, and similar considerations.
I am working on a massive join at work and have very limited resources in terms of being able to add indexes and such as well as what I can do in the query itself due to the environment (i.e. I can only select data, no variables or table creations allowed). I have read somewhere that a subquery will automatically index the result, is this true? Also for my major join tables (3) each has ~140K rows. I have to join 2 extra tables to ensure filtering is correct. I have the query listed below which I currently have criteria on the JOIN clause. Another question is if I move my criteria to a where clause either in or out of the subquery will it benefit?
SELECT *
FROM (SELECT NULL AS A1,
DFS_ROHEADER.TECHID,
DFS_ROHEADER.RONUMBER,
DFS_ROHEADER.CUSTOMERNUMBER,
DFS_CUSTOMER.BNAME,
DFS_ROHEADER.UNITNUMBER,
DFS_ROHEADER.MILEAGE,
DFS_ROHEADER.OPENEDDATE,
DFS_ROHEADER.CLOSEDDATE,
DFS_ROHEADER.STATUS,
DFS_ROHEADER.PONUMBER,
DFS_TECH.REGION,
DFS_TECH.RSM,
DFS_ROPART.PARTID,
CONVERT(NVARCHAR(max), DFS_RODETAIL.STORY) AS STORY
FROM DFS_ROHEADER
LEFT JOIN DFS_CUSTOMER
ON DFS_ROHEADER.CUSTOMERNUMBER = DFS_CUSTOMER.CUST_NO
LEFT JOIN DFS_TECH
ON DFS_ROHEADER.TECHID = DFS_TECH.TECHID
INNER JOIN DFS_RODETAIL
ON DFS_ROHEADER.RONUMBER = DFS_RODETAIL.RONUMBER
INNER JOIN DFS_ROPART
ON DFS_RODETAIL.RONUMBER = DFS_ROPART.RONUMBER
AND DFS_RODETAIL.LINENUMBER = DFS_ROPART.LINENUMBER
AND DFS_ROHEADER.RONUMBER LIKE '%$FF_RONumber%'
AND DFS_ROHEADER.UNITNUMBER LIKE '%$FF_UnitNumber%'
AND DFS_ROHEADER.PONUMBER LIKE '%$FF_PONumber%'
AND ( DFS_CUSTOMER.BNAME LIKE '%$FF_Customer%'
OR DFS_CUSTOMER.BNAME IS NULL )
AND DFS_ROHEADER.TECHID LIKE '%$FF_TechID%'
AND DFS_ROHEADER.CLOSEDDATE BETWEEN
FF_ClosedBegin AND FF_ClosedEnd
AND ( DFS_TECH.REGION LIKE '%$FilterRegion%'
OR DFS_TECH.REGION IS NULL )
AND ( DFS_TECH.RSM LIKE '%$FF_RSM%'
OR DFS_TECH.RSM IS NULL )
AND DFS_RODETAIL.STORY LIKE '%$FF_Story%'
AND DFS_ROPART.PARTID LIKE '%$FF_PartID%'
WHERE DFS_ROHEADER.DELETED_BY < 0
AND DFS_RODETAIL.DELETED_BY < 0
AND DFS_ROPART.DELETED_BY < 0) T
ORDER BY T.RONUMBER
This query works; however, it can take forever to run, and can timeout. I have other queries that also run in the environment and I will take whatever you can give me in terms of suggestions and apply it to those. I am using SQLServer 2000, Thanks for the help.
EDIT:
Execution Plan:
https://dl.dropboxusercontent.com/u/99733863/ExecutionPlan.sqlplan
UPDATE:
I have come to the conclusion the environment I'm working in is the cause of the problem. My query works as intended and is not slow at all (1 sec. for 18,000 rows). As stated in the comments I have to fill grids with limited flexibility and I believe that these grids fill by first filling a temporary grid with the SQL statement and then copying row by row into the desired grid. There is a good chance that this is the cause of my issues. Thanks for the help.
I have come to the conclusion the environment I'm working in is the cause of the problem. My query works as intended and is not slow at all (1 sec. for 18,000 rows). As stated in the comments I have to fill grids with limited flexibility and I believe that these grids fill by first filling a temporary grid with the SQL statement and then copying row by row into the desired grid. There is a good chance that this is the cause of my issues. Thanks for the help everyone.
My 2 cents here.. In general LIKE is not very well optimized. In your case you also seem to be using LIKE with '%value%'. In that case the query optimizer has to scan the entire index. At a minimum I would see if there is a way to avoid using this.
I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?