I have a question about testing query performance for views in redshift.
I have two tables: table_a and table_b:
- table a and table b have different sort key defined.
- table a has 6 fields for sort key.
- table b has 4 fields for sort key.
- both tables share some column names/types but table a is a super set of table b.
I created a view v_combined. The view combines data from table a and table b based on dates queried. For example if query is made before date XYZ, the view will source table a. Otherwise it sources table b.
create view as v_combined
select a as x, b as y, c as z, to_timestamp(time_field::TEXT, 'YYYYMMDD'):timestamp as date
from table_a
where date < "MY_DATE"
union all
select * from table_b
where date > "MY_DATE"
I did a comparison between the view and corresponding table:
select count(*) from v_combined where date < "MY_DATE"
select count(*) from table_a where date < "MY_DATE"
select count(*) from v_combined where date > "MY_DATE"
select count(*) from table_b where date > "MY_DATE"
select * from v_combined where date < "MY_DATE" limit 10000
select * from table_a where date < "MY_DATE" limit 10000
select * from v_combined where date > "MY_DATE" limit 10000
select * from table_b where date > "MY_DATE" limit 10000
(1) and (2) have similar execution time as expected.
(3) and (4) have similar execution time as expected.
(5) seems to have longer execution time than (6).
(7) seems to have longer execution time than (8).
What is the best way to test the performance of a view in redshift?
I'd say that the best way to test the performance of a view is to run test queries exactly like you did!
The performance of this particular View will always be poor because it is doing a UNION ALL.
In (5), it needs to get ALL rows from both tables before applying the LIMIT, whereas (6) only needs to access table_a and can stop as soon as it hits the limit.
If you need good performance from queries like this, you could consider creating a combined table (rather than a view). Run a daily (or hourly?) script to recreate the table from the combined data. That way, queries will run much faster.
Related
Say I want to match records in table_a that have a startdate and an enddate to individual days and see if on, for instance March 13, one or more records in table_a match. I'd like to solve this by generating a row per day, with the date as the leading column, and any matching data from table_a as a left join.
I've worked with data warehouses that have date dimensions that make this job easy. But unfortunately I need to run this particular query on an OLTP database that doesn't have such a table.
How can I generate a row-per-day table in SQL Server? How can I do this inside my query, without temp tables, functions/procedures etc?
An alternative is a recursive query to generate the date series. Based on your pseudo-code:
with dates_table as (
select <your-start-date> dt
union all
select dateadd(day, 1, dt) from dates_table where dt < <your-end-date>
)
select d.dt, a.<whatever>
from dates_table d
left outer join table_a a on <join / date matching here>
-- where etc etc
option (maxrecursion 0)
I found a bit of a hack way to do this. I'll assume two years of dates is sufficient for your dates table.
Now, find a table in your database that has at least 800 records (365 x 2 + leap years - headache for multiplying 365 = ~~ 800). The question talks about selecting data from table_a, so we'll assume this is table_b. Then, create this Common Table Expression at the top of the query:
with dates_table as (
select top 800 -- or more/less if your timespan isn't ~2years
[date] = date_add(day, ROW_NUMBER() over (order by <random column>) -1, <your-start-date>)
from table_b
)
select d.[date]
, a.<whatever>
from dates_table d
left outer join table_a a on <join / date matching here>
-- where etc, etc, etc
Some notes:
This works by just getting the numbers 0 - 799 and adding them to an arbitrary date.
If you need more or less dates than two years, increase or decrease the number fed into the top statement. Ensure that table_b has sufficient rows: select count(*) from table_b.
<random column can be any column on table_b, the ordering doesn't matter. We're only interested in the numbers 1-800 ( -1 for a range of 0-799), but ROW_NUMBER requires an order by argument.
<your-start-date> has the first date tyou want in the dates table, and is included in the output.
in the where of the joined query, you could filter out any excess days that we overshot by taking 800 rows instead of 730 (+leap) by adding stuff like year(d.[date]) IN (2020, 2021).
If table_a has more than 800 records itself, this could be used as the basis for the dates_table instead of some other table too.
I have the following SQL right now, i am interested only in 100 records to be fetched. Say my table has 100k records. Fetch these records subject to the condition that we are optimal.
current SQL:
Select a,b,c from table where SomeCostlyFunction(,,) > T1
LIMIT 100;
I want to introduce another check based on SomeCostlyFunction(,,,)
option#1
Select a,b,c from table
where SomeCostlyFunction(,,) > T1 AND SomeCostlyFunction(a,b,c) < T2
LIMIT 100;
option#2
Select a,b,c from
(Select a,b,c, SomeCostlyFunction(a,b,c) as func_val from table)
where func_val > T1 AND func_val < T2
LIMIT 100;
Doubts:
In option#1 SomeCostlyFunction() will be called twice more compared to baseline.
But in option#2 because of the nested select will SomeCostlyFunction() be called 100k times ? I am only interested in getting 100 records.
I have the followning SQL statement:
SELECT *
FROM (
SELECT eu_dupcheck AS dupcheck
, eu_date AS threshold
FROM WF_EU_EVENT_UNPROCESSED
WHERE eu_dupcheck IS NOT NULL
UNION
SELECT he_dupcheck AS dupcheck
, he_date AS threshold
FROM WF_HE_HISTORY_EVENT
WHERE he_dupcheck IS NOT NULL
)
WHERE threshold > sysdate - 30
The second table is partitioned by date but the first isn't. I need to know if the partition of the second table will be hit in this query, or will it do a full table scan?
I would be surprised if Oracle were smart enough to avoid a full table scan. Remember that UNION processes the data by removing duplicates. So, Oracle would have to recognize that:
The where clause is appropriate for the partitioning (this is actually easy).
That partitioning does not affect the duplicate removal (this is a bit harder, but true because the date is in the select).
Oracle has a smart optimizer, so perhaps it can recognize this situation (and it would probably avoid the full table scan for a UNION ALL). However, you are safer by moving the condition to the subqueries:
SELECT *
FROM ((SELECT eu_dupcheck AS dupcheck, eu_date AS threshold
FROM WF_EU_EVENT_UNPROCESSED
WHERE eu_dupcheck IS NOT NULL AND eu_date > sysdate - 30
) UNION
(SELECT he_dupcheck AS dupcheck, he_date AS threshold
FROM WF_HE_HISTORY_EVENT
WHERE he_dupcheck IS NOT NULL AND he_date > sysdate - 30
)
) eh;
I'm writing a small Objective-C library that works with an embedded SQLite database.
The SQLite version I'm using is 3.7.13 (checked with SELECT sqlite_version())
My query is:
SELECT ROUND(AVG(difference), 5) as distance
FROM (
SELECT (
SELECT A.timestamp - B.timestamp
FROM ExampleTable as B
WHERE B.timestamp = (
SELECT MAX(timestamp)
FROM ExampleTable as C
WHERE C.timestamp < A.timestamp
)
) as difference
FROM ExampleTable as A
ORDER BY timestamp)
Basically it outputs the average timestamp difference between rows ordered by timestamp.
I tried the query on a sample database with 35k rows and it runs in around 100ms. So far so good.
I then tried the query on another sample database with 100k rows and it hangs at sqlite3_step() taking up 100% of CPU usage.
Since I cannot step into sqlite3_step() with the debugger, is there another way I can get a grasp of where is the function hanging or a debug log of what is the issue here?
I also tried running other queries from my library on the 100k rows database and there is no issue, but it's also true that these are simple queries with no subquery. Maybe this is the issue?
Thanks
UPDATE
This is the output of EXPLAIN QUERY PLAN as requested:
"1","0","0","SCAN TABLE ExampleTable AS A"
"1","0","0","EXECUTE CORRELATED SCALAR SUBQUERY 2"
"2","0","0","SCAN TABLE ExampleTable AS B"
"2","0","0","EXECUTE CORRELATED SCALAR SUBQUERY 3"
"3","0","0","SEARCH TABLE ExampleTable AS C"
"1","0","0","USE TEMP B-TREE FOR ORDER BY"
"0","0","0","SCAN SUBQUERY 1"
Looking up rows by their timestamp value can be optimized with an index on this column:
CREATE INDEX whatever ON ExampleTable(timestamp);
And this query is inefficient: ORDER BY does not affect values that are averaged, and the timestamp values in B and C are always identical, so you can drop one of them:
SELECT ROUND(AVG(difference), 5) AS distance
FROM (
SELECT timestamp -
(SELECT MAX(timestamp)
FROM ExampleTable AS B
WHERE timestamp < A.timestamp)
AS difference
FROM ExampleTable AS A)
I eventually went with this solution:
CREATE TABLE tmp AS SELECT timestamp FROM ExampleTable ORDER BY timestamp
SELECT ROUND(AVG(difference), 5)
FROM (
SELECT (
SELECT A.timestamp - B.timestamp
FROM tmp as B
WHERE B.rowid = A.rowid-1
) as difference
FROM tmp as A
ORDER BY timestamp)
DROP TABLE ExampleTable
Actually I went further and I am only using this strategy for high number of rows (> 40k), since the other strategy (single query) works better for "small" tables.
I have to structure these queries so they are perfect SQL. The queries need to be for a SQL Server database, I have a database StoresDB, a table items_table.
I need to retrieve the
total number of items within this table
The number of item where the price is higher or equal than £10 - the column name is amount
The list of items in the computer category - column name ='comp_id' sorted by decreased amount.
For the above requests I have attempted the below:
SELECT COUNT(*) FROM items_table
Select * from items_table where amount >= 10
Select * from items_table where comp_id = ’electronics’ desc
I am very new to SQL and not sure if I have attempted this correctly.
Maybe is good to know few things when writing this sort of query:
a) SELECT COUNT(*) FROM items_table
This query is written correctly.
b) SELECT COUNT(*) FROM items_table WHERE amount >= 10
Query is OK, but choose to create indexes which cover WHERE clause, in this case, is good to have non-clustered index on amount column
c) SELECT * FROM items_table WHERE comp_id = 'electronics' ORDER BY price DESC
With this last query you have an issue that searching all columns in result, with SELECT * ... which is considered like bad practice in production, so you need to put in SELECT list only columns which are really needed, not all columns. Also you can create non-clustered index on comp_id column, with included columns from SELECT list.
a) Looks correct.
b) You are being asked for a count but are querying a list.
SELECT COUNT(*) FROM items_table WHERE price >= 10
c) This one looks good but you are missing an ORDER BY statement.
SELECT * FROM items_table WHERE catID='electronics' ORDER BY price DESC