How to write a subquery to optimise performance? - sql

I have the following query that shows total sales for the selected dimensions. Table a does not contain product_name, this is why I've joined data with table b on product_id.
However, table b is too big, and I'd like to optimize it to scan fewer data.
SELECT a.date,
a.hour,
a.category_id,
a.product_id,
b.product_name,
sum(a.sales) AS sales
FROM a
LEFT JOIN b
ON a.product_id = b.product_id
WHERE date(a.date) >= date('2021-01-01')
AND date(B.date) = date('2021-01-01')
GROUP BY 1, 2, 3, 4, 5
What would be your suggestions here?

There are two ways to decrease the amount of data Athena needs to scan for a given query:
Make sure the table is partitioned, and make sure the query makes use of the partitioning.
Store the data as Parquet or ORC.
These two can be used separately or in combination. Best results are achieved with the combination, but sometimes that's not convenient or possible.
Your question doesn't say if the tables are partitioned, but from the query it looks to me like they are not – unless date is a partition key.
date would be an excellent partition key, and if it is, your query is already pretty good. AND date(B.date) = date('2021-01-01') will limit the scan of the table b to a single partition. However, if date is not a partition key then what will happen is that Athena will have to scan the whole table to find rows that match the criteria.
This is where a file format like Parquet and ORC can help; these store the data for each column separately, and also store metadata like the min and max values for each column. If the files for the b table were sorted by date, or at least created over time in such a way that they were mostly sorted by date, Athena would be able to look at the metadata and skip files that can't contain the sought date because it's outside of the range given by the min/max values for that file. Athena would also only have to read the parts of the files for the b table that contained the date column, because that is the only one used in the query.
If you amend your question with a little more information about the table schemas and how the data is stored I can answer in more detail how to optimise. With the available information I can only give general guidance as above.

Make sure b table has indexes on date and product_id, as Stu's comment suggests
Run an Explain Plan (from console) on your SQL to see whether optimizer filters b before joining to a. If it already does so, you're done - step 3 won't help
Replace your From a Left Join b with From a Left Join (Select product_id, product_name from b where date(date) = date('2021-01-01')) b

Related

Creating a view from JOIN two massive tables

Context
I have a big table, say table_A, with roughly 20 billion rows and 600 columns. I don't own this table but I can read from it.
For a fraction of these columns I produce a few extra columns (50) which I store in a separate table, say table_B, which is therefore roughly 20 bn X 50 large.
Now I have the need to expose the join of table table_A and table_B to users, which I tried as
CREATE VIEW table_AB
AS
SELECT *
FROM table_A AS ta
LEFT JOIN table_B AS tb ON (ta.tec_key = tb.tec_key)
The problem is that for any simple query like SELECT * FROM table_AB LIMIT 2 will fail because of memory issues: apparently Impala attempts to do a full join first in memory which would result into a table of 0.5 Petabyte. Hence the failure.
Question
What is the best way to create such a view?
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
I have also tried with [...] SELECT STRAIGHT_JOIN * [...] but did not help.
What is the best way to create such a view?
Since both tables are huge, there will be memory problem. here are some points i would recommend,
Assuming table a and b have same tec_key, do a inner join
Keep (smaller) table b as driver. create vw as select ... from b join a on .... Impala stores driver table in memory and so it will require less memory.
Select only columns required and do not select all.
put filter in view.
Do partitions in table b if you can on some dtae/year/region/anything that can evenly distribute the data.
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
You can not ensure filter goes before or after join. Only way to ensure a filter will improve perf is if you have partition on the filter column. Else, you can try to filter first and the join to see if it improves perf like this
select ... from b
join ( select ... from a where region='Asia') a on ... -- wont improve much
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
Completely agree on this. Multiple smaller tables is way better than one giant table with 600 columns. So, create few stg table with only required fields and then enrich that data. Its a difficult data set, but no one will change 20b rows everyday - so some sort of incremental is also possible to implement.

Specify the partition # based on date range for that pkey value

We have a DW query that needs to extract data from a very large table around 10 TB which is partitioned by datetime column lets say time to purge data based on this column everyday. So my understanding is each partition has worth a day of data. From storage (SSMS GUI) tab I see # of partitions is 1995.
There is no clustered index on this table as its mostly intended for write operations. Just a design by vendor.
SELECT
a.*
FROM dbo.VLTB AS a
CROSS APPLY
(
VALUES($PARTITION.a_func(a.time))
) AS c (pid)
WHERE c.pid = 1896;
Currently query submitted is as
SELECT * from dbo.VLTB
WHERE time >= convert(datetime,'20210601',112)
AND time < convert(datetime,'20210602',112)
So replacing inequality predicates with equality to look in that days specific partition might help. Users via app can control dates when sending but how will they manage if we want them to use partition # as per first query
Question
How do I find a way in above query to find partition number for that day rather than me inserting like for 06/01 I had to give 1896 part#. Is there a better way to have script find the part# to avoid all partitions being scanned and can insert correct part# in where clause query?
Thank you

To create index on a varchar column or not - Access

I have read a couple of articles about when to create an index on a column and all of those were related to Mysql, SQL Server or Oracle. I have a fair bit of idea now about whether I should create an index on my column or not, but I would like to have a learned opinion on it before I actually try it.
I have a MS Access database which has around 15 tables. All tables have a column called [Locations] and this column is used in almost all WHERE clauses and most of the JOIN conditions. This column has 5 distinct values as of now i.e 5 locations viz A, B, C, D, E.
So my question is though this column is part of most WHERE clause and JOIN, the limited variety in values (just 5) is making me to hesitate to create an index on it.
Please advice.
It is important to bear in mind that an Access database is a "peer-to-peer" (as opposed to "client-server") database, so table scans can be particularly detrimental to performance, especially if the back-end database file is on a network share. Therefore it is always a good idea to ensure that there are indexes on all fields that participate in WHERE clauses or the ON conditions of JOINs.
Example: I have a sample table with one million rows and a field named [Category] that contains the value 'A' or 'B'. Without an index on the [Category] field the query
SELECT COUNT(*) AS n FROM [TestData] WHERE [Category] = 'B'
had to do a table scan and that generated about 48 MB of total network traffic. Simply adding an index on [Category] reduced the total network traffic to 0.27 MB for the exact same query.

Comparing two partition's data in hive

I have 9 million records in each of my partition in hive and I have two partitions. The table has 20 columns. Now I want to compare the dataset between the partitions based upon an id column. which is the best way to do it considering the fact that self join with 9 million records will create performence issues.
Can you try the SMB join - its mostly like a merging two sorted lists. However in this case you will need to create two more tables.
Another option would be to write an UDF to do the same - that would be project by itself. The first option is easier.
Did you try the self join and have it fail? I don't think it should be an issue as long as you specify the join condition correctly. 9 million rows is actually not that much for Hive. It can handle large joins by using the join condition as a reduce key, so it doesn't actually do the full cartesian product.
select a.foo, b.foo
from my_table a
full outer join my_table b
on a.id <=> b.id
where a.partition = 'x' and b.partition = 'y'
To do a full comparison of 2 tables (or comparing 2 partitions of the same table), my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives performance problems as you mentioned, and also gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq.
In your case, you would use that program specifying that the "2 tables to compare" are the same and using the "--source-where" and "--destination-where" to indicate which partitions you want to compare. The "--group-by-column" option might also be useful to specify the "id" column.

SQL Server 2005 query plan optimizer choking on date partitioned tables

We have TABLE A partitioned by date and does not contain data from today, it only contains data from prior day and going to year to date.
We have TABLE B also partitioned by date which does contain data from today as well as data from prior day going to year to date. On top of TABLE B there is a view, View_B which joins against View_C, View_D and left outer joins Table E. View_C and View_D are each selects from 1 table and do not have any other tables joined in. So View_B looks something like
SELECT b.Foo, c.cItem, d.dItem, E.eItem
FROM TABLE_B b JOIN View_C c on c.cItem = b.cItem
JOIN View_D d on b.dItem = d.dItem
LEFT OUTER JOIN TABLE_E on b.eItem = e.eItem
View_AB joins TABLE A and View_B on extract date as well as one other constraint. So it looks something like:
SELECT a.Col_1, b.Col_2, ...
FROM TABLE_A a LEFT OUTER JOIN View_B b
on a.ExtractDate = b.ExtractDate and a.Foo=b.Foo
-- no where clause
When searching for data from anything other than prior day, the query analyzer does what would be expected and does a hash match join to complete the outer join and reads about 116 pages worth of data from table B. If run for prior day however, the query optimizer freaks out and uses a nested join, scans the table 7000+ times and reads 8,000,000+ pages in the join.
We can fake it/force it to use a different query plan by using join hints, however that causes any constraints in the view that look at table B to cause the optimizer to throw an error that the query can't be completed due to join hints.
Editing to add that the pages/scans = the same number as is hit in one scan when run for a prior day where the optimizer correctly chooses a hash instead of nested join.
As mentioned in the comments, we have severely reduced the impact by creating a covered index on TABLE_B to cover the join in View_B but the IO is still higher than it would be if the optimizer chose the correct plan, especially since the index is essentially redundant for all but prior day searches.
The sqlplan is at http://pastebin.com/m53789da9, sorry that it's not the nicely formatted version.
If you can post the .sqlplan for each of the queries it would help for sure, but my hunch is that you are getting a parallel plan when querying for dates prior to the current day and the nested loop is possibly a constant loop over the partitions included in the table which would then spawn a worker thread for each partition (for more information on this, see the SQLCAT post on parallel plans with partitioned tables in Sql2005). Can't verify if this is the case or not without seeing the plans however.
In case anyone ever runs into this, the issue appears to be only tangentially related to the partitioning scheme. Even though we run a statistics update nightly, it appears that SQL Server
Didn't create a statistic on the ExtractDate
Even when the extract date statistic was explicitly created, didn't pick up that the prior day had data.
We resolved it by doing a CREATE STATISTICS TABLE_A_ExtractDate_Stats ON TABLE_A WITH FULLSCAN. Now searching for prior day and a random sampling of days appears to generate the correct plan.