HIVE: Optimizing join of non-partitioned table to partitioned table - hive

I'm trying to join some customer applications (~560k records) to another servicing table (~4.6 Billion records) which contains multiple snapshots of customers. Keep in mind that the latter is partitioned by servicing date.
The aim is to join the servicing data as of the time of application, where the application date is equal to servicing date for each customer.
The way I've done my join relies on loading all partitions of the servicing table, but obviously this is a very expensive operation and causing my query to take a really long time to run. So would appreciate any help to optimize this join.
Below is the code I've tried:
SELECT * FROM
applications apps
JOIN
-- partitioned table
(SELECT * FROM servicing WHERE serv_date > 0) serv
ON apps.customer_id = serv.customer_id
AND apps.app_date = serv.serv_date

Related

How to improve SQL query in Spark when updating table? ('NOT IN' in subquery)

I have a Dataframe in Spark which is registered as a table called A and has 1 billion records and 10 columns. First column (ID) is Primary Key.
Also have another Dataframe which is registered as a table called B and has 10,000 records and 10 columns (same columns as table A, first column (ID) is Primary Key).
Records in Table B are 'Update records'. So I need to update all 10,000 records in table A with records in table B.
I tried first with this SQL query:
select * from A where ID not in (select ID from B) and then to Union that with table B. Approach is ok but first query (select * from A where ID not in (select ID from B)) is extremly slow (hours on moderate cluster).
Then I tried to speed up first query with LEFT JOIN:
select A.* from A left join B on (A.ID = B.ID ) where B.ID is null
That approach seems fine logically but it takes WAY to much memory for Spark containers (YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memory)..
What would be a better/faster/less memory consumption approach?
I would go with left join too rather than not in.
A couple of advices to reduce memory requirement and performance -
Please see the large table is uniformly distributed by join key (ID). If not then some tasks will be heavily burdened and some lightly busy. This will cause serious slowness. Please do a groupBy ID and count to measure this.
If the join key is naturally skewed then add more columns to the join condition keeping the result same. More columns may increase the chance to shuffle data uniformly. This is little hard to achieve.
Memory demand depends on - number of parallel tasks running, volume of data per task being executed in an executor. Reducing either or both will reduce memory pressure and obviously run slower but that is better than crashing. I would reduce the volume of data per task by creating more partitions on the data. Say you have 10 partitions for 1B rows then make it 200 to reduce the volume per task. Use repartition on table A. Don't create too many partitions because that will cause inefficiency, 10K partitions may be a bad idea.
There are some parameters to be tweaked which is explained here.
The small table having 10K rows should be automatically broadcasted because its small. If not you can increase the broadcast limit and apply broadcast hint.

How to improve query performance if runs against view spanning over more than 50 tables?

I have a bit situation with SSRS reports which I have built and currently under development.
Some background on DB. We have CRM on-prem2015, which has SQL DB in the back end. My SSRS reports are based on Filtered Views, which has matching names in the front-end in CRM. So I have to pick and choose the field from the filtered view and then put the SQL logic in.
Since mostly reports are based on new Admission and Service Activities view, which has 1-N relationship respectively. Both this views are growing exponentially day by day.
If I just run Select * from ServiceActivitesFilteredView it takes more than 15 minutes to return around 500,000 rows which i growing by 2000 a day. This view is based on more than 50 tables, mostly I checked those are connected in the back end with Left Outer Join.
And If I just run Select * from AdmissionFileteredView it takes around 7 minutes and growing I would say day by day and returns around 215,000 rows.
So when I have to make any reports via including both above FilteredViews it is becoming nightmare. There are two situation though!
If I put too many parameters in SSRS and try to drill down to client level( Most granular level) which is either one or few rows as result, SSRS report works fine.
But when reports need data at Office level or Area level which may have few hundreds clients it's started taking more than 20 minutes to return the results with depending on the office( see blow code, but no more than 100 rows).
I have created ODS for few reports where it was OK to have one-day or one-week old data. But few reports needs live data and which are getting very poor in performance day by day. I tried "SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;" and also "with(nolock)" option in stored procedure where I use this views. Just FYI. we are not ready at this point to go DataWarehouse side.
Here is the stored procedure which forced me to ask this question on this forum. Basically, our company has policy that Supervisor will go to client house who has agreed to install Air conditioner services, Supervisor will go on the day installers will install the AC. What I am trying to do here is to get list of clients who missed Initial Visit from supervisor when they installed Air conditioner from our company, and next booked or reschedule service date for the same client so that installed can go with them and finish their Initial visit as mentioned in policy.
Select data.ServiceproviderName,data.new_clientidname,data.new_subprogramname,data.createdon,data.new_addresscity,data.new_workgroupidname,nextdate.NextVisitdate,data.new_sitename from
(Select distinct
fa.new_sitename,
fa.new_clientidname,
fs.new_subprogramname,
fa.new_servicename,
fa.createdon,
fa.new_admissionid,
fa.ServiceproviderName,
fa.new_addresscity,
fa.new_workgroupidname
from AdmissionFilteredView fa with(nolock)
left join ServiceAppoinmentFilteredView fs with(nolock)
on
fa.new_admissionid=fs.regardingobjectid
where
fa.new_sitename IN (SELECT value FROM dbo.udf_Split(#Office, ','))
and cast(fa.createdon as date) BETWEEN cast(#Start as date) AND cast(#End as date)
and fa.new_admissionstatusname In ('Admitted')
and fa.new_servicename like 'AC Repair%'
and fs.new_visittypename <> 'Initial'
group by
fa.new_sitename,fa.new_clientidname,fa.new_admissionid,fa.new_servicename,fa.createdon,fs.new_subprogramname,fa.ServiceproviderName,fa.new_admissionid,fa.new_addresscity,fa.new_workgroupidname) data
left join
(Select distinct new_clientidname,min(fs.scheduledstart) as NextVisitdate
from
AdmissionFilteredView fa with(nolock)
left join ServiceAppoinmentFilteredView fs with(nolock)
on
fa.new_admissionid=fs.regardingobjectid
where fa.new_sitename IN (SELECT value FROM dbo.udf_Split(#Office, ','))
and cast(fa.createdon as date) BETWEEN cast(#Start as date) AND cast(#End as date)
and fa.new_admissionstatusname In ('Admitted')
and fa.new_servicename like 'AC Repair%'
and fs.new_visittypename <> 'Initial'
and fs.statuscodename IN ('Booked','Rescheduled')
group by
new_clientidname) nextdate
on data.new_clientidname=nextdate.new_clientidname
This takes roungly 25 minutes in SSMS and 35 minutes in SSRS in SSDT and it even doesn't run on the CRM and goes in SQL - time out error. I can't create ODS since this report needs the live data.
Only thing I can think of is to find actual tables from which these two views are created and re-write this stor. proc. based on these tables or create two tables from these two views and write a code to have up-to-date data in these tables ,I am not sure even this is possible by something like Change data capture or incremental load or update these two tables every time there is new entry in the views or tables which made these two views.
Please help, considering the bigger picture and not just this stored procedure in general.
Thanks in advance.
You can use snapshot option in ssrs ,so report will not keep on loading at client end .
At database end ,have you tried creating indexes on your tables ??
I agree with the comments regarding your split function. You could store that overhead in a table variable, and the just reference the variable in your query:
DECLARE #Start DATETIME, #End DATETIME
;
DECLARE #office VARCHAR(123) = 'Office A,Office B,Office C'
;
DECLARE #officeList TABLE (
Office VARCHAR(100)
);
INSERT INTO #officeList
SELECT Value FROM dbo.udf_Split(#office, ',')
;
DECLARE #local_StartDate DATE = cast(#Start as date),
#local_EndDate DATE = cast(#End as date);
-- from your query
where
fa.new_sitename IN (SELECT Office FROM #officeList)
and cast(fa.createdon as date) BETWEEN #local_StartDate AND #local_EndDate
Create snapshot of the database which is also called reporting database. You can do hourly snapshot or weekly depending upon the frequency of reports.
Run reports as task on background thread:- You can create jobs that run at night, that will create reports for you by querying reporting database and will present you a report when you come in morning. Or you can create it as a task that runs on background and sends you an email when report is ready so you do not have to wait 15 minutes for report to be generated.
Use non-clustered indexes/Filtered indexes for columns you are returning and using in where clauses.
You can create a new table and insert the reporting data query that you have into that at night and then just do select * from new table to get a report which would be very fast because data will already build at night.
If you cannot improve your query and data is increasing every day, then snapshot/reporting database is your best bet.
How often do you need to run the report and how concurrent does it need to be? Would the users be ok with near-real time data vs. real time data? Perhaps you could pre-execute the heavy queries with a sql job and store the results in staging tables and then report off of the staging table or a combination of staging tables. Perhaps some of the 50 operational tables could be warehoused into dimensional tables or staging tables designed for reporting. Also, what version of SQL Server are you using? It will help us figure out what might be available in your bag of tricks.
The first issue I noticed was you have 'select distinct' right at the top. This type of statement locks up tables and negates use of with no lock, and left join.
You need to re-write your queries so they do NOT use any 'select distinct' clauses.

Why is my query slow for different dates when they are smaller datasets?

I wrote a TSQL query for a ad-hoc report that is reading off a very large table (500million records) that's indexed (clustered) on Date/Time.
The query runs terribly slow on certain date ranges versus others where it's lightning fast. I'm trying to figure out why it's doing that.
I took 2 sets of date ranges. One for (04-03-2014 to 04-04-2014) and the other for (05-03-2014 to 05-04-2014). Basically one month apart from both test results. The first range is fast, returning in a mere 10 seconds or so where the other hangs forever.
Looking at the data sets to see if one is significantly larger than the other, I analyze 2 tables in my query as a form of unit testing each segment. The TableA is the first table I'm selecting with the big data. TableB is the joined table later on the query where I LEFT JOIN TableA ON TableB:
TableA (04-03) = 239,806 Records (1 Second Query Time)
TableB (04-03) = 6,569 Records (0 Second Query Time)
TableA (05-03) = 203,535 Records (8 Second Query Time)
TableB (05-03) = 3,388 Records (0 Second Query Time)
As you can see, TableA of the 04 date month is faster and more records than the TableA of the 05 date month, which has less records and slower times.
Now for the query itself, but I'm working on updating that. Here is some pseudo code:
CTE Query
SELECT PRODUCTS (TableA - 100K+ Records)
LEFT JOIN PRODUCT TABLE (1K Records)
FILTERED BY [Time], LIKE Statement off LEFT JOIN
SELECT FROM ( --SUBQUERY
SELECT FROM CTE Query
LEFT JOIN SALES (TableB - 1K+ Records)
JOIN ON [User-ID]
)
PIVOT SUBQUERY (18 Columns in Pivot)
Product is indexed (Clustered) on [Time], which is used in the query.
Sales is joined on [Users-ID] which is NON-CLustered INDEX on SALES (TableB)
Bottleneck looks to be when I join SALES within the SUBQUERY.
Optimizations
I looked at the fragmented indexes to see if that was the cause. I noticed the product table has a 85% fragmented index that could be the cause on a NON-CLUSTERED. I rebuilt that last night and no change. The Sales table also had a smaller one that was rebuilt too.
Rebuilt the clustered index where there was a low percentage fragmented on disk. After rebuilding the index, I had to restart the SQL Server for an unrelated task and the query was running the same speeds on the bad date range as all other ranges. I will assume the fix is attributed to the rebuild of the index as that makes the most sense if the same query is faster with other date ranges than others where the record sets were larger.

What's the most efficient query?

I have a table named Projects that has the following relationships:
has many Contributions
has many Payments
In my result set, I need the following aggregate values:
Number of unique contributors (DonorID on the Contribution table)
Total contributed (SUM of Amount on Contribution table)
Total paid (SUM of PaymentAmount on Payment table)
Because there are so many aggregate functions and multiple joins, it gets messy do use standard aggregate functions the the GROUP BY clause. I also need the ability to sort and filter these fields. So I've come up with two options:
Using subqueries:
SELECT Project.ID AS PROJECT_ID,
(SELECT SUM(PaymentAmount) FROM Payment WHERE ProjectID = PROJECT_ID) AS TotalPaidBack,
(SELECT COUNT(DISTINCT DonorID) FROM Contribution WHERE RecipientID = PROJECT_ID) AS ContributorCount,
(SELECT SUM(Amount) FROM Contribution WHERE RecipientID = PROJECT_ID) AS TotalReceived
FROM Project;
Using a temporary table:
DROP TABLE IF EXISTS Project_Temp;
CREATE TEMPORARY TABLE Project_Temp (project_id INT NOT NULL, total_payments INT, total_donors INT, total_received INT, PRIMARY KEY(project_id)) ENGINE=MEMORY;
INSERT INTO Project_Temp (project_id,total_payments)
SELECT `Project`.ID, IFNULL(SUM(PaymentAmount),0) FROM `Project` LEFT JOIN `Payment` ON ProjectID = `Project`.ID GROUP BY 1;
INSERT INTO Project_Temp (project_id,total_donors,total_received)
SELECT `Project`.ID, IFNULL(COUNT(DISTINCT DonorID),0), IFNULL(SUM(Amount),0) FROM `Project` LEFT JOIN `Contribution` ON RecipientID = `Project`.ID GROUP BY 1
ON DUPLICATE KEY UPDATE total_donors = VALUES(total_donors), total_received = VALUES(total_received);
SELECT * FROM Project_Temp;
Tests for both are pretty comparable, in the 0.7 - 0.8 seconds range with 1,000 rows. But I'm really concerned about scalability, and I don't want to have to re-engineer everything as my tables grow. What's the best approach?
Knowing the timing for each 1K rows is good, but the real question is how they'll be used.
Are you planning to send all these back to a UI? Google doles out results 25 per page; maybe you should, too.
Are you planning to do calculations in the middle tier? Maybe you can do those calculations on the database and save yourself bringing all those bytes across the wire.
My point is that you may never need to work with 1,000 or one million rows if you think carefully about what you do with them.
You can EXPLAIN PLAN to see what the difference between the two queries is.
I would go with the first approach. You are allowing the RDBMS to do it's job, rather than trying to do it's job for it.
By creating a temp table, you will always create the full table for each query. If you only want data for one project, you still end up creating the full table (unless you restrict each INSERT statement accordingly.) Sure, you can code it, but it's already becoming a fair amount code and complexity for a small performance gain.
With a SELECT, the db can fetch the appriate amount of data, optimizing the whole query based on context. If other users have queried the same data, it may even be cached (query, and possibly data, depending upon your db). If performance is truly a concern, you might consider using Indexed/Materialized Views, or generating a table on an INSERT/UPDATE/DELETE trigger. Scaling out, you can use server clusters and partioned views - something that I believe will be difficult if you are creating temporary tables.
EDIT: the above is written without any specific rdbms in mind, although the OP added that mysql is the target db.
There is a third option which is derived tables:
Select Project.ID AS PROJECT_ID
, Payments.Total AS TotalPaidBack
, Coalesce(ContributionStats.DonarCount, 0) As ContributorCount
, ContributionStats.Total As TotalReceived
From Project
Left Join (
Select C1.RecipientId, Sum(C1.Amount) As Total, Count(Distinct C1.DonarId) ContributorCount
From Contribution As C1
Group By C1.RecipientId
) As ContributionStats
On ContributionStats.RecipientId = Project.Project_Id
Left Join (
Select P1.ProjectID, Sum(P1.PaymentAmount) As Total
From Payment As P1
Group By P1.RecipientId
) As Payments
On Payments.ProjectId = Project.Project_Id
I'm not sure if it will perform better, but you might give it shot.
A few thoughts:
The derived table idea would be good on other platforms, but MySQL has the same issue with derived tables that it does with views: they aren't indexed. That means that MySQL will execute the full content of the derived table before applying the WHERE clause, which doesn't scale at all.
Option 1 is good for being compact, but syntax might get tricky when you want to start putting the derived expressions in the WHERE clause.
The suggestion of materialized views is a good one, but MySQL unfortunately doesn't support them. I like the idea of using triggers. You could translate that temporary table into a real table that persists, and then use INSERT/UPDATE/DELETE triggers on the Payments and Contribution tables to update the Project Stats table.
Finally, if you don't want to mess with triggers, and if you aren't too concerned with freshness, you can always have the separate stats table and update it offline, having a cron job that runs every few minutes that does the work that you specified in Query #2 above, except on the real table. Depending on the nuances of your application, this slight delay in updating the stats may or may not be acceptable to your users.

SQL Server 2005 query plan optimizer choking on date partitioned tables

We have TABLE A partitioned by date and does not contain data from today, it only contains data from prior day and going to year to date.
We have TABLE B also partitioned by date which does contain data from today as well as data from prior day going to year to date. On top of TABLE B there is a view, View_B which joins against View_C, View_D and left outer joins Table E. View_C and View_D are each selects from 1 table and do not have any other tables joined in. So View_B looks something like
SELECT b.Foo, c.cItem, d.dItem, E.eItem
FROM TABLE_B b JOIN View_C c on c.cItem = b.cItem
JOIN View_D d on b.dItem = d.dItem
LEFT OUTER JOIN TABLE_E on b.eItem = e.eItem
View_AB joins TABLE A and View_B on extract date as well as one other constraint. So it looks something like:
SELECT a.Col_1, b.Col_2, ...
FROM TABLE_A a LEFT OUTER JOIN View_B b
on a.ExtractDate = b.ExtractDate and a.Foo=b.Foo
-- no where clause
When searching for data from anything other than prior day, the query analyzer does what would be expected and does a hash match join to complete the outer join and reads about 116 pages worth of data from table B. If run for prior day however, the query optimizer freaks out and uses a nested join, scans the table 7000+ times and reads 8,000,000+ pages in the join.
We can fake it/force it to use a different query plan by using join hints, however that causes any constraints in the view that look at table B to cause the optimizer to throw an error that the query can't be completed due to join hints.
Editing to add that the pages/scans = the same number as is hit in one scan when run for a prior day where the optimizer correctly chooses a hash instead of nested join.
As mentioned in the comments, we have severely reduced the impact by creating a covered index on TABLE_B to cover the join in View_B but the IO is still higher than it would be if the optimizer chose the correct plan, especially since the index is essentially redundant for all but prior day searches.
The sqlplan is at http://pastebin.com/m53789da9, sorry that it's not the nicely formatted version.
If you can post the .sqlplan for each of the queries it would help for sure, but my hunch is that you are getting a parallel plan when querying for dates prior to the current day and the nested loop is possibly a constant loop over the partitions included in the table which would then spawn a worker thread for each partition (for more information on this, see the SQLCAT post on parallel plans with partitioned tables in Sql2005). Can't verify if this is the case or not without seeing the plans however.
In case anyone ever runs into this, the issue appears to be only tangentially related to the partitioning scheme. Even though we run a statistics update nightly, it appears that SQL Server
Didn't create a statistic on the ExtractDate
Even when the extract date statistic was explicitly created, didn't pick up that the prior day had data.
We resolved it by doing a CREATE STATISTICS TABLE_A_ExtractDate_Stats ON TABLE_A WITH FULLSCAN. Now searching for prior day and a random sampling of days appears to generate the correct plan.