We have separate Databases in DB2 for each customer but with same table structure in each of them. For a .Net Application I need to scan all the databases and show result for the matching entries to the user. I was wondering would it be faster to do a UNION ALL for all the databases or run each query in parallel and then combine them from my .Net Application.
Select EmpName, EmpSal, EmpDate
from A.Emptable
where EmpDate > '2015-01-01'
UNION ALL
Select EmpName, EmpSal, EmpDate
from B.Emptable
where EmpDate > '2015-01-01'
UNION ALL
Select EmpName, EmpSal, EmpDate
from C.Emptable
where EmpDate > '2015-01-01'
VS.
Creating a .net Method GetEmpData to call each query and combine their results as:
var response = await Task.WhenAll(GetEmpData(A,'2015-01-01'),GetEmpData(B,'2015-01-01'),GetEmpData(C,'2015-01-01'));
var result = response[0].Concat(response[1]).Concat(response[2]).ToList();
Thanks.
If your federation is optimally configured for all aspects (especially pushdown and performance aspects), then I expect the UNION ALL to be more maintainable in the long term. As regards relative performance, you can measure, because so many factors influence that.
Related
I have complex query which I am simplifying for greater understanding.
Query: The query is having group by, order by, where clause and multiple joins with other tables.
SELECT FIRSTNAME, LASTNAME FROM CUSTOMERS ;
Requirement: NEED OPTIMIZED APPROACH, Get the total count of records of a BIG query along with pagination.
My approach 1: Execute two queries to find count first and then paginated rows
SELECT COUNT(1) FROM FROM CUSTOMERS ;
SELECT FIRSTNAME, LASTNAME, ROWNUMER FROM (
SELECT FIRSTNAME, LASTNAME, ROW_NUMBER() OVER(ORDER BY CUSTOMERCID) AS ROWNUMER FROM CUSTOMERS
)WHERE ROWNUMER BETWEEN 10 AND 20;
My approach 2: Find total count as well as get the required paginated rows in a single query.
SELECT FIRSTNAME, LASTNAME, ROWNUMER, COUNTROWS FROM (
SELECT FIRSTNAME, LASTNAME, ROW_NUMBER() OVER(ORDER BY CUSTOMERCID) AS ROWNUMER
FROM CUSTOMERS
) AS CUST, (SELECT COUNT(1) AS COUNTROWS FROM CUSTOMERS ) AS COUNTNUM
WHERE ROWNUMER BETWEEN 10 AND 20;
My approach 3: Create a VIEW of second approach.
Please suggest which approach I should opt? As per my research, 3rd approach will be more optimized compared to other approach as DATABSE VIEWS are more optimized.
There's nothing about a view that automatically makes it "more optimized" than the query contained within it. The query optimizer decomposes the original SQL and often rewrites it into a much different-looking statement before execution.
After performing RUNSTATS to ensure your tables and indexes have accurate statistics, DB2's built-in EXPLAIN tools such as the db2expln utility, the Design Advisor (db2advis), and the Visual Explain tool in IBM Data Studio offer the best chance at understanding exactly why a particular query option is better or worse than another.
Best performance for pagination is when the least number columns are doing the work (pagination) and then join the results by the key columns to get more data. Two columns control pagination customercid and rownumber one is the primary key already indexed because customercid I'm assuming is unique. customercid is also in the row_number() function so this is the most efficient pagination.
create view dancustomercid as
SELECT CustomerCID, ROW_NUMBER() OVER(ORDER BY CUSTOMERCID) AS ROWNUMER FROM CUSTOMERS
Then join on the output from the view notice there is no order by to slow things down just a join on key fields customercid
SELECT FIRSTNAME, LASTNAME, ROWNUMBER from dancustomercid a join CUSTOMERCID AS b on
a.customercid = b.customercid where a.ROWNUMER
BETWEEN 11 AND 20;
I'm not particularly familiar with SQL, but my team asked me to take a look at this series of sql statements and see if it is possible to reduce it down to just 1 or 2. I looked at it, and I don't believe so, but I don't quite have the experience or knowledge of the tricks of sql.
So all of the statements have pretty much the same format
select
id, count(*)
from
table
where
x_date between to_date(start_date) and to_date(end_date)
group by
id
where x_date is the only thing that changes. (Start_date and end_date are just what I typed here to make it a bit more readable). There are 10 statements total, 7 of which are exactly this format.
Of the 3 different ones, one of them looks like this:
select
id, count(*)
from
table
where
x_date between to_date(start_date) and to_date(end_date)
and userid not like 'AUTOPEND'
group by
id
and the other 2 look like this:
select
id, count(*)
from
table
where
x_date between to_date(start_date) and to. _date(end_date)
group by
id, x_code
Where x_code differs between them.
They want to use this data for statistical analysis, but they insist on manually using a query and typing it in. The way I see it is that I can't really combine these statements because they are all grouping by the same field (except the last 2), so it all gets combined in the results, making the results useless for the analysis.
Am I thinking about it right, or is there some way to do like they asked? Can I make a 1 or 2 sql statements output more than 1 table each?
Oh, I almost forgot. I believe that this is Oracle PL/SQL using SQL developer.
You are trying to get multiple aggregates with different grouping sets in a single query. This is called a ROLLUP or a CUBE. There are a few ways to solve your specific problem, but the extended grouping functions are the right tool for the job. Going forward, it will be more maintainable and faster.
http://www.oracle-base.com/articles/misc/rollup-cube-grouping-functions-and-grouping-sets.php
Since the first and second example are grouping by the same thing you can use CASE statement nested in an aggregation function:
SELECT id, COUNT(*) ,
SUM( CASE WHEN userid not like 'AUTOPEND' THEN 1 ELSE 0 END) AS [NotAUTOPEND]
FROM table
WHERE x_date between to_date(start_date) and to_date(end_date)
GROUP BY id
This question already has answers here:
SQL Server UNION - What is the default ORDER BY Behaviour
(6 answers)
Closed 9 years ago.
Can I be sure that the result set of the following script will always be sorted like this O-R-D-E-R ?
SELECT 'O'
UNION ALL
SELECT 'R'
UNION ALL
SELECT 'D'
UNION ALL
SELECT 'E'
UNION ALL
SELECT 'R'
Can it be proved to sometimes be in a different order?
There is no inherent order, you have to use ORDER BY. For your example you can easily do this by adding a SortOrder to each SELECT. This will then keep the records in the order you want:
SELECT 'O', 1 SortOrder
UNION ALL
SELECT 'R', 2
UNION ALL
SELECT 'D', 3
UNION ALL
SELECT 'E', 4
UNION ALL
SELECT 'R', 5
ORDER BY SortOrder
You cannot guarantee the order unless you specifically provide an order by with the query.
No it does not. SQL tables are inherently unordered. You need to use order by to get things in a desired order.
The issue is not whether it works once when you try it out. The issue is whether you can trust this behavior. And you cannot. SQL Server does not even guarantee the ordering for this:
select *
from (select t.*
from t
order by col1
) t
It says here:
When ORDER BY is used in the definition of a view, inline function,
derived table, or subquery, the clause is used only to determine the
rows returned by the TOP clause. The ORDER BY clause does not
guarantee ordered results when these constructs are queried, unless
ORDER BY is also specified in the query itself.
A fundamental principle of the SQL language is that tables are not ordered. So, although your query might work in many databases, you should use the version suggested by BlueFeet to guarantee the ordering of results.
Try removing all of the ALLs, for example. Or even just one of them. Now consider that the type of optimization that has to happen there (and many other types) will also be possible when the SELECT queries are actual queries against tables, and are optimized separately. Without an ORDER BY, ordering within each query will be arbitrary, and you can't guarantee that the queries themselves will be processed in any order.
Saying UNION ALL with no ORDER BY is like saying "Just throw all the marbles on the floor." Maybe every time you throw all the marbles on the floor, they end up being organized by color. That doesn't mean the next time you throw them on the floor they'll behave the same way. The same is true for ordering in SQL Server - if you don't say ORDER BY then SQL Server assumes you don't care about order. You may see by coincidence a certain order being returned all the time, but many things can affect the arbitrary order that has been selected next time. Data changes, statistics changes, recompile, plan flush, upgrade, service pack, hotfix, trace flag... ad nauseum.
I will put this in large letters to make it clear:
You cannot guarantee an order without ORDER BY
Some further reading:
Bad habits to kick : relying on undocumented behavior
Also, please read this post by Conor Cunningham, a pretty smart guy on the SQL team.
No. You get the records in whatever way SQL Server fetches them for you. You can apply an order on a unioned result set by 1-based index thusly:
SELECT 1, 'O'
UNION ALL
SELECT 2, 'R'
UNION ALL
SELECT 3, 'D'
UNION ALL
SELECT 4, 'E'
UNION ALL
SELECT 5, 'R'
ORDER BY 1
Here is the query:
SELECT DISTINCT patientid.acct
FROM patientid,
doc_table
WHERE patientid.acct = doc_table.acct
AND patientType IN ( 'I', '1', 'T' )
AND fid IN ( '023' )
AND ( dischargedate IS NULL
OR dischargedate = ''
OR dischargedate < '19000101' )
AND dbcreate_date > DATEADD(hh, -24, GETDATE());
We will be adding indicies for patientType, fid dischargedate and dbcreate_date. But I wondering if the query itself could be written in a different way to make it more efficient.
Thank you,
Without more info it's hard to be prescriptive, but in general:
1 - Are you sure you need to DISTINCT? That can be an expensive operation on larger data sets. Run the query without it to see if the results differ. If they do you may want to change the structure to eliminate duplicates.
2 - Put your most restrictive condition in your WHERE clause first, then in descending order of restriction. For example, if your patienttype filter drops the results to 50%, but the fid filter drops it to 20%, use the fid filter first.
3 - for indexes, make sure your JOIN keys are indexed: patientid.acct and doc_table.acct this will probably have the biggest impact on performance.
As a side not, use explicit JOIN syntax, it's much easier to read and maintain, especially on longer queries. A revised version would be:
SELECT DISTINCT patientid.acct
FROM patientid
INNER JOIN doc_table
ON patientid.acct = doc_table.acct
WHERE patientType IN ( 'I', '1', 'T' )
...
Use WHERE EXISTS instead of a JOIN and getting rid of the duplicates with DISTINCT
One thing you should consider: Fix the values for the column dischargedate and select one value which says null or '' or < '19000101' so you can get rid of the or: a) it costs time and b) a database shouldn't be a mess.
So I am querying some extremely large tables. The reason they are so large is because PeopleSoft inserts new records every time a change is made to some data, rather than updating existing records. In effect, its transactional tables are also a data warehouse.
This necessitates queries that have nested selects in them, to get the most recent/current row. They are both effective dated and within each date (cast to a day) they can have an effective sequence. Thus, in order to get the current record for user_id=123, I have to do this:
select * from sometable st
where st.user_id = 123
and st.effective_date = (select max(sti.effective_date)
from sometable sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
There are a phenomenal number of indexes on these tables, and I can't find anything else that would speed up my queries.
My trouble is that I often times want to get data about an individual from these tables for maybe 50 user_ids, but when I join my tables having only a few records in them with a few of these PeopleSoft tables, things just go to crap.
The PeopleSoft tables are on a remote database that I access through a database link. My queries tend to look like this:
select st.* from local_table lt, sometable#remotedb st
where lt.user_id in ('123', '456', '789')
and lt.user_id = st.user_id
and st.effective_date = (select max(sti.effective_date)
from sometable#remotedb sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable#remotedb sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
Things get even worse when I have to join several PeopleSoft tables with my local table. Performance is just unacceptable.
What are some things I can do to improve performance? I've tried query hints to ensure that my local table is joined to its partner in PeopleSoft first, so it doesn't attempt to join all its tables together before narrowing it down to the correct user_id. I've tried the LEADING hint and toyed around with hints that tried to push the processing to the remote database, but the explain plan was obscured and just said 'REMOTE' for several of the operations and I had no idea what was going on.
Assuming I don't have the power to change PeopleSoft and the location of my tables, are hints my best choice? If I was joining a local table with four remote tables, and the local table joined with two of them, how would I format the hint so that my local table (which is very small -- in fact, I can just do an inline view to have my local table only be the user_ids I'm interested in) is joined first with each of the remote ones?
EDIT: The application needs real-time data so unfortunately a materialized view or other method of caching data will not suffice.
Does refactoring your query something like this help at all?
SELECT *
FROM (SELECT st.*, MAX(st.effective_date) OVER (PARTITION BY st.user_id) max_dt,
MAX(st.effective_sequence) OVER (PARTITION BY st.user_id, st.effective_date) max_seq
FROM local_table lt JOIN sometable#remotedb st ON (lt.user_id = st.user_id)
WHERE lt.user_id in ('123', '456', '789'))
WHERE effective_date = max_dt
AND effective_seq = max_seq;
I agree with #Mark Baker that performance joining over DB Links really can suck and you're likely to be limited in what you can accomplish with this approach.
One approach would be to stick PL/SQL functions around everything.
As an example
create table remote (user_id number, eff_date date, eff_seq number, value varchar2(10));
create type typ_remote as object (user_id number, eff_date date, eff_seq number, value varchar2(10));
.
/
create type typ_tab_remote as table of typ_remote;
.
/
insert into remote values (1, date '2010-01-02', 1, 'a');
insert into remote values (1, date '2010-01-02', 2, 'b');
insert into remote values (1, date '2010-01-02', 3, 'c');
insert into remote values (1, date '2010-01-03', 1, 'd');
insert into remote values (1, date '2010-01-03', 2, 'e');
insert into remote values (1, date '2010-01-03', 3, 'f');
insert into remote values (2, date '2010-01-02', 1, 'a');
insert into remote values (2, date '2010-01-02', 2, 'b');
insert into remote values (2, date '2010-01-03', 1, 'd');
create function show_remote (i_user_id_1 in number, i_user_id_2 in number) return typ_tab_remote pipelined is
CURSOR c_1 is
SELECT user_id, eff_date, eff_seq, value
FROM
(select user_id, eff_date, eff_seq, value,
rank() over (partition by user_id order by eff_date desc, eff_seq desc) rnk
from remote
where user_id in (i_user_id_1,i_user_id_2))
WHERE rnk = 1;
begin
for c_rec in c_1 loop
pipe row (typ_remote(c_rec.user_id, c_rec.eff_date, c_rec.eff_seq, c_rec.value));
end loop;
return;
end;
/
select * from table(show_remote(1,null));
select * from table(show_remote(1,2));
Rather than having user_id's passed individually as parameters, you could load them into a local table (eg a global temporary table). The PL/SQL would loop then through the table, doing the remote select for each row in the local table. No single query would have both local and remote tables. Effectively you would be writing your own join code.
One option is to first materialize the remote part of the query using a common table expression so you can be sure only relevantt data is fetched from remote db.Another improvement would be to merge the 2 subqueries against the remote db into one analytical function based subquery.Such a query can be used in your current query also. I can make other suggestions only after playing with the db.
see below
with remote_query as
(
select /*+ materialize */ st.* from sometable#remotedb st
where st.user_id in ('123', '456', '789')
and st.rowid in( select first_value(rowid) over (order by effective_date desc,
effective_sequence desc ) from sometable#remotedb st1
where st.user_id=st1.user_id)
)
select lt.*,st.*
FROM local_table st,remote_query rt
where st.user_id=rt.user_id
You haven't mentioned the requirements for the freshness of the data, but one option would be to create materialized views (you'll be restricted to REFRESH COMPLETE since you can't create snapshot logs in the source system) that have data only for the current versioned row of the transaction tables. These materialized view tables will reside in your local system and additional indexing can be added to them to improve query performance.
The performance issue is going to be the access across the link. With part of the query against local tables, it's all being executed locally so no access to the remote indexes and it's pulling all the remote data back to test lkocally.
If you could use materialized views in a local database refreshed from the peoplesoft database on a periodic (nightly) basis for the historic data, only accessing the remote peoplesoft database for today's changes (adding a effective_date = today to your where clause) and merging the two queries.
Another option might be to use an INSERT INTO X SELECT FROM just for the remote data to pull it into a temporary local table or materialized view, then a second query to join that with your local data... similar to josephj1989's suggestion
Alternatively (though there may be licensing issues) try RAC Clustering your local db with the remote peoplesoft db.
Instead of using the subqueries, you can try this. I don't know if Oracle will perform better with this or not, since I don't use Oracle much.
SELECT
ST1.col1,
ST1.col2,
...
FROM
Some_Table ST1
LEFT OUTER JOIN Some_Table ST2 ON
ST2.user_id = ST1.user_id AND
(
ST2.effective_date > ST1.effective_date OR
(
ST2.effective_date = ST1.effective_date AND
ST2.effective_sequence > ST1.effective_sequence
)
)
WHERE
ST2.user_id IS NULL
Another possible solution would be:
SELECT
ST1.col1,
ST1.col2,
...
FROM
Some_Table ST1
WHERE
NOT EXISTS
(
SELECT
FROM
Some_Table ST2
WHERE
ST2.user_id = ST1.user_id AND
(
ST2.effective_date > ST1.effective_date OR
(
ST2.effective_date = ST1.effective_date AND
ST2.effective_sequence > ST1.effective_sequence
)
)
)
Would it be an option to create a database that you use for non-warehousing type stuff that you could update on a nightly basis? If it is you could create a nightly process that will move over only the most recent records. That would get rid of the MAX stuff you are doing for every day queries and significantly reduce the number or records.
Also, depends on whether you can have a 1 day lapse between the most recent data and what is available.
I'm not super familiar with Oracle so there may be a way to get improvements by making changes to your query also...
Can you ETL the rows with the desired user_id's into your own table, creating only the needed indexes to support your queries and perform your queries on it?
Is the PeopleSoft table a delivered one, or is it custom? Are you sure it's a physical table, and not a poorly-written view on the PS side? If it's a delivered record you're going against (example looks much like PS_JOB or a view that references it), maybe you could indicate this. PS_JOB is a beast with tons of indexes delivered, and most sites add even more.
If you know the indexes on the table, you can use Oracle hints to specify a preferred index to use; that sometimes helps.
Have you done an explain plan to see if you can determine where the problem is? Maybe there's a cartesian join, full table scan, etc.?
It looks to me that you are dealing with a type 2 dimension in the data warehouse. There are several ways how to implement type 2 dimension, mostly having columns like ValidFrom, ValidTo, Version, Status. Not all of them are always present, it would be interesting if you could post the schema for your table. Here is an example of how it may look like (John Smith moved from Indiana to Ohio on 2010-06-24)
UserKey UserBusinessKey State ValidFrom ValidTo Version Status
7234 John_Smith_17 Indiana 2005-03-20 2010-06-23 1 expired
9116 John_Smith_17 Ohio 2010-06-24 3000-01-01 2 current
To obtain the latest version of a row, it is common to use
WHERE Status = 'current'
or
WHERE ValidTo = '3000-01-01'
Note that this one has some constant far in the future.
or
WHERE ValidTo > CURRENT_DATE
Seems that your example uses ValidFrom (effective_date), so you are forced to find max() in order to locate the latest row. Take a look at the schema -- is there Status or ValidTo equivalents in your tables?