SQL queries with views and subqueries

SQL queries with views and subqueries - sql

select nid, avg, std from sView1
where sid = 4891
and nid in (select distinct nid from tblref where rid = 799)
and oidin (select distinct oid from tblref where rid = 799)
and anscount > 3
This is a query I'm currently trying to run. And running it like this takes about 3-4 seconds. However, if I replace the "4891" value with a subquery saying (select distinct sid from tblref where rid = 799) the procedure just hangs, even though the subquery only returns one sid.
The query is supposed to return a dataset with averages (avg) and standard deviations (std) over a resultset which is calculated through nested views in sView1. This dataset is then run through another view to get some top-level averages and stdevs.
The averages may need to include more than 1 sid (sid identifies a dataset).
It's difficult describing it more without revealing codebase and codestructure that shouldn't be revealed ;)
Can anyone suggest why the query hangs when trying to use the subquery? (The code is rebuilt from originally using nested cursors, since I have been told that cursors are the work of the devil, and nested cursors may make me sterile)

Try this. Exists returns as soon as it finds a matching condition, select distinct will require going through the dataset and optionally sorting it to remove the duplicates.
SELECT nid,avg,std from sView1 AS SV
WHERE EXISTS (SELECT * FROM TblRef AS TR WHERE sv.sid = Tr.sid AND Sv.nid = tr.nid AND sv.oid = tr.oid AND tr.rid = 799)
AND ansCount>3
Also, it is pretty difficult to provide a meaningful answer without access to query plans and table structures. So DDL and sample data will definitely help.

Related

Left join or Select in select (SQL - Speed of query)

I have something like this:
SELECT CompanyId
FROM Company
WHERE CompanyId not in
(SELECT CompanyId
FROM Company
WHERE (IsPublic = 0) and CompanyId NOT IN
(SELECT ShoppingLike.WhichId
FROM Company
INNER JOIN
ShoppingLike ON Company.CompanyId = ShoppingLike.UserId
WHERE (ShoppingLike.IsWaiting = 0) AND
(ShoppingLike.ShoppingScoreTypeId = 2) AND
(ShoppingLike.UserId = 75)
)
)
It has 3 select, I want to know how could I have it without making 3 selects, and which one has better speed for 1 million record? "select in select" or "left join"?

My experiences are from Oracle. There is never a correct answer to optimising tricky queries, it's a collaboration between you and the optimiser. You need to check explain plans and sometimes traces, often at each stage of writing the query, to find out what the optimiser in thinking. Having said that:
You could remove the outer SELECT by putting the entire contents of it's subquery WHERE clause in a NOT(...). On the face of it will prevent that outer full scan of Company (or it's index of CompanyId). Try it, check the output is the same and get timings, then remove it temporarily before trying the below. The NOT() may well cause the optimiser to stop considering an ANTI-JOIN against the ShoppingLike subquery due to an implicit OR being created.
Ensure that CompanyId and WhichId are defined as NOT NULL columns. Without this (or the likes of an explicit CompanyId IS NOT NULL) then ANTI-JOIN options are often discarded.
The inner most subquery is not correlated (does not reference anything from it's outer query) so can be extracted and tuned separately. As a matter of style I'd swap the table names round the INNER JOIN as you want ShoppingLike scanned first as it has all the filters against it. It wont make any difference but it reads easier and makes it possible to use a hint to scan tables in the order specified. I would even question the need for the Company table in this subquery.
You've used NOT IN when sometimes the very similar NOT EXISTS gives the optimiser more/alternative options.
All the above is just trial and error unless you start trying the explain plan. Oracle can, with a following wind, convert between LEFT JOIN and IN SELECT. 1M+ rows will create time to invest.

In an EXISTS can my JOIN ON use a value from the original select

I have an order system. Users with can be attached to different orders as a type of different user. They can download documents associated with an order. Documents are only given to certain types of users on the order. I'm having trouble writing the query to check a user's permission to view a document and select the info about the document.
I have the following tables and (applicable) fields:
Docs: DocNo, FileNo
DocAccess: DocNo, UserTypeWithAccess
FileUsers: FileNo, UserType, UserNo
I have the following query:
SELECT Docs.*
FROM Docs
WHERE DocNo = 1000
AND EXISTS (
SELECT * FROM DocAccess
LEFT JOIN FileUsers
ON FileUsers.UserType = DocAccess.UserTypeWithAccess
AND FileUsers.FileNo = Docs.FileNo /* Errors here */
WHERE DocAccess.UserNo = 2000 )
The trouble is that in the Exists Select, it does not recognize Docs (at Docs.FileNo) as a valid table. If I move the second on argument to the where clause it works, but I would rather limit the initial join rather than filter them out after the fact.
I can get around this a couple ways, but this seems like it would be best. Anything I'm missing here? Or is it simply not allowed?

I think this is a limitation of your database engine. In most databases, docs would be in scope for the entire subquery -- including both the where and in clauses.
However, you do not need to worry about where you put the particular clause. SQL is a descriptive language, not a procedural language. The purpose of SQL is to describe the output. The SQL engine, parser, and compiler should be choosing the most optimal execution path. Not always true. But, move the condition to the where clause and don't worry about it.

I am not clear why do you need to join with FileUsers at all in your subquery?
What is the purpose and idea of the query (in plain English)?
In any case, if you do need to join with FileUsers then I suggest to use the inner join and move second filter to the WHERE condition. I don't think you can use it in JOIN condition in subquery - at least I've never seen it used this way before. I believe you can only correlate through WHERE clause.

You have to use aliases to get this working:
SELECT
doc.*
FROM
Docs doc
WHERE
doc.DocNo = 1000
AND EXISTS (
SELECT
*
FROM
DocAccess acc
LEFT OUTER JOIN
FileUsers usr
ON
usr.UserType = acc.UserTypeWithAccess
AND usr.FileNo = doc.FileNo
WHERE
acc.UserNo = 2000
)
This also makes it more clear which table each field belongs to (think about using the same table twice or more in the same query with different aliases).
If you would only like to limit the output to one row you can use TOP 1:
SELECT TOP 1
doc.*
FROM
Docs doc
INNER JOIN
FileUsers usr
ON
usr.FileNo = doc.FileNo
INNER JOIN
DocAccess acc
ON
acc.UserTypeWithAccess = usr.UserType
WHERE
doc.DocNo = 1000
AND acc.UserNo = 2000
Of course the second query works a bit different than the first one (both JOINS are INNER). Depeding on your data model you might even leave the TOP 1 out of that query.

Why is an IN statement with a list of items faster than an IN statement with a subquery?

I'm having the following situation:
I've got a quite complex view from which I've to select a couple of records.
SELECT * FROM VW_Test INNER JOIN TBL_Test ON VW_Test.id = TBL_Test.id
WHERE VW_Test.id IN (1000,1001,1002,1003,1004,[etc])
This returns a result practically instantly (currently with 25 items in that IN statement). However when I use the following query it slows down really fast.
SELECT * FROM VW_Test INNER JOIN TBL_Test ON VW_Test.id = TBL_Test.id
WHERE VW_Test.id IN (SELECT id FROM TBL_Test)
With 25 records in the TBL_Test this query takes about 5 seconds. I've got an index on that id in the TBL_Test.
Anyone got an idea why this happens and how to get performance up?
EDIT: I forgot to mention that this subquery
SELECT id FROM TBL_Test
returns a result instantly as well.

Well, when using a subquery the database engine will first have to generate the results for the subquery before it can do anything else, which takes time. If you have a predefined list, this will not need to happen and the engine can simply use those values 'as is'. At least, this is how I understand it.
How to improve performance: do away with the subquery. I don't think you even need the IN clause in this case. The INNER JOIN should suffice.

How to perform multiple SQL tasks when using SQL within code (in this case vbscript)

I am hitting a brick wall with something I'm trying to do.
I'm trying to perform a complex query and return the results to a vbscript (vbs) record set.
In order to speed up the query I create temporary tables and then use those tables in the main query (creates a speed boost of around 1200% on just using sub queries)
the problem is, the outlying code seems to ignore the main query, only 'seeing' the result of the very first command (i.e. it will return a 'records affected' figure)
For example, given a query like this..
delete from temp
select * into temp from sometable where somefield = somefilter
select sum(someotherfield) from yetanothertable where account in (select * from temp)
The outlying code only seems to 'see' the returned result of 'delete from temp' I can't access the data that the third command is returning.
(Obviously the sql query above is pseudo/fake. the real query is large and it's content not relevant to the question being asked. I need to solve this problem as without being able to use a temporary table the query goes from taking 3 seconds to 6 minutes!)
edit: I know I could get around this by making multiple calls to ADODB.Connection's execute (make the call to empty the temp tables, make the call to create them again, finally make the call to get the data) but I'd rather find an elegant solution/way to avoid this way of doing it.
edit 2: Below is the actual SQL code I've ended up with. Just adding it for the curiosity of people who have replied. It doesn't use the nocount as I'd already settled on a solution which works for me. It is also probably badly written. It evolved over time from something more basic. I could probably improve it myself but as it works and returns data extremely quickly I have stuck with it. (for now)
Here's the SQL.
Here's the Code where it's called. My chosen solution is to run the first query into a third temp table, then run a select * on that table from the code, then a delete from from the code...
I make no claims about being a 'good' sql scripter (self taught via necesity mostly), and the database is not very well designed (a mix of old and new tables. Old tables not relational and contain numerical values and date values stored as strings)
Here is the original (slow) query...
select
name,
program_name,
sum(handle) + sum(refund) as [Total Sales],
sum(refund) as Refunds,
sum(handle) as [Net Sales],
sum(credit - refund) as Payout,
cast(sum(comm) as money) as commission
from
(select accountnumber,program_name,
cast(credit_amount as money) as credit,cast(refund_amt as money) as refund,handle, handle * (
(select commission from amtotecommissions
where _date = a._date
and pool_type = (case when a.pool_type in ('WP','WS','PS','WPS') then 'WN' else a.pool_type end)
and program_name = a.program_name) / 100) as comm
from amtoteaccountactivity a where _date = '#yy/#mm/#dd' and transaction_type = 'Bet'
and accountnumber not in ('5067788','5096272') /*just to speed the query up a bit. I know these accounts aren't included*/
) a,
ews_db.dbo.amtotetrack t
where (a.accountnumber in (select accountno from ews_db.dbo.get_all_customers where country = 'US')
or a.accountnumber in ('5122483','5092147'))
and t.our_code = a.program_name collate database_default
and t.tracktype = 2
group by name,program_name

I suspect that with the right SQL and indexes you should be able to get equal performance with a single SELECT, however there isn't enough information in the original question to be able to give guidance on that.
I think you'll be best of doing this as a stored procedure and calling that.
CREATE PROCEDURE get_Count
#somefilter int
AS
delete from temp;
select * into temp from sometable where somefield = #somefilter;
select sum(someotherfield) from yetanothertable
where account in (select * from temp);
However an example avoiding the IN the way you're using it via a JOIN will probably fix the performance issue. Use EXPLAIN SELECT to see what's going on and optimise from there. For example the following
select sum(transactions.value) from transactions
inner join user on transactions.user=user.id where user.name='Some User'
is much quicker than
select sum(transactions.value) from transactions
where user in (SELECT id from user where user.name='Some User')
because the amount of rows scanned in the second example will be the entire table, whereas in the first the indexes can be used.
Rev1
Taking the slow SQL posted it is appears that there are full table scans going on where the SQL states WHERE .. IN e.g.
where (a.accountnumber in (select accountno from ews_db.dbo.get_all_customers))
The above will pull in lots of records which may not be required. This together with the other nested table selects are not allowing the optimiser to pull in only the records that match, as would be the case when using JOIN at the outer level.
When building these type of complex queries I generally start with the inner detail, because we need to have the inner detail so we can perform joins and aggregate operations.
What I mean by this is if you have a typical DB with customers that have orders that create transactions that contain items then I would start with the items and pull in the rest of the detail with joins.
By way of example only I suggest building the query more like the following:
select name,
program_name,
SUM(handle) + SUM(refund) AS [Total Sales],
SUM(refund) AS Refunds,
SUM(handle) AS [Net Sales],
SUM(credit - refund) AS Payout,
CAST(SUM(comm) AS money) AS commission,
FROM ews_db.dbo.get_all_customers AS cu
INNER JOIN amtoteactivity AS a ON a.accoutnumber = cu.accountnumber
INNER JOIN ews_db.dbo.amtotetrack AS track ON track.our_code = a.program_name
INNER JOIN amtotecommissions AS commision ON co.program_name = a.program_name
WHERE customers.country='US'
AND t.tracktype = 2
AND a.transaction_type = 'Bet'
AND a._date = ''#yy/#mm/#dd'
AND a.program_name = co.program_name
AND co.pool_type = (case when a.pool_type in ('WP','WS','PS','WPS') then 'WN' else a.pool_type end)
GROUP BY name,program_name,co.commission
NOTE: The above is not functional and is for illustration purposes. I'd need to have the database online to build the real query. I'm hoping you'll get the general idea and build from there.
My top tip for complex queries that don't work is simply to completely start again throwing away what you've already got. Sometimes I will do this three or four times when building a really tricky query.
Always build these queries gradually starting from the most detail and working outwards. Inspect the results at each stage because it helps visualise what the data are.

If you could come to a common data structure for all the selects you could UNION ALL them together with perhaps selecting a constant in each union so you know where the data was coming from - kinda like
select '1',col1,col2,'' from table 1
UNION ALL
select '2',col1,col2,col3 from table2

I just solved my original problem (that I came up against again today on a different query) in a slightly hacky way...
Conn.Execute(split(query,";")(0))
set rs = Conn.Execute(split(query,";")(1))
Works perfectly!
Edit : I just noticed that the first comment on my original question also provided a quick fix (set nocount on). I forgot about that. Well there is this and that. I had tried to get the query working without the temporary table but I couldn't get anywhere near the same performance as with it.

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?

What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.

Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.

To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas