Access Query MAX() Slows Query - sql

I have the below Access Query and it works fine. However, it takes about 8-10 seconds to finish on a table that is about 700 records right now. The FROM is another query that has very little query time. I have narrowed it down to the MAX() function, because when I remove that function it runs with very little query time. What can I do to speed this up? I am going to assume as more data comes into the database the longer it will take to query.
SELECT FirstName, LastName, TeamID, MAX(total) AS totalMax
FROM attendanceViewAll
WHERE TeamID IN(5,9,13)
GROUP BY FirstName, LastName, TeamID
Here is the Sub Query, basically it selects a bunch of data from a table. This happens in less than a second. The result of this query is everything ordered by date and agentID. I then use the above query to find the MAX(total) so I can group the agents for a summary. I use the below query for other reports as well.
SELECT
a1.TeamID,
a1.FirstName,
a1.LastName,
a1.incurredDate,
a1.points,
a1.OneFallOff,
a1.TwoFallOff,
(select sum(a2.actualPoints)
from attendanceView as a2 where a2.agentID = a1.agentID and a2.incurredDate <= a1.incurredDate) as total,
a1.comment, a1.linked, a1.FallOffDate
FROM attendanceView as a1;

Your [attendanceViewAll] query is using a correlated subquery to produce a running total (ref: your previous question here). Now you are asking for the MAX() of that running total, which is the same thing as the SUM() of the [TwoFallOff] values. That is, for
incurredDate TwoFallOff total
------------ ---------- -----
2014-01-10 2 2
2014-01-11 3 5
2014-01-12 1 6
MAX(total) is the same value as SUM(TwoFallOff). The big difference is that to get each value for [total] you need to run the correlated subquery, whereas to get each value for [TwoFallOff] you don't.
In other words, I suspect that your current query is slow because the MAX() is forcing the correlated subquery in [attendanceViewAll] to be executed many times. You may get faster response if you have your current query refer directly back to [attendanceView] and SUM() the [TwoFallOff] values from there.

What you need is a multiple-column index and it should be almost instantaneous.
Use the interface as this link describes if you need help on that. However, your index should be first on the criteria, secondary on the fields used in group by, so I would have an index on
TeamID, FirstName, LastName

Related

SQL Aggregate Function over partitions

I'm relatively new to SQL but have learned some cool stuff. I'm getting results that don't make sense. I've got a query with several subqueries and what-not but I have a windowed function that isn't working like I'm expecting.
The part that isn't working is this (simplified from the 300 line query):
SELECT AVG(table.sales_amount)
OVER (PARTITION BY table.month, table.sales_rep, table.department)
FROM table
The problem is that when I pull the data non aggregated I get a value different (107) than the above returns (95).
I've used windowed functions for COUNT and SUM and they work fine, but AVG is acting strangely. Am I missing something about how this works with AVG?
The subquery that table is a standin for looks like:
sales_rep, month, department, sales_amount
1, 2017-1, abc, 125.20
1, 2017-2, abc, 120.00
2, 2017-1, def, 100.00
...etc
Working out of Sql Server Management studio
SOLVED: I did finally figure it out, the results i was joining this subquery to had the sales rep multiple times in a month selling objects A&B which caused whoever sold both to be counted twice. whoops, my bad.
The results that you get should be the same values as in:
SELECT AVG(table.sales_amount)
FROM table
GROUP BY table.month, table.sales_rep, table.department;
Of course, the rows will be different. You need to match up the three key columns.
Based on your sample data, it looks like the partitioning keys uniquely define each row. Perhaps you really intend:
SELECT AVG(table.sales_amount) OVER () as overall_average
FROM table;
EDIT:
For the departmental average:
SELECT AVG(table.sales_amount) OVER (partition by table.department) as department_average
FROM table;
After some bruteforcing of potential errors I finally figured out the issue. I was joining that subquery to the another which had multiple instances of a sales_rep in a given month (selling objects a & b) which caused the average of those with sales of both objects to be counted twice instead of once.
so sales rep 1 sold objects a & b which made his avg count as 66% of the dept avg instead of 50%, and sales rep 2 count only 33%.

Max vs Count Huge Performance difference on a query

I have to similar queries which the only difference is that one is doing a sum of a column and the other is doing a count(distinct) of another column.
The first one runs in seconds (17s) and the other one never stops (1 hour and counting). I've seen the plan for the count query and it has huge costs. I don't understand why.
They are hitting the exact same views.
Why is this happening and what can I do?
The one that is running fine:
select a11.SOURCEPP SOURCEPP,
a12.DUMMY DUMMY,
a11.SIM_NAME SIM_NAME,
a13.THEORETICAL THEORETICAL,
sum(a11.REVENUE) WJXBFS1
from CLIENT_SOURCE_DATA a11
join DUMMY_V a12
on (a11.SOURCEPP = a12.SOURCEPP)
join SIM_INFO a13
on (a11.SIM_NAME = a13.SIM_NAME)
where (a13.THEORETICAL in (0)
and a11.SIM_NAME in ('ETS40'))
group by a11.SOURCEPP,
a12.DUMMY,
a11.SIM_NAME,
a13.THEORETICAL
the one that doesn't run:
select a12.SOURCEPP SOURCEPP,
a12.SIM_NAME SIM_NAME,
a13.THEORETICAL THEORETICAL,
count(distinct a12.CLIENTID) WJXBFS1
from CLIENT_SOURCE_DATA a12
join SIM_INFO a13
on (a12.SIM_NAME = a13.SIM_NAME)
where (a13.THEORETICAL in (0)
and a12.SIM_NAME in ('ETS40'))
group by a12.SOURCEPP,
a12.SIM_NAME,
a13.THEORETICAL
DISTINCT is very slow when there are many DISTINCT values, database needs to SORT/HASH and store all values (or sets) in memory/temporary tablespace. Also it makes parallel execution much more difficult to apply.
If there is a way how to rewrite the query without using DISTINCT you should definitely do it.
As answered above, DISTINCT has to do a table scan and then hash, aggregate and sort the data into sets. This increases the amount of time it takes across the board (CPU, disk access, and the time it takes to return the data). I would recommend trying a subquery instead if possible. This will limit the aggregation execution to only the data you want to be distinct instead of having the engine perform it on all of the data. Here's an article on how this works in practice, with an example.

What do OrientDB's functions do when applied to the results of another function?

I am getting very strange behavior on 2.0-M2. Consider the following against the GratefulDeadConcerts database:
Query 1
SELECT name, in('written_by') AS wrote FROM V WHERE type='artist'
This query returns a list of artists and the songs each has written; a majority of the rows have at least one song.
Query 2
Now try:
SELECT name, count(in('written_by')) AS num_wrote FROM V WHERE type='artist'
On my system (OSX Yosemite; Orient 2.0-M2), I see just one row:
name num_wrote
---------------------------
Willie_Cobb 224
This seems wrong. But I tried to better understand. Perhaps the count() causes the in() to look at all written_by edges...
Query 3
SELECT name, in('written_by') FROM V WHERE type='artist' GROUP BY name
Produces results similar to the first query.
Query 4
Now try count()
SELECT name, count(in('written_by')) FROM V WHERE type='artist' GROUP BY name
Wrong path -- So try LET variables...
Query 5
SELECT name, $wblist, $wbcount FROM V
LET $wblist = in('written_by'),
$wbcount = count($wblist)
WHERE type='artist'
Produces seemingly meaningless results:
You can see that the $wblist and $wbcount columns are inconsistent with one another, and the $wbcount values don't show any obvious progression like a cumulative result.
Note that the strange behavior is not limited to count(). For example, first() does similarly odd things.
count(), like in RDBMS, computes the sum of all the records in only one value. For your purpose .size()seems the right method to call:
in('written_by').size()

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)

SQL Server aggregate performance

I am wondering whether SQL Server knows to 'cache' if you like aggregates while in a query, if they are used again.
For example,
Select Sum(Field),
Sum(Field) / 12
From Table
Would SQL Server know that it has already calculated the Sum function on the first field and then just divide it by 12 for the second? Or would it run the Sum function again then divide it by 12?
Thanks
It calculates once
Select
Sum(Price),
Sum(Price) / 12
From
MyTable
The plan gives:
|--Compute Scalar(DEFINE:([Expr1004]=[Expr1003]/(12.)))
|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
|--Stream Aggregate(DEFINE:([Expr1010]=Count(*), [Expr1011]=SUM([myDB].[dbo].[MyTable].[Price])))
|--Index Scan(OBJECT:([myDB].[dbo].[MyTable].[IX_SomeThing]))
This table has 1.35 million rows
Expr1011 = SUM
Expr1003 = some internal thing to do with "no rows" etc but is Expr1011 basically
Expr1004 = Expr1011 / 12
According to the execution plan, it doesn't re-sum the column.
good question, i think the answer is no, it doesn't not cache it.
I ran a test query with around 3000 counts in it, and it was much slower than one with only a few. Still want to test if the query would be just as slow selecting just plain columns
edit: OK, i just tried selecting a large amount of columns or just one, and the amount of columns (when talking about thousands being returned) does effect the speed.
Overall, unless you are using that aggregate number a ton of times in your query, you should be fine. Push comes to shove, you could always save the outcome to a variable and do the math after the fact.