Append data first then group or group first the append - sql

I have two tables, with the exact same format. Since each table has the date column(the date used to create the table), group first or append first will not make any difference to the result.
I use two queries to test:
SELECT * FROM
(SELECT
TXN,CONT,ReportingDate,sum(AMT) AS TOT
FROM Table1
GROUP BY TXN,CONT,ReportingDate
UNION ALL
SELECT
TXN,CONT,ReportingDate,sum(AMT) AS TOT
FROM Table2
GROUP BY TXN,CONT,ReportingDate)
TEST
SELECT TXN, CONT,Reportingdate,sum(AMT)
from
(
SELECT
TXN,CONT,AMT,ReportingDate
FROM Table1
UNION ALL
SELECT
TXN,CONT,AMT,ReportingDate
FROM Table2
)
test
GROUP BY
TXN,CONT,Reportingdate
(22596 row(s) affected)
SQL Server Execution Times:
CPU time = 156 ms, elapsed time = 2582 ms.
(22596 row(s) affected)
SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 2337 ms.
The statistics do not show a lot of difference. The timings change a few every time I run the queries.
The Execution plan
Which one will be faster? I just list one result here. I run these two queries for 10 times, 7 out of which show query 1 is faster.
The reportingdate column will be totally different in the two tables, so there will be no duplicate result for query 1. For example, the reportingdate in table 1 is a column of 10/28/2015s, and the reportingdate in table 2 are 10/29/2015s.
Thanks

Typically when decided which version of a SQL statement I want to use I consider the following:
Will they both return the same results? As mentioned by Gordon in the comment, conceptually the first would return a row duplicated in both tables as separate rows whereas the second would group them together and you would see the sum of both of them.
Performance difference. Not much performance difference here, but the second one does seem to be faster (which makes sense as the DBMS is able to get all the rows and then sum once rather than get some rows, sum, then get some more rows, and sum)
Readability/maintainability. In your opinion, when someone is debugging this later on, would they rather test the inner statements with or without a grouping statement? Really your call on this one.

Related

Bigquery runs indefinitely

I have a query like this:
WITH A AS (
SELECT id FROM db1.X AS d
WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
),
B AS (
SELECT id
FROM db2.Y as t
WHERE
t.start <= TIMESTAMP(DATE_SUB(current_date(), INTERVAL 7 DAY))
AND t.end >= TIMESTAMP(current_date())
)
SELECT * FROM A as d JOIN B as t on d.id = t.id;
db1.X has 1.6 Billion rows.
db2.Y has 15K rows.
db1.X is a materialized view on a bigger table.
db2.Y is a table with source as a google sheet.
Issue
The query keeps running indefinitely.
I had to cancel it when it reached about an hour, but one query which I left running went on for 6 hours and then timed-out without any further error.
The query used to run fine till 2nd Jan, After that I reran it on 9th Jan and it never ended. Both the tables are auto-populated so it is possible that they ran over some threshold during this time, but I could not find any such threshold value. (Similar fate of 3 other queries, same tables)
What's tried
Removed join to use a WHERE IN. Still never ending.
No operation works on A, but all work on B. For ex: SELECT count(*) from B; will work. It keeps on going for A. (But it works when the definition of B is removed)
The above behaivour is replicated even when not using subqueries.
A has 10.6 Million rows, B has 31 rows (Much less than actual table, but still the same result)
The actual query was without any subqueries and used only multiple date comparisons while joining. So I used subqueries which filters data before going into the join. (This is the one above) But it also runs indefinitely
JOIN EACH: This never got out of syntax error. Replacing JOIN with JOIN EACH in above query complains about the "AS", removing that it complains that I should use dataset.tablename, on fixing that it complains Expected end of input but got "."
It turns out that the table size is the problem.
I created a smaller table and ran exactly the same queries, and that works.
This was also expected because the query just stopped running one day. The only variable was the amount of data in source tables.
In my case, I needed the data every week, so I created a scheduled query to update the smaller table with only 1 month worth of data.
The smaller versions of the tables have:
db1.X: 40 million rows
db2.Y: 400 rows
Not sure what's going on exactly in terms of issues due to size, but apart from some code clarity your query should run as expected. Am I correct in reading from your query that table A should return results within the last 7 days whereas table B should return results outside of the last 7 days? Some things you might try to make debugging easier.
Use BETWEEN and dates. E.g. WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
Use a backtick (`) for your FROM statement to prevent table name errors like the one you mentioned (expected end of input but got ".")
Limit your CTE instead of the outer query. The current limit in the outer query has no effect on computed data only on the output. E.g. to limit the source data from table A instead use:
WITH A AS (
SELECT id FROM `db1.X`
WHERE DATE(date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
LIMIT 10
)
...

Oracle SQL query performance change when changing date predicate by one day

I have this very complex sql query with many joins that runs in a few seconds with one date in the predicate, but when I change the date by one day, I end up cancelling the query after 15 minutes. PHA is the PO.PO_HEADERS_ALL table in Oracle, and the CREATION_DATE column is defined as type DATE.
--this finishes in a few seconds with 444 records
and pha.creation_date > to_date('23-JAN-2021','DD-MON-YYYY')
-- this never finishes
and pha.creation_date > to_date('22-JAN-2021','DD-MON-YYYY')
If use TRUNC(pha.creation_date) in the predicate, then both queries finish in a few seconds, with expected results.
Just wondering if someone can explain why one day would cause a difference? The explain plans in TOAD did not really look any different between the two queries.

Is it possible to create more rows than original data using GROUPBY and COUNT(*)?

I have the following query:
SELECT "domain", site_domain, pageurl, count (*) FROM impressions WHERE imp_date > '20150718' AND insertion_order_id = '227363'
GROUP BY 1,2,3
It was an incorrectly conceived query this I understand, but it took over 30 minutes to run, while just pulling the data without a count and groupby took just 20 seconds.
My question being is it possible that there are more rows created than the original data set?
Thanks!
The only time that an aggregation query will return more rows than in the original data set is when two conditions are true:
There are no matching rows in the query.
There is no group by.
In this case, the aggregation query returns one row; without the aggregation you would have no rows.
Otherwise, GROUP BY combines rows together, so the result cannot be larger than the original data.
When you are comparing time spent for returning a result set, you need to distinguish between time to first row and time to last row. When you execute a simple SELECT, you are inclined to measure the time to the first row returned. However, a group by needs to process all the data before returning any rows (under most circumstances). Hence, it would be better to compare the time to the last row returned by the simple query.

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.
How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc

How Does Dateadd Impact the Performance of a SQL Query?

Say for instance I'm joining on a number table to perform some operation between two dates in a subquery, like so:
select n
,(select avg(col1)
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate))
from numbers
where n < 1200
Would the query perform better if I, say, constructed the date from concatenating varchars than using the dateadd function?
Keeping data in the datetime format using DATEADD is most likely to be quicker
Check this question: Most efficient way in SQL Server to get date from date+time?
The accepted answer (not me!) demonstrates DATEADD over string conversions. I've seen another too many years ago that showed the same
Be careful with between and dates, take a look at How Does Between Work With Dates In SQL Server?
I once optmized a query to run from over 24 hours to 36 seconds. Just don't use date functions or conversions on the column , see here: Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code
to see what query performs better, execute both queries and look at execution plans, you can also use statistics io and statistics time to get how many reads and the time it took to execute the queries
I would NOT go with concatenating varchars.
DateAdd will def be better performace than string contatenation, and casting to DATETIME.
As always, you best bet would be to profile the 2 options, and determine the best result, as no DB is specified.
most likely there will be no differenfce one way or another.
I would run this:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
followed by both variants of your query, so that you see and compare real execution costs.
As long as your predicate calculations do not include references to the columns of the table you're querying, your approach shouldn't matter either way (go for clarity).
If you were to include something from Table1 in the calculation, though, I'd watch out for table scans or covering index scans as it may no longer be sargable.
In any case, check (or post!) the execution plan to confirm.
Why would you ever use a correlated subquery to begin with? That's going to slow you up far more than dateadd. They are like cursors, they work row by row.
Will something like this work?
select n.n , avgcol1
from numbers n
left outer join
(
select avg(col1) as avgcol1, n
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate)
Group by n
) t
on n.n = t.n
where n < 1200