Is it possible to create more rows than original data using GROUPBY and COUNT(*)? - sql

I have the following query:
SELECT "domain", site_domain, pageurl, count (*) FROM impressions WHERE imp_date > '20150718' AND insertion_order_id = '227363'
GROUP BY 1,2,3
It was an incorrectly conceived query this I understand, but it took over 30 minutes to run, while just pulling the data without a count and groupby took just 20 seconds.
My question being is it possible that there are more rows created than the original data set?
Thanks!

The only time that an aggregation query will return more rows than in the original data set is when two conditions are true:
There are no matching rows in the query.
There is no group by.
In this case, the aggregation query returns one row; without the aggregation you would have no rows.
Otherwise, GROUP BY combines rows together, so the result cannot be larger than the original data.
When you are comparing time spent for returning a result set, you need to distinguish between time to first row and time to last row. When you execute a simple SELECT, you are inclined to measure the time to the first row returned. However, a group by needs to process all the data before returning any rows (under most circumstances). Hence, it would be better to compare the time to the last row returned by the simple query.

Related

Get the number of records from 2 columns where the time is overlapping

I am new to MS ACCESS and am having trouble trying to get the number of records from overlapping time ranges. This is an example of my data.
example of raw data
I am trying to do is to get the column number_of_records. For example, if there are 4 records added at 5.11, the number_of_records should become 8 as 4 records are added at 5.10.
example of raw data with no_of_records column
There is a mistake in my image above. I forgot to mention that for example, if the time hits 6:00, the number of records should not add on to the previous records and should start afresh.
Do any of you have any suggestions?
Consider the correlated count subquery:
SELECT t.time_column_1, t.time_column_2,
(SELECT Count(*) FROM myTable sub
WHERE sub.time_column_1 <= t.time_column_1
AND sub.time_column_2 = t.time_column_2) AS number_of_records
FROM mytable t
ORDER BY t.time_column_2, t.time_column_1

Get the minimum of 3 columns in Vertica

In Vertica, how can I get a column that is the min of 3 existing columns? In the case of a all nulls it should return zero.
I've tried min() function, but realized that it's only returning min of a column.
I thought about a case statement but realized it would be super long to capture every combination of results and would be very resource intensive.
I appreciate any suggestions. Thank you!
Use LEAST to get the minimum value of multiple columns per row.
select least(coalesce(open_hrs_diff,0),coalesce(click_hrs_diff,0),coalesce(login_hrs_diff,0))
from tablename

Why is SQL Server returning a different order when using 'month' in 'where'?

I run a procedure call that calculates sums into table rows. First I taught the procedure is not working as expected, so I wasted half a day trying to fix what actually works fine.
Later I actually taken a look at the SELECT that gets the data on screen and was surprised by this:
YEAR(M.date) = 2016
--and MONTH(M.date) = 2
and
YEAR(M.date) = 2016
and MONTH(M.date) = 2
So the second example returns a different sorting than the first.
The thing is I do calculations on the whole year. Display data on year + month parameters.
Can someone explain why this is happening and how to avoid this?
In my procedure that calls the SELECT for on screen data I have it implemented like so:
and (#month = 0 or (month(M.date) = #month))
and year(M.date) = #year
So the month parameter is optional if the user wants to see the data for the whole year and year parameter is mandatory.
You are ordering by the date column. However, the date column is not unique -- multiple rows have the same date. The ORDER BY returns these in arbitrary order. In fact, you might get a different ordering for the same query running at different times.
To "fix" this, you need to include another column (or columns) that is unique for each row. In your case, that would appear to be the id column:
order by date, id
Another way to think about this is that in SQL the sorts are not stable. That is, they do not preserve the original ordering of the data. This is easy to remember, because there is no "original ordering" for a table or result set. Remember, tables represent unordered sets.

Append data first then group or group first the append

I have two tables, with the exact same format. Since each table has the date column(the date used to create the table), group first or append first will not make any difference to the result.
I use two queries to test:
SELECT * FROM
(SELECT
TXN,CONT,ReportingDate,sum(AMT) AS TOT
FROM Table1
GROUP BY TXN,CONT,ReportingDate
UNION ALL
SELECT
TXN,CONT,ReportingDate,sum(AMT) AS TOT
FROM Table2
GROUP BY TXN,CONT,ReportingDate)
TEST
SELECT TXN, CONT,Reportingdate,sum(AMT)
from
(
SELECT
TXN,CONT,AMT,ReportingDate
FROM Table1
UNION ALL
SELECT
TXN,CONT,AMT,ReportingDate
FROM Table2
)
test
GROUP BY
TXN,CONT,Reportingdate
(22596 row(s) affected)
SQL Server Execution Times:
CPU time = 156 ms, elapsed time = 2582 ms.
(22596 row(s) affected)
SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 2337 ms.
The statistics do not show a lot of difference. The timings change a few every time I run the queries.
The Execution plan
Which one will be faster? I just list one result here. I run these two queries for 10 times, 7 out of which show query 1 is faster.
The reportingdate column will be totally different in the two tables, so there will be no duplicate result for query 1. For example, the reportingdate in table 1 is a column of 10/28/2015s, and the reportingdate in table 2 are 10/29/2015s.
Thanks
Typically when decided which version of a SQL statement I want to use I consider the following:
Will they both return the same results? As mentioned by Gordon in the comment, conceptually the first would return a row duplicated in both tables as separate rows whereas the second would group them together and you would see the sum of both of them.
Performance difference. Not much performance difference here, but the second one does seem to be faster (which makes sense as the DBMS is able to get all the rows and then sum once rather than get some rows, sum, then get some more rows, and sum)
Readability/maintainability. In your opinion, when someone is debugging this later on, would they rather test the inner statements with or without a grouping statement? Really your call on this one.

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.
How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc