sql dw count(*) with selected columns doesn't aggregate - sql

This is rather strange, and I have had this query work on numerous databases but here I am stumped.
I know that my Synapse table has duplicates
SELECT nmiandnmisuffixkey, ReadingDate, IntervalNumber
FROM [dbo].[factMeterDataDetail]
where nmiandnmisuffixkey = 'XXXXXXXXXX'
and readingdate = '2020-10-08'
and IntervalNumber = 12
produces
+--------------------+-------------+----------------+
| nmiandnmisuffixkey | ReadingDate | IntervalNumber |
+--------------------+-------------+----------------+
| XXXXXXXXXX | 2020-10-08 | 12 |
| XXXXXXXXXX | 2020-10-08 | 12 |
+--------------------+-------------+----------------+
but when I try to run following
SELECT nmiandnmisuffixkey, ReadingDate, IntervalNumber, count(*) as cnt
FROM [dbo].[factMeterDataDetail]
where nmiandnmisuffixkey = 'XXXXXXXXXX'
and readingdate = '2020-10-08'
and IntervalNumber = 12
group by nmiandnmisuffixkey, ReadingDate, IntervalNumber
I get the following:-
+--------------------+-------------+----------------+-----+
| nmiandnmisuffixkey | ReadingDate | IntervalNumber | cnt |
+--------------------+-------------+----------------+-----+
| XXXXXXXXXX | 2020-10-08 | 12 | 1 |
| XXXXXXXXXX | 2020-10-08 | 12 | 1 |
+--------------------+-------------+----------------+-----+
why does the count not aggregate up?

Some possibilities -
date has different time/millisecond. so, you can try removing time part and run group by query again.
string column(key) can have white spaces in the end or begining. you can use ltrim/rtrim and run group by query again. Client tool will display them as identical data like your output. By trimming space there can be a true comparison.
In OP's case, it was case #2. Using ltrim/rtrim resolved the agg issue.

This is your query:
select nmiandnmisuffixkey, ReadingDate, IntervalNumber, count(*) as cnt
from [dbo].[factMeterDataDetail]
where nmiandnmisuffixkey = 'XXXXXXXXXX' and
readingdate = '2020-10-08' and
IntervalNumber = 12
group by nmiandnmisuffixkey, ReadingDate, IntervalNumber
The query is filtering on specific values for each of the columns used in the group by. And yet, you are getting multiple rows when aggregating on them.
So, your question is really: "When does an equality comparison not match the concept of "equality" for aggregation?"
I'm sure this is not a comprehensive list.
One possibility is that IntervalNumber is really a string. The = converts the values to a number, so '012' and '12' are the same for equality, but not for aggregation. (Here is an example.)
In other words, type conversion can cause this discrepancy.
This might occur with strings and collations. Normally, I would expect a collation conflict error. But you might check if the string columns have an explicit collation different from the database default (which would be used for the string constant).
I don't think there is an equivalent difference for your date comparison.
I should also note a workaround for this use-case:
select max(nmiandnmisuffixkey), max(ReadingDate), max(IntervalNumber), count(*) as cnt
from [dbo].[factMeterDataDetail]
where nmiandnmisuffixkey = 'XXXXXXXXXX' and
readingdate = '2020-10-08' and
IntervalNumber = 12;
That is, just use an aggregation query with no group by. It is guaranteed to return one row.

Your table definitely has an ID. When you do count (*) it includes the ID. First, put the desired fields in a temp table, then group by.
In this way:
SELECT
nmiandnmisuffixkey, ReadingDate, IntervalNumber
Into
#tmp FROM [dbo].[factMeterDataDetail]
where
nmiandnmisuffixkey = 'XXXXXXXXXX' and readingdate = '2020-10-08' and IntervalNumber = 12
Select
nmiandnmisuffixkey, ReadingDate, IntervalNumber,count (*)as cnt
from
#tmp
Group by
nmiandnmisuffixkey, ReadingDate, IntervalNumber

Related

SQL: Calculate number of days since last success

Following table represents results of given test.
Every result for the same test is either pass ( error_id=0) or fail ( error_id <> 0)
I need help to write a query, that returns the number of runs since last good run ( error_id= 0) and the date.
| Date | test_id | error_id |
-----------------------------------
| 2019-12-20 | 123 | 23
| 2019-12-19 | 123 | 23
| 2019-12-17 | 123 | 22
| 2019-12-18 | 123 | 0
| 2019-12-16 | 123 | 11
| 2019-12-15 | 123 | 11
| 2019-12-13 | 123 | 11
| 2019-12-12 | 123 | 0
So the result for this example should be:
| 2019-12-18 | 123 | 4
as the test 123 was PASS on 2019-12-18 and this happened 4 runs ago.
I have a query to determine whether given run is error or not, but I have trouble applying appropriate window function to it to get the wanted result
select test_id, Date, error_id, (CASE WHEN error_id 0 THEN 1 ELSE 0 END) as is_error
from testresults
You can generate a row number, in reverse order from the sorting of the query itself:
SELECT test_date, test_id, error_code,
(row_number() OVER (ORDER BY test_date asc) - 1) as runs_since_last_pass
FROM tests
WHERE test_date >= (SELECT MAX(test_date) FROM tests WHERE error_code=0)
ORDER BY test_date DESC
LIMIT 1;
Note that this will run into issues if test_date is not unique. Better use a timestamp (precise to the millisecond) instead of a date.
Here's a DBFiddle: https://www.db-fiddle.com/f/8gSHVcXMztuRiFcL8zLeEx/0
If there's more than one test_id, you'll want to add a PARTITION BY clause to the row number function, and the subquery would become a bit more complex. It may be more efficient to come up with a way to do this by a JOIN instead of a subquery, but it would be more cognitively complex.
I think you just want aggregation and some filtering:
select id, count(*),
max(date) over (filter where error_id = 0) as last_success_date
from t
where date > (select max(t2.date) from t t2 where t2.error_id = 0);
group by id;
You have to use the Maximum date of the good runs for every test_id in your query. You can try this query:
select tr2.Date_error, tr.test_id, count(tr.error_id) from
testresults tr inner join (select max(Date_error), test_id
from testresult where error_id=0 group by test_id) tr2 on
tr.test_id=tr2.test_id and tr.date_error >=tr2.date_error
group by test_id
This should do the trick:
select count(*) from table t,
(select max(date) date from table where error_id = 0) good
where t.date >= good.date
Basically you are counting the rows that have a date >= the date of the last success.
Please note: If you need the number of days, it is a complete different query:
select now()::date - max(test_date) last_valid from tests
where error_code = 0;

Greatest N Per Group with JOIN and multiple order columns

I have two tables:
Table0:
| ID | TYPE | TIME | SITE |
|----|------|-------|------|
| aa | 1 | 12-18 | 100 |
| aa | 1 | 12-10 | 101 |
| bb | 2 | 12-10 | 102 |
| cc | 1 | 12-09 | 100 |
| cc | 2 | 12-12 | 103 |
| cc | 2 | 12-01 | 109 |
| cc | 1 | 12-07 | 101 |
| dd | 1 | 12-08 | 100 |
and
Table1:
| ID |
|----|
| aa |
| cc |
| cc |
| dd |
| dd |
I'm trying to output results where:
ID must exist in both tables.
TYPE must be the maximum for each ID.
TIME must be the minimum value for the maximum TYPE for each ID.
SITE should be the value from the same row as the minimum TIME value.
Given my sample data, my results should look like this:
| ID | TYPE | TIME | SITE |
|----|------|-------|------|
| aa | 1 | 12-10 | 101 |
| cc | 2 | 12-01 | 109 |
| dd | 1 | 12-08 | 100 |
I've tried these statements:
INSERT INTO "NuTable"
SELECT DISTINCT(QTS."ID"), "SITE",
CASE WHEN MAS.MAB=1 THEN 'B'
WHEN MAS.MAB=2 THEN 'F'
ELSE NULL END,
"TIME"
FROM (SELECT DISTINCT("ID") FROM TABLE1) AS QTS,
TABLE0 AS MA,
(SELECT "ID", MAX("TYPE") AS MASTY, MIN("TIME") AS MASTM
FROM TABLE0
GROUP BY "ID") AS MAS,
WHERE QTS."ID" = MA."ID"
AND QTS."ID" = MAS."ID"
AND MSD.MASTY =MA."TYPE"
...which generates a syntax error
INSERT INTO "NuTable"
SELECT DISTINCT(QTS."ID"), "SITE",
CASE WHEN MAS.MAB=1 THEN 'B'
WHEN MAS.MAB=2 THEN 'F'
ELSE NULL END,
"TIME"
FROM (SELECT DISTINCT("ID") FROM TABLE1) AS QTS,
TABLE0 AS MA,
(SELECT "ID", MAX("TYPE") AS MAB
FROM TABLE0
GROUP BY "ID") AS MAS,
((SELECT "ID", MIN("TIME") AS MACTM, MIN("TYPE") AS MACTY
FROM TABLE0
WHERE "TYPE" = 1
GROUP BY "ID")
UNION
(SELECT "ID", MIN("TIME"), MAX("TYPE")
FROM TABLE0
WHERE "TYPE" = 2
GROUP BY "ID")) AS MACU
WHERE QTS."ID" = MA."ID"
AND QTS."ID" = MAS."ID"
AND MACU."ID" = QTS."ID"
AND MA."TIME" = MACU.MACTM
AND MA."TYPE" = MACU.MACTB
... which is getting the wrong results.
Answering your direct question "how to avoid...":
You get this error when you specify a column in a SELECT area of a statement that isn't present in the GROUP BY section and isn't part of an aggregating function like MAX, MIN, AVG
in your data, I cannot say
SELECT
ID, site, min(time)
FROM
table
GROUP BY
id
I didn't say what to do with SITE; it's either a key of the group (in which case I'll get every unique combination of ID,site and the min time in each) or it should be aggregated (eg max site per ID)
These are ok:
SELECT
ID, max(site), min(time)
FROM
table
GROUP BY
id
SELECT
ID, site, min(time)
FROM
table
GROUP BY
id,site
I cannot simply not specify what to do with it- what should the database return in such a case? (If you're still struggling, tell me in the comments what you think the db should do, and I'll better understand your thinking so I can tell you why it can't do that ). The programmer of the database cannot make this decision for you; you must make it
Usually people ask this when they want to identify:
The min time per ID, and get all the other row data as well. eg "What is the full earliest record data for each id?"
In this case you have to write a query that identifies the min time per id and then join that subquery back to the main data table on id=id and time=mintime. The db runs the subquery, builds a list of min time per id, then that effectively becomes a filter of the main data table
SELECT * FROM
(
SELECT
ID, min(time) as mintime
FROM
table
GROUP BY
id
) findmin
INNER JOIN table t ON t.id = findmin.id and t.time = findmin.mintime
What you cannot do is start putting the other data you want into the query that does the grouping, because you either have to group by the columns you add in (makes the group more fine grained, not what you want) or you have to aggregate them (and then it doesn't necessarily come from the same row as other aggregated columns - min time is from row 1, min site is from row 3 - not what you want)
Looking at your actual problem:
The ID value must exist in two tables.
The Type value must be largest group by id.
The Time value must be smallest in the largest type group.
Leaving out a solution that involves having or analytics for now, so you can get to grips with the theory here:
You need to find the max type group by id, and then join it back to the table to get the other relevant data also (time is needed) for that id/maxtype and then on this new filtered data set you need the id and min time
SELECT t.id,min(t.time) FROM
(
SELECT
ID, max(type) as maxtype
FROM
table
GROUP BY
id
) findmax
INNER JOIN table t ON t.id = findmax.id and t.type = findmax.maxtype
GROUP BY t.id
If you can't see why, let me know
demo:db<>fiddle
SELECT DISTINCT ON (t0.id)
t0.id,
type,
time,
first_value(site) OVER (PARTITION BY t0.id ORDER BY time) as site
FROM table0 t0
JOIN table1 t1 ON t0.id = t1.id
ORDER BY t0.id, type DESC, time
ID must exist in both tables
This can be achieved by joining both tables against their ids. The result of inner joins are rows that exist in both tables.
SITE should be the value from the same row as the minimum TIME value.
This is the same as "Give me the first value of each group ofids ordered bytime". This can be done by using the first_value() window function. Window functions can group your data set (PARTITION BY). So you are getting groups of ids which can be ordered separately. first_value() gives the first value of these ordered groups.
TYPE must be the maximum for each ID.
To get the maximum type per id you'll first have to ORDER BY id, type DESC. You are getting the maximum type as first row per id...
TIME must be the minimum value for the maximum TYPE for each ID.
... Then you can order this result by time additionally to assure this condition.
Now you have an ordered data set: For each id, the row with the maximum type and its minimum time is the first one.
DISTINCT ON gives you exactly the first row of each group. In this case the group you defined is (id). The result is your expected one.
I would write this using distinct on and in/exists:
select distinct on (t0.id) t0.*
from table0 t0
where exists (select 1 from table1 t1 where t1.id = t0.id)
order by t0.id, type desc, time asc;

MSSQL Sum of values referenced by another table

I'm attempting to create a report on the total money spent per day.
In the database are these two tables. They are matched using "UID" made at creation.
I've created this query but it results in duplicate dates.
Select LEFT(f.timestamp, 10) timestamp, sum(s.Total) Total
FROM dbo.purchasing AS f
Join (SELECT uid,SUM(CONVERT(DECIMAL(18,2), (CONVERT(DECIMAL(18,4), qty) * price))) Total
FROM dbo.purchasingitems
GROUP BY uid)
AS s ON f.uid = s.uid
GROUP BY TIMESTAMP
purchasing:
+--+---------+------------+--------+---+
|ID| UID | timestamp | contact|...|
+--+---------+------------+--------+---+
| 1|abr92nas9| 01/01/2018 | ROB |...|
| 2|nsa93m187| 02/02/2018 | ROB |...|
+--+---------+------------+--------+---+
purchasingitems:
+--+---------+-----+--------+---+
|ID| UID | QTY | Price |...|
+--+---------+-----+--------+---+
| 1|abr92nas9| 20 | 0.2435 |...|
| 2|abr92nas9| 5 | 0.5 |...|
| 3|nsa93m187| 1 | 100 |...|
| 4|nsa93m187| 4 | 15.5 |...|
+--+---------+-----+--------+---+
You need to group by the expression:
SELECT LEFT(f.timestamp, 10) as timestamp, sum(s.Total) as Total
FROM dbo.purchasing f JOIN
(SELECT uid, SUM(CONVERT(DECIMAL(18,2), (CONVERT(DECIMAL(18,4), qty) * price))) as Total
FROM dbo.purchasingitems
GROUP BY uid
) s
ON f.uid = s.uid
GROUP BY LEFT(f.timestamp, 10);
Notes:
You should not be storing date/time values as strings (unless you have a really good reason). If timestamp is a date, you should use cast(timestamp as date).
You should not be using string functions on date/times.
timestamp is a keyword in SQL Server (although not reserved), so it is not a good choice for a column name.
Your problem is that you think that GROUP BY timestamp refers to the expression in the SELECT. SQL Server does not support column aliases, so it can only refer to the column of that name.
I don't see a reason to convert to decimal for the multiplication. You might have a good reason.
You probably want order by as well, to ensure that the result set is in a sensible order.
Data you posted does NOT produce duplicates
No reason for the sub query
Select LEFT(f.timestamp, 10) timestamp,
SUM(CONVERT(DECIMAL(18,2), (CONVERT(DECIMAL(18,4), s.qty) * s.price))) Total
FROM dbo.purchasing AS f
join dbo.purchasingitems s
ON f.uid = s.uid
GROUP BY f.TIMESTAMP

Count and name content from a SQL Server table

I have a table which is structured like this:
+-----+-------------+-------------------------+
| id | name | timestamp |
+-----+-------------+-------------------------+
| 1 | someName | 2016-04-20 09:41:41.213 |
| 2 | someName | 2016-04-20 09:42:41.213 |
| 3 | anotherName | 2016-04-20 09:43:41.213 |
| ... | ... | ... |
+-----+-------------+-------------------------+
Now, I am trying to create a query, which selects all timestamps since time x and count the amount of times the same name occurs in the result.
As an example, if we would apply this query to the table above, with 2016-04-20 09:40:41.213 as the date from which on it should be counted, the result should look like this:
+-------------+-------+
| name | count |
+-------------+-------+
| someName | 2 |
| anotherName | 1 |
+-------------+-------+
What I have accomplished so far is the following query, which gives me the the names, but not their count:
WITH screenshots AS
(
SELECT * FROM SavedScreenshotsLog
WHERE timestamp > '2016-04-20 09:40:241.213'
)
SELECT s.name
FROM SavedScreenshotsLog s
INNER JOIN screenshots sc ON sc.name = s.name AND sc.timestamp = s.timestamp
ORDER BY s.name
I have browsed through stackoverflow but was not able to find a solution which fits my needs and as I am not very experienced with SQL, I am out of ideas.
You mention one table in your question, and then show a query with two tables. That makes it hard to follow the question.
What you are asking for is a simple aggregation:
SELECT name, COUNT(*)
FROM SavedScreenshotsLog
WHERE timestamp > '2016-04-20 09:40:241.213'
GROUP BY name
ORDER BY COUNT(*) DESC;
EDIT:
If you want "0" values, you can use conditional aggregation:
SELECT name,
SUM(CASE WHEN timestamp > '2016-04-20 09:40:241.213' THEN 1 ELSE 0 END) as cnt
FROM SavedScreenshotsLog
GROUP BY name
ORDER BY cnt DESC;
Note that this will run slower because there is no filter on the dates prior to aggregation.
CREATE TABLE #TEST (name varchar(100), dt datetime)
INSERT INTO #TEST VALUES ('someName','2016-04-20 09:41:41.213')
INSERT INTO #TEST VALUES ('someName','2016-04-20 09:41:41.213')
INSERT INTO #TEST VALUES ('anotherName','2016-04-20 09:43:41.213')
declare #YourDatetime datetime = '2016-04-20 09:41:41.213'
SELECT name, count(dt)
FROM #TEST
WHERE dt >= #YourDatetime
GROUP BY name
I've posted the answer, because using the above query can generate errors in converting the string in where clause into a datetime, it depends on the format of the datetime.

Trending sum over time

I have a table (in Postgres 9.1) that looks something like this:
CREATE TABLE actions (
user_id: INTEGER,
date: DATE,
action: VARCHAR(255),
count: INTEGER
)
For example:
user_id | date | action | count
---------------+------------+--------------+-------
1 | 2013-01-01 | Email | 1
1 | 2013-01-02 | Call | 3
1 | 2013-01-03 | Email | 3
1 | 2013-01-04 | Call | 2
1 | 2013-01-04 | Voicemail | 2
1 | 2013-01-04 | Email | 2
2 | 2013-01-04 | Email | 2
I would like to be able to view a user's total actions over time for a specific set of actions; for example, Calls + Emails:
user_id | date | count
-----------+-------------+---------
1 | 2013-01-01 | 1
1 | 2013-01-02 | 4
1 | 2013-01-03 | 7
1 | 2013-01-04 | 11
2 | 2013-01-04 | 2
The monstrosity that I've created so far looks like this:
SELECT
date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM
actions
WHERE
action IN ('Call', 'Email')
GROUP BY
user_id, date, count;
Which works for single actions, but seems to break for multiple actions when they happen on the same day, for example instead of the expected 11 on 2013-01-04, we get 9:
date | user_id | count
------------+--------------+-------
2013-01-01 | 1 | 1
2013-01-02 | 1 | 4
2013-01-03 | 1 | 7
2013-01-04 | 1 | 9 <-- should be 11?
2013-01-04 | 2 | 2
Is it possible to tweak my query to resolve this issue? I tried removing the grouping on count, but Postgres doesn't seem to like that:
column "actions.count" must appear in the GROUP BY clause
or be used in an aggregate function
LINE 2: date, user_id, SUM(count) OVER (PARTITION BY user...
^
This query produces the result you are looking for:
SELECT DISTINCT
date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM actions
WHERE
action IN ('Call', 'Email');
The default window is already what you want, according to the official docs and the "DISTINCT" eliminates duplicate rows when both Emails and Calls happen on the same day.
See SQL Fiddle.
The table has a column named "count", and the expresion in the SELECT clause is aliased as "count", it is ambiguous.
Read documentation: http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-GROUPBY
In case of ambiguity, a GROUP BY name will be interpreted as an
input-column name rather than an output column name.
That means, that your query does not group by "count" evaluated in the SELECT clause, but rather it groups by "count" values taken from the table.
This query gives expected results, see SQL Fiddle
SELECT date, user_id, count
from (
Select date, user_id,
SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM actions
WHERE
action IN ('Call', 'Email')
) alias
GROUP BY
user_id, date, count;
Asserts
It is unclear whether you want to sort by user_id or date
It is also unclear whether you want to include dates in the result list, for which there is no row in the base table. In this case, refer to this closely related answer:
PostgreSQL: running count of rows for a query 'by minute'
Repair names
First off, I am using this test table instead of your problematic table:
CREATE TEMP TABLE actions (
user_id integer,
thedate date,
action text,
ct integer
);
Your use of reserved words and function names as identifiers (column names) is part of the problem.
Repair query
Combine aggregate and window functions
Since aggregate functions are applied first, your original query lumps the two rows found for user_id = 1 and thedate = '2013-01-04' into one. You have to multiply by count(*) to get the actual running count.
You can do this without subquery, since you can combine aggregate functions and window functions. Aggregate functions are applied first. You can even have a window functions over the result of aggregate functions.
SELECT thedate
, user_id
, sum(ct * count(*)) OVER (PARTITION BY user_id
ORDER BY thedate) AS running_ct
FROM actions
WHERE action IN ('Call', 'Email')
GROUP BY user_id, thedate, ct
ORDER BY user_id, thedate;
Or simplify to:
...
, sum(sum(ct)) OVER (PARTITION BY user_id
ORDER BY thedate) AS running_ct
...
This should also be the fastest of the solutions presented.
Here, the inner sum() is an aggregate function, while the outer sum() is a window function - over the result of the aggregate function.
Or use DISTINCT
Another way would to use DISTINCT or DISTINCT ON, since that is applied after window functions:
DISTINCT - this is possible, since running_ct is guaranteed to be the same in this case anyway, since all peers are summed at once for the default frame definition of window functions.
SELECT DISTINCT
thedate
, user_id
, sum(ct) OVER (PARTITION BY user_id ORDER BY thedate) AS running_ct
FROM actions
WHERE action IN ('Call', 'Email')
ORDER BY thedate, user_id;
Or simplify with DISTINCT ON:
SELECT DISTINCT ON (thedate, user_id)
...
->SQLfiddle demonstrating all variants.