SQL - Count Duplicate Values Within Date Range - sql

I have a simple, but large, database that I need to write a SQL statement for. The statements needs to do the following:
Get the 15 most popular values for a field.
From those 15, get the count that value has appeared within a particular time period.
My table contains both a Date and a Value field. I am able to extract the 15 most popular values, or get the count for a particular value in a given time period. I do not know how to put the two together.
This is my current SQL:
SELECT
Count( Value ) AS Total,
Value AS Value
FROM
Database
GROUP BY
Value
ORDER BY
Total DESC
LIMIT 15
That will get my most popular 15. But from that, I want to display the COUNT() that each Value is between two dates.
Would this require a HAVING clause?

I simplified the previous solution (which would also do a job) a little bit:
SELECT
Value,
Count(*) as TotalInPeriod
FROM Database
WHERE Value in (SELECT Value FROM Database GROUP BY Value
ORDER BY count(*) DESC LIMIT 15)
AND date_field BETWEEN your_start_date and your_end_date
GROUP BY Value

Try something like this. Make an inner query that finds the top 15 values overall, and join it to the main set to limit it to those values.
SELECT
Count( Value) as TotalInPeriod,
Value as Value
FROM
Database a
JOIN (SELECT
Count( Value ) AS Total,
Value AS Value
FROM
Database
GROUP BY
Value
ORDER BY
Total DESC
LIMIT 15) as topValues
ON
a.Value = topValues.Value
WHERE
a.date_field BETWEEN your_start_date and your_end_date
GROUP BY
a.Value

Related

Access query finding the duplicates without using DISTINCT

i have this query
SELECT PersonalInfo.id, PersonalInfo.[k-commission], Abs(Not IsNull([PersonalInfo]![k-commission].[Value])) AS CommissionAbsent
FROM PersonalInfo;
and the PersonalInfo.k-commission is a multi value field. the CommissionAbsent shows duplicate values for each k-commission value. when i use DISTINCT i get an error saying that the keyword cannot be used with a multi value field.
now i want to remove the duplicates and show only one result for each. i tried using a WHERE but i dont know how.
edit: i have a lot more columnes and in the example i only showed the few i need.
You can use GROUP BY and COUNT to solve your problem, here is an example for it
SELECT clmn1, clmn2, COUNT(*) as count
FROM table
GROUP BY clmn1, clmn2
HAVING COUNT(*) > 1;
the query groups the rows in the table by the clmn1 and clmn2 columns, and counts the number of occurrences of each group. The HAVING clause is then used to filter the groups and only return the groups that have a count greater than 1, which indicates duplicates.
If you want to select all, then you can do like this
SELECT *
FROM table
WHERE (clmn1, clmn2) IN (SELECT clmn1, clmn2
FROM table
GROUP BY clmn1, clmn2
HAVING COUNT(*) > 1)
SELECT PersonalInfo.id, PersonalInfo.[k-commission], Abs(Not IsNull([PersonalInfo]![k-commission].[Value])) AS CommissionAbsent
FROM PersonalInfo
GROUP BY PersonalInfo.id, PersonalInfo.[k-commission], Abs(Not IsNull([PersonalInfo]![k-commission].[Value]))
HAVING COUNT(*) > 1

SQL Server : verify that two columns are in same sort order

I have a table with an ID and a date column. It's possible (likely) that when a new record is created, it gets the next larger ID and the current datetime. So if I were to sort by date or I were to sort by ID, the resulting data set would be in the same order.
How do I write a SQL query to verify this?
It's also possible that an older record is modified and the date is updated. In that case, the records would not be in the same sort order. I don't think this happens.
I'm trying to move the data to another location, and if I know that there are no modified records, that makes it a lot simpler.
I'm pretty sure I only need to query those two columns: ID, RecordDate. Other links indicate I should be able to use LAG, but I'm getting an error that it isn't a built-in function name.
In other words, both https://dba.stackexchange.com/questions/42985/running-total-to-the-previous-row and Is there a way to access the "previous row" value in a SELECT statement? should help, but I'm still not able to make that work for what I want.
If you cannot use window functions, you can use a correlated subquery and EXISTS.
SELECT *
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.id < t1.id
AND t2.recorddate > t1.recorddate);
It'll select all records where another record with a lower ID and a greater timestamp exists. If the result is empty you know that no such record exists and the data is like you want it to be.
Maybe you want to restrict it a bit more by using t2.recorddate >= t1.recorddate instead of t2.recorddate > t1.recorddate. I'm not sure how you want it.
Use this:
SELECT ID, RecordDate FROM tablename t
WHERE
(SELECT COUNT(*) FROM tablename WHERE tablename.ID < t.ID)
<>
(SELECT COUNT(*) FROM tablename WHERE tablename.RecordDate < t.RecordDate);
It counts for each row, how many rows have id less than the row's id and
how many rows have RecordDate less than the row's RecordDate.
If these counters are not equal then it outputs this row.
The result is all the rows that would not be in the same position after sorting by ID and RecordDate
One method uses window functions:
select count(*)
from (select t.*,
row_number() over (order by id) as seqnum_id,
row_number() over (order by date, id) as seqnum_date
from t
) t
where seqnum_id <> seqnum_date;
When the count is zero, then the two columns have the same ordering. Note that the second order by includes id. Two rows could have the same date. This makes the sort stable, so the comparison is valid even when date has duplicates.
the above solutions are all good but if both dates and ids are in increment then this should also work
select modifiedid=t2.id from
yourtable t1 join yourtable t2
on t1.id=t2.id+1 and t1.recordDate<t2.recordDate

Use of HAVING without GROUP BY not working as expected

I am starting to learn SQL Server, in the documentation found in msdn states like this
HAVING is typically used with a GROUP BY clause. When GROUP BY is not used, there is an implicit single, aggregated group.
This made me to think that we can use having without a groupBy clause, but when I am trying to make a query I am not able to use it.
I have a table like this
CREATE TABLE [dbo].[_abc]
(
[wage] [int] NULL
) ON [PRIMARY]
GO
INSERT INTO [dbo].[_abc] (wage)
VALUES (4), (8), (15), (30), (50)
GO
Now when I run this query, I get an error
select *
from [dbo].[_abc]
having sum(wage) > 5
Error:
The documentation is correct; i.e. you could run this statement:
select sum(wage) sum_of_all_wages
, count(1) count_of_all_records
from [dbo].[_abc]
having sum(wage) > 5
The reason your statement doesn't work is because of the select *, which means select every columns' value. When there is no group by, all records are aggregated; i.e. you only get 1 record in your result set which has to represent every record. As such, you can only* include values provided by applying aggregate functions to your columns; not the columns themselves.
* of course, you can also provide constants, so select 'x' constant, count(1) cnt from myTable would work.
There aren't many use cases I can think of where you'd want to use having without a group by, but certainly it can be done as shown above.
NB: If you wanted all rows where the wage was greater than 5, you'd use the where clause instead:
select *
from [dbo].[_abc]
where wage > 5
Equally, if you want the sum of all wages greater than 5 you can do this
select sum(wage) sum_of_wage_over_5
from [dbo].[_abc]
where wage > 5
Or if you wanted to compare the sum of wages over 5 with those under:
select case when wage > 5 then 1 else 0 end wage_over_five
, sum(wage) sum_of_wage
from [dbo].[_abc]
group by case when wage > 5 then 1 else 0 end
See runnable examples here.
Update based on comments:
Do you need having to use aggregate functions?
No. You can run select sum(wage) from [dbo].[_abc]. When an aggregate function is used without a group by clause, it's as if you're grouping by a constant; i.e. select sum(wage) from [dbo].[_abc] group by 1.
The documentation merely means that whilst normally you'd have a having statement with a group by statement, it's OK to exclude the group by / in such cases the having statement, like the select statement, will treat your query as if you'd specified group by 1
What's the point?
It's hard to think of many good use cases, since you're only getting one row back and the having statement is a filter on that.
One use case could be that you write code to monitor your licenses for some software; if you have less users than per-user-licenses all's good / you don't want to see the result since you don't care. If you have more users you want to know about it. E.g.
declare #totalUserLicenses int = 100
select count(1) NumberOfActiveUsers
, #totalUserLicenses NumberOfLicenses
, count(1) - #totalUserLicenses NumberOfAdditionalLicensesToPurchase
from [dbo].[Users]
where enabled = 1
having count(1) > #totalUserLicenses
Isn't the select irrelevant to the having clause?
Yes and no. Having is a filter on your aggregated data. Select says what columns/information to bring back. As such you have to ask "what would the result look like?" i.e. Given we've had to effectively apply group by 1 to make use of the having statement, how should SQL interpret select *? Since your table only has one column this would translate to select wage; but we have 5 rows, so 5 different values of wage, and only 1 row in the result to show this.
I guess you could say "I want to return all rows if their sum is greater than 5; otherwise I don't want to return any rows". Were that your requirement it could be achieved a variety of ways; one of which would be:
select *
from [dbo].[_abc]
where exists
(
select 1
from [dbo].[_abc]
having sum(wage) > 5
)
However, we have to write the code to meet the requirement, rather than expect the code to understand our intent.
Another way to think about having is as being a where statement applied to a subquery. I.e. your original statement effectively reads:
select wage
from
(
select sum(wage) sum_of_wage
from [dbo].[_abc]
group by 1
) singleRowResult
where sum_of_wage > 5
That won't run because wage is not available to the outer query; only sum_of_wage is returned.
HAVING without GROUP BY clause is perfectly valid but here is what you need to understand:
The result will contain zero or one row
The implicit GROUP BY will return exactly one row even if the WHERE condition matched zero rows
HAVING will keep or eliminate that single row based on the condition
Any column in the SELECT clause needs to be wrapped inside an aggregate function
You can also specify an expression as long as it is not functionally dependent on the columns
Which means you can do this:
SELECT SUM(wage)
FROM employees
HAVING SUM(wage) > 100
-- One row containing the sum if the sum is greater than 5
-- Zero rows otherwise
Or even this:
SELECT 1
FROM employees
HAVING SUM(wage) > 100
-- One row containing "1" if the sum is greater than 5
-- Zero rows otherwise
This construct is often used when you're interested in checking if a match for the aggregate was found:
SELECT *
FROM departments
WHERE EXISTS (
SELECT 1
FROM employees
WHERE employees.department = departments.department
HAVING SUM(wage) > 100
)
-- all departments whose employees earn more than 100 in total
In SQL you cannot return aggregate functioned columns directly. You need to group the non aggregate fields
As shown below example
USE AdventureWorks2012 ;
GO
SELECT SalesOrderID, SUM(LineTotal) AS SubTotal
FROM Sales.SalesOrderDetail
GROUP BY SalesOrderID
HAVING SUM(LineTotal) > 100000.00
ORDER BY SalesOrderID ;
In your case you don't have identity column for your table it should come as below
Alter _abc
Add Id_new Int Identity(1, 1)
Go

Condition inside a count -SQL

I am trying to write a condition inside a count statement where it should only count the entries which do not have an ENDDATE. i am looking for writing the condition inside the count as this is a very small part of a large SQl Query
sample query,
select product, count(*) as quantity
from table
where end_date is null
group by age
This query lists quantity for each product which do not have an end date
One method uses conditional aggregation:
select sum(case when end_date is null then 1 else 0 end) as NumNull
. . .
Another method is just to subtract two counts:
select ( count(*) - count(end_date) ) as NumNull
count(end_date) counts the number that are not NULL, so subtracting this from the full count gets the number that are NULL.
Uhmmmm.
It sounds like you are looking for conditional aggregation.
So, if you have a current statement that's sort of working (and we're just guessing because we don't see anything you have attempted so far...)
SELECT COUNT(1)
FROM mytable t
And you want another another expression that returns a count of rows that meet some set of conditions...
and when you say "do not have an ENDDATE", you are refderring to rows that have an ENDDATE value of NULL (and again, we're just guessing that the table has a column named ENDDATE. Every row will have an ENDDATE column.)
We'll use a ANSI standards compliant CASE expression, because this would work in most databases (SQL Server, Oracle, MySQL, Postgres... and we don't have clue what database you are using.
SELECT COUNT(1)
, COUNT(CASE WHEN t.ENDDATE IS NULL THEN 1 ELSE NULL END) AS cnt_null_enddate
FROM mytable t

sql get max based on field

I need to get the ID based from what ever the max amount is. Below is giving me an error
select ID from Prog
where Amount = MAX(Amount)
An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.
The end result is that I need to get the just the ID as I need to pass it something else that is expecting it.
You need to order by Amount and select 1 record instead...
SELECT ID
FROM Prog
ORDER BY Amount DESC
LIMIT 1;
This takes all the rows in Prog, orders them in descending order by Amount (in other words, the first sorted row has the highest Amount), then limits the query to select only one row (the one with the highest Amount).
Also, subqueries are bad for performance. This code runs on a table with 200k records in half the time as the subquery versions.
Just pass a subquery with the max value to the where clause :
select ID from Prog
where Amount = (SELECT MAX(Amount) from Prog)
If you're using SQL Server that should do it :
SELECT TOP 1 ID
FROM Prog
ORDER BY Amount DESC
This should be something like:
select P.ID from Prog P
where P.Amount = (select max(Amount) from Prog)
EDIT:
If you really want only 1 row, you should do:
select max(P.ID) from Prog P
where P.Amount = (select max(Amount) from Prog);
However, if you have multiple rows that would match amount and you only want 1 row, you should have some kind of logic behind how you pick your one row. Not just relying on this max trick, or limit 1 type logic.
Also, I don't write limit 1, because this is not ANSI sql -- it works in mysql but OP doesn't say what he wants. Every db is different -- see here: Is there an ANSI SQL alternative to the MYSQL LIMIT keyword? Don't get used to one db's extensions unless you only want to work in 1 db for the rest of your life.
select min(ID) from Prog
where Amount in
(
select max(amount)
from prog
)
The min statement ensures that you get only one result.