Without using DISTINCT, how to group data without altering value? - sql

I have a feeling this is a dumb question with a simple answer, but here goes.
How can I group the following data without using DISTINCT? #Table has 5 rows, which shows data for Hrs 5-9. I just don't like DISTINCT.
Since I need to display all hours of the day upto Hr9 (including 0-4), I'm joining it with table DimTime. DimTime has all hours, but with its 15-min intervals. So, DimTime looks like this:
Hour Minute
0 0
0 15
0 30
0 45
1 0
1 15
1 30
1 45
So here's my script:
declare #table table
(
Hour int,
Value int
)
insert into #table select 5, 25
insert into #table select 6, 34
insert into #table select 7, 54
insert into #table select 8, 65
insert into #table select 9, 11
select d.hour, t.hour, sum(value)
from #table t
left join dimtime d on d.hour = t.hour
group by d.hour, t.hour
If I use GROUP BY, then I need to have an aggregate function. So if I use SUM, it'll multiply all values by 4. If I remove the aggregate function, I'll get a syntax error.
Also, I cannot use a CTE since the contents in #table comes from a CTE (I just didn't include it here).
Here's the result that I need to display:
Hour Value
0 null
1 null
2 null
3 null
4 null
5 25
6 34
7 54
8 65
9 11

Simply add a condition WHERE minute = 0 to return only one row per hour.

If you really with to skip the sorting operation on dimtime with the use of distinct clause then check the below explanation.
Display all hours (0-9) from dimtime and sum the value given in #table for a particular hour:
SELECT
d.hour, SUM(t.value)
FROM
dimtime d
LEFT JOIN #table t
ON d.hour = t.hour
WHERE d.minute = 0 -- retrieves one row for every hour from dimtime
GROUP BY d.hour
ORDER BY d.hour -- not needed, but will give you resultset sorted by hour
Assuming that you have a row with value minute = 0 in your dimtable for every hour you could just limit the rows retrieved for join operation. That will work with any value from list 0, 15, 30, 45.
SUM() will work properly by summing all the values for a given hour in #table. If there are no rows with a particular hour, it will return 0 value.

You should have a better reason for not using a programming function than "I just don't like it"
You can have a CTE that uses another CTE
#dnoeth provided an excellent answer, but here's another option:
SELECT
d.hour,
t.value
FROM
#table t
INNER JOIN (SELECT DISTINCT hour FROM dimTime) d ON d.hour = t.hour

Try
SELECT *
FROM DimTime D
LEFT JOIN myTable T
ON D.Hour = T.Hour
WHERE D.Minute = 0
SQL Fiddle Demo
Output
| Hour | Minute | Hour | Value |
|------|--------|--------|--------|
| 0 | 0 | (null) | (null) |
| 1 | 0 | (null) | (null) |
| 2 | 0 | (null) | (null) |
| 3 | 0 | (null) | (null) |
| 4 | 0 | (null) | (null) |
| 5 | 0 | 5 | 25 |
| 6 | 0 | 6 | 34 |
| 7 | 0 | 7 | 54 |
| 8 | 0 | 8 | 65 |
| 9 | 0 | 9 | 11 |

If I use GROUP BY, then I need to have an aggregate function
Only if you include expressions in your SELECT that are not part of your group key. You could certainly do
select d.hour, t.hour, value
from #table t
inner join dimtime d
on d.hour = t.hour
group by d.hour, t.hour, value
or
select d.hour, t.hour, MIN(value)
from #table t
inner join dimtime d
on d.hour = t.hour
group by d.hour, t.hour
Note that the first query gives you the exact same results as DISTINCT (and may even be compiled to the same query plan) so I'm not sure what your aversion is to DISTINCT.

Related

T-SQL, how to get IDs that visit X amount of location within X amount of time?

T-SQL question, I been trying to find the best/optimal solution for this one.
Say we have this theoritical table
-----------------------------------
ID | DATETIME | Location
11 | 1/27 3:30pm | a
11 | 1/27 3:31pm | b
11 | 1/27 3:32pm | c
22 | 2/14 1:10pm | g
22 | 2/14 1:12pm | i
22 | 2/15 5:48pm | w
55 | 3/18 8:48pm | d
55 | 3/18 9:48pm | e
---------------------------
I want to create a query that return IDs that have been in 2 or more different locations within 5 minutes. In this case if you look at the table, ID: 11 and 22 visits 2 or more different location within 5 minutes, thus it should return ID 11 and 22. How do I develop a query that returns the IDs that been to X amount of location within X amount of time in minutes?
I suggest using cross apply
select t.*, ca.num_visit
from table1 as t
cross apply (
select count(*) num_visit
from table1 as c
where c.id = t.id
and c.DATETIME > t.DATETIME
and c.DATETIME <= dateadd(minute,5,t.DATETIME)
) ca
where num_visit >= 2
If you assume that the locations are different on each row for a given id, you can use lead()/lag():
select id, datetime
from (select t.*,
lead(datetime) over (partition by id order by datetime) as next_datetime
from t
) t
where next_datetime < dateadd(minute, 5, datetime);
This is not a general solution to the problem. But it does solve the particular example you have in your question.

How to calculate moving average in SQL?

I've a table with 2 columns in SQL
+------+--------+
| WEEK | OUTPUT |
+------+--------+
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 4 | 40 |
| 5 | 50 |
| 6 | 50 |
+------+--------+
How do I calculate to sum up output for 2 weeks before (ex : on week 3, it will sum up the output for week 3, 2 and 1), I've seen many tutorials to do moving average but they are using date, in my case i want to use (int), is that possible ?.
Thanks !.
I think you want something like this :
SELECT *,
(SELECT Sum(output)
FROM table1 b
WHERE b.week IN( a.week, a.week - 1, a.week - 2 )) AS SUM
FROM table1 a
OR
In clause can be converted to between a.week-2 and a.week.
sql fiddle
You can use a self-join. The idea is to put you table beside itself with a condition that brings matching rows in a single row:
SELECT * FROM [output] o1
INNER JOIN [output] o2 ON o1.Week between o2.Week and o2.Week + 2
this select will produce this output:
o1.Week o1.Output o2.Week o2.Output
--------------------------------------------
1 10 1 10
2 20 1 10
2 20 2 20
3 30 1 10
3 30 2 20
3 30 3 30
4 40 2 20
4 40 3 30
4 40 4 40
and so on. Note that for weeks 1 and 2 there aren't previous weeks available.
Now you should just group the data by o1.Week and get the SUM:
SELECT o1.Week, SUM(o2.Output)
FROM [output] o1
INNER JOIN [output] o2 ON o1.Week between o2.Week and o2.Week + 2
GROUP BY o1.Week
If week is continuous, you can simply use Window function
SELECT [Week], [Output],
SUM([Output]) OVER (ORDER BY [Week] ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM dbo.SomeTable
Range is more accurate for your calculation, but it not implemented in SQL Server yet. Other database engines may support
SELECT [Week], [Output],
SUM([Output]) OVER (ORDER BY [Week] RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM dbo.SomeTable
Try this:
SELECT SUM(t1.output) / 3
FROM yourtable t1
WHERE t1.week <=
(select t2.week from yourtable t2 where t2.week - t1.week > 0 and t2.week - t1.week <= 2)
You are not written your sqlserver, if it is sqlserver2012 or above , then the simple example is
declare #table table(wk int,outpt int )
insert into #table values (1,10)
,(2,20)
,(3,30)
,(4,40)
,(5,50)
,(6,60)
select *,SUM(outpt) over(partition by id order by id rows between unbounded preceding and current row ) dd
from (
select * , 1 id
from #table
where wk < 5
) a

How to calculate the value of a previous row from the count of another column

I want to create an additional column which calculates the value of a row from count column with its predecessor row from the sum column. Below is the query. I tried using ROLLUP but it does not serve the purpose.
select to_char(register_date,'YYYY-MM') as "registered_in_month"
,count(*) as Total_count
from CMSS.USERS_PROFILE a
where a.pcms_db != '*'
group by (to_char(register_date,'YYYY-MM'))
order by to_char(register_date,'YYYY-MM')
This is what i get
registered_in_month TOTAL_COUNT
-------------------------------------
2005-01 1
2005-02 3
2005-04 8
2005-06 4
But what I would like to display is below, including the months which have count as 0
registered_in_month TOTAL_COUNT SUM
------------------------------------------
2005-01 1 1
2005-02 3 4
2005-03 0 4
2005-04 8 12
2005-05 0 12
2005-06 4 16
To include missing months in your result, first you need to have complete list of months. To do that you should find the earliest and latest month and then use heirarchial
query to generate the complete list.
SQL Fiddle
with x(min_date, max_date) as (
select min(trunc(register_date,'month')),
max(trunc(register_date,'month'))
from users_profile
)
select add_months(min_date,level-1)
from x
connect by add_months(min_date,level-1) <= max_date;
Once you have all the months, you can outer join it to your table. To get the cumulative sum, simply add up the count using SUM as analytical function.
with x(min_date, max_date) as (
select min(trunc(register_date,'month')),
max(trunc(register_date,'month'))
from users_profile
),
y(all_months) as (
select add_months(min_date,level-1)
from x
connect by add_months(min_date,level-1) <= max_date
)
select to_char(a.all_months,'yyyy-mm') registered_in_month,
count(b.register_date) total_count,
sum(count(b.register_date)) over (order by a.all_months) "sum"
from y a left outer join users_profile b
on a.all_months = trunc(b.register_date,'month')
group by a.all_months
order by a.all_months;
Output:
| REGISTERED_IN_MONTH | TOTAL_COUNT | SUM |
|---------------------|-------------|-----|
| 2005-01 | 1 | 1 |
| 2005-02 | 3 | 4 |
| 2005-03 | 0 | 4 |
| 2005-04 | 8 | 12 |
| 2005-05 | 0 | 12 |
| 2005-06 | 4 | 16 |

How to fill missing dates by groups in a table in sql

I want to know how to use loops to fill in missing dates with value zero based on the start/end dates by groups in sql so that i have consecutive time series in each group. I have two questions.
how to loop for each group?
How to use start/end dates for each group to dynamically fill in missing dates?
My input and expected output are listed as below.
Input: I have a table A like
date value grp_no
8/06/12 1 1
8/08/12 1 1
8/09/12 0 1
8/07/12 2 2
8/08/12 1 2
8/12/12 3 2
Also I have a table B which can be used to left join with A to fill in missing dates.
date
...
8/05/12
8/06/12
8/07/12
8/08/12
8/09/12
8/10/12
8/11/12
8/12/12
8/13/12
...
How can I use A and B to generate the following output in sql?
Output:
date value grp_no
8/06/12 1 1
8/07/12 0 1
8/08/12 1 1
8/09/12 0 1
8/07/12 2 2
8/08/12 1 2
8/09/12 0 2
8/10/12 0 2
8/11/12 0 2
8/12/12 3 2
Please send me your code and suggestion. Thank you so much in advance!!!
You can do it like this without loops
SELECT p.date, COALESCE(a.value, 0) value, p.grp_no
FROM
(
SELECT grp_no, date
FROM
(
SELECT grp_no, MIN(date) min_date, MAX(date) max_date
FROM tableA
GROUP BY grp_no
) q CROSS JOIN tableb b
WHERE b.date BETWEEN q.min_date AND q.max_date
) p LEFT JOIN TableA a
ON p.grp_no = a.grp_no
AND p.date = a.date
The innermost subquery grabs min and max dates per group. Then cross join with TableB produces all possible dates within the min-max range per group. And finally outer select uses outer join with TableA and fills value column with 0 for dates that are missing in TableA.
Output:
| DATE | VALUE | GRP_NO |
|------------|-------|--------|
| 2012-08-06 | 1 | 1 |
| 2012-08-07 | 0 | 1 |
| 2012-08-08 | 1 | 1 |
| 2012-08-09 | 0 | 1 |
| 2012-08-07 | 2 | 2 |
| 2012-08-08 | 1 | 2 |
| 2012-08-09 | 0 | 2 |
| 2012-08-10 | 0 | 2 |
| 2012-08-11 | 0 | 2 |
| 2012-08-12 | 3 | 2 |
Here is SQLFiddle demo
I just needed the query to return all the dates in the period I wanted. Without the joins. Thought I'd share for those wanting to put them in your query. Just change the 365 to whatever timeframe you are wanting.
DECLARE #s DATE = GETDATE()-365, #e DATE = GETDATE();
SELECT TOP (DATEDIFF(DAY, #s, #e)+1)
DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY number)-1, #s)
FROM [master].dbo.spt_values
WHERE [type] = N'P' ORDER BY number
The following query does a union with tableA and tableB. It then uses group by to merge the rows from tableA and tableB so that all of the dates from tableB are in the result. If a date is not in tableA, then the row has 0 for value and grp_no. Otherwise, the row has the actual values for value and grp_no.
select
dat,
sum(val),
sum(grp)
from
(
select
date as dat,
value as val,
grp_no as grp
from
tableA
union
select
date,
0,
0
from
tableB
where
date >= date '2012-08-06' and
date <= date '2012-08-13'
)
group by
dat
order by
dat
I find this query to be easier for me to understand. It also runs faster. It takes 16 seconds whereas a similar right join query takes 32 seconds.
This solution only works with numerical data.
This solution assumes a fixed date range. With some extra work this query can be adapted to limit the date range to what is found in tableA.

Statistical Mode with postgres

I have a table that has this schema:
create table mytable (creation_date timestamp,
value int,
category int);
I want the maximum ocurrence of a value every each hour per category, Only on week days. I had made some progress, I have a query like this now:
select category,foo.h as h,value, count(value) from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value;
as result I got something like this:
category | h | value | count
---------+----------+---------+-------
1 | 00:00:00 | 2 | 1
1 | 01:00:00 | 2 | 1
1 | 02:00:00 | 2 | 6
1 | 03:00:00 | 2 | 31
1 | 03:00:00 | 3 | 11
1 | 04:00:00 | 2 | 21
1 | 04:00:00 | 3 | 9
1 | 13:00:00 | 1 | 14
1 | 14:00:00 | 1 | 10
1 | 14:00:00 | 2 | 7
1 | 15:00:00 | 1 | 52
for example at 04:00 I have to values 2 and 3, with counts of 21 and 9 respectively, I only need the value with highest count which would be the statiscal mode.
BTW I have more than 2M records
This can be simpler:
SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
category
, extract(hour FROM creation_date)::int AS h
, count(*)::int AS max_ct
, value
FROM mytable
WHERE extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP BY 1,2,4
ORDER BY 1,2,3 DESC;
Basically these are the steps:
Exclude weekends (WHERE ...). Use ISODOW to simplify the expression.
Extract hour from timestamp as h.
Group by category, h and value.
Count the rows per combination of the three; cast to integer - we don't need bigint.
Order by category, h and the highest count (DESC).
Only pick the first row (highest count) per (category, h) with the according category.
I am able to do this in one query level, because DISTINCT is applied after the aggregate function.
The result will hold no rows for any (category, h) without no entries at all. If you need to fill in the blanks, LEFT JOIN to this:
SELECT c.category, h.h
FROM cat_tbl c
CROSS JOIN (SELECT generate_series(0, 23) AS h) h
Given the size of your table, I'd be tempted to use your query to build a temporary table, then run a query on that to finalise the results.
Assuming you called the temporary table "summary_table", the following query should do it.
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s2.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));
If you don't want to create a table, you could use a WITH clause to attach your query to this one.
with summary_table as (
select category,foo.h as h,value, count(value) as count from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value)
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s1.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));