How to scale creating extra columns in an SQL statement instead of creating extra rows using GROUP BY?

How to scale creating extra columns in an SQL statement instead of creating extra rows using GROUP BY? - sql

The schema of the database is mentioned here: http://sqlzoo.net/wiki/Guest_House
The request is for each day of the week beginning 2016-11-14 show how many guests are checking out that day by floor number.
The final table should be like this one (the naming of the columns is of little importance of course):
+------------+-----+-----+-----+
| i | 1st | 2nd | 3rd |
+------------+-----+-----+-----+
| 2016-11-14 | 5 | 3 | 4 |
| 2016-11-15 | 6 | 4 | 1 |
| 2016-11-16 | 2 | 2 | 4 |
| 2016-11-17 | 5 | 3 | 6 |
| 2016-11-18 | 2 | 3 | 2 |
| 2016-11-19 | 5 | 5 | 1 |
| 2016-11-20 | 2 | 2 | 2 |
+------------+-----+-----+-----+
In the attempt to implement it using SQL for the middle column (2nd) resulted in this script:
select ADDDATE(booking_date, nights) as checkout, count(distinct guest_id) as '2nd'
from booking
where CAST(room_no as char) like '2%'
and ADDDATE(booking_date, nights) >= '2016-11-14'
group by checkout
order by checkout
LIMIT 7
The issue is that this script produces only one column at a time
A scalable version is this, but this is only per row:
select ADDDATE(booking_date, nights) as checkout,
SUBSTR(CAST(room_no as char), 1, 1) as floor, count(distinct guest_id) as 'guest count'
from booking
where ADDDATE(booking_date, nights) >= '2016-11-14'
group by checkout, floor
order by checkout, floor
LIMIT 21
and the formatted output of this approach is not ideal to be presented:
checkout floor guest count
2016-11-14 1 5
2016-11-14 2 3
2016-11-14 3 4
2016-11-15 1 6
2016-11-15 2 4
2016-11-15 3 1
2016-11-16 1 2
2016-11-16 2 2
2016-11-16 3 4
2016-11-17 1 5
2016-11-17 2 3
2016-11-17 3 6
2016-11-18 1 2
2016-11-18 2 3
2016-11-18 3 2
2016-11-19 1 5
2016-11-19 2 5
2016-11-19 3 1
2016-11-20 1 2
2016-11-20 2 2
2016-11-20 3 2

Twice as many columns versus twice as many rows == not much difference on scaling. And there is a much smaller limit for the number of columns.
Go for more rows, then use pivoting for displaying as columns. (See the [pivot-table] tag)

Related

How to group data within a range of contigious timestamps

I have a table made up of rows of data collected through an indeterministic polling process. Each row has a start and end timestamp denoting the time period in which the data was collected. In some cases the data was collected contiguously, in which case the timestamp of one row will have the same value as the start timestamp for the next row. In other cases there is a break in time between one row and the next.
For example, in the table below, rows number 1,2,3 and 4 are all part of one time series of data. Similarly for rows 5, 6, 7 and 8 and again for rows 9 and 10. In between are time periods for which I do not have data.
Row Start_Timestamp End_Timestamp Data_Item
--- --------------- -------------- ---------
1 2019-08-12_22:07:53 2019-08-12_22:09:57 100
2 2019-08-12_22:09:57 2019-08-12_22:12:01 203
3 2019-08-12_22:12:01 2019-08-12_22:13:03 487
4 2019-08-12_22:13:03 2019-08-12_22:16:19 113
5 2019-08-12_22:24:34 2019-08-12_22:26:37 632
6 2019-08-12_22:26:37 2019-08-12_22:27:40 532
7 2019-08-12_22:27:40 2019-08-12_22:28:42 543
8 2019-08-12_22:28:42 2019-08-12_22:31:57 142
9 2019-08-13_19:56:06 2019-08-13_19:57:08 351
10 2019-08-13_19:57:08 2019-08-13_19:58:10 982
I would like to groups these contiguous time series ideally as follows:
Row Series Start_Timestamp End_Timestamp Data_Item
--- ------ --------------- -------------- -----------
1 1 2019-08-12_22:07:53 2019-08-12_22:09:57 100
2 1 2019-08-12_22:09:57 2019-08-12_22:12:01 203
3 1 2019-08-12_22:12:01 2019-08-12_22:13:03 487
4 1 2019-08-12_22:13:03 2019-08-12_22:16:19 113
5 2 2019-08-12_22:24:34 2019-08-12_22:26:37 632
6 2 2019-08-12_22:26:37 2019-08-12_22:27:40 532
7 2 2019-08-12_22:27:40 2019-08-12_22:28:42 543
8 2 2019-08-12_22:28:42 2019-08-12_22:31:57 142
9 3 2019-08-13_19:56:06 2019-08-13_19:57:08 351
10 3 2019-08-13_19:57:08 2019-08-13_19:58:10 982
I am new to SQL and have been struggling with this problem. I appreciate any insights or advice on how I might achieve this.

This is a simplified gaps-and-island problem. Assuming that your RDBMS support window functions, you can approach this with a window sum. When the Start_Timestamp of record is different than the End_Timestamp of the previous record, a new group starts:
select
t.Row,
sum(case when Start_Timestamp = lag_End_Timestamp then 0 else 1 end)
over(order by End_Timestamp) series,
t.Start_Timestamp,
t.End_Timestamp,
t.Data_Item
from (
select
t.*,
lag(End_Timestamp) over (order by End_Timestamp) lag_End_Timestamp
from mytable t
) t
Demo on DB Fiddle:
Row | series | Start_Timestamp | End_Timestamp | Data_Item
--: | -----: | :------------------ | :------------------ | --------:
1 | 1 | 2019-08-12 22:07:53 | 2019-08-12 22:09:57 | 100
2 | 1 | 2019-08-12 22:09:57 | 2019-08-12 22:12:01 | 203
3 | 1 | 2019-08-12 22:12:01 | 2019-08-12 22:13:03 | 487
4 | 1 | 2019-08-12 22:13:03 | 2019-08-12 22:16:19 | 113
5 | 2 | 2019-08-12 22:24:34 | 2019-08-12 22:26:37 | 632
6 | 2 | 2019-08-12 22:26:37 | 2019-08-12 22:27:40 | 532
7 | 2 | 2019-08-12 22:27:40 | 2019-08-12 22:28:42 | 543
8 | 2 | 2019-08-12 22:28:42 | 2019-08-12 22:31:57 | 142
9 | 3 | 2019-08-13 19:56:06 | 2019-08-13 19:57:08 | 351
10 | 3 | 2019-08-13 19:57:08 | 2019-08-13 19:58:10 | 982

How to get an hourly average number of unique persons using Hive?

I have this data in a table my_table:
camera_id person_id datetime
1 1 2017-03-02 18:06:20
1 1 2017-03-02 18:05:10
1 1 2017-04-01 18:04:09
2 1 2017-03-02 19:06:50
2 2 2017-03-02 19:07:22
2 2 2017-03-02 19:09:15
2 3 2017-05-03 19:07:05
2 4 2017-05-03 19:19:08
2 5 2017-05-03 19:20:18
I need to count an hourly average number of UNIQUE persons detected by each camera.
For example let's take camera 2 and a time window from 19:00 to 20:00. The camera determined 2 unique visits on 2017-03-02 and 3 unique visits on 2017-05-03. So, the answer is (2+3)/2 = 2.5
Expected result:
camera_id HOUR HOURLY_AVG_COUNT
1 18 1
2 19 2.5

select camera_id
,hour(datetime) as hour
,count(distinct person_id,date(datetime),hour(datetime)) /
count(distinct date(datetime),hour(datetime)) as hourly_avg_count
from my_table
group by camera_id
,hour(datetime)
order by camera_id
;
+-----------+------+------------------+
| camera_id | hour | hourly_avg_count |
+-----------+------+------------------+
| 1 | 18 | 1 |
| 2 | 19 | 2.5 |
+-----------+------+------------------+
P.s.
date(datetime),hour(datetime) can be also replaced by one of the following:
substr(cast(datetimeas string),1,13)
date_format(datetime,'yyyy-MM-dd HH')

Enumerating records by date

Say we have 5 records for items sold on particular dates like this
Date of Purchase Qty
2016-11-29 19:33:50.000 5
2017-01-03 20:09:49.000 4
2017-02-23 16:21:21.000 11
2016-11-29 14:33:51.000 2
2016-12-02 16:24:29.000 4
I´d like to enumerate each record by the date in order with an extra column like this:
Date of Purchase Qty Order
2016-11-29 19:33:50.000 5 1
2017-01-03 20:09:49.000 4 3
2017-02-23 16:21:21.000 11 4
2016-11-29 14:33:51.000 2 1
2016-12-02 16:24:29.000 4 2
Notice how both dates on 2016-11-29 have the same order number because I only want to order the records by the date and not by the datetime. How would I create this extra column in just plain SQL?

Using dense_rank() and ordering by the date of DateOfPurchase
select *
, [Order] = dense_rank() over (order by convert(date,DateOfPurchase))
from t
rextester demo: http://rextester.com/FAAQL92440
returns:
+---------------------+-----+-------+
| DateOfPurchase | Qty | Order |
+---------------------+-----+-------+
| 2016-11-29 19:33:50 | 5 | 1 |
| 2016-11-29 14:33:51 | 2 | 1 |
| 2016-12-02 16:24:29 | 4 | 2 |
| 2017-01-03 20:09:49 | 4 | 3 |
| 2017-02-23 16:21:21 | 11 | 4 |
+---------------------+-----+-------+

In SQL, group user actions by first-time or recurring

Imagine a sequence of actions. Each action is of a certain type.
Grouping by a given time-frame (e.g. day), how many of these actions happened for the first time, and how many were recurring?
Example Input:
+-----------+-------------+-------------------+
| user_id | action_type | timestamp |
+-----------+-------------+-------------------+
| 5 | play | 2014-02-02 00:55 |
| 2 | sleep | 2014-02-02 00:52 |
| 5 | play | 2014-02-02 00:42 |
| 5 | eat | 2014-02-02 00:31 |
| 3 | eat | 2014-02-02 00:19 |
| 2 | eat | 2014-02-01 23:52 |
| 3 | play | 2014-02-01 23:50 |
| 2 | play | 2014-02-01 23:48 |
+-----------+-------------+-------------------+
Example Output
+------------+------------+-------------+
| first_time | recurring | day |
+------------+------------+-------------+
| 4 | 1 | 2014-02-02 |
| 3 | 0 | 2014-02-01 |
+------------+------------+-------------+
Explanation
On 2014-02-02, users 2, 3, and 5 performed various different actions. There were 4 instances were the users performed an action for the first time; in one case the user 5 repeated the action 'play'.

I added a column 'Total Actions' because as I said, I believe there is a misinterpretation of facts in output example. You can remove it easily.
TEST in SQLFiddle.com for SQL Server 2008.
select
COUNT(q.repetitions) 'first time',
SUM(case when q.repetitions>1 then q.repetitions-1 else 0 end) as 'recurring',
day
from (
select COUNT(i.action_type) as 'repetitions',convert(date,i.time_stamp) as 'day'
from input i
group by i.user_id, i.action_type,convert(date,i.time_stamp)
) q
group by q.day
order by day desc

Query to use GROUP BY multiple columns

I have a table full of patients/responsible parties/insurance carrier combinations (e.g. patient Jim Doe's responsible party is parent John Doe who has insurance carrier Aetna Insurance). For each of these combinations, they have a contract that has multiple payments. For this particular table, I need to write a query to find any parent/RP/carrier combo that has multiple contract dates in the same month. Is there anyway to do this?
Example table:
ContPat | ContResp | ContIns | ContDue
------------------------------------------------------
53 | 13 | 27 | 2012-01-01 00:00:00.000
53 | 13 | 27 | 2012-02-01 00:00:00.000
53 | 15 | 27 | 2012-03-01 00:00:00.000
12 | 15 | 3 | 2011-05-01 00:00:00.000
12 | 15 | 3 | 2011-05-01 00:00:00.000
12 | 15 | 3 | 2011-06-01 00:00:00.000
12 | 15 | 3 | 2011-07-01 00:00:00.000
12 | 15 | 3 | 2011-08-01 00:00:00.000
12 | 15 | 3 | 2011-09-01 00:00:00.000
In this example, I would like to generate a list of all the duplicate months for any Patient/RP/Carrier combinations. The 12/15/3 combination would be the only row returned here, but I'm working with thousands of combinations.
Not sure if this is possible using a GROUP BY or similar functions. Thanks in advance for any advice!

If all you care about is multiple entries in the same calendar month:
SELECT
ContPat,
ContResp,
ContIns,
MONTH(ContDue) as Mo,
YEAR(ContDue) as Yr,
COUNT(*) as 'Records'
FROM
MyTable
GROUP BY
ContPat,
ContResp,
ContIns,
MONTH(ContDue),
YEAR(ContDue)
HAVING
COUNT(*) > 1
This will show you any Patient/Responsible Party/Insurer/Calendar month combination with more than one record for that month.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to scale creating extra columns in an SQL statement instead of creating extra rows using GROUP BY? - sql

Twice as many columns versus twice as many rows == not much difference on scaling. And there is a much smaller limit for the number of columns. Go for more rows, then use pivoting for displaying as columns. (See the [pivot-table] tag)

Related

How to group data within a range of contigious timestamps

How to get an hourly average number of unique persons using Hive?

Enumerating records by date

In SQL, group user actions by first-time or recurring

Query to use GROUP BY multiple columns

Categories

Resources