I have a table in sqlite database where I store data about call logs. As an example assume that my table looks like this
| Calls_count | Calls_duration | Time_slice | Time_stamp |
| 10 | 500 | 21 | 1399369269 |
| 2 | 88 | 22 | 1399383668 |
Here
Calls_count is calls made since last observations
Calls_duration is the duration of calls in ms since last observations
Time-slice represents a time portion of week. Every day is divided into 4 portions of 6 hours each such that
06:00-11:59 | 12:00-17:59 | 18:00- 23.59 | 24:00-05:59 |
Mon| 11 | 12 | 13 | 14 |
Tue| 21 | 22 | 23 | 24 |
Wed| 31 | 32 | 33 | 34 |
Thu| 41 | 42 | 43 | 44 |
Fri| 51 | 52 | 53 | 54 |
Sat| 61 | 62 | 63 | 64 |
Sun| 71 | 72 | 73 | 74 |
And the time_stamp is unix epoch when the observation was made/ record was inserted in the database
Now I want to create a query so that if I specify time_stamp for a start and the end of week, The result is 168 rows of data, giving me sum of calls grouped by hour such that I get 24 rows for each day of week. This is an hourly break down of calls in a week.
SUM_CALLS | Time_Slice | Hour_of_Week |
10 | 11 | 1 |
0 | 11 | 2 |
....
7 | 74 | 167 |
4 | 74 | 168 |
In the above example of intended result,
1st row is Monday 06:00-06:59
2nd row is Monday 07:00-07:59
Last row is Sunday 04:00-05:59
Since version 3.8.3 SQLite supports common table expressions
and this is a possible solution
WITH RECURSIVE
hours(x,y) AS (SELECT CAST(STRFTIME('%s',STRFTIME('%Y-%m-%d %H:00:00', '2014-05-05 00:00:00')) AS INTEGER),
CAST(STRFTIME('%s',STRFTIME('%Y-%m-%d %H:59:59', '2014-05-05 00:00:00')) AS INTEGER)
UNION ALL
SELECT x+3600,y+3600 FROM hours LIMIT 168)
SELECT
COALESCE(SUM(Calls_count),0) AS SUM_CALLS,
CASE CAST(STRFTIME('%w',x,'unixepoch') AS INTEGER)
WHEN 0 THEN 7 ELSE STRFTIME('%w',x,'unixepoch') END
||
CASE
WHEN STRFTIME('%H:%M:%S',x,'unixepoch') BETWEEN '06:00:00' AND '11:59:59' THEN 1
WHEN STRFTIME('%H:%M:%S',x,'unixepoch') BETWEEN '12:00:00' AND '17:59:59' THEN 2
WHEN STRFTIME('%H:%M:%S',x,'unixepoch') BETWEEN '18:00:00' AND '23:59:59' THEN 3
WHEN STRFTIME('%H:%M:%S',x,'unixepoch') BETWEEN '00:00:00' AND '05:59:59' THEN 4
END AS Time_Slice,
((x-(SELECT MIN(x) FROM hours))/3600)+1 AS Hour_of_Week
FROM hours LEFT JOIN call_logs
ON call_logs.time_stamp >= hours.x AND call_logs.time_stamp <= hours.y
GROUP BY Hour_of_Week
ORDER BY Hour_of_Week
;
This is tested with SQLite version 3.7.13 without cte:
DROP VIEW IF EXISTS digit;
CREATE TEMPORARY VIEW digit AS SELECT 0 AS d UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION
SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
;
DROP VIEW IF EXISTS hours;
CREATE TEMPORARY VIEW hours AS SELECT STRFTIME('%s','2014-05-05 00:00:00') + s AS x,
STRFTIME('%s','2014-05-05 00:00:00') + s+3599 AS y
FROM (SELECT (a.d || b.d || c.d) * 3600 AS s FROM digit a, digit b, digit c LIMIT 168)
;
SELECT
COALESCE(SUM(Calls_count),0) AS SUM_CALLS,
CASE CAST(STRFTIME('%w',x,'unixepoch') AS INTEGER)
WHEN 0 THEN 7 ELSE STRFTIME('%w',x,'unixepoch') END
||
CASE
WHEN STRFTIME('%H:%M:%S',x,'unixepoch') BETWEEN '06:00:00' AND '11:59:59' THEN 1
WHEN STRFTIME('%H:%M:%S',x,'unixepoch') BETWEEN '12:00:00' AND '17:59:59' THEN 2
WHEN STRFTIME('%H:%M:%S',x,'unixepoch') BETWEEN '18:00:00' AND '23:59:59' THEN 3
WHEN STRFTIME('%H:%M:%S',x,'unixepoch') BETWEEN '00:00:00' AND '05:59:59' THEN 4
END AS Time_Slice,
((x-(SELECT MIN(x) FROM hours))/3600)+1 AS Hour_of_Week
FROM hours LEFT JOIN call_logs
ON call_logs.time_stamp >= hours.x AND call_logs.time_stamp <= hours.y
GROUP BY Hour_of_Week
ORDER BY Hour_of_Week
;
Related
How might I calculate cumulative percentages in SQL (Postgres/Vertica)?
For instance, the question is "As of each date, of all patients who had been diagnosed by that date, what percent had been treated by that date?"
For instance, this table shows dates of diagnosis and treatment, with binary values that might be summed
ID | diagnosed | date_diag | treated | date_treat
---|------------|-----------|----------|-----------
1 1 Jan 1 0 null
2 1 Jan 15 1 Feb 20
3 1 Jan 29 1 Feb 1
4 1 Feb 08 1 Mar 4
5 1 Feb 12 0 null
6 1 Feb 18 1 Feb 24
7 1 Mar 15 1 May 5
8 1 Apr 14 1 Apr 20
I'd like to get a table of cumulative treated-vs-diagnosed ratio that might look like this.
date | ytd_diag | ytd_treat | ytd_percent
-------|------------|-----------|----------
Jan 01 1 0 0.00
Jan 15 2 0 0.00
Jan 29 3 0 0.00
Feb 08 4 1 0.25
Feb 12 5 1 0.20
Feb 18 6 1 0.17
Mar 15 7 4 0.57
Apr 14 8 4 0.50
I can calculate cumulative counts of diagnosed or treated (e.g. below), using window functions but I can't figure out a SQL query to get the number of people who'd already been treated as of each diagnosis date.
SELECT
date_diag ,
SUM(COUNT(*)) OVER ( ORDER BY date_diag ) as freq
FROM patients
WHERE diagnosed = 1
GROUP BY date_diag
ORDER BY date_diag;
You can use conditional aggregation with SUM() window function:
WITH cte AS (
SELECT kind,
date,
SUM((kind = 1)::int) OVER (ORDER BY date) ytd_diag,
SUM((kind = 2)::int) OVER (ORDER BY date) ytd_treat
FROM (
SELECT 1 kind, date_diag date, diagnosed status FROM patients
UNION ALL
SELECT 2, date_treat, treated FROM patients WHERE date_treat IS NOT NULL
) t
)
SELECT date, ytd_diag, ytd_treat,
ROUND(1.0 * ytd_treat / ytd_diag, 2) ytd_percent
FROM cte
WHERE kind = 1;
See the demo.
You can solve this with window functions. The first thing you want to do is to derive a table from your patients table that has a running tally of both the diagnosed and treated columns. The rows should be tallied in ascending order of the diagnosis date.
Here's how you do that.First I'll create a sample patients table and data (I'll only include the columns necessary):
create temporary table patients (
date_diag date,
diagnosed int default 0,
treated int default 0
);
insert into patients (date_diag, diagnosed, treated) values
('2021-01-01', 1, 0),
('2021-01-11', 1, 1),
('2021-01-16', 1, 0),
('2021-01-30', 1, 1),
('2021-02-04', 1, 1),
('2021-01-14', 1, 1);
Then here's how to create the derived table of all the tallied results.
select
date_diag,
diagnosed,
treated,
sum(treated) over(order by date_diag ASC ) as treated_cmtv,
count(diagnosed) over(order by date_diag ASC) as diagnosed_cmtv
from patients
/*
date_diag | diagnosed | treated | treated_cmtv | diagnosed_cmtv
------------+-----------+---------+--------------+----------------
2021-01-01 | 1 | 0 | 0 | 1
2021-01-11 | 1 | 1 | 1 | 2
2021-01-14 | 1 | 1 | 2 | 3
2021-01-16 | 1 | 0 | 2 | 4
2021-01-30 | 1 | 1 | 3 | 5
2021-02-04 | 1 | 1 | 4 | 6
*/
Now that you have this table you can easily calculate the percentage by using defining this derived table in a subquery and then selecting the necessary columns for the calculation. Like so:
select
p.date_diag,
p.diagnosed,
p.diagnosed_cmtv,
p.treated_cmtv,
p.treated,
TRUNC(p.treated_cmtv::numeric / p.diagnosed_cmtv * 1.0, 2) as percent
from (
-- same table as above
select
date_diag,
diagnosed,
treated,
sum(treated) over(order by date_diag ASC ) as treated_cmtv,
count(diagnosed) over(order by date_diag ASC) as diagnosed_cmtv
from patients
) as p;
/*
date_diag | diagnosed | diagnosed_cmtv | treated_cmtv | treated | percent
------------+-----------+----------------+--------------+---------+---------
2021-01-01 | 1 | 1 | 0 | 0 | 0.00
2021-01-11 | 1 | 2 | 1 | 1 | 0.50
2021-01-14 | 1 | 3 | 2 | 1 | 0.66
2021-01-16 | 1 | 4 | 2 | 0 | 0.50
2021-01-30 | 1 | 5 | 3 | 1 | 0.60
2021-02-04 | 1 | 6 | 4 | 1 | 0.66
*/
I think that gives you what you are asking for.
An alternative approach to the other answers is to use a coordinated sub query in the select
SELECT
p.date_diag,
(SELECT COUNT(*)
FROM patients p2
WHERE p2.date_treat <= p.date_diag) ytd_treated
FROM
patients p
WHERE diagnosed = 1
GROUP BY p.date_diag
ORDER BY p.date_diag
This will give you that column of 0,0,0,1,1,4,4 - you can divide it by the diagnosed column to give your percentage
SELECT
(select ...) / SUM(COUNT(*)) OVER(...)
Note you might need some more clauses in your inner where, such as having a treated date greater than or equal to Jan 1st of the year of the diag date if you're running it against a dataset with more than just one year's data
Also bear in mind that treated as an integer will (should) nearly always be less than diagnosed so if you do an integer divide you'll get zero. Cast one of the operands to float or if you're doing your percentage out of a hundred maybe *100.0
I'm given data of pitchers, the pitch type, and the pitch speed.
|------------------------------------------------|
| day | inning | pitcher| pitch_type| pitch_speed|
| 1 1 AE1 fastball| 97 |
| 1 1 AE1 fastball| 94 |
| 1 1 AE1 slider | 83 |
| 1 2 AE1 fastball| 96 |
| 1 2 AE1 slider | 86 |
| 1 2 AE1 fastball| 97 |
|------------------------------------------------|
Is there a way of querying the data to get the avg value of the pitch speed for a specific pitch type.
I.E. a way to return fastball_speed = 96 and slider_speed = 84.5 (the average)
What about this?
select pitch_type, avg(pitch_speed) from your_table group by pitch_type
BTW please when specifying sample data, use CTE to make work easier for solvers:
#standardSql
with t as (
select 1 as day, 1 as inning, 'AE1' as pitcher, 'fastball' as pitch_type, 97 as pitch_speed union all
select 1 as day, 1 as inning, 'AE1' as pitcher, 'fastball' as pitch_type, 94 as pitch_speed union all
select 1 as day, 1 as inning, 'AE1' as pitcher, 'slider' as pitch_type, 83 as pitch_speed union all
select 1 as day, 2 as inning, 'AE1' as pitcher, 'fastball' as pitch_type, 96 as pitch_speed union all
select 1 as day, 2 as inning, 'AE1' as pitcher, 'slider' as pitch_type, 86 as pitch_speed union all
select 1 as day, 2 as inning, 'AE1' as pitcher, 'fastball' as pitch_type, 97 as pitch_speed
)
select pitch_type, avg(pitch_speed) from t group by pitch_type
The table I am trying to create should look like this
**ID** **Timeframe** Value
1 60 15
1 60 30
1 90 45
2 60 15
2 60 30
2 90 45
3 60 15
3 60 30
3 90 45
So for each ID the values of 60,60,90 and 15,30,45 should be repeated.
Could anyone help me with a code? :)
You are looking for a cross join. The basic idea is something like this:
select i.id, tv.timeframe, tv.value
from (values (1), (2), (3)) i(id) cross join
(values (60, 15), (60, 30), (90, 45)) tv(timeframe, value)
order by i.id, tv.value;
Not all databases support the values() table constructor. In those databases, you would need to use the appropriate syntax.
So you have this table: ...
id
1
2
3
and you have this table: ...
timeframe value
60 15
60 30
90 45
Then try this:
WITH
-- the ID table...
id(id) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
)
,
-- the values table:
vals(timeframe,value) AS (
SELECT 60,15
UNION ALL SELECT 60,30
UNION ALL SELECT 90,45
)
SELECT
id
, timeframe
, value
FROM id CROSS JOIN vals
ORDER BY id, timeframe;
-- out id | timeframe | value
-- out ----+-----------+-------
-- out 1 | 60 | 30
-- out 1 | 60 | 15
-- out 1 | 90 | 45
-- out 2 | 60 | 30
-- out 2 | 60 | 15
-- out 2 | 90 | 45
-- out 3 | 60 | 30
-- out 3 | 60 | 15
-- out 3 | 90 | 45
-- out (9 rows)
I have a table in vertica :
id Timestamp Mask1 Mask2
-------------------------------------------
1 11:30 50 100
1 11:35 52 101
2 12:00 53 102
3 09:00 50 100
3 22:10 52 105
. . . .
. . . .
Which I want to transform into :
id rows 09:00 11:30 11:35 12:00 22:10 .......
--------------------------------------------------------------
1 Mask1 Null 50 52 Null Null .......
Mask2 Null 100 101 Null Null .......
2 Mask1 Null Null Null 53 Null .......
Mask2 Null Null Null 102 Null .......
3 Mask1 50 Null Null Null 52 .......
Mask2 100 Null Null Null 105 .......
The dots (...) indicate that I have many records.
Timestamp is for a whole day and is of format hours:minutes:seconds starting from 00:00:00 to 24:00:00 for a day (I have just used hours:minutes for the question).
I have defined just two extra columns Mask1 and Mask2. I have about 200 Mask columns to work with.
I have shown 5 records but in real I have about a million record.
What I have tried so far:
Dumping each records based on id in a csv file.
Applying transpose in python pandas.
Joining the transposed tables.
The possible generic solution may be pivoting in vertica (or UDTF), but I am fairly new to this database.
I am struggling with this logic for couple of days. Can anyone please help me. Thanks a lot.
Below is the solution as I would code it for just the time values that you have in your data examples.
If you really want to be able to display all 86400 of '00:00:00' through '23:59:59', though, you won't be able to. Vertica's maximum number of columns is 1600.
You could, however, play with the Vertica function TIME_SLICE(timestamp::TIMESTAMP,1,'MINUTE')::TIME
(TIME_SLICE takes a timestamp as input and returns a timestamp, so you have to cast (::) back and forth), to reduce the number of rows to 1440 ...
In any case, I would start with SELECT DISTINCT timestamp FROM input ORDER BY 1;, and then, in the final query, would generate one line per found timestamp (hoping they won't be more than 1598....), like the ones actually used for your data, into your query:
, SUM(CASE timestamp WHEN '09:00' THEN val END) AS "09:00"
, SUM(CASE timestamp WHEN '11:30' THEN val END) AS "11:30"
, SUM(CASE timestamp WHEN '11:35' THEN val END) AS "11:35"
, SUM(CASE timestamp WHEN '12:00' THEN val END) AS "12:00"
, SUM(CASE timestamp WHEN '22:10' THEN val END) AS "22:10"
SQL in general has no variable number of output columns from any given query. If the number of final columns varies depending on the data, you will have to generate your final query from the data, and then run it.
Welcome to SQL and relational databases ..
Here's the complete script for your data. I pivot vertically first, along the "Mask-n" column names, and then I re-pivot horizontally, along the timestamps.
\pset null Null
-- ^ this is a vsql command to display nulls with the "Null" string
WITH
-- your input, not in final query
input(id,Timestamp,Mask1,Mask2) AS (
SELECT 1 , TIME '11:30' , 50 , 100
UNION ALL SELECT 1 , TIME '11:35' , 52 , 101
UNION ALL SELECT 2 , TIME '12:00' , 53 , 102
UNION ALL SELECT 3 , TIME '09:00' , 50 , 100
UNION ALL SELECT 3 , TIME '22:10' , 52 , 105
)
,
-- real WITH clause starts here
-- need an index for your 200 masks
i(i) AS (
SELECT MICROSECOND(ts) FROM (
SELECT TIMESTAMPADD(MICROSECOND, 1,TIMESTAMP '2000-01-01') AS tm
UNION ALL SELECT TIMESTAMPADD(MICROSECOND,200,TIMESTAMP '2000-01-01') AS tm
)x
TIMESERIES ts AS '1 MICROSECOND' OVER(ORDER BY tm)
)
,
-- verticalised masks
vertical AS (
SELECT
id
, i
, CASE i
WHEN 1 THEN 'Mask001'
WHEN 2 THEN 'Mask002'
WHEN 200 THEN 'Mask200'
END AS rows
, timestamp
, CASE i
WHEN 1 THEN Mask1
WHEN 2 THEN Mask2
WHEN 200 THEN 0 -- no mask200 present
END AS val
FROM input CROSS JOIN i
WHERE i <=2 -- only 2 masks present currently
)
-- test the vertical CTE ...
-- SELECT * FROM vertical order by id,rows,timestamp;
-- out id | i | rows | timestamp | val
-- out ----+---+---------+-----------+-----
-- out 1 | 1 | Mask001 | 11:30:00 | 50
-- out 1 | 1 | Mask001 | 11:35:00 | 52
-- out 1 | 2 | Mask002 | 11:30:00 | 100
-- out 1 | 2 | Mask002 | 11:35:00 | 101
-- out 2 | 1 | Mask001 | 12:00:00 | 53
-- out 2 | 2 | Mask002 | 12:00:00 | 102
-- out 3 | 1 | Mask001 | 09:00:00 | 50
-- out 3 | 1 | Mask001 | 22:10:00 | 52
-- out 3 | 2 | Mask002 | 09:00:00 | 100
-- out 3 | 2 | Mask002 | 22:10:00 | 105
SELECT
id
, rows
, SUM(CASE timestamp WHEN '09:00' THEN val END) AS "09:00"
, SUM(CASE timestamp WHEN '11:30' THEN val END) AS "11:30"
, SUM(CASE timestamp WHEN '11:35' THEN val END) AS "11:35"
, SUM(CASE timestamp WHEN '12:00' THEN val END) AS "12:00"
, SUM(CASE timestamp WHEN '22:10' THEN val END) AS "22:10"
FROM vertical
GROUP BY
id
, rows
ORDER BY
id
, rows
;
-- out Null display is "Null".
-- out id | rows | 09:00 | 11:30 | 11:35 | 12:00 | 22:10
-- out ----+---------+-------+-------+-------+-------+-------
-- out 1 | Mask001 | Null | 50 | 52 | Null | Null
-- out 1 | Mask002 | Null | 100 | 101 | Null | Null
-- out 2 | Mask001 | Null | Null | Null | 53 | Null
-- out 2 | Mask002 | Null | Null | Null | 102 | Null
-- out 3 | Mask001 | 50 | Null | Null | Null | 52
-- out 3 | Mask002 | 100 | Null | Null | Null | 105
-- out (6 rows)
-- out
-- out Time: First fetch (6 rows): 28.143 ms. All rows formatted: 28.205 ms
You can use union all to unpivot the data and then conditional aggregation:
select id, which,
max(case when timestamp >= '09:00' and timestamp < '09:30' then mask end) as "09:00",
max(case when timestamp >= '09:30' and timestamp < '10:00' then mask end) as "09:30",
max(case when timestamp >= '10:00' and timestamp < '10:30' then mask end) as "10:00",
. . .
from ((select id, timestamp,
'Mask1' as which, Mask1 as mask
from t
) union all
(select id, timestamp, 'Mask2' as which, Mask2 as mask
from t
)
) t
group by t.id, t.which;
Note: This includes the id on each row. I strongly recommend doing that, but you could use:
select (case when which = 'Mask1' then id end) as id
If you really wanted to.
We have an absence system where people are putting in their total time off instead of splitting it into different records. So my data looks like this
EMP_ID | HOURS | DATE
---------|-----------|------------
1 | 24 | 2013-10-10
2 | 8 | 2013-11-06
3 | 48 | 2013-11-13
4 | 51 | 2013-12-10
I need a query (this can ultimately be a stored proc) that will bring data back like so
EMP_ID | HOURS | DATE
-------------|-----------|-------------
1 | 24 | 2013-10-10
2 | 8 | 2013-11-06
3 | 24 | 2013-11-13
3 | 24 | 2013-11-14
4 | 24 | 2013-12-10
4 | 24 | 2013-12-11
4 | 3 | 2013-12-12
Notice how the day just increases by 1. Anyone who entered more than 24 hours will get their record split. Any residual is tacked on at the end as a partial day.
Lastly (I can work this on my own), I need to be careful not to cross over the year's end.
If this is SQL Server you can use a recursive CTE.
SQL Fiddle Demo
with cte as
(
select emp_id,
hours as tHours,
case when hours >= 24 then 24 else hours end as hours,
date
from YourTable
UNION ALL
select c.emp_id,
c.thours - (case when c.thours >= 24 then 24 else c.hours end),
case when (c.thours - (case when c.thours >= 24 then 24 else c.hours end)) >= 24 then 24 else c.thours - 24 end,
dateadd(day, 1, c.date)
from cte c
join YourTable m on c.thours - 24 > 0 and m.emp_id = c.emp_id
)
select emp_id,
hours,
date
from cte
order by emp_id, date