Binning SQL data

Binning SQL data - sql

Sorry if this has been asked before. Let's imagine I have table of temperature measurements inside a set of mechanical components:
ComponentID
Timestamp
Value
A
1st Jan 2020 00:00
20 C
A
1st Jan 2020 00:10
25 C
B
1st Jan 2018 00:00
19C
...and so on. Size of the table is fairly big, i.e. I have thousands of components with 10-minute measurements over a couple of years. What I need is a tally of the temperatures for each component in each year into, say, 5-degree bins, so a table looking like this:
ComponentID
Year
[-20;-15)
[-15,-10)
[-10;-5)
...
A
2018
5
20
300
...
A
2019
0
41
150
...
B
2018
60
10
1
...
..so for each component in each year, I count the number of measurements where the temperature has been in the [-20,-15) range, the number of measurements in the [-15,-10) range, and so on. I have a query doing this, but it's awfully slow. Is there an 'optimal' way of doing this kind of aggregation?

I'd say you should first pre-process your data to make it more simple to aggregate, then aggregate it with another query like (MySQL syntax):
SELECT cats.ComponentID, cats.Year,
SUM(tm5) `[-5;0)`,
SUM(t00) `[0;5)`,
SUM(tp5) `[5;10)`,
SUM(tp10) `[10;15)`,
SUM(tp15) `[15;20)`,
SUM(tp20) `[20;25)`,
SUM(tp25) `[25;30)`
FROM (
SELECT
ComponentID,
YEAR(`Timestamp`) `Year`,
(`Value` BETWEEN -5 AND -0.0001 ) tm5,
(`Value` BETWEEN 0 AND 4.9999 ) t00,
(`Value` BETWEEN 5 AND 9.9999 ) tp5,
(`Value` BETWEEN 10 AND 14.9999) tp10,
(`Value` BETWEEN 15 AND 19.9999) tp15,
(`Value` BETWEEN 20 AND 24.9999) tp20,
(`Value` BETWEEN 25 AND 29.9999) tp25
FROM
measurements
) cats
GROUP BY cats.ComponentID, cats.Year
ORDER BY cats.ComponentID, cats.Year
Inner query could be done into a temporary table if it's too much of a strain on memory.
I've ignored the fact that your temperatures are expressed as Strings including unit, which of course you should convert to numbers at some point, as it was not the point of the question.
Input (table named measurements):
id ComponentID Timestamp Value
------ ----------- ------------------- --------
3 B 2018-01-01 00:00:00 19
4 A 2019-03-05 05:10:00 16
5 A 2019-12-01 00:00:00 18
1 A 2020-01-01 00:00:00 20
2 A 2020-01-01 00:10:00 25
Result:
ComponentID Year [-5;0) [0;5) [5;10) [10;15) [15;20) [20;25) [25;30)
----------- ------ ------ ------ ------ ------- ------- ------- ---------
A 2019 0 0 0 0 2 0 0
A 2020 0 0 0 0 0 1 1
B 2018 0 0 0 0 1 0 0

I would suggest:
SELECT ComponentID, YEAR(`Timestamp`) as Year,
SUM(Value >= -20 AND Value < -15) as [-20;-15),
SUM(Value >= -15 AND Value < -10) as [-15;-10),
SUM(Value >= -10 AND Value < -5) as [-10;-05),
SUM(Value >= -5 AND Value < 0) as [-05;00),
SUM(Value >= 0 AND Value < 5) as [00;-05),
. . .
FROM measurements m
GROUP BY m.ComponentID, Year;
Note the use of inequalities to capture the exact ranges that you want.

Related

I need to count the average of each day's records and size in MB for each file created in a day. For a whole year

I ask for your help after several unsuccessful attempts.
I am learning with PL SQL. I am using Oracle SQL developer v.20
I have this situation. My data set looks like this:
id_file size_byte created_at
_________ _________ ____________________________
1 45323 17-FEB-22 17:21:13,726874000
2 41232 17-FEB-22 17:21:13,740587004
3 1234456 20-FEB-22 17:25:13,368874058
4 233545488 20-FEB-22 17:21:18,400049000
5 233545488 21-FEB-22 18:11:18,058746868
So my desired output would be something like this for year 2022:
TOT_records AVG_file_created_for_day TOT_size_files AVG_size_files_created_each_day
___________ ________________________ ______________ _______________________________
9.999.999 10.000 999.999.999 5 MB (default is byte)
ID is type NUMBER, SIZE_BYTE is type NUMBER, CREATED_AT is TIMESTAMP(6)
My table is partitioned for each year, PARTITION_DATE is type DATE

There's some ambiguity on things like "average file size per day"... That could be:
sum all file sizes / total number of days, or
average of files size per day, then take average of that average
Anyway, here's some stuff to get you going (I'm assuming the latter above)
SQL> create table t as
2 select
3 rownum id_file,
4 dbms_random.value(1000,20000000) bytes,
5 date '2021-01-01' + dbms_random.value(1,700) created_at
6 from dual
7 connect by level <= 5000;
Table created.
SQL>
SQL> select * from t
2 where rownum <= 20;
ID_FILE BYTES CREATED_A
---------- ---------- ---------
1 19305636.7 02-SEP-22
2 6305773.83 10-OCT-21
3 11939117.8 04-NOV-21
4 11039507.9 01-SEP-21
5 15555516.8 02-NOV-22
6 2809048.47 13-SEP-22
7 2070381.41 18-DEC-21
8 11116786.1 11-MAR-22
9 17519679.8 21-DEC-21
10 6728222.84 02-APR-22
11 7569442.31 07-AUG-22
12 16949454.2 06-JUL-21
13 8019443.02 03-JUN-21
14 13147674.9 31-AUG-21
15 14590702.5 16-JUL-22
16 13028609.7 11-MAY-21
17 5466477.07 06-APR-22
18 4469902.12 08-MAY-21
19 14511096 31-MAY-22
20 5245726.03 12-JUL-21
20 rows selected.
SQL> select
2 count(*) total_records,
3 avg(daily_size_avg)/1024/1024 avg_size_files_per_day_mb,
4 sum(bytes)/1024/1024/1024 tot_bytes_gb,
5 avg(files_per_day) avg_files_per_day
6 from
7 (
8 select
9 bytes,
10 avg(bytes) over ( partition by trunc(created_at) ) daily_size_avg,
11 count(*) over ( partition by trunc(created_at) ) files_per_day
12 from t
13 );
TOTAL_RECORDS AVG_SIZE_FILES_PER_DAY_MB TOT_BYTES_GB AVG_FILES_PER_DAY
------------- ------------------------- ------------ -----------------
5000 9.5313187 46.5396421 8.092

How to query data and its count in multiple range at same time

I have a table like below,
id
number
date
1
23
2020-01-01
2
12
2020-03-02
3
23
2020-09-02
4
11
2019-03-04
5
12
2019-03-23
6
23
2019-04-12
I want to know is that how many times each number appears per year, such as,
number
2019
2020
23
1
2
12
1
1
11
1
0
I'm kinda stuck.. tried with left join or just a single select, but still, cannot figure out how to make it, please help thank you!

SELECT C.NUMBER,
SUM
(
CASE
WHEN C.DATE BETWEEN '20190101'AND '20191231'
THEN 1 ELSE NULL
END
) AS A_2019,
SUM
(
CASE
WHEN C.DATE BETWEEN '20200101'AND '20201231'
THEN 1 ELSE NULL
END
) AS A_2020
FROM I_have_a_table_like_below AS C
GROUP BY C.NUMBER

Calculate Churn by aggregating by date range in SQL

I am trying to calculate the churn rate from a data that has customer_id, group, date. The aggregation is going to be by id, group and date. The churn formula is (customers in previous cohort - customers in last cohort)/customers in previous cohort
customers in previous cohort refers to cohorts in before 28 days
customers in last cohort refers to cohorts in last 28 days
I am not sure how to aggregate them by date range to calculate the churn.
Here is sample data that I copied from SQL Group by Date Range:
Date Group Customer_id
2014-03-01 A 1
2014-04-02 A 2
2014-04-03 A 3
2014-05-04 A 3
2014-05-05 A 6
2015-08-06 A 1
2015-08-07 A 2
2014-08-29 XXXX 2
2014-08-09 XXXX 3
2014-08-10 BB 4
2014-08-11 CCC 3
2015-08-12 CCC 2
2015-03-13 CCC 3
2014-04-14 CCC 5
2014-04-19 CCC 4
2014-08-16 CCC 5
2014-08-17 CCC 3
2014-08-18 XXXX 2
2015-01-10 XXXX 3
2015-01-20 XXXX 4
2014-08-21 XXXX 5
2014-08-22 XXXX 2
2014-01-23 XXXX 3
2014-08-24 XXXX 2
2014-02-25 XXXX 3
2014-08-26 XXXX 2
2014-06-27 XXXX 4
2014-08-28 XXXX 1
2014-08-29 XXXX 1
2015-08-30 XXXX 2
2015-09-31 XXXX 3
The goal is to calculate the churn rate every 28 days in between 2014 and 2015 by the formula given above. So, it is going to be aggregating the data by rolling it by 28 days and calculating the churn by the formula.
Here is what I tried to aggregate the data by date range:
SELECT COUNT(distinct customer_id) AS count_ids, Group,
DATE_SUB(CAST(Date AS DATE), INTERVAL 56 DAY) AS Date_min,
DATE_SUB(CURRENT_DATE, INTERVAL 28 DAY) AS Date_max
FROM churn_agg
GROUP BY count_ids, Group, Date_min, Date_max
Hope someone will help me with aggregation and churn calculation. I want to simply deduct the aggregated count_ids to deduct it from the next aggregated count_ids which is after 28 days. So this is going to be successive deduction of the same column value (count_ids). I am not sure if I have to use rolling window or simple aggregation to find the churn.

As corrected by #jarlh, it's not 2015-09-31 but 2015-09-30
You can use this to create 28 days calendar:
create table daysby28 (i int, _Date date);
insert into daysby28 (i, _Date)
SELECT i, cast('01-01-2014'as date) + i*INTERVAL '28 day'
from generate_series(0,50) i
order by 1;
After you use #jarlh churn_agg table creation he sent with the fiddle, with this query, you get what you want:
with cte as
(
select count(Customer) as TotalCustomer, Cohort, CohortDateStart From
(
select distinct a.Customer_id as Customer, b.i as Cohort, b._Date as CohortDateStart
from churn_agg a left join daysby28 b on a._Date >= b._Date and a._Date < b._Date + INTERVAL '28 day'
) a
group by Cohort, CohortDateStart
)
select a.CohortDateStart,
1.0*(b.TotalCustomer - a.TotalCustomer)/(1.0*b.TotalCustomer) as Churn from cte a
left join cte b on a.cohort > b.cohort
and not exists(select 1 from cte c where c.cohort > b.cohort and c.cohort < a.cohort)
order by 1
The fiddle of all together is here

How to write the query to make report by month in sql

I have the receiving and sending data for whole year. so i want to built the monthly report base on that data with the rule is Fisrt in first out. It means is the first receiving will be sent out first ...
DECLARE #ReceivingTbl AS TABLE(Id INT,ProId int, RecQty INT,ReceivingDate DateTime)
INSERT INTO #ReceivingTbl
VALUES (1,1001,210,'2019-03-12'),
(2,1001,315,'2019-06-15'),
(3,2001,500,'2019-04-01'),
(4,2001,10,'2019-06-15'),
(5,1001,105,'2019-07-10')
DECLARE #SendTbl AS TABLE(Id INT,ProId int, SentQty INT,SendMonth int)
INSERT INTO #SendTbl
VALUES (1,1001,50,3),
(2,1001,100,4),
(3,1001,80,5),
(4,1001,80,6),
(5,2001,200,6)
SELECT * FROM #ReceivingTbl ORDER BY ProId,ReceivingDate
SELECT * FROM #SendTbl ORDER BY ProId,SendMonth
Id ProId RecQty ReceivingDate
1 1001 210 2019-03-12
2 1001 315 2019-06-15
5 1001 105 2019-07-10
3 2001 500 2019-04-01
4 2001 10 2019-06-15
Id ProId SentQty SendMonth
1 1001 50 3
2 1001 100 4
3 1001 80 5
4 1001 80 6
5 2001 200 6
--- And the below is what i want:
Id ProId RecQty ReceivingDate ... Mar Apr May Jun
1 1001 210 2019-03-12 ... 50 100 60 0
2 1001 315 2019-06-15 ... 0 0 20 80
5 1001 105 2019-07-10 ... 0 0 0 0
3 2001 500 2019-04-01 ... 0 0 0 200
4 2001 10 2019-06-15 ... 0 0 0 0
Thanks!

Your question is not clear to me.
If you want to purely use the FIFO approach, therefore ignore any data the table contains, you necessarely need to order by ID, which in your example you are providing, and looks like it is in order of insert.
The first line inserted should be also the first line appearing in the select (FIFO), in order to do so you have to use:
ORDER BY Id ASC
Which will place the lower value of the ID first (1, 2, 3, ...)
To me though, this doesn't make much sense, so pay attention to the meaning o the data you actually have and leverage dates like ReceivingDate, and order by that, maybe even filtering by month of the date, below an example for January data:
WHERE MONTH(ReceivingDate) = 1

Left join with nested selects and aggregate functions

Problem
I have one table of generated dates (s) which I want to join with another table (d) which is a list of dates where a specific occurrence has happened.
table s
Wednesday 23rd August 2017
Thursday 24th August 2017
Friday 25th August 2017
Saturday 26th August 2017
table d
day_created -------------------------------- count
Thursday 24th August 2017 ---------------- 45
Saturday 26th August 2017 ---------------- 32
I want to show rows where the occurrence does not take place, which I cannot do if I just have table d.
I want something that looks like:
day_created -------------------------------- count
Wednesday 23rd August --------------------- 0
Thursday 24th August 2017 ---------------- 45
Friday 25th August 2017 ------------------ 0
Saturday 26th August 2017 ---------------- 32
I've tried joining with a left join as follows:
SELECT day_created, COUNT(d.day_created) as total_per_day
FROM
(SELECT date_trunc('day', task_1.created_at) as day_created
FROM task_1
)
d
LEFT JOIN (
SELECT (generate_series('2017-05-01', current_date, '1 day'::INTERVAL)) as standard_date
)
s
ON d.day_created=s.standard_date
GROUP BY d.day_created
ORDER BY day_created DESC;
I don't get an error however the join isn't working (i.e. it doesn't return dates where the count is null). What it returns is the dates from table d and the count, but not the dates in between where there are 0 occurrences.
I've been going round in circles and have understood that I need to make table s (I think!) the left table, but I'm getting confused as a newbie with the syntax.
This is all in PostgreSQL 9.5.8.

Basically, you had the LEFT JOIN backwards. This should work, with some other simplifications and performance optimizations:
SELECT s.standard_date, COUNT(d.day_created) AS total_per_day
FROM generate_series('2017-05-01', current_date, interval '1 day') s(standard_date)
LEFT JOIN task_1 d ON d.day_created >= s.standard_date
AND d.day_created < s.standard_date + interval '1 day'
GROUP BY 1
ORDER BY 1;
This counts rows in d, like you commented. Does not sum values.
Be aware that generate_series() still returns timestamp with time zone, even if you pass date values to it. You may want to cast to date or format with to_char() for display in the outer SELECT. (But rather group and order by the original timestamp value, not the formatted string.)
There may be corner cases depending on the current time zone setting depending on the actual undisclosed table definition.
Related:
How to avoid a subquery in FILTER clause?

I have one table of generated dates (s)
In real databases, we don't store a generated series. We just generate them when needed.
which I want to join with another table (d) which is a list of dates where a specific occurrence has happened. [...] I want to show rows where the occurrence does not take place, which I cannot do if I just have table d.
Nah, you can do it.
CREATE TABLE d(day_created, count) AS VALUES
('24 August 2017'::date, 45),
('26 August 2017'::date, 32);
SELECT day_created, coalesce(count,0)
FROM (
SELECT d::date
FROM generate_series(
'2017-08-01'::timestamp without time zone,
'2017-09-01'::timestamp without time zone,
'1 day'
) AS gs(d)
) AS gs(day_created)
LEFT OUTER JOIN d USING(day_created)
ORDER BY day_created;
day_created | coalesce
-------------+----------
2017-08-01 | 0
2017-08-02 | 0
2017-08-03 | 0
2017-08-04 | 0
2017-08-05 | 0
2017-08-06 | 0
2017-08-07 | 0
2017-08-08 | 0
2017-08-09 | 0
2017-08-10 | 0
2017-08-11 | 0
2017-08-12 | 0
2017-08-13 | 0
2017-08-14 | 0
2017-08-15 | 0
2017-08-16 | 0
2017-08-17 | 0
2017-08-18 | 0
2017-08-19 | 0
2017-08-20 | 0
2017-08-21 | 0
2017-08-22 | 0
2017-08-23 | 0
2017-08-24 | 45
2017-08-25 | 0
2017-08-26 | 32
2017-08-27 | 0
2017-08-28 | 0
2017-08-29 | 0
2017-08-30 | 0
2017-08-31 | 0
2017-09-01 | 0
(32 rows)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Binning SQL data - sql

Related

I need to count the average of each day's records and size in MB for each file created in a day. For a whole year

How to query data and its count in multiple range at same time

Calculate Churn by aggregating by date range in SQL

How to write the query to make report by month in sql

Left join with nested selects and aggregate functions

Categories

Resources