Multiple SELECTS and a CTE table - sql

I have this statement which returns values for the dates that exist in the table, the cte then just fills in the half hourly intervals.
with cte (reading_date) as (
select date '2020-11-17' from dual
union all
select reading_date + interval '30' minute
from cte
where reading_date + interval '30' minute < date '2020-11-19'
)
select c.reading_date, d.reading_value
from cte c
left join dcm_reading d on d.reading_date = c.reading_date
order by c.reading_date
However, later on I needed to use A SELECT within a SELECT like this:
SELECT serial_number,
register,
reading_date,
reading_value,,
ABS(A_plus)
FROM
(
SELECT
serial_number,
register,
TO_DATE(reading_date, 'DD-MON-YYYY HH24:MI:SS') AS reading_date,
reading_value,
LAG(reading_value,1, 0) OVER(ORDER BY reading_date) AS previous_read,
LAG(reading_value, 1, 0) OVER (ORDER BY reading_date) - reading_value AS A_plus,
reading_id
FROM DCM_READING
WHERE device_id = 'KXTE4501'
AND device_type = 'E'
AND serial_number = 'A171804699'
AND reading_date BETWEEN TO_DATE('17-NOV-2019' || ' 000000', 'DD-MON-YYYY HH24MISS') AND TO_DATE('19-NOV-2019' || ' 235959', 'DD-MON-YYYY HH24MISS')
ORDER BY reading_date)
ORDER BY serial_number, reading_date;
For extra information:
I am selecting data from a table that exists, and using lag function to work out difference in reading_value from previous record. However, later on I needed to insert dummy data where there are missing half hour reads. The CTE table brings back a list of all half hour intervals between the two dates I am querying on.
ultimately I want to get a result that has all the reading_dates in half hour, the reading_value (if there is one) and then difference between the reading_values that do exist. For the half hourly reads that don't have data returned from table DCM_READING I want to just return NULL.
Is it possible to use a CTE table with multiple selects?

Not sure what you would like to achieve, but you can have multiple CTEs or even nest them:
with
cte_1 as
(
select username
from dba_users
where oracle_maintained = 'N'
),
cte_2 as
(
select owner, round(sum(bytes)/1024/1024) as megabytes
from dba_segments
group by owner
),
cte_3 as
(
select username, megabytes
from cte_1
join cte_2 on cte_1.username = cte_2.owner
)
select *
from cte_3
order by username;

Related

BigQuery: iterating groups within a window of 28days before a start_date column using _TABLE_SUFFIX

I got a table like this:
group_id
start_date
end_date
19335
20220613
20220714
19527
20220620
20220719
19339
20220614
20220720
19436
20220616
20220715
20095
20220711
20220809
I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.
Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.
My idea is to:
Retrieve distinct user_id per group_id within the period [start_date, end_date]
Retrieve previous 28d metric data prior to the start date of each group_id
A snippet code on how to retrieve data from a single group_id is the following:
WITH users_per_group AS (
SELECT
users_metadata.user_id,
users_metadata.group_id,
FROM
`my_table_users_*` users_metadata
WHERE
_TABLE_SUFFIX BETWEEN '20220314' --start_date
AND '20220413' --end_date
AND experiment_id = 16709
GROUP BY
1,
2
)
SELECT
_TABLE_SUFFIX AS date,
user_id,
SUM(
COALESCE(metric, 0)
) AS metric,
FROM
users_per_group
JOIN `my_metric_table*` metric USING (user_id)
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 28 DAY
)
) -- 28 days before it starts
AND FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 1 DAY
)
) -- 1 day before it starts
GROUP BY
1,
2
ORDER BY
date ASC
Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.
Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?
I can think of 2 approaches.
Join all the tables and then perform your query.
Create dynamic queries for each of your users.
Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
EG:
WITH users_per_group AS (
SELECT
user_id, group_id
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
FROM TableName
)
Once you have this kind of table then you can use any of the mentioned approaches.
Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
Approach 1
-- consider this your main table which contains user,grp,start_date,end_date
with maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
),
--then calculate search from-to date for every user and group
user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from --change interval as per your need
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
)
select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits from
user_per_grp ug
left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
group by 1,2
Approach 2
declare queries array<string> default [];
create temp table maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
);
create temp table user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
);
-- for each user create a seperate query here
FOR record IN (SELECT * from user_per_grp)
DO
set queries = queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
--replace your query here.
END FOR;
--aggregating all the queries and executing it
execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.

BigQuery partition pruning while using analytical function as part of a view, on a DATETIME field with DAY granularity

I am trying to use analytical functions (e.g. FIRST_VALUE), while still benefiting from from partition pruning. This while the table is partitioned on a DATETIME field with DAY granularity.
Example Data
Let's assume a table with the following columns:
name
type
dt
DATETIME
value
STRING
The table is partitioned on dt with the DAY granularity.
An example table can be created using the following SQL:
CREATE TABLE `project.dataset.example_partioned_table`
PARTITION BY DATE(dt)
AS
SELECT dt, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
The generated table will be over 1 GB (around 3.4 MB per day).
The Problem
Now I want to get the first value in each partition. (Later I actually want to have a further breakdown)
As I want to use a view, the view itself wouldn't know the final date range. In the example query I using a temporary table t_view in place of the view.
WITH t_view AS (
SELECT
dt,
value,
FIRST_VALUE(value) OVER(
PARTITION BY DATE(dt)
ORDER BY dt
) AS first_val
FROM `project.dataset.example_partioned_table`
)
SELECT *,
FROM t_view
WHERE DATE(dt) = DATE('2020-01-01')
The query will contain something like some value: 2020-01-01 00:00:00 for first_val (i.e. first value for the date).
However, as it stands, it is scanning the whole table (over 1 GB), when it should just scan the partition.
Other observations
If I don't include first_val (the analytical function) in the result, then the partition pruning works as intended.
Including first_val causes it to scan everything.
If I don't wrap dt with DATE, then the partition pruning also works, but would of course not provide the correct value.
I also tried DATETIME_TRUNC(request.timestamp, DAY), with the same lacking partition pruning result as DATE(request.timestamp).
Also adding the date where clause inside the temporary table works, but I wouldn't know the date range inside the view.
How can I restrict the analytical function to the partition of the row?
Failed workaround using GROUP BY
Related, I also tried a workaround using GROUP BY (the date), with the same result...
WITH t_view_1 AS (
SELECT
dt,
DATE(dt) AS _date,
value,
FROM `project.dataset.example_partioned_table`
),
t_view_2 AS (
SELECT
_date,
MIN(value) AS first_val
FROM t_view_1
GROUP BY _date
),
t_view AS (
SELECT
t_view_1._date,
t_view_1.dt,
t_view_2.first_val
FROM t_view_1
JOIN t_view_2
ON t_view_2._date = t_view_1._date
)
SELECT *,
FROM t_view
WHERE _date = '2020-01-01'
As before, it is scanning the whole table rather than only processing the partition with the selected date.
Potentially working workaround with partition on DATE field
If the table is instead partitioned on a DATE field (_date), e.g.:
CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
Then the partition pruning works with the following adjusted example query:
WITH t_view AS (
SELECT
dt,
_date,
value,
FIRST_VALUE(value) OVER(
PARTITION BY _date
ORDER BY dt
) AS first_val
FROM `elife-data-pipeline.de_proto.example_date_field_partioned_table`
)
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')
i.e. the query scans around 4 MB rather than 1 GB
However, now I would need to add and populate that additional _date field. (Inconvenient with an external data source)
Having two fields with redundant information can also be confusing.
Additionally there is now no partition pruning at all on dt (queries need to make sure to use _date instead).
BQ functions can sometimes lead the query optimizer to make some inefficient choices, however we’re constantly trying to improve the query optimizer.
So, the best possible workaround in your scenario would be adding an extra column date column and using it to partition the table.
Ie.
CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
WITH t_view AS (
SELECT
dt,
_date,
value,
FIRST_VALUE(value) OVER(
PARTITION BY _date
ORDER BY dt
) AS first_val
FROM `mock.example_date_field_partioned_table`
)
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')

greenplum string_agg conversion into hivesql supported

We are migrating greenplum sql query to hivesql and please find below statement available, string_agg. how do we migrate, kindly help us. below sample greenplum code needed for migration hive.
select string_agg(Display_String, ';' order by data_day )
select string_agg(Display_String, ';' order by data_day )
from
(
select data_day,
sum(revenue)/1000000.00 as revenue,
data_day||' '||trim(to_char(sum(revenue),'9,999,999,999')) as Display_String
from(
select case when data_date = current_date then 'D:'
when data_date = current_date - 1 then ' D-01:'
when data_date = current_date - 2 then ' D-02:'
when data_date = current_date - 7 then ' D-07:'
when data_date = current_date - 28 then ' D-28:'
end data_day, revenue/1000000.00 revenue
from test.testable
where data_date between current_date - 28 and current_date and hour <=(Select hour from ( select row_number() over(order by hour desc) iRowsID, hour from test.testable where data_date = current_date and type = 'UVC')tbl1
where irowsid = 2) and type in( 'UVC')
order by 1 desc) a
group by 1)aa;
There is nothing like this in hive. However you can use collect list and partition by/Order by to calculate it.
select concat_ws(';', max(concat_str))
FROM (
SELECT collect_list(Display_String) over (order by data_day ) concat_str
FROM
(your above SQL) s ) concat_qry)r
Explanation -
collect list concats the string and while doing it it, order by orders data on day column.
Outermost MAX() will pickup max data for the concatenated string.
Pls note this is a very slow operation. Test performance as well before implementing it.
Here is a sample SQL and result to help you.
select
id, concat_ws(';', max(concat_str))
from
( select
s.id, collect_list(s.c) over (partition by s.id order by s.c ) concat_str
from
( select 1 id,'ax' c union
select 1,'b'
union select 2,'f'
union select 2,'g'
union all select 1,'b'
union all select 1,'b' )s
) gs
group by id

How to fill missing dates between empty records?

I am trying to fill dates between empty records but without success. Tried to do multiple selects method, tried to join, but it seems like I am missing the point. I would like to generate records with missing dates, to generate chart from this block of code. Firstly I would like to have dates filled "manually", later I will reorganise this code and swap that method for an argument.
Can someone help me with that expression?
SELECT
LOG_LAST AS "data",
SUM(run_cnt) AS "Number of runs"
FROM
dual l
LEFT OUTER JOIN "LOG_STAT" stat ON
stat."LOG_LAST" = l."CLASS"
WHERE
new_class = '$arg[klasa]'
--SELECT to_date(TRUNC (SYSDATE - ROWNUM), 'DD-MM-YYYY'),
--0
--FROM dual CONNECT BY ROWNUM < 366
GROUP BY
LOG_LAST
ORDER BY
LOG_LAST
//Edit:
LOG_LAST is just a column with date (for example: 25.04.2018 15:44:21), run_cnt is a column with just a simple number, LOG_STAT is a table that contains LOG_LAST and run_cnt, new_class is a column with name of the record I would like to list records even when they are no existing. For example: I have a records with date 24-09-2018, 23-09-2018, 20-09-2018, 18-09-2018, and I would like to list records even without names and run_cnt, but to generate missing dates in some period
try to fill with isnull:
SELECT
case when trim(LOG_LAST) is null then '01-01-2018'
else isnull(LOG_LAST,'01-01-2018')end AS data,
SUM(isnull(run_cnt,0)) AS "Number of runs"
FROM
dual l
LEFT OUTER JOIN "LOG_STAT" stat ON
stat."LOG_LAST" = l."CLASS"
WHERE
new_class = '$arg[klasa]'
--SELECT to_date(TRUNC (SYSDATE - ROWNUM), 'DD-MM-YYYY'),
--0
--FROM dual CONNECT BY ROWNUM < 366
GROUP BY
LOG_LAST
ORDER BY
LOG_LAST
What you want is more or less:
select d.day, sum(ls.run_cnt)
from all_dates d
left join log_stat ls on trunc(ls.log_last) = d.day
where ls.new_class = :klasa
group by d.day
order by d.day;
The all_dates table in above query is supposed to contain all dates beginning with the minimum klasa log_last date and ending with the maximum klasa log_last date. You get these dates with a recursive query.
with ls as
(
select trunc(log_last) as day, sum(run_cnt) as total
from log_stat
where new_class = :klasa
group by trunc(log_last)
)
, all_dates(day) as
(
select min(day) from ls
union all
select day + 1 from all_dates where day < (select max(day) from ls)
)
select d.day, ls.total
from all_dates d
left join ls on ls.day = d.day
order by d.day;
It's called data densification. From oracle doc Data Densification for Reporting, An example data densification
with ls as
(
select trunc(created) as day,object_type new_class, sum(1) as total
from user_objects
group by trunc(created),object_type
)
, all_dates(day) as
(
select min(day) from ls
union all
select day + 1 from all_dates where day < (select max(day) from ls)
)
select d.day, nvl(ls.total,0),new_class
from all_dates d
left join ls partition by (ls.new_class) on ls.day = d.day
order by d.day;

map/sample timeseries data to another timeserie db2

I am trying to combine the results of two SQL (DB2 on IBM bluemix) queries:
The first query creates a timeserie from startdate to enddate:
with dummy(minute) as (
select TIMESTAMP('2017-01-01')
from SYSIBM.SYSDUMMY1 union all
select minute + 1 MINUTES
from dummy
where minute <= TIMESTAMP('2018-01-01')
)
select to_char(minute, 'DD.MM.YYYY HH24:MI') AS minute
from dummy;
The second query selects data from a table which have a timestamp. This data should be joined to the generated timeseries above. The standalone query is like:
SELECT DISTINCT
to_char(date_trunc('minute', TIMESTAMP), 'DD.MM.YYYY HH24:MI') AS minute,
VALUE AS running_ct
FROM TEST
WHERE ID = 'abc'
AND NAME = 'sensor'
ORDER BY minute ASC;
What I suppose to get is a query with one result with contains of two columns:
first column with the timestamp from startdate to enddate and
the second with values which are sorted by there own timestamps to the
first column (empty timestamps=null).
How could I do that?
A better solution, especially if your detail table is large, is to generate a range. This allows the optimizer to use indices to fulfill the bucketing, instead of calling a function on every row (which is expensive).
So something like this:
WITH dummy(temporaer, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-01'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM dummy
WHERE rangeEnd < TIMESTAMP('2018-01-31'))
SELECT Dummy.temporaer, AVG(Test.value) AS TEXT
FROM Dummy
LEFT OUTER JOIN Test
ON Test.timestamp >= Dummy.temporaer
AND Test.timestamp < Dummy.rangeEnd
AND Test.id = 'abc'
AND Test.name = 'text'
GROUP BY Dummy.temporaer
ORDER BY Dummy.temporaer ASC;
Note that the end of the range is now exclusive, not inclusive like you had it before: you were including the very first minute of '2018-01-31', which is probably not what you wanted. Of course, excluding just the last day of a month also strikes me as a little strange - you most likely really want < TIMESTAMP('2018-02-01').
found a working solution:
with dummy(temporaer) as (
select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
union all
select temporaer + 1 MINUTES from dummy where temporaer <= TIMESTAMP('2018-01-31'))
select temporaer, avg(VALUE) as text from dummy
LEFT OUTER JOIN TEST ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='text'
group by temporaer
ORDER BY temporaer ASC;
cheers