map/sample timeseries data to another timeserie db2 - sql

I am trying to combine the results of two SQL (DB2 on IBM bluemix) queries:
The first query creates a timeserie from startdate to enddate:
with dummy(minute) as (
select TIMESTAMP('2017-01-01')
from SYSIBM.SYSDUMMY1 union all
select minute + 1 MINUTES
from dummy
where minute <= TIMESTAMP('2018-01-01')
)
select to_char(minute, 'DD.MM.YYYY HH24:MI') AS minute
from dummy;
The second query selects data from a table which have a timestamp. This data should be joined to the generated timeseries above. The standalone query is like:
SELECT DISTINCT
to_char(date_trunc('minute', TIMESTAMP), 'DD.MM.YYYY HH24:MI') AS minute,
VALUE AS running_ct
FROM TEST
WHERE ID = 'abc'
AND NAME = 'sensor'
ORDER BY minute ASC;
What I suppose to get is a query with one result with contains of two columns:
first column with the timestamp from startdate to enddate and
the second with values which are sorted by there own timestamps to the
first column (empty timestamps=null).
How could I do that?

A better solution, especially if your detail table is large, is to generate a range. This allows the optimizer to use indices to fulfill the bucketing, instead of calling a function on every row (which is expensive).
So something like this:
WITH dummy(temporaer, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-01'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM dummy
WHERE rangeEnd < TIMESTAMP('2018-01-31'))
SELECT Dummy.temporaer, AVG(Test.value) AS TEXT
FROM Dummy
LEFT OUTER JOIN Test
ON Test.timestamp >= Dummy.temporaer
AND Test.timestamp < Dummy.rangeEnd
AND Test.id = 'abc'
AND Test.name = 'text'
GROUP BY Dummy.temporaer
ORDER BY Dummy.temporaer ASC;
Note that the end of the range is now exclusive, not inclusive like you had it before: you were including the very first minute of '2018-01-31', which is probably not what you wanted. Of course, excluding just the last day of a month also strikes me as a little strange - you most likely really want < TIMESTAMP('2018-02-01').

found a working solution:
with dummy(temporaer) as (
select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
union all
select temporaer + 1 MINUTES from dummy where temporaer <= TIMESTAMP('2018-01-31'))
select temporaer, avg(VALUE) as text from dummy
LEFT OUTER JOIN TEST ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='text'
group by temporaer
ORDER BY temporaer ASC;
cheers

Related

BigQuery: iterating groups within a window of 28days before a start_date column using _TABLE_SUFFIX

I got a table like this:
group_id
start_date
end_date
19335
20220613
20220714
19527
20220620
20220719
19339
20220614
20220720
19436
20220616
20220715
20095
20220711
20220809
I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.
Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.
My idea is to:
Retrieve distinct user_id per group_id within the period [start_date, end_date]
Retrieve previous 28d metric data prior to the start date of each group_id
A snippet code on how to retrieve data from a single group_id is the following:
WITH users_per_group AS (
SELECT
users_metadata.user_id,
users_metadata.group_id,
FROM
`my_table_users_*` users_metadata
WHERE
_TABLE_SUFFIX BETWEEN '20220314' --start_date
AND '20220413' --end_date
AND experiment_id = 16709
GROUP BY
1,
2
)
SELECT
_TABLE_SUFFIX AS date,
user_id,
SUM(
COALESCE(metric, 0)
) AS metric,
FROM
users_per_group
JOIN `my_metric_table*` metric USING (user_id)
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 28 DAY
)
) -- 28 days before it starts
AND FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 1 DAY
)
) -- 1 day before it starts
GROUP BY
1,
2
ORDER BY
date ASC
Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.
Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?
I can think of 2 approaches.
Join all the tables and then perform your query.
Create dynamic queries for each of your users.
Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
EG:
WITH users_per_group AS (
SELECT
user_id, group_id
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
FROM TableName
)
Once you have this kind of table then you can use any of the mentioned approaches.
Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
Approach 1
-- consider this your main table which contains user,grp,start_date,end_date
with maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
),
--then calculate search from-to date for every user and group
user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from --change interval as per your need
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
)
select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits from
user_per_grp ug
left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
group by 1,2
Approach 2
declare queries array<string> default [];
create temp table maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
);
create temp table user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
);
-- for each user create a seperate query here
FOR record IN (SELECT * from user_per_grp)
DO
set queries = queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
--replace your query here.
END FOR;
--aggregating all the queries and executing it
execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.

want to substract two dates from two different tables, getting syntax error

select to_date(to_char(MIN (logical_date), 'YYYYMMDD'), 'YYYYMMDD')from table_1
- to_date(to_char(MIN (due_date) ,'YYYYMMDD'),'YYYYMMDD') FROM table_2
You could subtract the results of two subqueries, each of which gets the minimum date from one of the tables; with the overall query run against dual (the built-in single-row table that's quite useful for this sort of thing):
-- CTEs for your sample data
with table_1 (logical_date) as (select date '2019-05-01' from dual),
table_2 (due_date) as (select date '2019-05-15' from dual)
-- actual query
select (select to_date(to_char(min(logical_date), 'YYYYMMDD'), 'YYYYMMDD') from table_1)
- (select to_date(to_char(min(due_date) ,'YYYYMMDD'),'YYYYMMDD') from table_2)
as diff
from dual;
DIFF
----------
-14
But you don't need to convert to and from strings, you can just do:
select (select min(logical_date) from table_1) - (select min(due_date) from table_2) as diff
from dual;
unless your dates have non-midnight time components, in which case you'll get a fractional number of days in your result; to get whole days only either round/trunc/floor/ceil the result, or use trunc() to set both time components to midnight before you subtract - which you do depends on how you want to handle those fractional days.
If you're expecting that difference to be -15, then subtract one from the result. If you expect a positive value then reverse the order of the subqueries, and add one instead.

Postgres inner query performance

I have a table which I need to select from everything with this rule:
id = 4524522000143 and validPoint = true
and date > (max(date)- interval '12 month')
-- the max date, is the max date for this id
Explaining the rule: I have to get all registers and count them, they must be at least 1 year old from the newest register.
This is my actual query:
WITH points as (
select 1 as ct from base_faturamento_mensal
where id = 4524522000143 and validPoint = true
group by id,date
having date > (max(date)- interval '12 month')
) select sum(ct) from points
Is there a more efficient way for this?
Well your query is using the trick with including an unaggregated column within HAVING clause but I don't find it particularly bad. It seems fine, but without the EXPLAIN ANALYZE <query> output I can't say much more.
One thing to do is you can get rid of the CTE and use count(*) within the same query instead of returning 1 and then running a sum on it afterwards.
select count(*) as ct
from base_faturamento_mensal
where id = 4524522000143
and validPoint = true
group by id, date
having date > max(date) - interval '12 months'

Combining dates fields from separate tables

I have two tables that are almost exactly the same. The only difference is one is an archive table (call that B) that has any records removed from the other table (call that A)
I'm needing to get ALL records from a given data range, thus I need to join the two tables (and actually join them to a third to get a piece of information that's not on those tables, but that doesn't affect my problem).
I'm wanting to group by the hour the record comes from (i.e. trunc(<date_field>, 'hh')
However, since I need to get records from each hour from the two tables it seems I would need to generate a single date field to group on, otherwise the group wouldn't make sense; each record will only have a date from one field so if I group by either table's date field it would inherently omit records from the other, and if I group by both I'll get no data back as no record appears in both tables.
SO, what I want to do is add two "dates" and have it work like it would in Excel (i.e. the dates get treated as their numeric equivalent, get added, and the resultant date is returned, which by the way is at least one case where adding dates is valid, despite this thread's opinion otherwise)
This makes even more sense as I'll be replacing the null date value with 0 so it should functionally be like adding a number to a date (12/31/14 + 1 = 1/1/15).
I just haven't been able to get it to work. I've tried several iterations to get the calculation to work the latest being:
SELECT DISTINCT Avg(NVL(to_number(to_char(trunc(fcr.actual_start_date, 'hh')))*86400, 0) + NVL(to_Number(to_char(trunc(acr.actual_start_date, 'hh')))*86400, 0)) Start_Num, SUM(AA.SESSIONCPU) TotalCPU, Count(1) Cnt
, SUM((NVL(to_number(to_char(trunc(fcr.actual_completion_date, 'hh')))*86400, 0) + NVL(to_Number(to_char(trunc(acr.actual_completion_date, 'hh')))*86400, 0)
- NVL(to_number(to_char(trunc(fcr.actual_start_date, 'hh')))*86400, 0) - NVL(to_Number(to_char(trunc(acr.actual_start_date, 'hh')))*86400, 0))) TotRun
FROM PSTAT.A$_A AA
LEFT OUTER JOIN APPL.FND_CR FCR On FCR.O_SES_ID = AA.SEsID
LEFT OUTER Join XX.E_FND_CR ACR on ACR.O_SES_ID = aa.sesid
WHERE (trunc(fcr.actual_start_date) >= to_date('28-Dec-2014', 'DD-MON-YYYY')
Or trunc(acr.actual_start_date) >= to_date('28-Dec-2014', 'DD-MON-YYYY'))
AND rownum <= 1048500
and (acr.status_code = 'C' or fcr.status_Code = 'C')
AND aa.sessioncpu is not null
GROUP BY to_number(NVL(trunc(fcr.actual_start_date, 'hh'), 0))*86400 + to_Number(NVL(trunc(acr.actual_start_date, 0), 'hh'))*86400
ORDER BY 2, 1;
My explicit problem with the code above is that Toad keeps ignoring the casts and says it is expecting a date value when it gets a number (the 0 gets highlighted). So if someone could:
A) Tell my why Toad would ignore the casts (it should be seeing a number and so should have absolutely no expectation of a date)
B) Provide any suggestions on how to get the addition to work, or failing that suggest an alternative route to combine the three tables so that I'm able to group by the start date values
As always, any help is much appreciated.
Adding dates or casting them to number throws ORA-00975: date+date not allowed and ORA-01722: invalid number.
So what can be done here to operate on dates in Excel way? My idea is to substract first day from calendar to_date(1, J) from each date you want to operate on.
Example with test dates:
with test_data as (
select sysdate dt from dual union all
select to_date(1, 'J') from dual union all
select null from dual )
select nvl(trunc(dt, 'hh') - to_date(1, 'J'), 0) num_val, dt,
to_char(dt, 'J') tc1, to_char(dt, 'yyyy-mm-ss hh24:mi:ss') tc2
from test_data
NUM_VAL DT TC1 TC2
---------- ---------- ------- -------------------
2457105,96 2015-03-24 2457106 2015-03-14 23:12:14
0 4712-01-01 0000001 4712-01-00 00:00:00
0
#David, your suggestion seems to have worked like charm. For those who come along afterwards my code as updated follows:
SELECT trunc(cr.actual_start_date, 'hh') Start_Date, SUM(AA.SESSIONCPU) TotalCPU,
Count(1) Cnt, SUM((cr.Actual_Completion_Date - cr.Actual_Start_Date)*86400) TotalRun
FROM (SELECT Actual_Start_Date, Actual_Completion_Date, Oracle_Session_ID, Status_Code
FROM APPL.FND_CR
UNION ALL
SELECT Actual_Start_Date, Actual_Completion_Date, Oracle_Session_ID, Status_Code
FROM XX.E_FND_CR) cr
RIGHT OUTER JOIN PSTAT.A$_A AA ON cr.Oracle_Session_ID = AA.SessionID
WHERE trunc(cr.actual_start_date) >= to_date('28-Dec-2014', 'DD-MON-YYYY')
AND rownum <= 1048500
and cr.status_code = 'C'
GROUP BY trunc(cr.actual_start_date, 'hh')
ORDER BY 1;

Query for dates which are not present in a table

Consider a table ABC which has a column of date type.
How can we get all the dates of a range (between start date and end date) which are not present in the table.
This can be done in PLSQL.I am searching a SQL query for it.
You need to generate the arbitrary list of dates that you want to check for:
http://hashfactor.wordpress.com/2009/04/08/sql-generating-series-of-numbers-in-oracle/
e.g.:
-- generate 1..20
SELECT ROWNUM N FROM dual
CONNECT BY LEVEL <= 20
Then left join with your table, or use a where not exists subquery (which will likely be faster) to fetch the dates amongst those you've generated that contains no matching record.
Assuming that your table's dates do not include a time element (ie. they are effectively recorded as at midnight), try:
select check_date
from (select :start_date + level - 1 check_date
from dual
connect by level <= 1 + :end_date - :start_date) d
where not exists
(select null from mytable where mydate = check_date)
Given a date column in order to do this you need to generate a list of all possible dates between the start and end date and then remove those dates that already exist. As Mark has already suggested the obvious way to generate the list of all dates is to use a hierarchical query. You can also do this without knowing the dates in advance though.
with the_dates as (
select date_col
from my_table
)
, date_range as (
select max(date_col) as maxdate, min(date_col) as mindate
from the_dates
)
select mindate + level
from date_range
connect by level <= maxdate - mindate
minus
select date_col
from the_dates
;
Here's a SQL Fiddle
The point of the second layer of the CTE is to have a "table" that has all the information you need but is only one row so that the hierarchical query will work correctly.