Two dimensional comparison in sql - sql

DB schema
CREATE TABLE newsletter_status
(
cryptid varchar(255) NOT NULL,
status varchar(25),
regDat timestamp,
confirmDat timestamp,
updateDat timestamp,
deleteDat timestamp
);
There are rows with the same cryptid, I need to squash them to one row. So the cryptid becomes effectively unique. The complexity comes from the fact that I need to compare dates by rows as well as by columns. How to implement that?
The rule I need to use is:
status should be taken from the row with the latest timestamp (among all 4 dates)
for every date column I need to select the latest date
Example:
002bc5 | new | 2010.01.15 | 2001.01.15 | NULL | 2020.01.10
002bc5 | confirmed | NULL | 2020.01.30 | 2020.01.15 | 2020.01.15
002bc5 | deactivated | NULL | NULL | NULL | 2020.12.03
needs to be squashed into:
002bc5 | deactivated | 2010.01.15 | 2020.01.30 | 2020.01.15 | 2020.12.03
The status deactivated is taken because the timestamp 2020.12.03 is the latest

What you need to get the status is to sort rowset by dates in descending order. In Oracle there is agg_func(<arg>) keep (dense_rank first ...), in other databases it can be replaced with row_number() and filter. Because analytic functions in HANA works not so good sometimes, I suggest to use the only one aggregate function I know in HANA that supports ordering inside - STRING_AGG - with little trick. If you have not a thousands of rows with statuses (i.e. concatenated status will not be greater 4000 for varchar), it will work. This is the query:
select
cryptid,
max(regDat) as regDat,
max(confirmDat) as confirmDat,
max(updateDat) as updateDat,
max(deleteDat) as deleteDat,
substr_before(
string_agg(status, '|'
order by greatest(
ifnull(regDat, date '1000-01-01'),
ifnull(confirmDat, date '1000-01-01'),
ifnull(updateDat, date '1000-01-01'),
ifnull(deleteDat, date '1000-01-01')
) desc),
'|'
) as status
from newsletter_status
group by cryptid

You can use aggregation:
select cryptid,
coalesce(max(case when status = 'deactivated' then status end)
max(case when status = 'confirmed' then status end),
max(case when status = 'new' then status end),
) as status,
max(regDat),
max(confirmDat),
max(updateDat),
max(deleteDat)
from newsletter_status
group by cryptid;
The coalesce()s are a trick to get the statuses in priority order.
EDIT:
If you just want the row with the latest timestamp:
select cryptid,
max(case when seqnum = 1 then status end) as status_on_max_date,
max(regDat),
max(confirmDat),
max(updateDat),
max(deleteDat)
from (select ns.*,
row_number() over (partition by cryptid
order by greatest(coalesce(regDat, '2000-01-01'),
coalesce(confirmDat, '2000-01-01'),
coalesce(updateDat, '2000-01-01'),
coalesce(deleteDat, '2000-01-01')
)
) as seqnum
from newsletter_status ns
) ns
group by cryptid;

I would start by ranking the rows of each cryptid by the greatest value of the date column. Then we can use that information to identify the latest status per cryptid, and aggregate :
select cryptid,
max(case when rn = 1 then status end) as status,
max(regDate) as regDat,
max(confirmDat) as confirmDat,
max(updatedDat) as updatedDat,
max(deleteDat) as deleteDat
from (
select ns.*,
row_number() over(
partition by cryptid
order by greatest(
coalesce(regDate, '0001-01-01'),
coalesce(confirmDat, '0001-01-01'),
coalesce(updatedDat, '0001-01-01'),
coalesce(deleteDat, '0001-01-01')
)
) rn
from newsletter_status ns
) ns
group by cryptid

Related

SQL query to allow for latest datasets per items

I have this table in an SQL server database:
and I would like a query that gives me the values of cw1, cw2,cw3 for a restricted date condition.
I would like a query giving me the "latest" values of cw1, cw2, cw3 giving me previous values of cw1, cw2, cw3, if they are null for the last plan_date. This would be with a date condition.
So if the condition is plan_date between "02.01.2020" and "04.01.2020" then the result should be
1 04.01.2020 null, 9, 4
2 03.01.2020 30 , 15, 2
where, for example, the "30" is from the last previous date for item_nr 2.
You can get the last value using first_value(). Unfortunately, that is a window function, but select distinct solves that:
select distinct item_nr,
first_value(cw1) over (partition by item_nr
order by (case when cw1 is not null then 1 else 2 end), plan_date desc
) as imputed_cw1,
first_value(cw2) over (partition by item_nr
order by (case when cw2 is not null then 1 else 2 end), plan_date desc
) as imputed_cw2,
first_value(cw3) over (partition by item_nr
order by (case when cw3 is not null then 1 else 2 end), plan_date desc
) as imputed_cw3
from t;
You can add a where clause after the from.
The first_value() window function returns the first value from each partition. The partition is ordered to put the non-NULL values first, and then order by time descending. So, the most recent non-NULL value is first.
The only downside is that it is a window function, so the select distinct is needed to get the most recent value for each item_nr.

SQL Grouping based on validity

I have the following table, let's call it tbl_costcenters, with the following dummy entries:
ID PosName CostcenterCode ValidFrom ValidUntil
1 test1 111 1.1.2019 1.6.2019
2 test1 111 1.6.2019 1.9.2019
3 test1 222 1.9.2019 1.6.2020
and i would have the following result:
PosName ValidFrom ValidUntil CostcenterCode
test1 1.1.2019 1.9.2019 111
test1 1.9.2019 1.6.2020 222
This is very simplified. The real table contains much more cols. I need to group them based on the costcentercode and get a validity that englobes the two first entries of my example, returning the validfrom from record ID 1 and the validuntil from record ID 2.
Sorry i did not really know for what to search. I think that the answer is easy for somebody that is strong in SQL.
The answer should work for both, SQL Server and for Oracle.
Thank you for your help.
This seems simple aggregation :
select PosName,
min(ValidFrom) as ValidFrom,
(case when max(ValidUntil) > min(ValidFrom) then max(ValidUntil) end) as ValidUntil,
CostcenterCode
from tbl_costcenters t
group by PosName, CostcenterCode;
I suspect that you want to group togethers records whose date overlap, while keeping those that don't overlap separated (although this is not showing in your sample data).
If so, we could use some gaps-and-island techniques here. One option uses window functions to build groups of adjacent records:
select
postName,
min(validFrom) validFrom,
max(validUntil) validUntil
costCenter
from (
select
t.*,
sum(case when validFrom <= lagValidUntil then 0 else 1 end)
over(partition by posName, costCenter order by validFrom) grp
from (
select
t.*,
lag(validUntil)
over(partition by posName, costCenter order by validFrom) lagValidUntil
from mytable t
) t
) t
group by postName, costCenter, grp
order by postName, validFrom
The definitve solution for me was:
select posname, min(validfrom),
case
when
max(case when validuntil is null then 1 ELSE 0 END) = 0
then max(validuntil)
end
from tbl_costcenters pos
group by posname, costcentercode;
Thank you all.

Extreme values within each group of dataset

I have an SQLScript query written in AMDP which creates two new columns source_contract and target_contract.
RETURN SELECT client as client,
pob_id as pob_id,
dateto as change_to,
datefrom as change_from,
cast( cast( substring( cast( datefrom as char( 8 ) ), 1,4 ) as NUMBER ( 4 ) ) as INT )
as change_year,
cast( CONCAT( '0' , ( substring( cast( datefrom as char( 8 ) ), 5,2 ) ) ) as VARCHAR (3))
as change_period,
LAG( contract_id, 1, '00000000000000' ) OVER ( PARTITION BY pob_id ORDER BY pob_id, datefrom )
as source_contract,
contract_id as target_contract
from farr_d_pob_his
ORDER BY pob_id
Original data:
POB Valid To Valid From Contract
257147 05.04.2018 05.04.2018 10002718
257147 29.05.2018 06.04.2018 10002719
257147 31.12.9999 30.05.2018 10002239
Data from AMDP view:
I want to ignore any intermediate rows (Date is the criteria to decide order). Any suggestion or ideas ?
I thought of using Group by to get the max date and min date and using union on these entries in a separate consumption view but if we are using group by we can't fetch other entries. The other possibility is order by date but it is not available in CDS.
You already have the optimal solution with sub-selects.
Pseudo code:
SELECT *
FROM OriginalData
WHERE (POB, ValidFrom)
IN (SELECT POB, MIN(ValidFrom)
FROM OriginalData
GROUP BY POB)
OR (POB, ValidTo)
IN (SELECT POB, MAX(ValidTo)
FROM OriginalData
GROUP BY POB);
GROUP BY won't work as it "mixes up" the minimums in different columns.
A nice touch might be extracting the sub-selects into views of their own, eg. EarliestContractPerPob and LatestContractPerPob.
Here is the proof-of-concept of solution for your task.
Provided we have pre-selected by material type (MTART) dataset based on table mara which is quite similar to yours:
------------------------------------------------
| MATNR | ERSDA | VPSTA |MTART|
------------------------------------------------
| 17000000007|18.06.2018|KEDBXCZ |ZSHD |
| 17000000008|21.06.2018|K |ZSHD |
| 17000000011|21.06.2018|K |ZSHD |
| 17000000023|22.06.2018|KEDCBGXZLV|ZSHD |
| 17000000103|09.01.2019|K |ZSHD |
| 17000000104|09.01.2019|K |ZSHD |
| 17000000105|09.01.2019|K |ZSHD |
| 17000000113|06.02.2019|V |ZSHD |
------------------------------------------------
Here are the materials and we want to leave only the last and the first material (MATNR) by creation date (ERSDA) and find maintenance type (VPSTA) for first and last ones.
------------------------------------------------
| MATNR | ERSDA | VPSTA |MTART|
------------------------------------------------
| 17000000007|18.06.2018|KEDBXCZ |ZSHD |
| 17000000113|06.02.2019|V |ZSHD |
------------------------------------------------
In your case you similarly search within each POB (mtart) source and target contracts contract_id (last and first vpsta) on the basis of datefrom criterion (ersda).
One can achieve that using UNION and two selects with sub-queries:
SELECT ersda AS date, matnr AS max, mtart AS type, vpsta AS maint
FROM mara AS m
WHERE ersda = ( SELECT MAX( ersda ) FROM mara WHERE mtart = m~mtart )
UNION SELECT ersda AS date, matnr AS max, mtart AS type, vpsta AS maint
FROM mara AS m2
WHERE ersda = ( SELECT MIN( ersda ) FROM mara WHERE mtart = m2~mtart )
ORDER BY type, date
INTO TABLE #DATA(lt_result).
Here you can notice the first select fetches max ersda dates and the second select fetches min ones.
The resulted set ordered by type and date will be somewhat what are you looking for (F = first, L = last):
Your SELECT should look somewhat like this:
SELECT datefrom as change_from, contract_id AS contract, pob_id AS pob
FROM farr_d_pob_his AS farr
WHERE datefrom = ( SELECT MAX( datefrom ) FROM farr_d_pob_his WHERE pob_id = farr~pob_id )
UNION SELECT datefrom as change_from, contract_id AS contract, pob_id AS pob
FROM farr_d_pob_his AS farr2
WHERE datefrom = ( SELECT MIN( datefrom ) FROM farr_d_pob_his WHERE pob_id = farr2~pob_id )
ORDER BY pob, date
INTO TABLE #DATA(lt_result).
Note, this will work only if you have unique datefrom dates, otherwise the query will not know which last/first contract you want to use. Also, in case of the only one contract within each POB there will be only one record.
A couple of words about implementation. In your sample I see that you use AMDP class but later you mentioned that ORDER is not supported by CDS. Yes, they are not supported in CDS as well as sub-queries, but they are supported in AMDP.
You should differentiate two types of AMDP functions: functions for AMDP method and functions for CDS table functions. The first ones perfectly handle SELECTs with sorting and sub-queries. You can view the samples in CL_DEMO_AMDP_VS_OPEN_SQL demo class which demonstrate AMDP features including sub-queries. You can derive you code in AMDP function and call it from your CDS table function implementation.

Getting rows with the highest SELECT COUNT from groups within a resultset

I have a SQLite Database that contains parsed Apache log lines.
A simplified version of the DB's only table (accesses) looks like this:
|referrer|datestamp|
+--------+---------+
|xy.de | 20170414|
|ab.at | 20170414|
|xy.de | 20170414|
|xy.de | 20170414|
|12.com | 20170413|
|12.com | 20170413|
|xy.de | 20170413|
|12.com | 20170413|
|12.com | 20170412|
|xy.de | 20170412|
|12.com | 20170412|
|12.com | 20170412|
|ab.at | 20170412|
|ab.at | 20170412|
|12.com | 20170412|
+--------+---------+
I am trying to retrieve the top referrer for each day by performing a sub query that does a SELECT COUNT on the referrer. Afterwards I select the entries from that subquery that have the highest count:
SELECT datestamp, referrer, COUNT(*)
FROM accesses WHERE datestamp BETWEEN '20170414' AND '20170414'
GROUP BY referrer
HAVING COUNT(*) = (select MAX(anz)
FROM (SELECT COUNT(*) anz
FROM accesses
WHERE datestamp BETWEEN '20170414' AND '20170414'
GROUP BY referrer
)
);
The above approach works as long as I perform the query for a single date, but it falls apart as soon as I query for date ranges.
How can I achieve grouping by date? I am also only interested in the referrer with the highest count.
If you want all the days combined with a single best referrer, then:
SELECT referrer, COUNT(*) as anz
FROM accesses
WHERE datestamp BETWEEN '20170414' AND '20170414'
GROUP BY referrer
ORDER BY COUNT(*) DESC
LIMIT 1;
I think you might want this information broken out by day. If so, a correlated subquery helps -- and a CTE as well:
WITH dr as (
SELECT a.datestamp, a.referrer, COUNT(*) as cnt
FROM accesses a
WHERE datestamp BETWEEN '20170414' AND '20170414'
GROUP BY a.referrer, a.datestamp
)
SELECT dr.*
FROM dr
WHERE dr.cnt = (SELECT MAX(dr2.cnt)
FROM dr dr2
WHERE dr2.datestamp = dr.datestamp
);
Just group by a date range. As an example,
SELECT referrer,
case when datestamp Between '20170101' AND '20170131' then 1
when datestamp Between '20170201' AND '20170228' then 2
when datestamp Between '20170301' AND '20170331' then 3
else 4 end DateRange
COUNT(*) as anz
FROM accesses
GROUP BY referrer,
case when datestamp Between '20170101' AND '20170131' then 1
when datestamp Between '20170201' AND '20170228' then 2
when datestamp Between '20170301' AND '20170331' then 3
else 4 end
ORDER BY referrer, COUNT(*) DESC
LIMIT 1;
You can put any legal SQL expression in a group by clause. This causes the Query processor to create individual buckets to aggregate the raw data into according to value of the group by expression.

How to find the minimum value in a postgres sql column which contains jsonb data?

I have a table t in postgres database. It has a column data which contains jsonb data in the following format (for each record)-
{
"20161214": {"4": ["3-14", "5-16", "642"], "9": ["3-10", "5-10", "664"] },
"20161217": {"3": ["3-14", "5-16", "643"], "7": ["3-10", "5-10", "661"] }
}
where 20161214 is the date, "4" is the month, 642 is the amount.
I need to find the minimum amount for each record of the table and the month that amount belongs to.
What I have tried:
Using jsonb_each function and separating key value pairs and then using min function.But still I cant get the month it belongs to.
How can this be achieved?
select j2.date
,j2.month
,j2.amount
from t
left join lateral
(select j1.date
,j2.month
,(j2.value->>2)::numeric as amount
from jsonb_each (t.data) j1 (date,value)
left join lateral jsonb_each (j1.value) j2 (month,value)
on true
order by amount
limit 1
) j2
on true
+----------+-------+--------+
| date | month | amount |
+----------+-------+--------+
| 20161214 | 4 | 642 |
+----------+-------+--------+
Alternatively (without joins):
select
min(case when amount = min_amount then month end) as month,
min_amount as amout
from (
select
key as month,
(select min((value->>2)::int) from jsonb_each(value)) as amount,
min((select min((value->>2)::int) from jsonb_each(value))) over(partition by rnum) as min_amount,
rnum
from (
select
(jsonb_each(data)).*,
row_number() over() as rnum
from t
) t
) t
group by
rnum, min_amount;