SQL/REGEX puzzle/challenge How to convert ASCII art ranges with multiple characters to relational data?

SQL/REGEX puzzle/challenge How to convert ASCII art ranges with multiple characters to relational data? - sql

The motivation here was to easily and accurately generate data samples for the nested ranges challenge.
A table contains a single column of text type.
The text contains one or more lines where each lines contains one or more sections created from letters.
The goal is to write a query that returns a tuple for each section with its start point ,end point and value.
Data sample
create table t (txt varchar (1000));
insert into t (txt) values
(
'
AAAAAAAAAAAAAAAAAAAAAAAAAAAA BBBB CCCCCCCCCCCCCCCCCCCCCCCCC
DDDE FFFFFFFF GGGGGGGGG HHHHHHHH IIIIIII
JJ KKKLLL MM NN OOOOO
P QQ
'
)
;
Requested results
* Only the last 3 columns (section start/end/val) are required, the rest are for debugging purposes.
line_ind section_ind section_length section_start section_end section_val
1 1 28 1 28 A
1 2 4 31 34 B
1 3 25 39 63 C
2 1 3 1 3 D
2 2 1 4 4 E
2 3 8 7 14 F
2 4 9 19 27 G
2 5 8 43 50 H
2 6 7 55 61 I
3 1 2 1 2 J
3 2 3 9 11 K
3 3 3 12 14 L
3 4 2 22 23 M
3 5 2 25 26 N
3 6 5 57 61 O
4 1 1 13 13 P
4 2 2 60 61 Q

Teradata
Currently regexp_split_to_table doesn't seem to support zero-length expression (I've created incident RECGZJKZV). In order to overcome this limitation I'm using regexp_replace to push space between adjacent sequences of letters, e.g. KKKLLL
with l
as
(
select line_ind
,line
from table
(
regexp_split_to_table (-1,t.txt,'\r','')
returns (minus_one int,line_ind int,line varchar(1000))
)
as l
)
select l.line_ind
,r.section_ind
,char_length (r.section) as section_length
,regexp_instr (l.line,'(\S)\1*',1,r.section_ind,0) as section_start
,regexp_instr (l.line,'(\S)\1*',1,r.section_ind,1) - 1 as section_end
,substr (r.section,1,1) as section_val
from table
(
regexp_split_to_table (l.line_ind,regexp_replace (l.line,'(?<=(?P<c>.))(?!(?P=c))',' '),'\s+','')
returns (line_ind int,section_ind int,section varchar(1000))
)
as r
,l
where l.line_ind =
r.line_ind
order by l.line_ind
,r.section_ind
;
Oracle
select regexp_instr (txt,'(\S)\1*',1,level,0) - instr (txt,chr(10),regexp_instr (txt,'(\S)\1*',1,level,0) - length (txt) - 1,1) as section_start
,regexp_instr (txt,'(\S)\1*',1,level,1) - 1 - instr (txt,chr(10),regexp_instr (txt,'(\S)\1*',1,level,0) - length (txt) - 1,1) as section_end
,regexp_substr (txt,'(\S)\1*',1,level,'',1) as section_val
from t
connect by level <= regexp_count (txt,'(\S)\1*')
;

Oracle
This will work even if you have multiple input rows:
WITH lines ( txt, id, line, pos, line_no ) AS(
SELECT txt,
id,
REGEXP_SUBSTR( txt, '.*', 1, 1 ),
REGEXP_INSTR( txt, '.*', 1, 1, 1 ),
1
FROM t
UNION ALL
SELECT txt,
id,
REGEXP_SUBSTR( txt, '.*', pos + 1, 1 ),
REGEXP_INSTR( txt, '.*', pos + 1, 1, 1 ),
line_no + 1
FROM lines
WHERE pos > 0
),
words ( id, line, line_no, section_start, section_end, section_value ) AS (
SELECT id,
line,
line_no,
REGEXP_INSTR( line, '(\S)\1*', 1, 1, 0 ),
REGEXP_INSTR( line, '(\S)\1*', 1, 1, 1 ) - 1,
REGEXP_SUBSTR( line, '(\S)\1*', 1, 1, NULL, 1 )
FROM lines
WHERE pos > 0
AND line IS NOT NULL
UNION ALL
SELECT id,
line,
line_no,
REGEXP_INSTR( line, '(\S)\1*', section_end + 1, 1, 0 ),
REGEXP_INSTR( line, '(\S)\1*', section_end + 1, 1, 1 ) - 1,
REGEXP_SUBSTR( line, '(\S)\1*', section_end + 1, 1, NULL, 1 )
FROM words
WHERE section_end > 0
)
SELECT id,
line_no,
section_start,
section_end,
section_value
FROM words
WHERE section_end > 0
ORDER BY id, line_no, section_start
So, for the input data (with an added id column to be able to easily differentiate the pieces of text):
create table t (id NUMBER(5,0), txt varchar (1000));
insert into t (id, txt) values
(
1,
'
AAAAAAAAAAAAAAAAAAAAAAAAAAAA BBBB CCCCCCCCCCCCCCCCCCCCCCCCC
DDDE FFFFFFFF GGGGGGGGG HHHHHHHH IIIIIII
JJ KKKLLL MM NN OOOOO
P QQ
'
);
insert into t (id, txt) values ( 2, 'RRRSTT UUU V WXYZ' );
This outputs:
ID | LINE_NO | SECTION_START | SECTION_END | SECTION_VALUE
-: | ------: | ------------: | ----------: | :------------
1 | 2 | 1 | 28 | A
1 | 2 | 31 | 34 | B
1 | 2 | 39 | 63 | C
1 | 3 | 1 | 3 | D
1 | 3 | 4 | 4 | E
1 | 3 | 7 | 14 | F
1 | 3 | 19 | 27 | G
1 | 3 | 43 | 50 | H
1 | 3 | 55 | 61 | I
1 | 4 | 1 | 2 | J
1 | 4 | 9 | 11 | K
1 | 4 | 12 | 14 | L
1 | 4 | 22 | 23 | M
1 | 4 | 25 | 26 | N
1 | 4 | 57 | 61 | O
1 | 5 | 13 | 13 | P
1 | 5 | 60 | 61 | Q
2 | 1 | 1 | 3 | R
2 | 1 | 4 | 4 | S
2 | 1 | 5 | 6 | T
2 | 1 | 8 | 10 | U
2 | 1 | 15 | 15 | V
2 | 1 | 17 | 17 | W
2 | 1 | 18 | 18 | X
2 | 1 | 19 | 19 | Y
2 | 1 | 20 | 20 | Z
db<>fiddle here

Related

How to get columns when using buckets (width_bucket)

I would like to know which row were moved to a bucket.
SELECT
width_bucket(s.score, sl.mins, sl.maxs, 9) as buckets,
COUNT(*)
FROM scores s
CROSS JOIN scores_limits sl
GROUP BY 1
ORDER BY 1;
My actual return:
buckets | count
---------+-------
1 | 182
2 | 37
3 | 46
4 | 15
5 | 29
7 | 18
8 | 22
10 | 11
| 20
What I expect to return:
SELECT buckets FROM buckets_table [...] WHERE scores.id = 1;
How can I get, for example, the column 'id' of table scores?

I believe you can include the id in an array with array_agg. If I recreate your case with
create table test (id serial, score int);
insert into test(score) values (10),(9),(5),(4),(10),(2),(5),(7),(8),(10);
The data is
id | score
----+-------
1 | 10
2 | 9
3 | 5
4 | 4
5 | 10
6 | 2
7 | 5
8 | 7
9 | 8
10 | 10
(10 rows)
Using the following and aggregating the id with array_agg
SELECT
width_bucket(score, 0, 10, 11) as buckets,
COUNT(*) nr_ids,
array_agg(id) agg_ids
FROM test s
GROUP BY 1
ORDER BY 1;
You get
buckets | nr_ids | agg_ids
---------+--------+----------
3 | 1 | {6}
5 | 1 | {4}
6 | 2 | {3,7}
8 | 1 | {8}
9 | 1 | {9}
10 | 1 | {2}
12 | 3 | {1,5,10}

Recursive join with SUM

I have data in the following format:
FromStateID ToStateID Seconds
1 2 10
2 3 20
3 4 15
4 5 5
I need the following output
FromStateID ToStateID Seconds
1 2 10
2 3 20
3 4 15
4 5 5
1 3 10+20
1 4 10+20+15
1 5 10+20+15+5
2 4 20+15
2 5 20+15+5
3 5 15+5
This output shows the total time taken FromStateId to ToStateId in every combination in chronological order.
Please help.

I think this is a recursive CTE that follows the links:
with cte as (
select FromStateID, ToStateID, Seconds
from t
union all
select cte.FromStateId, t.ToStateId, cte.Seconds + t.Seconds
from cte join
t
on cte.toStateId = t.FromStateId
)
select *
from cte;
Here is a db<>fiddle.

#Gordon LinOff is the better solution. Below is another option to achieve the same.
You can achieve this using CROSS JOIN and GROUP BY
DECLARE #table table(FromStateId int, ToStateId int, seconds int)
insert into #table
values
(1 ,2 ,10),
(2 ,3 ,20),
(3 ,4 ,15),
(4 ,5 ,5 );
;with cte_fromToCombination as
(select f.fromStateId, t.tostateId
from
(select distinct fromStateId from #table) as f
cross join
(select distinct toStateId from #table) as t
)
select c.FromStateId, c.ToStateId, t.sumseconds as Total_seconds
from cte_fromToCombination as c
CROSS APPLY
(SELECT sum(t.seconds)
from
#table as t
WHERE t.ToStateId <= c.ToStateId
) as t(sumseconds)
where c.tostateId > c.fromStateId
order by FromStateId,ToStateId
+-------------+-----------+---------------+
| FromStateId | ToStateId | Total_seconds |
+-------------+-----------+---------------+
| 1 | 2 | 10 |
| 1 | 3 | 30 |
| 1 | 4 | 45 |
| 1 | 5 | 50 |
| 2 | 3 | 30 |
| 2 | 4 | 45 |
| 2 | 5 | 50 |
| 3 | 4 | 45 |
| 3 | 5 | 50 |
| 4 | 5 | 50 |
+-------------+-----------+---------------+

expand oracle table fields linearly

I have the following table in oracle:
ID field_1 field_2
1 1-5 1-5
1 20-30 55-65
2 1-8 10-17
2 66-72 80-86
I need to convert this table to the following format where field_1 and field_2 must be matched linearly:
ID field_1 field_2
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
1 20 55
1 21 56
1 22 57
1 23 58
1 24 59
1 25 60
1 26 61
1 27 62
1 28 63
1 29 64
1 30 65
2 1 10
2 2 11
2 3 12
2 4 13
2 5 14
2 6 15
2 7 16
2 8 17
2 66 80
2 67 81
2 68 82
2 69 83
2 70 84
2 71 85
2 72 86
what is the easiest and fastest way to accomplish this, knowing that the original table contains thousands of records

The lateral clause, used below, is available since Oracle 12.1. For older versions, a connect by hierarchical query is still probably the fastest, but it will need to be written with a bit more care (and it will be slower than using connect by in a lateral join).
Of course, the big assumption is that the inputs are always in the form number-dash-number, and that the difference between the upper and the lower bound is the same in the two columns, for each row. I am not even attempting to check for that.
select t.id, l.field_1, l.field_2
from mytable t,
lateral (select to_number(substr(field_1, 1, instr(field_1, '-') - 1))
+ level - 1 as field_1,
to_number(substr(field_2, 1, instr(field_2, '-') - 1))
+ level - 1 as field_2
from dual
connect by level <=
to_number(substr(field_1, instr(field_1, '-') + 1))
- to_number(substr(field_1, 1, instr(field_1, '-') - 1)) + 1
) l
;

One option uses a recursive query. Starting 11gR2, Oracle supports standard recursive common table expressions, so you can do:
with cte(id, field_1, field_2, max_field_1, max_field_2) as (
select
id,
to_number(regexp_substr(field_1, '^\d+')),
to_number(regexp_substr(field_2, '^\d+')),
to_number(regexp_substr(field_1, '\d+$')),
to_number(regexp_substr(field_2, '\d+$'))
from mytable
union all
select
id,
field_1 + 1,
field_2 + 1,
max_field_1,
max_field_2
from cte
where field_1 < max_field_1
)
select id, field_1, field_2 from cte order by id, field_1
This assumes that intervals on the same row have always the same length, as showned in your sample data. If that's not the case, you would to explain how you want to handle that.
Demo on DB Fiddle:
ID | FIELD_1 | FIELD_2
-: | ------: | ------:
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
1 | 4 | 4
1 | 5 | 5
1 | 20 | 55
1 | 21 | 56
1 | 22 | 57
1 | 23 | 58
1 | 24 | 59
1 | 25 | 60
1 | 26 | 61
1 | 27 | 62
1 | 28 | 63
1 | 29 | 64
1 | 30 | 65
2 | 1 | 10
2 | 2 | 11
2 | 3 | 12
2 | 4 | 13
2 | 5 | 14
2 | 6 | 15
2 | 7 | 16
2 | 8 | 17
2 | 66 | 80
2 | 67 | 81
2 | 68 | 82
2 | 69 | 83
2 | 70 | 84
2 | 71 | 85
2 | 72 | 86

You can use the cross join with generated values as follows:
SELECT ID,
to_number(regexp_substr(field_1, '[0-9]+',1,1)) + column_value - 1 AS FIELD_1,
to_number(regexp_substr(field_2, '[0-9]+',1,1)) + column_value - 1 AS FIELD_2
FROM your_table
cross join
table(cast(multiset(select level from dual
connect by level <=
to_number(regexp_substr(field_1, '[0-9]+',1,2))
- to_number(regexp_substr(field_1, '[0-9]+',1,1))
+ 1
) as sys.odcinumberlist))
ORDER BY 1,2

Oracle : SQL Request with a Group By and a Percentage on two differents tables

I'm currently blocked on an complex request... (with a join) :
I have this table "DATA":
order | product
----------------
1 | A
1 | B
2 | A
2 | D
3 | A
3 | C
4 | A
4 | B
5 | Y
5 | Z
6 | W
6 | A
And this table "DICO":
order | couple | first | second
-------------------------------
1 | A-B | A | B
2 | A-D | A | D
3 | A-C | A | C
4 | A-B | A | B
5 | Y-Z | Y | Z
6 | W-A | W | A
I would like to obtain, on one line :
order | count | total1stElem | %1stElem | total2ndElem | %1ndElem
------------------------------------------------------------------
A-B | 2 | 5 | 40% | 2 | 100%
A-D | 1 | 5 | 20% | 1 | 100%
A-C | 1 | 5 | 20% | 1 | 100%
Y-Z | 1 | 1 | 100% | 1 | 100%
W-A | 1 | 1 | 100% | 5 | 20%
I'm totally blocked on the jointure part of my request. Somebody can help me ?

Without any joins - just using UNPIVOT and PIVOT:
Oracle Setup:
CREATE TABLE DICO ( "order", couple, first, second ) AS
SELECT 1, 'A-B', 'A', 'B' FROM DUAL UNION ALL
SELECT 2, 'A-D', 'A', 'D' FROM DUAL UNION ALL
SELECT 3, 'A-C', 'A', 'C' FROM DUAL UNION ALL
SELECT 4, 'A-B', 'A', 'B' FROM DUAL UNION ALL
SELECT 5, 'Y-Z', 'Y', 'Z' FROM DUAL UNION ALL
SELECT 6, 'W-A', 'W', 'A' FROM DUAL;
Query:
SELECT "order",
"count",
"1stElem_TOTAL" AS Total1stElem,
100*"count"/"1stElem_TOTAL" AS "%1stElem",
"2ndElem_TOTAL" AS Total2ndElem,
100*"count"/"2ndElem_TOTAL" AS "%2ndElem"
FROM (
SELECT couple AS "order",
key,
COUNT(*) OVER ( PARTITION BY COUPLE )/2 AS "count",
COUNT(*) OVER ( PARTITION BY VALUE ) AS num_value
FROM DICO
UNPIVOT ( Value FOR Key IN ( first AS 1, second AS 2 ) )
)
PIVOT ( MAX( NUM_VALUE ) AS Total FOR key IN ( 1 AS "1stElem", 2 AS "2ndElem" ) );
Results:
order count TOTAL1STELEM %1stElem TOTAL2NDELEM %2ndElem
----- ----- ------------ -------- ------------ --------
A-D 1 5 20 1 100
A-B 2 5 40 2 100
A-C 1 5 20 1 100
Y-Z 1 1 100 1 100
W-A 1 1 100 5 20

Postgres width_bucket() not assigning values to buckets correctly

In postgresql 9.5.3 I can't get width_bucket() to work as expected, it appears to be assigning values to the wrong buckets.
Dataset:
1
2
4
32
43
82
104
143
232
295
422
477
Expected output (bucket ranges and zero-count rows added to help analysis):
bucket | bucketmin | bucketmax | Expect | Actual
--------+-----------+-----------+--------|--------
1 | 1 | 48.6 | 5 | 5
2 | 48.6 | 96.2 | 1 | 2
3 | 96.2 | 143.8 | 2 | 1
4 | 143.8 | 191.4 | 0 | 0
5 | 191.4 | 239 | 1 | 1
6 | 239 | 286.6 | 0 | 1
7 | 286.6 | 334.2 | 1 | 0
8 | 334.2 | 381.8 | 0 | 1
9 | 381.8 | 429.4 | 1 | 0
10 | 429.4 | 477 | 1 | 1
Actual output:
wb | count
----+-------
1 | 5
2 | 2
3 | 1
5 | 1
6 | 1
8 | 1
10 | 1
Code to generate actual output:
create temp table metrics (val int);
insert into metrics (val) values(1),(2),(4),(32),(43),(82),(104),(143),(232),(295),(422),(477);
with metric_stats as (
select
cast(min(val) as float) as minV,
cast(max(val) as float) as maxV
from metrics m
),
hist as (
select
width_bucket(val, s.minV, s.maxV, 9) wb,
count(*)
from metrics m, metric_stats s
group by 1 order by 1
)
select * from hist;

Your calculations appear to be off. The following query:
with metric_stats as (
select cast(min(val) as float) as minV,
cast(max(val) as float) as maxV
from metrics m
)
select g.n,
s.minV + ((s.maxV - s.minV) / 9) * (g.n - 1) as bucket_start,
s.minV + ((s.maxV - s.minV) / 9) * g.n as bucket_end
from generate_series(1, 9) g(n) cross join
metric_stats s
order by g.n
Yields the following bins:
1 1 53.8888888888889
2 53.8888888888889 106.777777777778
3 106.777777777778 159.666666666667
4 159.666666666667 212.555555555556
5 212.555555555556 265.444444444444
6 265.444444444444 318.333333333333
7 318.333333333333 371.222222222222
8 371.222222222222 424.111111111111
9 424.111111111111 477
I think you intend for the "9" to be a "10", if you want 10 buckets.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL/REGEX puzzle/challenge How to convert ASCII art ranges with multiple characters to relational data? - sql

Related

How to get columns when using buckets (width_bucket)

Recursive join with SUM

expand oracle table fields linearly

Oracle : SQL Request with a Group By and a Percentage on two differents tables

Postgres width_bucket() not assigning values to buckets correctly

Categories

Resources