Get group maxima from combined strings

Get group maxima from combined strings - sql

I have a table with a column code containing multiple pieces of data like this:
001/2017/TT/000001
001/2017/TT/000002
001/2017/TN/000003
001/2017/TN/000001
001/2017/TN/000002
001/2016/TT/000001
001/2016/TT/000002
001/2016/TT/000001
002/2016/TT/000002
There are 4 items in 001/2016/TT/000001: 001, 2016, TT and 000001.
How can I extract the max for every group formed by the first 3 items? The result I want is this:
001/2017/TT/000003
001/2017/TN/000002
001/2016/TT/000002
002/2016/TT/000002
Edit
The subfield separator is /, and the length of subfields can vary.
I use PostgreSQL 9.3.

Obviously, you should normalize the table and split the combined string into 4 columns with proper data type. The function split_part() is the tool of choice if the separator '/' is constant in your string and the length of can vary.
CREATE TABLE tbl_better AS
SELECT split_part(code, '/', 1)::int AS col_1 -- better names?
, split_part(code, '/', 2)::int AS col_2
, split_part(code, '/', 3) AS col_3 -- text?
, split_part(code, '/', 4)::int AS col_4
FROM tbl_bad
ORDER BY 1,2,3,4 -- optionally cluster data.
Then the task is trivial:
SELECT col_1, col_2, col_3, max(col_4) AS max_nr
FROM tbl_better
GROUP BY 1, 2, 3;
Related:
Split comma separated column data into additional columns
Of course, you can do it on the fly, too. For varying subfield length you could use substring() with a regular expression like this:
SELECT max(substring(code, '([^/]*)$')) AS max_nr
FROM tbl_bad
GROUP BY substring(code, '^(.*)/');
Related (with basic explanation for regexp pattern):
Filter strings with regex before casting to numeric
Or to get only the complete string as result:
SELECT DISTINCT ON (substring(code, '^(.*)/'))
code
FROM tbl_bad
ORDER BY substring(code, '^(.*)/'), code DESC;
About DISTINCT ON:
Select first row in each GROUP BY group?
Be aware that data items cast to a suitable type may behave differently from their string representation. The max of 900001 and 1000001 is 900001 for text and 1000001 for integer ...

Use the LEFT and RIGHT functions.
SELECT MAX(RIGHT(code,6)) AS MAX_CODE
FROM yourtable
GROUP BY LEFT(code,12)

check this out, possible helpfull
select
distinct on (tab[4],tab[2]) tab[4],tab[3],tab[2],tab[1]
from
(
select
string_to_array(exe.x,'/') as tab,
exe.x
from
(
select
unnest
(
array
['001/2017/TT/000001',
'001/2017/TT/000002',
'001/2017/TN/000003',
'001/2017/TN/000001',
'001/2017/TN/000002',
'001/2016/TT/000001',
'001/2016/TT/000002',
'001/2016/TT/000001',
'002/2016/TT/000002']
) as x
) exe
) exe2
order by tab[4] desc,tab[2] desc,tab[3] desc;

Related

ORACLE TO_CHAR SPECIFY OUTPUT DATA TYPE

I have column with data such as '123456789012'
I want to divide each of each 3 chars from the data with a '/' in between so that the output will be like: "123/456/789/012"
I tried "SELECT TO_CHAR(DATA, '999/999/999/999') FROM TABLE 1" but it does not print out the output as what I wanted. Previously I did "SELECT TO_CHAR(DATA, '$999,999,999,999.99') FROM TABLE 1 and it printed out as "$123,456,789,012.00" so I thought I could do the same for other case as well, but I guess that's not the case.
There is also a case where I also want to put '#' in front of the data so the output will be something like this: #12345678901234. Can I use TO_CHAR for this problem too?
Is these possible? Because when I go through the documentation of oracle about TO_CHAR, it stated a few format that can be use for TO_CHAR function and the format that I want is not listed there.
Thank you in advance. :D

Here is one option with varchar2 datatype:
with test as (
select '123456789012' a from dual
)
select listagg(substr(a,(level-1)*3+1,3),'/') within group (order by rownum) num
from test
connect by level <=length(a)
or
with test as (
select '123456789012.23' a from dual
)
select '$'||listagg(substr((regexp_substr(a,'[0-9]{1,}')),(level-1)*3+1,3),',') within group (order by rownum)||regexp_substr(a,'[.][0-9]{1,}') num
from test
connect by level <=length(a)
output:
1st query
123/456/789/012
2nd query
$123,456,789,012.23

If you wants groups of three then you can use the group separator G, and specify the character to use:
SELECT TO_CHAR(DATA, 'FM999G999G999G999', 'NLS_NUMERIC_CHARACTERS=./') FROM TABLE_1
123/456/789/012
If you want a leading # then you can use the currency indicator L, and again specify the character to use:
SELECT TO_CHAR(DATA, 'FML999999999999', 'NLS_CURRENCY=#') FROM TABLE_1
#123456789012
Or combine both:
SELECT TO_CHAR(DATA, 'FML999G999G999G999', 'NLS_CURRENCY=# NLS_NUMERIC_CHARACTERS=./') FROM TABLE_1
#123/456/789/012
db<>fiddle
The data type is always a string; only the format changes.

Replace Digits Using SubString and lpad/rpad

is there any way to replace digits using substring in Hive.
scenario:
1. I need to check 0's from right to left and as soon as i find 0 before any digit then i need to replace all the trailing 0's by 9.
2. In output at-least 3 digit should be there before 9.
3. In Input if 2 or less digits are available in input then I need to skip some 0's and make sure that at-least 3 digits are there before 9.
4. If more than 3 digits are available before trailing 0's then only need to replace 0.No need to replace digits.
see the below table
input output
123000 123999
120000 120999
123400 123499
101010 101019
I have tried using below query, and it is working as expected.(Hive Join with CTE)
with mytable as (
select '123000' as input
union all
select '120000' as input
union all
select '123400' as input
union all
select '101010' as input
)
select input,lpad(concat(splitted[0], translate(splitted[1],'0','9')),6,0) as output
from (
select input, split(regexp_replace(input,'(\\d{3,}?)(0+)$','$1|$2'),'\\|') splitted from mytable )s;
but In my actual query which is more than 500+ lines,it is very difficult to adjust this logic (with CTE) for the sigle column. so wondering if is there any way to achieve the same using only lpad/rpad and substring/length and can achieve by adding the functions without using CTE queries.
so let say if length of digits before trailing 0's is less than 6 then can skip the substring
from (input,1,6) and will replace the remaining 0's and if length of digits before trailing 0's is 6
or more then 6 then just keep digits as it is and replace remaining trailing 0's by 9.
Kindly Suggest.
My Actual Query Looks like.
with mytable as
(
select lpad(input,13,9) as output from mytable where code='00'
union
select output from mytable where code='01'
)
select t1.*,m1.output from table1 t1 , mytable m1 where
(t1.card='00' and substr(t1.low,1,13)<=m1.low and m1.output <= substr(t1.output,1,13) and m1.card='00' )
or
(t1.card='01' and substr(t1.low,1,16)<=m1.low and m1.output <= substr(t1.output,1,16) and m1.card='01' )
I want to Replace above logic for 2nd output where code=01 in union query.

Instead of UNION you can do the same using single query:
Instead of this
select lpad(output,13,9) as output from mytable where code='00'
union
select output from mytable where code='01'
use this
select distinct
case code when '00' then lpad(output,13,9)
when '01' then output
end output
from mytable
If you need to filter codes, add where code in ('00','01')

It Worked Now.I modified My query as below.
with mytable as
(
select lpad(input,13,9) as output from mytable where code='00'
union
select lpad(concat(split(regexp_replace('(\\d{6,}?)(0+)$','$1|$2'),'\\|') [0],
translate(split(regexp_replace(input,'(\\d{6,}?)(0+)$','$1|$2'),'\\|')[1],'0','9')),16,0 )
output from mytable where code='01'
)
select t1.*,m1.output from table1 t1 , mytable m1 where
(t1.card='00' and substr(t1.low,1,13)<=m1.low
and m1.output <= substr(t1.output,1,13) and m1.card='00' )
or
(t1.card='01' and substr(t1.low,1,16)<=m1.low
and output <= substr(output,1,16) and m1.card='01')

How can I count repeated values in the string in BigQuery?

Example:
I have the following string:
201904,BLANK,201902,BLANK,BLANK,201811,201810,201809
How can I count the number of repeated values "BLANK" that goes one by one?
In the described example the answer is 2, but what is the query?
Thanks for your help in advance!

Below is for BigQuery Standard SQL (with quick simplified example)
Corrected Version
#standardSQL
WITH `project.dataset.table` AS (
SELECT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK' value UNION ALL
SELECT '201904,BLANK,201902,BLANK,BLANK,BLANK,201811' UNION ALL
SELECT '201904,BLANK,201902,BLANK,201811,201902,BLANK,201811'
)
SELECT value,
(
SELECT MAX(ARRAY_LENGTH(SPLIT(list))) - 1
FROM UNNEST(REGEXP_EXTRACT_ALL(value || ',', r'(?:BLANK,){1,}')) list
) max_repeated_count
FROM `project.dataset.table`
The idea here is
extract all instances of consecutive BLANK
split each such instances to array of elements of BLANK
and finally get max length of those arrays as a result
Just something came as quick approach
Refactored Version
#standardSQL
WITH `project.dataset.table` AS (
SELECT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK' value UNION ALL
SELECT '201904,BLANK,201902,BLANK,BLANK,BLANK,201811' UNION ALL
SELECT '201904,BLANK,201902,BLANK,201811,201902,BLANK,201811'
)
SELECT value,
(
SELECT MAX(LENGTH(element) - 1)
FROM UNNEST(REGEXP_EXTRACT_ALL(REPLACE(value || ',', 'BLANK', ''), r',+')) element
) max_repeated_count
FROM `project.dataset.table`
Both with output
Row value max_repeated_count
1 201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK 3
2 201904,BLANK,201902,BLANK,BLANK,BLANK,201811 3
3 201904,BLANK,201902,BLANK,201811,201902,BLANK,201811 1
Refactored version is slightly different (but main idea the same)
it removes all BLANKS (assuming BLANK cannot be part of other element - if it can - code can easily be adjusted)
then extract all consecutive entries of commas into array
calculates max length of such sequences of commas

Maybe I misunderstood, but can't you simply split by the value you're looking for and subtract 2 (1 for the first element and 1 for counting elements after splitting):
declare t DEFAULT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809';
SELECT
t as theString,
split(t,'BLANK') as theSplittedString,
array_length(split(t,'BLANK'))-2 as theAmount
n>0 - amount of repetition,
0 - no repetition,
-1 - element not found

SQL Summing digits of a number

i'm using presto. I have an ID field which is numeric. I want a column that adds up the digits within the id. So if ID=1234, I want a column that outputs 10 i.e 1+2+3+4.
I could use substring to extract each digit and sum it but is there a function I can use or simpler way?

You can combine regexp_extract_all from #akuhn's answer with lambda support recently added to Presto. That way you don't need to unnest. The code would be really self explanatory if not the need for cast to and from varchar:
presto> select
reduce(
regexp_extract_all(cast(x as varchar), '\d'), -- split into digits array
0, -- initial reduction element
(s, x) -> s + cast(x as integer), -- reduction function
s -> s -- finalization
) sum_of_digits
from (values 1234) t(x);
sum_of_digits
---------------
10
(1 row)

If I'm reading your question correctly you want to avoid having to hardcode a substring grab for each numeral in the ID, like substring (ID,1,1) + substring (ID,2,1) + ...substring (ID,n,1). Which is inelegant and only works if all your ID values are the same length anyway.
What you can do instead is use a recursive CTE. Doing it this way works for ID fields with variable value lengths too.
Disclaimer: This does still technically use substring, but it does not do the clumsy hardcode grab
WITH recur (ID, place, ID_sum)
AS
(
SELECT ID, 1 , CAST(substring(CAST(ID as varchar),1,1) as int)
FROM SO_rbase
UNION ALL
SELECT ID, place + 1, ID_sum + substring(CAST(ID as varchar),place+1,1)
FROM recur
WHERE len(ID) >= place + 1
)
SELECT ID, max(ID_SUM) as ID_sum
FROM recur
GROUP BY ID

First use REGEXP_EXTRACT_ALL to split the string. Then use CROSS JOIN UNNEST GROUP BY to group the extracted digits by their number and sum over them.
Here,
WITH my_table AS (SELECT * FROM (VALUES ('12345'), ('42'), ('789')) AS a (num))
SELECT
num,
SUM(CAST(digit AS BIGINT))
FROM
my_table
CROSS JOIN
UNNEST(REGEXP_EXTRACT_ALL(num,'\d')) AS b (digit)
GROUP BY
num
;

pgsql parse string to get a string after certain position

I have a table column that has data like
NA_PTR_51000_LAT_CO-BOGOTA_S_A
NA_PTR_51000_LAT_COL_M_A
NA_PTR_51000_LAT_COL_S_A
NA_PTR_51000_LAT_COL_S_B
NA_PTR_51000_LAT_MX-MC_L_A
NA_PTR_51000_LAT_MX-MTY_M_A
I want to parse each column value so that I get the values in column_B. Thank you.
COLUMN_A COLUMN_B
NA_PTR_51000_LAT_CO-BOGOTA_S_A CO-BOGOTA
NA_PTR_51000_LAT_COL_M_A COL
NA_PTR_51000_LAT_COL_S_A COL
NA_PTR_51000_LAT_COL_S_B COL
NA_PTR_51000_LAT_MX-MC_L_A MX-MC
NA_PTR_51000_LAT_MX-MTY_M_A MX-MTY

I'm not sure of the Postgresql and I can't get SQL fiddle to accept the schema build...
substring and length may vary...
Select Column_A, substr(columN_A,18,length(columN_A)-17-4) from tableName
Ok how about this then:
http://sqlfiddle.com/#!15/ad0dd/56/0
Select column_A, b
from (
Select Column_A, b, row_number() OVER (ORDER BY column_A) AS k
FROM (
SELECT Column_A
, regexp_split_to_table(Column_A, '_') b
FROM test
) I
) X
Where k%7=5
Inside out:
Inner most select simply splits the data into multiple rows on _
middle select adds a row number so that we can use the use the mod operator to find all occurances of a 5th remainder.
This ASSUMES that the section of data you're after is always the 5th segment AND that there are always 7 segments...

Use regexp_matches() with a search pattern like 'NA_PTR_51000_LAT_(.+)_'
This should return everything after NA_PTR_51000_LAT_ before the next underscore, which would match the pattern you are looking for.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get group maxima from combined strings - sql

Use the LEFT and RIGHT functions. SELECT MAX(RIGHT(code,6)) AS MAX_CODE FROM yourtable GROUP BY LEFT(code,12)

Related

ORACLE TO_CHAR SPECIFY OUTPUT DATA TYPE

Replace Digits Using SubString and lpad/rpad

How can I count repeated values in the string in BigQuery?

SQL Summing digits of a number

pgsql parse string to get a string after certain position

Categories

Resources