Bigquery merge and de-duplicate column of comma separated numbers - google-bigquery

I have a column with comma separated numbers. I want to merge and de-duplicate the values so each value only appears once.
numbers
123, 983
983, 482
The desired output from above would be 123, 983, 482. Whats the best way to do this in bigquery?

You can consider below.
WITH sample_table AS (
SELECT '123, 983' numbers UNION ALL
SELECT '983, 483'
)
SELECT STRING_AGG(DISTINCT number, ', ') merged
FROM sample_table, UNNEST(SPLIT(numbers, ', ')) number
Query results
STRING_AGG()

Related

create new columns from xml value in hive

I have a column desc_txt in my table and its contents are quite similar to that of xml like shown below-
desc_txt
-----------
<td><strong>Criticality</strong></td><td>High</td></tr><td><strong>Country</strong></td><td>India</td></tr><tr><td><strong>City</strong></td><td>Indore</td>
Requirement is to have a new table/view created from this table having additional columns like Criticality, Country, City along with the column values like High, India, Indore, respectively.
How can this be achieved in Hive/Impala?
This can be done in two steps. I assumed you have only four columns to pull.
Load the data as is in a table. Put everything in a column.
Then use this below SQL to split the data multiple columns. I assumed 4 columns, you can increase as per your requirement.
with t as (
SELECT rtrim(ltrim(
regexp_replace( replace( trim(
regexp_replace(
regexp_replace("<td><strong>Criticality</strong></td><td>High</td></tr><td><strong>Country</strong></td><td>India</td></tr><tr><td><strong>City</strong></td><td>Indore</td>","</?[^>]*>",",")
,',,',',') ), ' ,', ',' ), '(,){2,}', ','),','),',')
str)
select split_part(str, ',', 1) as first_col,
split_part(str, ',', 2) as second_col,
split_part(str, ',', 3) as third_col,
split_part(str, ',', 4) as fourth_col
from t
The query is tricky - first it replaces all tags with comma in them, then it replaces multiple commas with single comma, then it removes comma from start and end of the string. split function then splits whole string based on comma and create individual columns.
HTH...

Split string into words using Postgres

I am looking for some help in separating scientific names in my data. I want to take only the genus names and group them, but they are both connected in the same column. I saw the SQL Sever had a CHARINDEX command, but PostgreSQL does not. Does there need to be a function created for this? If so, how would it look?
I want to change 'Mallotus philippensis' to just 'Mallotus' or to just 'philippensis'
I am currently using Postgres 11, 12.
Use SPLIT_PART:
WITH yourTable AS (
SELECT 'Mallotus philippensis'::text AS genus
)
SELECT
SPLIT_PART(genus, ' ', 1) AS genus,
SPLIT_PART(genus, ' ', 2) AS species
FROM yourTable;
Demo
Probably string_to_array will be slightly more efficient than split_part here because string splitting will be done only once for each row.
SELECT
val_arr[1] AS genus,
val_arr[2] AS species
FROM (
SELECT string_to_array(val, ' ') as val_arr
FROM (
VALUES
('aaa bbb'),
('cc dddd'),
('e fffff')
) t (val)
) tt;

USING SQL . extract numbers comma separated from string 'HEADER|N1000|E1001|N1002|E1003|N1004|N1005'

'HEADER|N1000|E1001|N1002|E1003|N1004|N1005'
'HEADER|N156|E1|N7|E122|N4|E5'
'HEADER|E0|E1|E2|E3|E4|E5'
'HEADER|N0|N1|N2|N3|N4|N5'
'HEADER|N125'
How to extract the numbers in comma-separated format from this stringS?
Expected result:
1000,1001,1002,1003,1004,1005
How to extract the numbers with N or E as suffix/prefix ie.
N1000
Expected result:
1000,1002,1004,1005
below regex does not return the result needed. But I want some thing like this
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?(\d+)', '\1,'), ',?\.*$', '') from dual
the problem here is
when i want numbers with E OR N
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?N(\d+)', '\1,'), ',?\.*$', '') from dual
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?E(\d+)', '\1,'), ',?\.*$', '') from dual
they give good results for this scenerio
but when i input 'HEADER|N1000|E1001' it gives wrong answer plzzz verify and correct it
Update
Based on the changes to the question, the original answer is not valid. Instead, the solution is considerably more complex, using a hierarchical query to extract all the numbers from the string and then LISTAGG to put back together a list of numbers extracted from each string. To extract all numbers we use this query:
WITH cte AS (
SELECT DISTINCT data, level AS l, REGEXP_SUBSTR(data, '[NE]\d+', 1, level) AS num FROM test
CONNECT BY REGEXP_SUBSTR(data, '[NE]\d+', 1, level) IS NOT NULL
)
SELECT data, LISTAGG(SUBSTR(num, 2), ',') WITHIN GROUP (ORDER BY l) AS "All numbers"
FROM cte
GROUP BY data
Output (for the new sample data):
DATA All numbers
HEADER|E0|E1|E2|E3|E4|E5 0,1,2,3,4,5
HEADER|N0|N1|N2|N3|N4|N5 0,1,2,3,4,5
HEADER|N1000|E1001|N1002|E1003|N1004|N1005 1000,1001,1002,1003,1004,1005
HEADER|N125 125
HEADER|N156|E1|N7|E122|N4|E5 156,1,7,122,4,5
To select only numbers beginning with E, we modify the query to replace the [EN] in the REGEXP_SUBSTR expressions with just E i.e.
SELECT DISTINCT data, level AS l, REGEXP_SUBSTR(data, 'E\d+', 1, level) AS num FROM test
CONNECT BY REGEXP_SUBSTR(data, 'E\d+', 1, level) IS NOT NULL
Output:
DATA E-numbers
HEADER|E0|E1|E2|E3|E4|E5 0,1,2,3,4,5
HEADER|N0|N1|N2|N3|N4|N5
HEADER|N1000|E1001|N1002|E1003|N1004|N1005 1001,1003
HEADER|N125
HEADER|N156|E1|N7|E122|N4|E5 1,122,5
A similar change can be made to extract numbers commencing with N.
Demo on dbfiddle
Original Answer
One way to achieve your desired result is to replace a string of characters leading up to a number with that number and a comma, and then replace any characters from the last ,| to the end of string from the result:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1000,1001,1002,1003,1004,1005
To only output the numbers beginning with N, we add that to the prefix string before the capture group:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?N(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1000,1002,1004,1005
To only output the numbers beginning with E, we add that to the prefix string before the capture group:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?E(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1001,1003
Demo on dbfiddle
I don't know what DBMS you are using, but here's one way to do it in Postgres:
WITH cte AS (
SELECT CAST('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|' AS VARCHAR(1000)) AS myValue
)
SELECT SUBSTRING(MyVal FROM 2)
FROM (
SELECT REGEXP_SPLIT_TO_TABLE(myValue,'\|') MyVal
FROM cte
) src
WHERE SUBSTRING(MyVal FROM 1 FOR 1) = 'N'
;
SQL Fiddle
As Far as I have understood the question , you want to extract substrings starting with N from the string, You can try following (And then you can merge the output seperated by commas if needed)
select REPLACE(value, 'N', '')
from STRING_SPLIT('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '|')
where value like 'N%'
OutPut :
1000
1002
1004
1005

Get group maxima from combined strings

I have a table with a column code containing multiple pieces of data like this:
001/2017/TT/000001
001/2017/TT/000002
001/2017/TN/000003
001/2017/TN/000001
001/2017/TN/000002
001/2016/TT/000001
001/2016/TT/000002
001/2016/TT/000001
002/2016/TT/000002
There are 4 items in 001/2016/TT/000001: 001, 2016, TT and 000001.
How can I extract the max for every group formed by the first 3 items? The result I want is this:
001/2017/TT/000003
001/2017/TN/000002
001/2016/TT/000002
002/2016/TT/000002
Edit
The subfield separator is /, and the length of subfields can vary.
I use PostgreSQL 9.3.
Obviously, you should normalize the table and split the combined string into 4 columns with proper data type. The function split_part() is the tool of choice if the separator '/' is constant in your string and the length of can vary.
CREATE TABLE tbl_better AS
SELECT split_part(code, '/', 1)::int AS col_1 -- better names?
, split_part(code, '/', 2)::int AS col_2
, split_part(code, '/', 3) AS col_3 -- text?
, split_part(code, '/', 4)::int AS col_4
FROM tbl_bad
ORDER BY 1,2,3,4 -- optionally cluster data.
Then the task is trivial:
SELECT col_1, col_2, col_3, max(col_4) AS max_nr
FROM tbl_better
GROUP BY 1, 2, 3;
Related:
Split comma separated column data into additional columns
Of course, you can do it on the fly, too. For varying subfield length you could use substring() with a regular expression like this:
SELECT max(substring(code, '([^/]*)$')) AS max_nr
FROM tbl_bad
GROUP BY substring(code, '^(.*)/');
Related (with basic explanation for regexp pattern):
Filter strings with regex before casting to numeric
Or to get only the complete string as result:
SELECT DISTINCT ON (substring(code, '^(.*)/'))
code
FROM tbl_bad
ORDER BY substring(code, '^(.*)/'), code DESC;
About DISTINCT ON:
Select first row in each GROUP BY group?
Be aware that data items cast to a suitable type may behave differently from their string representation. The max of 900001 and 1000001 is 900001 for text and 1000001 for integer ...
Use the LEFT and RIGHT functions.
SELECT MAX(RIGHT(code,6)) AS MAX_CODE
FROM yourtable
GROUP BY LEFT(code,12)
check this out, possible helpfull
select
distinct on (tab[4],tab[2]) tab[4],tab[3],tab[2],tab[1]
from
(
select
string_to_array(exe.x,'/') as tab,
exe.x
from
(
select
unnest
(
array
['001/2017/TT/000001',
'001/2017/TT/000002',
'001/2017/TN/000003',
'001/2017/TN/000001',
'001/2017/TN/000002',
'001/2016/TT/000001',
'001/2016/TT/000002',
'001/2016/TT/000001',
'002/2016/TT/000002']
) as x
) exe
) exe2
order by tab[4] desc,tab[2] desc,tab[3] desc;

How to count number commas present in a CSV file

i have one csv file (ABC.txt) which has data as below :
1234,"djjdjd",45566,84774,45666,"djdjd"
i want to count number of commas present in a row of this CSV file.
how can i get this .
REGEXP_COUNT() can do this easily:
with tbl(data_string) as
(
select '1234,"djjdjd",45566,84774,45666,"djdjd"' from dual
)
select regexp_count(data_string, ',') from tbl;
For a pure (oracle) sql solution, try
SELECT NVL(LENGTH(REGEXP_REPLACE(csv_row, '[^,]', '')), 0) FROM csv_table
assuming that you have the data from your csv file stored in the database.
The query replaces all charcters but commas in the original string, so determining the length becomes equivalent to counting.
Note that you might want to treat commas inside csv fields differently.
Alternative ( see Alex K.'s comment )
SELECT LENGTH(csv_row) - NVL(LENGTH(REPLACE(csv_row, ',', '')), 0) FROM csv_table