how to remove duplicate value in a cell Hive table - sql

I got a column in my Hive SQL table where values are separated by comma (,) for each cell. Some values in this string are duplicated which I want to remove. Here is an example of my data:
test, test1, test,test1
rest,rest1,rest1,rest
chest,nest,lest,gest
The result should replace any duplicates:
test,test1
rest,rest1
chest,nest,lest,gest
I want to remove duplicates. Could anyone help me with this issue?
Thank you

Solution for Hive.
Split to get an array, explode, use collect_set to get array without duplicates, concatenate array using concat_ws.
Demo (Hive):
with your_table as(
select stack(3,
1, 'test, test1, test,test1',
2, 'rest,rest1,rest1,rest',
3, 'chest,nest,lest,gest'
) as (id, colname)
)
select t.id, t.colname, concat_ws(',',collect_set(trim(e.elem))) result
from your_table t
lateral view outer explode(split(colname,',')) e as elem
group by t.id, t.colname
trim() is used to remove spaces which present in your data example.
Result:
t.id t.colname result
1 test, test1, test,test1 test,test1
2 rest,rest1,rest1,rest rest,rest1
3 chest,nest,lest,gest chest,nest,lest,gest

Related

How to remove the elements with value as zero in hive array

I have an array column in hive which will be having 7 numbers.
For Ex: [32,4,0,43,23,0,1]
I want my output to be [32,4,43,23,1] (with all the zero elements removed)
Someone help me to accomplish this?
Explode array, filter, collect again.
Demo:
with mydata as (
select array(32,4,0,43,23,0,1) as initial_array
)
select initial_array, collect_set(element) as result_array
from
(
select initial_array, e.element
from mydata
lateral view outer explode(initial_array)e as element
) s
where element != 0
group by initial_array
Result:
initial_array result_array
[32,4,0,43,23,0,1] [32,4,43,23,1]

SQL Array with Null

I'm trying to group BigQuery columns using an array like so:
with test as (
select 1 as A, 2 as B
union all
select 3, null
)
select *,
[A,B] as grouped_columns
from test
However, this won't work, since there is a null value in column B row 2.
In fact this won't work either:
select [1, null] as test_array
When reading the documentation on BigQuery though, it says Nulls should be allowed.
In BigQuery, an array is an ordered list consisting of zero or more
values of the same data type. You can construct arrays of simple data
types, such as INT64, and complex data types, such as STRUCTs. The
current exception to this is the ARRAY data type: arrays of arrays are
not supported. Arrays can include NULL values.
There doesn't seem to be any attributes or safe prefix to be used with ARRAY() to handle nulls.
So what is the best approach for this?
Per documentation - for Array type
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
BigQuery translates NULL ARRAY into empty ARRAY in the query result, although inside the query NULL and empty ARRAYs are two distinct values.
So, as of your example - you can use below "trick"
with test as (
select 1 as A, 2 as B union all
select 3, null
)
select *,
array(select cast(el as int64) el
from unnest(split(translate(format('%t', t), '()', ''), ', ')) el
where el != 'NULL'
) as grouped_columns
from test t
above gives below output
Note: above approach does not require explicit referencing to all involved columns!
My current solution---and I'm not a fan of it---is to use a combo of IFNULL(), UNNEST() and ARRAY() like so:
select
*,
array(
select *
from unnest(
[
ifnull(A, ''),
ifnull(B, '')
]
) as grouping
where grouping <> ''
) as grouped_columns
from test
An alternative way, you can replace NULL value to some NON-NULL figures using function IFNULL(null, 0) as given below:-
with test as (
select 1 as A, 2 as B
union all
select 3, IFNULL(null, 0)
)
select *,
[A,B] as grouped_columns
from test

Select distinct values of comma-separated values column excluding subsets in PostgreSQL

Assume having a table foo with column bar that carries comma-separated values,
('a,b',
'a,b,c',
'a,b,c,d',
'd,e')
How can I select the largest combination and exclude all the subsets included in that combination (the largest one)?
Example on the above data-set. The result should be:
('a,b,c,d', 'd,e') and the first two entities ('a,b', 'a,b,c') are excluded as they are subset of ('a,b,c,d').
Taking in consideration that all the values in the comma-separated string are sorted alphabetically.
I tried the below query, but the results seem a little far away from what I need:
select distinct a.bar from foo a inner join foo b
on a.bar like '%'|| b.bar||'%'
and a.bar != b.bar
You can use string_to_array() to split the strings into an array. With the contains operator, #>, you can check whether an array contains another. (See "9.18. Array Functions and Operators".)
Use that in a NOT EXISTS clause. fi.ctid <> fo.ctid is there to make sure the physical addresses of the compared pair of rows is not equal, as of course an array of one row would contain the array compared to the same row.
SELECT fo.bar
FROM foo fo
WHERE NOT EXISTS (SELECT *
FROM foo fi
WHERE fi.ctid <> fo.ctid
AND string_to_array(fi.bar, ',') #> string_to_array(fo.bar, ','));
SQL Fiddle
But: Don't use comma-separated strings in a relational database. You've got something way better. It's called "table".
First process the string into sets of characters, and then cross join the character-sets with itself, excluding rows where the character-sets on both sides are the same.
Next, aggregate and use BOOL_OR in a HAVING clause to filter out any character-set that is a subset of any other character-set.
With a sample table declared in the CTE, the query becomes:
WITH foo(bar) AS (SELECT '("a,b" , "a,b,c" , "a,b,c,d" , "d,e")'::TEXT)
SELECT bar, string_to_array(elems[1], ',') not_subset
FROM foo
CROSS JOIN regexp_matches(bar, '[\w|,]+', 'g') elems
CROSS JOIN regexp_matches(bar, '[\w|,]+', 'g') elems2
WHERE elems2[1] != elems[1]
-- my regex also matches the ',' between sets which need to be ignored
-- alternatively, i have to refine the regex
AND elems2[1] != ','
AND elems[1] != ','
GROUP BY 1, 2
HAVING NOT BOOL_OR(string_to_array(elems[1], ',') <# string_to_array(elems2[1], ','))
produces the output
bar not_subset
'("a,b" , "a,b,c" , "a,b,c,d" , "d,e")' {'d','e'}
'("a,b" , "a,b,c" , "a,b,c,d" , "d,e")' {'a','b','c','d'}
Example in SQL Fiddle

Get group maxima from combined strings

I have a table with a column code containing multiple pieces of data like this:
001/2017/TT/000001
001/2017/TT/000002
001/2017/TN/000003
001/2017/TN/000001
001/2017/TN/000002
001/2016/TT/000001
001/2016/TT/000002
001/2016/TT/000001
002/2016/TT/000002
There are 4 items in 001/2016/TT/000001: 001, 2016, TT and 000001.
How can I extract the max for every group formed by the first 3 items? The result I want is this:
001/2017/TT/000003
001/2017/TN/000002
001/2016/TT/000002
002/2016/TT/000002
Edit
The subfield separator is /, and the length of subfields can vary.
I use PostgreSQL 9.3.
Obviously, you should normalize the table and split the combined string into 4 columns with proper data type. The function split_part() is the tool of choice if the separator '/' is constant in your string and the length of can vary.
CREATE TABLE tbl_better AS
SELECT split_part(code, '/', 1)::int AS col_1 -- better names?
, split_part(code, '/', 2)::int AS col_2
, split_part(code, '/', 3) AS col_3 -- text?
, split_part(code, '/', 4)::int AS col_4
FROM tbl_bad
ORDER BY 1,2,3,4 -- optionally cluster data.
Then the task is trivial:
SELECT col_1, col_2, col_3, max(col_4) AS max_nr
FROM tbl_better
GROUP BY 1, 2, 3;
Related:
Split comma separated column data into additional columns
Of course, you can do it on the fly, too. For varying subfield length you could use substring() with a regular expression like this:
SELECT max(substring(code, '([^/]*)$')) AS max_nr
FROM tbl_bad
GROUP BY substring(code, '^(.*)/');
Related (with basic explanation for regexp pattern):
Filter strings with regex before casting to numeric
Or to get only the complete string as result:
SELECT DISTINCT ON (substring(code, '^(.*)/'))
code
FROM tbl_bad
ORDER BY substring(code, '^(.*)/'), code DESC;
About DISTINCT ON:
Select first row in each GROUP BY group?
Be aware that data items cast to a suitable type may behave differently from their string representation. The max of 900001 and 1000001 is 900001 for text and 1000001 for integer ...
Use the LEFT and RIGHT functions.
SELECT MAX(RIGHT(code,6)) AS MAX_CODE
FROM yourtable
GROUP BY LEFT(code,12)
check this out, possible helpfull
select
distinct on (tab[4],tab[2]) tab[4],tab[3],tab[2],tab[1]
from
(
select
string_to_array(exe.x,'/') as tab,
exe.x
from
(
select
unnest
(
array
['001/2017/TT/000001',
'001/2017/TT/000002',
'001/2017/TN/000003',
'001/2017/TN/000001',
'001/2017/TN/000002',
'001/2016/TT/000001',
'001/2016/TT/000002',
'001/2016/TT/000001',
'002/2016/TT/000002']
) as x
) exe
) exe2
order by tab[4] desc,tab[2] desc,tab[3] desc;

pgsql parse string to get a string after certain position

I have a table column that has data like
NA_PTR_51000_LAT_CO-BOGOTA_S_A
NA_PTR_51000_LAT_COL_M_A
NA_PTR_51000_LAT_COL_S_A
NA_PTR_51000_LAT_COL_S_B
NA_PTR_51000_LAT_MX-MC_L_A
NA_PTR_51000_LAT_MX-MTY_M_A
I want to parse each column value so that I get the values in column_B. Thank you.
COLUMN_A COLUMN_B
NA_PTR_51000_LAT_CO-BOGOTA_S_A CO-BOGOTA
NA_PTR_51000_LAT_COL_M_A COL
NA_PTR_51000_LAT_COL_S_A COL
NA_PTR_51000_LAT_COL_S_B COL
NA_PTR_51000_LAT_MX-MC_L_A MX-MC
NA_PTR_51000_LAT_MX-MTY_M_A MX-MTY
I'm not sure of the Postgresql and I can't get SQL fiddle to accept the schema build...
substring and length may vary...
Select Column_A, substr(columN_A,18,length(columN_A)-17-4) from tableName
Ok how about this then:
http://sqlfiddle.com/#!15/ad0dd/56/0
Select column_A, b
from (
Select Column_A, b, row_number() OVER (ORDER BY column_A) AS k
FROM (
SELECT Column_A
, regexp_split_to_table(Column_A, '_') b
FROM test
) I
) X
Where k%7=5
Inside out:
Inner most select simply splits the data into multiple rows on _
middle select adds a row number so that we can use the use the mod operator to find all occurances of a 5th remainder.
This ASSUMES that the section of data you're after is always the 5th segment AND that there are always 7 segments...
Use regexp_matches() with a search pattern like 'NA_PTR_51000_LAT_(.+)_'
This should return everything after NA_PTR_51000_LAT_ before the next underscore, which would match the pattern you are looking for.