How can I merge 2 partially overlapping strings using Apache Hive?

How can I merge 2 partially overlapping strings using Apache Hive? - sql

I have a field which holds a short list of ids of a fixed length.
e.g. aab:aac:ada:afg
The field is intended to hold at most 5 ids, growing gradually. I update it by adding from a similarly constructed field that may partially overlap with my existing set, e.g. ada:afg:fda:kfc.
The field expans when joined to an "update" table, as in the following example.
Here, id_list is the aforementioned list I want to "merge", and table_update is a table with new values I want to "merge" into table1.
insert overwrite table table1
select
id,
field1,
field2,
case
when (some condition) then a.id_list
else merge(a.id_list, b.id_list)
end as id_list
from table1 a
left join
table_update b
on a.id = b.id;
I'd like to produce a combined field with the following value:
aab:aac:ada:afg:fda.
The challenge is that I don't know whether or how much overlap the strings have until execution, and I cannot run any external code, or create UDFs.
Any suggestions how I could approach this?

Split to get arrays, explode them, select existing union all new, aggregate using collect_set, it will produce unique array, concatenate array into string using concat_ws(). Not tested:
select concat_ws(':',collect_set(id))
from
(
select explode(split('aab:aac:ada:afg',':')) as id --existing
union all
select explode(split('ada:afg:fda:kfc',':')) as id --new
);
You can use UNION instead UNION ALL to get distinct values before aggregating into array. Or you can join new and existing and concatenate strings into one, then do the same:
select concat_ws(':',collect_set(id))
from
(
select explode(split(concat('aab:aac:ada:afg',':','ada:afg:fda:kfc'),':')) as id --existing+new
);
Most probably you will need to use lateral view with explode in the real query. See this answer about lateral view usage
Update:
insert overwrite table table1
select concat_ws(':',collect_set(a.idl)) as id_list,
id,
field1,
field2
from
(
select
id,
field1,
field2,
split(
case
when (some condition) then a.id_list
when b.id_list is null then a.id_list
else concat(a.id_list,':',b.id_list)
end,':') as id_list_array
from table1 a
left join table_update b on a.id = b.id
)s
LATERAL VIEW OUTER explode(id_list_array ) a AS idl
group by
id,
field1,
field2
;

Related

Query not reading the quoted string values stored in the table

I have stored some quoted values in a separate table and based on the value in this table. I am trying to filter the rows in another table
by using the values in this table in a subquery. But it is not reading the values for the subquery and returns a blank table in output.
The value is in column override and resolves to 'HCC11','HCC12'.
When I just copy the value from the column and paste it in place of the subquery it is fetching the data correctly. I am not able to understand the issue here. I have tried using the trim() function here but still its not working
Note-: I have attached the pic for your reference:
select *
from table1
where column1 in (select override from table 2 )

Storing comma separated values in a single column is a really poor database to begin with enclosing them in quotes makes things even wors. The proper solution to your problem is a better design.
However, if you are forced to work with that bad design, you can convert them to a proper list of values using
select *
from table1
where column1 in (select trim(both '''' from w.word)
from table2 t2
cross join unnest(string_to_array(t2.override, ',')) as w(word)
This assumes that table1.column1 only contains a single value without any quotes and that the override values never contain a comma in the real value (e.g. the above would break on a value like 'A,B', 'C')

You have the override column value as 'HCC11','HCC12' which can not match with single value 'HCC11'. You should better use the LIKE operator as follows:
select * from table1 t1
where exists
(select 1 from table2 t2
where t2.override like concat('%''', t1.column1, '''%'));

According to your image, the value of table1.column1 has to be 'HCC11','HCC12' (one string) to get the match from subquery.
If the table1 has 2 rows with values HCC11 and HCC12 then you might use the exists keyword in your subquery.
Something like
select *
from table1 t1
where exists
(select 1
from table2 t2
where instr( t2.override, concat("'",t1.column1,"'") ) >=1
);

You can do this like -
1.
select * from table1
where column1 in
(select regexp_replace(unnest(string_to_array(override, ',')),'''', '', 'g') from table2)
Or
2.
select * from table1
where '''' || column1 || '''' in
(select unnest(string_to_array(override, ',')) from table2)
Although, I would just recommend not storing your data like this, since you want to query using it.

union unusual behavior

Trying to union two tables with the same field into one master table but for some reason im getting a weird result.
select count(*)
from staging.sandoval_parcels
where parcel_id = 0;
returns 0
select count(*)
from staging.bernalillo_parcels
where parcel_id = 0;
returns 0
but when i merge the tables using
CREATE TABLE staging.master_parcels
AS
SELECT * FROM bernalillo_parcels
UNION ALL
SELECT * FROM sandoval_parcels
;
then
select count(*)
from staging.master_parcels
where parcel_id = 0;
returns 85553
both tables have the same fields and the fields are the same data type,also, no of the values for any field are missing, thus no nulls, why am i getting ids = 0 when either of the table have parcel_ids = 0?

The order of the fields matter, replace the * for the explicit name, other wise the second query field will be inserted on the first query position. But not necessarily on the same field you want.
CREATE TABLE staging.master_parcels
AS
SELECT parcel_id, field1 ... FROM bernalillo_parcels
UNION ALL
SELECT parcel_id, field1 ... FROM sandoval_parcels
;

Union will merge tables even if the column order is not the same. If all of the columns match and are in the same order, it will union distinct values and not create duplicates if the rows are the same for each table. Having the order and data type be the same is important.

I am trying to return a certain values in each row which depend on whether different values in that row are already in a different table

I'm still a n00b at SQL and am running into a snag. What I have is an initial selection of certain IDs into a temp table based upon certain conditions:
SELECT DISTINCT ID
INTO #TEMPTABLE
FROM ICC
WHERE ICC_Code = 1 AND ICC_State = 'CA'
Later in the query I SELECT a different and much longer listing of IDs along with other data from other tables. That SELECT is about 20 columns wide and is my result set. What I would like to be able to do is add an extra column to that result set with each value of that column either TRUE or FALSE. If the ID in the row is in #TEMPTABLE the value of the additional column should read TRUE. If not, FALSE. This way the added column will ready TRUE or FALSE on each row, depending on if the ID in each row is in #TEMPTABLE.
The second SELECT would be something like:
SELECT ID,
ColumnA,
ColumnB,
...
NEWCOLUMN
FROM ...
NEWCOLUMN's value for each row would depend on whether the ID in that row returned is in #TEMPTABLE.
Does anyone have any advice here?
Thank you,
Matt

If you left join to the #TEMPTABLE you'll get a NULL where the ID's don't exist
SELECT ID,
ColumnA,
ColumnB,
...
T.ID IS NOT NULL AS NEWCOLUMN -- Gives 1 or 0 or True/false as a bit
FROM ... X
LEFT JOIN #TEMPTABLE T
ON T.ID = X.ID -- DEFINE how the two rows can be related unquiley

You need to LEFT JOIN your results query to #TEMPTABLE ON ID, this will give you the ID if there is one and NULL if there isn't, if you want 1 or 0 this would do it (For SQL Server) ISNULL(#TEMPTABLE.ID,0)<>0.
A few notes on coding for performance:
By definition an ID column is unique so the DISTINCT is redundant and causes unnecisary processing (unless it is an ID from another table)
Why would you store this to a temporary table rather than just using it in the query directly?

You could use a union and a subquery.
Select . . . . , 'TRUE'
From . . .
Where ID in
(Select id FROM #temptable)
UNION
SELECT . . . , 'FALSE'
FROM . . .
WHERE ID NOT in
(Select id FROM #temptable)
So the top part, SELECT ... FROM ... WHERE ID IN (Subquery), does a SELECT if the ID is in your temptable.
The bottom part does a SELECT if the ID is not in the temptable.
The UNION operator joins the two results nicely, since both SELECT statements will return the same number of columns.

To expand on what someone else was saying with Union, just do something like so
SELECT id, TRUE AS myColumn FROM `table1`
UNION
SELECT id, FALSE AS myColumn FROM `table2`

Display multiple queries with different row types as one result

In PostgreSQL 8.3 on Ubuntu, I do have 3 tables, say T1, T2, T3, of different schemas.
Each of them contains (a few) records related to the object of the ID I know.
Using 'psql', I frequently do the 3 operations:
SELECT field-set1 FROM T1 WHERE ID='abc';
SELECT field-set2 FROM T2 WHERE ID='abc';
SELECT field-set3 FROM T3 WHERE ID='abc';
and just watch the results; for me it is enough to see.
Is it possible to have a procedure/function/macro etc, with one parameter 'id',
just running the three SELECTS one after another,
displaying results on the screen ?
field-set1, field-set2 and field-set 3 are completely different.
There is no reasonable way to JOIN the tables T1, T2, T3; these are unrelated data.
I do not want JOIN.
I want to see the three resulting sets on the screen.
Any hint?

Quick and dirty method
If the row types (data types of all columns in sequence) don't match, UNION will fail.
However, in PostgreSQL you can cast a whole row to its text representation:
SELECT t1:text AS whole_row_in_text_representation FROM t1 WHERE id = 'abc'
UNION ALL
SELECT t2::text FROM t2 WHERE id = 'abc'
UNION ALL
SELECT t3::text FROM t3 WHERE id = 'abc';
Only one ; at the end, and the one is optional with a single statement.
A more refined alternative
But also needs a lot more code. Pick the table with the most columns first, cast every individual column to text and give it a generic name. Add NULL values for the other tables with fewer columns. You can even insert headers between the tables:
SELECT '-t1-'::text AS c1, '---'::text AS c2, '---'::text AS c1 -- table t1
UNION ALL
SELECT '-col1-'::text, '-col2-'::text, '-col3-'::text -- 3 columns
UNION ALL
SELECT col1::text, col2::text, col3::text FROM t1 WHERE id = 'abc'
UNION ALL
SELECT '-t2-'::text, '---'::text, '---'::text -- table t2
UNION ALL
SELECT '-col_a-'::text, '-col_b-'::text, NULL::text -- 2 columns, 1 NULL
UNION ALL
SELECT col_a::text, col_b::text, NULL::text FROM t2 WHERE id = 'abc'
...

put a union all in between and name all columns equal
SELECT field-set1 as fieldset FROM T1 WHERE ID='abc';
union all
SELECT field-set2 as fieldset FROM T2 WHERE ID='abc';
union all
SELECT field-set3 as fieldset FROM T3 WHERE ID='abc';
and execute it at once.

PostgreSQL: Select a single-row x amount of times

A single row in a table has a column with an integer value >= 1 and must be selected however many times the column says. So if the column had '2', I'd like the select query to return the single-row 2 times.
How can this be accomplished?

Don't know why you would want to do such a thing, but...
CREATE TABLE testy (a int,b text);
INSERT INTO testy VALUES (3,'test');
SELECT testy.*,generate_series(1,a) from testy; --returns 3 rows

You could make a table that is just full of numbers, like this:
CREATE TABLE numbers
(
num INT NOT NULL
, CONSTRAINT numbers_pk PRIMARY KEY (num)
);
and populate it with as many numbers as you need, starting from one:
INSERT INTO numbers VALUES(1);
INSERT INTO numbers VALUES(2);
INSERT INTO numbers VALUES(3);
...
Then, if you had the table "mydata" that han to repeat based on the column "repeat_count" you would query it like so:
SELECT mydata.*
FROM mydata
JOIN numbers
ON numbers.num <= mydata.repeat_count
WHERE ...
If course you need to know the maximum repeat count up front, and have your numbers table go that high.
No idea why you would want to do this thought. Care to share?

You can do it with a recursive query, check out the examples in
the postgresql docs.
something like
WITH RECURSIVE t(cnt, id, field2, field3) AS (
SELECT 1, id, field2, field3
FROM foo
UNION ALL
SELECT t.cnt+1, t.id, t.field2, t.field3
FROM t, foo f
WHERE t.id = f.id and t.cnt < f.repeat_cnt
)
SELECT id, field2, field3 FROM t;

The simplest way is making a simple select, like this:
SELECT generate_series(1,{xTimes}), a.field1, a.field2 FROM my_table a;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I merge 2 partially overlapping strings using Apache Hive? - sql

Related

Query not reading the quoted string values stored in the table

union unusual behavior

I am trying to return a certain values in each row which depend on whether different values in that row are already in a different table

Display multiple queries with different row types as one result

PostgreSQL: Select a single-row x amount of times

Categories

Resources