Casting string to int i.e. the string "res" - sql

I have a column in a table which is type array<string>. The table is partitioned daily since 2018-01-01. At some stage, the values in the array goes from strings to integers. The data looks like this:
| yyyy_mm_dd | h_id | p_id | con |
|------------|-------|------|---------------|
| 2018-10-01 | 52988 | 1 | ["res", "av"] |
| 2018-10-02 | 52988 | 1 | ["1","2"] |
| 2018-10-03 | 52988 | 1 | ["1","2"] |
There is a mapping between the strings and integers. "res" maps to 1 and "av" maps to 2 etc. However, I've written a query to perform some logic. Here is a snippet (subquery) of it:
SELECT
t.yyyy_mm_dd,
t.h_id,
t.p_id,
CAST(e.con AS INT) AS api
FROM
my_table t
LATERAL VIEW EXPLODE(con) e AS con
My problem is that this doesn't work for the earlier dates when strings were used instead of integers. Is there anyway to select con and remap the strings to integers so the data is across all partitions?
Expected output:
| yyyy_mm_dd | h_id | p_id | con |
|------------|-------|------|---------------|
| 2018-10-01 | 52988 | 1 | ["1","2"] |
| 2018-10-02 | 52988 | 1 | ["1","2"] |
| 2018-10-03 | 52988 | 1 | ["1","2"] |
Once the values selected are all integers (within a string array), then the CAST(e.con AS INT) will work
Edit: To clarify, I will put the solution as a subquery before I use lateral view explode. This way I am exploding on a table where all partitions have integers in con. I hope this makes sense.

CAST(e.api as INT) returns NULL if not possible to cast. collect_list will collect an array including duplicates and without NULLs. If you need array without duplicated elements, use collect_set().
SELECT
t.yyyy_mm_dd,
t.h_id,
t.p_id,
collect_list(--array of integers
--cast case as string if you need array of strings
CASE WHEN e.api = 'res' THEN 1
WHEN e.api = 'av' THEN 2
--add more cases
ELSE CAST(e.api as INT)
END
) as con
FROM
my_table t
LATERAL VIEW EXPLODE(con) e AS api
GROUP BY t.yyyy_mm_dd, t.h_id, t.p_id

Related

Oracle SQL query comparing multiple rows with same identifier

I'm honestly not sure how to title this - so apologies if it is unclear.
I have two tables I need to compare. One table contains tree names and nodes that belong to that tree. Each Tree_name/Tree_node combo will have its own line. For example:
Table: treenode
| TREE_NAME | TREE_NODE |
|-----------|-----------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 1 | E |
| 2 | A |
| 2 | B |
| 2 | D |
| 3 | C |
| 3 | D |
| 3 | E |
| 3 | F |
I have another table that contains names of queries and what tree_nodes they use. Example:
Table: queryrecord
| QUERY | TREE_NODE |
|---------|-----------|
| Alpha | A |
| Alpha | B |
| Alpha | D |
| BRAVO | A |
| BRAVO | B |
| BRAVO | D |
| CHARLIE | A |
| CHARLIE | B |
| CHARLIE | F |
I need to create an SQL where I input the QUERY name, and it returns any ‘TREE_NAME’ that includes all the nodes associated with the query. So if I input ‘ALPHA’, it would return TREE_NAME 1 & 2. If I ask it for CHARLIE, it would return nothing.
I only have read access, and don’t believe I can create temp tables, so I’m not sure if this is possible. Any advice would be amazing. Thank you!
You can use group by and having as follows:
Select t.tree_name
From tree_node t
join query_record q
on t.tree_node = q.tree_node
WHERE q.query = 'ALPHA'
Group by t.tree_name
Having count(distinct t.tree_node)
= (Select count(distinct q.tree_node) query_record q WHERE q.query = 'ALPHA');
Using an IN condition (a semi-join, which saves time over a join):
with prep (tree_node) as (select tree_node from queryrecord where query = :q)
select tree_name
from treenode
where tree_node in (select tree_node from prep)
group by tree_name
having count(*) = (select count(*) from prep)
;
:q in the prep subquery (in the with clause) is the bind variable to which you will assign the various QUERY values at runtime.
EDIT
I don't generally set up the test case on online engines; but in a comment below this answer, the OP said the query didn't work for him. So, I set up the example on SQLFiddle, here:
http://sqlfiddle.com/#!4/b575e/2
A couple of notes: for some reason, SQLFiddle thinks table names should be at most eight characters, so I had to change the second table name to queryrec (instead of queryrecord). I changed the name in the query, too, of course. And, second, I don't know how I can give bind values on SQLFiddle; I hard-coded the name 'Alpha'. (Note also that in the OP's sample data, this query value is not capitalized, while the other two are; of course, text values in SQL are case sensitive, so one should pay attention when testing.)
You can do this with a join and aggregation. The trick is to count the number of nodes in query_record before joining:
select qr.query, t.tree_name
from (select qr.*,
count(*) over (partition by query) as num_tree_node
from query_record qr
) qr join
tree_node t
on t.tree_node = qr.tree_node
where qr.query = 'ALPHA'
group by qr.query, t.tree_name, qr.num_tree_node
having count(*) = qr.num_tree_node;
Here is a db<>fiddle.

array clustering with unique identifier for file datasets

I have a dataset with big int array column in s3 and I want to filter rows efficiently based on array values. I know we can use gin index in sql table but need solution to work on s3 dataset. I am planning to use cluster id for each combinations of elements in array (as their cardinality is not huge. max 2500) and then store it as new column on which later on filter can applied.
Example,
Table A
+------+------+-----------+
| Col1 | Col2 | Col3 |
+------+------+-----------+
| 1 | 101 | [123,234] |
| 2 | 102 | [123] |
| 3 | 103 | [234,345] |
+------+------+-----------+
I am trying to add new column like,
Table B (column Col3 will be removed from actual schema)
+------+------+-----------+-----------+
| Col1 | Col2 | Col3 | Cid |
+------+------+-----------+-----------+
| 1 | 101 | [123,234] | 1 |
| 2 | 102 | [123] | 2 |
| 3 | 103 | [234,345] | 3 |
+------+------+-----------+-----------+
and there will be another table of mapping for col3 and Cid like,
Table C
+-----------+-----+
| Col3 | Cid |
+-----------+-----+
| [123,234] | 1 |
| [123] | 2 |
| [234,345] | 3 |
+-----------+-----+
This table C will be added a new entry if a new combination is created and B will be updated if any array element gets added or removed. Goal is to be able to filter out records from Table A based on values in array column efficiently. Queries like
123 = Any(Col3) can be served as Cid = 2 or queries like [123, 345] = Any(Col3) can be served as Cid in (2,3).
Is there any better way to do solve this problem?
Also I am thinking of creating required combinations at runtime to limit number of combinations. Is it a good idea to create minimum combinations?
In Postgres, you can create the table and use join to calculate the values:
create table array_dim as
select col3 as arr, row_number() over (order by min(col1)) as array_id
from t
group by col3;
You can then add the new column:
select a.*, ad.array_id
from a join
array_dim ad
on a.col3 = ad.arr

How to select properties of arrays of indeterminate size (SQL Server)

I have a table of arrays, and would like to return a list of items.
| table |
|---------------------------------------|
| [{"item": 1},{"item": 2},{"item": 3}] |
| [{"item": 4},{"item": 5},{"item": 6}] |
| [{"item": 7},{"item": 8},{"item": 9}] |
Example Query
The following method will return a wide table of items, but it doesn't scale well.
select
json_value([table], '$[0].item'),
json_value([table], '$[1].item'),
json_value([table], '$[2].item')
from someTable;
How can I select properties of arrays of indeterminate size? (like 100)
Desired Output
| items |
|-------|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
You need to use OPENJSON to achieve this. Also, as I note, you should choose a different name than Table for a column's name. TABLE is a reserved key word and it's confusing as a COLUMN is not a TABLE.
WITH VTE AS(
SELECT *
FROM (VALUES(N'[{"item": 1},{"item": 2},{"item": 3}]'),
(N'[{"item": 4},{"item": 5},{"item": 6}]'),
(N'[{"item": 7},{"item": 8},{"item": 9}]'))V([Table])) --Table isn't a good choice of a came for a column.
--TABLE is a reserved keyword.
SELECT T.item
FROM VTE V
CROSS APPLY OPENJSON(V.[Table]) --Table isn't a good choice of a came for a column. TABLE is a reserved keyword.
WITH(item int) T;

Find max value from column that has a json object with key-value pairs

I Have a table that has a column of a JSON string (key-value pairs) of items, I want to return only the key-value pair of the largest value
I can do this by first UNNESTing the JSON object and then taking the largest value by ORDER BY item, value (DESC) and using array_agg to get the largest one. The problem is that this means creating multiple tables and is slow. I am hoping that in one operation, I'll be able to extract the largest key-value pair.
This:
| id | items |
| -- | ---------------------------------- |
| 1 | {Item1=7.3, Item2=1.3, Item3=9.8} |
| 2 | {Item2=4.4, Item3=5.2, Item1=0.1} |
| 3 | {Item5=6.6, Item2=1.4, Item4=1.5} |
| 4 | {Item6=0.9, Item7=11.2, Item4=8.1} |
Should become:
| id | item | value |
| -- | ----- | ----- |
| 1 | Item3 | 9.8 |
| 2 | Item3 | 5.2 |
| 3 | Item5 | 6.6 |
| 4 | Item7 | 11.2 |
I don't actually need the value, so long as the item is the largest from the JSON object, so the following would be fine as well:
| id | item |
| -- | ----- |
| 1 | Item3 |
| 2 | Item3 |
| 3 | Item5 |
| 4 | Item7 |
Presto's UNNEST performance got improved in Presto 316. However, you don't need UNNEST in this case.
You can
convert your JSON to arary of key/value pairs using JSON CAST and map_entries
reduce the array to pick the key for highest value
since key/value pairs are represented as anonymous row elements, it's very convenient to use positional access to row elements with subscript operator, (available since Presto 314)
Use query like
SELECT
id,
reduce(
-- conver JSON to array of key/value pairs
map_entries(CAST(data AS map(varchar, double))),
-- initial state for reduce (must be same type as key/value pairs)
(CAST(NULL AS varchar), -1e0), -- assuming your values cannot be negative
-- reduction function
(state, element) -> if(state[2] > element[2], state, element),
-- reduce output function
state -> state[1]
) AS top
FROM (VALUES
(1, JSON '{"Item1":7.3, "Item2":1.3, "Item3":9.8}'),
(4, JSON '{"Item6":0.9, "Item7":11.2, "Item4":8.1}'),
(5, JSON '{}'),
(6, NULL)
) t(id, data);
Output
id | top
----+-------
1 | Item3
4 | Item7
5 | NULL
6 | NULL
(4 rows)
Store the values one per row in a child table.
CREATE TABLE child (
id INT NOT NULL,
item VARCHAR(6) NOT NULL,
value DECIMAL(9,1),
PRIMARY KEY (id, item)
);
You don't have to do a join to find the largest per group, just use a window function:
WITH cte AS (
SELECT id, item, ROW_NUMBER() OVER (PARTITION BY id ORDER BY value DESC) AS rownum
FROM mytable
)
SELECT * FROM cte WHERE rownum = 1;
Solving this with JSON is a bad idea. It makes your table denormalized, it makes the queries harder to design, and I predict it will make the query performance worse.

How to explode map datatype in Hive OR how to give multiple aliases in Hive

Suppose I query :
select explode(map_column_name) as exploded from table_name
I get this error:
The number of aliases in the AS clause does not match the number of
columns output by the UDTF, expected 2 aliases but got 1
and I googled the error and got to know that to give more than one alias , we use stack function ..
How to use stack function along with explode function so that I eventually explode map datatype and also give 2 aliases at a time?
Kindly bear with me as I am a beginner and learning Hive.
With default columns names
select explode(map) from table_name
With aliases
select explode(map) as (mykey,myval) from table_name
Demo
With default columns names
select explode (map('A',1,'B',2,'C',3))
;
+-----+-------+
| key | value |
+-----+-------+
| A | 1 |
| B | 2 |
| C | 3 |
+-----+-------+
With aliases
select explode (map('A',1,'B',2,'C',3)) as (mykey,myvalue)
;
+-------+---------+
| mykey | myvalue |
+-------+---------+
| A | 1 |
| B | 2 |
| C | 3 |
+-------+---------+