How to convert arrays from two different table columns to parallel rows? - sql

I'm working with hive and I have a table of the following format (I present only one row, but it has many rows)
_______________________________
segments | rates | sessID
---------|-----------|---------
'1,2,3' | '10,20,30'| 555
Namely, two columns have a string representing arrays of the same length and the third column has some integer. I want to flatten the arrays such that first member of the first array appears in the same row with the first member of the second array, etc:
Something like:
----------------------------
segment | rate | sessId
--------|------|------------
1 | 10 | 555
2 | 20 | 555
3 | 30 | 555
I've tried the following query (for simplicity I've hardcoded the values):
SELECT explode(segments), explode (rates), sessID FROM
(SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) data ;
However, this does produce the required result, returning an error:
FAILED: SemanticException 1:26 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'rates'
When I try to flatten just one column it does work:
The query:
SELECT explode(segments) FROM (
SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) data ;
the result:
1
2
3
How can I get the result I want?

I don't have access to Hive to test this, but the approach should basically work.
POSEXPLODE() can be used to get two columns, the position within an array and the item itself. Then you can use that position to look up the corresponding item from the other array...
SELECT
yourData.sessID,
segment.item AS segment,
SPLIT(yourData.rates, ',')[segment.pos] AS rate
FROM
yourData
LATERAL VIEW
POSEXPLODE(SPLIT(yourData.segments,',')) segment AS pos, item
I think that POSEXPLODE() returns the positions starting from 1, but array indexes in Hive start from 0? If that's the case then use [segment.pos - 1] instead.

Please give a try on this.
select sessID,tf1.val as segments, tf2.val as rates
from (SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) t
lateral view posexplode(segments) tf1
lateral view posexplode(rates) tf2
where tf1.pos = tf2.pos;
+---------+-----------+--------+--+
| sessid | segments | rates |
+---------+-----------+--------+--+
| 555 | 1 | 10 |
| 555 | 2 | 20 |
| 555 | 3 | 30 |
+---------+-----------+--------+--+

Related

SQL: How to find string occurrences, sort them randomly by key and assign them to new attributes?

I have the following sample data:
key | source_string
---------------------
1355 | hb;am;dr;cc;
3245 | am;xa;sp;cc;
9831 | dr;hb;am;ei;
What I need to do:
Find strings from a fixed list ('hb','am','dr','ac') in the source_string
Create 3 new attributes and assign the found strings to them randomly but fixed (no difference after query re-execution)
If possible no subqueries and all in one SQL SELECT statement
The solution should look like this:
key | source_string | t_1 | t_2 | t_3
---------------------------------------
1355 | hb;am;dr;cc; | hb | dr |
3245 | am;xa;sp;cc; | am | |
9831 | dr;hb;am;ei; | hb | dr | am
My thought process:
I wanted to return the strings that occurred per row -> 1355: hb,am,dr,cc, (no idea how)
Rank them based on the key to have it randomly (maybe with rank() and mod())
Assign the strings based on their rank to the new attributes. At key 1355 4 attributes match, but only 3 need to be assigned, so the one left has to be ignored.
(Everything in Postgres)
In my current solution I created a rule for every case, which results in a huge query which is not desirable.
One simple method is to split the string, reaggregate the matches to an array and use that for the results
select t.*,
ar[1], ar[2], ar[3]
from t cross join lateral
(select array_agg(el order by random()) as ar
from regexp_split_to_table(t.source_string, ';') el
where el in ('hb','am','dr','ac')
) s;
Here is a db<>fiddle;

Recursive self join over file data

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:
ID | Value | Parent id
-----------------------------
But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.
What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,5,6,7,8,...
2 | *,record,abc,efg,hij,lmn,opq,rst
3 | ,,1,x,y,z,t,u,v,...
4 | ,,2,q,r,s,l,m,n,...
I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:
I don't know how many sections will occur in the file for the same
object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
9 | ,4,Data,1,2,3,4,...
10 | *,record,lmn,opq,rst,...
11 | ,,1,t,u,v,...
In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.
UPDATE
I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.
Either way, #Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):
+----+------------+------------------------------------------+-----------------------+
| ID | header | x | value
|
+----+------------+------------------------------------------+-----------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 |3,Formula,1,2,3,4,5,6,7,8 |
| 2 | ,, | ,1,x,y,z,t,u,v | ,1,x,y,z,t,u,v |
| 3 | ,, | ,2,q,r,s,l,m,n | ,2,q,r,s,l,m,n |
| 4 | *,record, | ,abc,efg,hij,lmn,opq,rst |*,record,abc,efg,hij,lmn,opq,rst |
| 5 | ,4, | ,Data,1,2,3,4 |,4,Data,1,2,3,4 |
| 6 | *,record, | ,lmn,opq,rst | ,lmn,opq,rst |
| 7 | ,, | ,1,t,u,v | ,1,t,u,v |
+----+------------+------------------------------------------+-----------------------------------------------+
I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.
You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.
One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:
3,Formula,
*,record,
,,1,
,,2,
,4,Data,
Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.
Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.
How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.
BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.
Here is a sample, but has some different with you need.
It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better
DECLARE #testdata TABLE(ID int,Line varchar(8000))
INSERT INTO #testdata
SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL
SELECT 2,'*,record,abc,efg,hij,...' UNION ALL
SELECT 3,',,1,x,y,z,...' UNION ALL
SELECT 4,',,2,q,r,s,...' UNION ALL
SELECT 5,'3,Formula,5,6,7,8,...' UNION ALL
SELECT 6,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 7,',,1,t,u,v,...' UNION ALL
SELECT 8,',,2,l,m,n,...' UNION ALL
SELECT 9,',4,Data,1,2,3,4,...' UNION ALL
SELECT 10,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 11,',,1,t,u,v,...'
;WITH t AS(
SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','') AS data
FROM #testdata AS t
CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
)
SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
FROM t
OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
GROUP BY t.header,c.x
+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header | x | value |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 | 3,Formula,1,2,3,4,5,6,7,8 |
| 3 | ,, | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v |
| 2 | *,record, | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9 | ,4, | ,Data,1,2,3,4 | ,4,Data,1,2,3,4 |
+----+------------+------------------------------------------+-----------------------------------------------+

Flattening a relation with an array to emit one row per array entry

Given a table defined as such:
CREATE TABLE test_values(name TEXT, values INTEGER[]);
...and the following values:
| name | values |
+-------+---------+
| hello | {1,2,3} |
| world | {4,5,6} |
I'm trying to find a query which will return:
| name | value |
+-------+-------+
| hello | 1 |
| hello | 2 |
| hello | 3 |
| world | 4 |
| world | 5 |
| world | 6 |
I've reviewed the upstream documentation on accessing arrays, and tried to think about what a solution using the unnest() function would look like, but have been coming up empty.
An ideal solution would be easy to use even in cases where there were a significant number of columns other than the array being expanded and no primary key. Handling a case with more than one array is not important.
We can put the set-returning function unnest() into the SELECT list like Raphaƫl suggests. This used to exhibit corner case problems before Postgres 10. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
Since Postgres 9.3 we can also use a LATERAL join for this. It is the cleaner, standard-compliant way to put set-returning functions into the FROM list, not into the SELECT list:
SELECT name, value
FROM tbl, unnest(values) value; -- implicit CROSS JOIN LATERAL
One subtle difference: this drops rows with empty / NULL values from the result since unnest() returns no row, while the same is converted to a NULL value in the FROM list and returned anyway. The 100 % equivalent query is:
SELECT t.name, v.value
FROM tbl t
LEFT JOIN unnest(t.values) v(value) ON true;
See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Well, you give the data, the doc, so... let's mix it ;)
select
name,
unnest(values) as value
from test_values
see SqlFiddle

Postgres/Postgis - Querying all values in clipped raster

I'm currently using Postgres 9.1
My goal is to clip a PostGIS raster with a polygon. Then I would like either an postgres array or delimited set of values for each of the raster pixels that are contained within that polygon. Here is the query I have gotten so far:
SELECT p.gid, (ST_ValueCount(ST_Clip(r.rast, p.geom))).value AS fval
FROM temp_raster AS r, temp_shapefile AS p
WHERE (r.start_time = 1384516800)
GROUP BY gid, fval;
This will output:
gid | fcstval
-----+---------
1 | 0
1 | 2
2 | 0
2 | 2
3 | 5
4 | 0
4 | 1
4 | 2
4 | 3
4 | 5
This data is good, but I would like a set of values for each gid like this:
gid | fcstval
-----+----------
1 | 0,2
2 | 0,2
3 | 5
4 | 0,1,2,3,5
The trouble I'm having is trying to then get these values aggregated into either an array or delimited string. Here's my attempt at an array:
SELECT p.gid, array_agg((ST_ValueCount(ST_Clip(r.rast, p.geom))).value) AS fval
FROM temp_raster AS r, temp_shapefile AS p
WHERE (r.start_time = 1384516800)
GROUP BY gid;
This doesn't work, and provides the error:
ERROR: set-valued function called in context that cannot accept a set
My guess is this is because I can't call array_agg in this way. I'm having a bit of difficulty figuring out how to do this otherwise. A subquery perhaps? I haven't been able to come up with anything yet though to get this working.
Thanks for any help!
Okay I think I figured it out for arrays. If I want it in a string, I can just convert my array to a string. However if anyone has any suggestions of cleaning this up I would appreciate it, as this doesn't seem like the simplest markup
SELECT p.gid, array_agg((subquery).tempval) as fcstval
FROM pih_raster AS r, hwy_pih_vertex_buf AS p,
(
SELECT p.gid AS tempgid, (ST_ValueCount(ST_Clip(r.rast, p.geom))).value AS tempval
FROM pih_raster AS r, hwy_pih_vertex_buf AS p
WHERE (r.start_time <= 1384624800) GROUP BY tempgid, tempval
) AS subquery
WHERE (r.start_time <= 1384624800) AND ((subquery).tempgid = gid) GROUP BY p.gid;
Here's the output:
gid | fcstval
-----+-------------
1 | {0,2}
2 | {0,2}
3 | {5}
4 | {0,1,2,3,5}

How to properly group SQL results set?

SQL noob, please bear with me!!
I am storing a 3-tuple in a database (x,y, {signal1, signal2,..}).
I have a database with tables coordinates (x,y) and another table called signals (signal, coordinate_id, group) which stores the individual signal values. There can be several signals at the same coordinate.
The group is just an abitrary integer which marks the entries in the signal table as belonging to the same set (provided they belong to the same coordinate). So that any signals with the same 'coordinate_id' and 'group' together form a tuple as shown above.
For example,
Coordinates table Signals table
-------------------- -----------------------------
| id | x | y | | id | signal | coordinate_id | group |
| 1 | 1 | 2 | | 1 | 45 | 1 | 1 |
| 2 | 2 | 5 | | 2 | 95 | 1 | 1 |
| 3 | 33 | 1 | 1 |
| 4 | 65 | 1 | 2 |
| 5 | 57 | 1 | 2 |
| 6 | 63 | 2 | 1 |
This would produce the tuples (1,2 {45,95,33}), (1,2,{65,57}), (2,5, {63}) and so on.
I would like to retrieve the sets of {signal1, signal2,...} for each coordinate. The signals belonging to a set have the same coordinate_id and group, but I do not necessarily know the group value. I only know that if the group value is the same for a particular coordinate_id, then all those with that group form one set.
I tried looking into SQL GROUP BY, but I realized that it is for use with aggregate functions.
Can someone point out how to do this properly in SQL or give tips for improving my database structure.
SQLite supports the GROUP_CONCAT() aggregate function similar to MySQL. It rolls up a set of values in the group and concatenates them together comma-separated.
SELECT c.x, c.y, GROUP_CONCAT(s.signal) AS signal_list
FROM Signals s
JOIN Coordinates ON s.coordinate_id = c.id
GROUP BY s.coordinate_id, s.group
SQLite also permits the mismatch between columns in the select-list and columns in the group-by clause, even though this isn't strictly permitted by ANSI SQL and most implementations.
personally I would write the database as 3 tables:
x_y(x, y, id) coords_groups(pos, group, id) signals(group, signal)
with signals.group->coords_groups.id and coords_groups.pos->x_y.id
as you are trying to represent a sort-of 4 dimensional array.
then, to get from a couple of coordinates (X, Y) an ArrayList of List of Signal you can use this
SELECT temp."group", signals.signal
FROM (
SELECT cg."group", cg.id
FROM x_y JOIN coords_groups AS cg ON x_y.id = cg.pos
WHERE x_y.x=X AND x_y.y=Y )
AS temp JOIN signals ON temp.id=signals."group"
ORDER BY temp."group" ASC
(X Y are in the innermost where)
inside this sort of pseudo-code:
getSignalsGroups(X, Y)
ArrayList<List<Signals>> a
List<Signals> temp
query=sqlLiteExecute(THE_SQL_SNIPPET, x, y)
row=query.fetch() //fetch the first row to set the groupCounter
actualGroup=row.group
temp.add(row.signal)
for(row : query) //foreach row add the signal to the list
if(row.group!=actualGroup) //or reset the list if is a new group
a.add(actualGroup, temp)
actualGroup=row.group; temp= new List
temp.add(row.signal)
return a