Convert multi-dimensional array to records - sql

Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;

This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs

Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)

Related

Weighted + ordered tag search using Postgres

After an AI file analysis across tens of thousands of audio files I end up with this kind of data structure in a Postgres DB:
id | name | tag_1 | tag_2 | tag_3 | tag_4 | tag_5
1 | first song | rock | pop | 80s | female singer | classic rock
2 | second song | pop | rock | jazz | electronic | new wave
3 | third song | rock | funk | rnb | 80s | rnb
Tag positions are really important: the more "to the left", the more prominent it is in the song. The number of tags is also finite (50 tags) and the AI always returns 5 of them for every song, no null values expected.
On the other hand, this is what I have to query:
{"rock" => 15, "pop" => 10, "soul" => 3}
The key is a Tag name and the value an arbitrary weight. Numbers of entries could be random from 1 to 50.
According to the example dataset, in this case it should return [1, 3, 2]
I'm also open for data restructuring if it could be easier to achieve using raw concatenated strings but... is it something doable using Postgres (tsvectors?) or do I really have to use something like Elasticsearch for this?
After a lot of trials and errors this is what I ended up with, using only Postgres:
Turn all data set to integers, so it moves on to something like this (I also added columns to match the real data set more closely) :
id | bpm | tag_1 | tag_2 | tag_3 | tag_4 | tag_5
1 | 114 | 1 | 2 | 3 | 4 | 5
2 | 102 | 2 | 1 | 6 | 7 | 8
3 | 110 | 1 | 9 | 10 | 3 | 12
Store requests in an array as strings (note that I sanitized those requests before with some kind of "request builder"):
requests = [
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 = 4
AND tag_5 = 5",
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 = 4
AND tag_5 IN (1, 3, 5)",
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 IN (1, 3, 5),
AND tag_5 IN (1, 3, 5)",
....
]
Simply loop in the requests array, from the most precise to the most approximate one:
# Ruby / ActiveRecord example
track_ids = []
requests.each do |request|
track_ids += Track.where([
"(#{request})
AND tracks.id NOT IN ?", track_ids
]).pluck(:id)
break if track_ids.length > 200
end
... and done! All my songs are ordered by similarity, the closest match at the top, and the more to the bottom, the more approximate they get. Since everything is about integers, it's pretty fast (fast enough on a 100K rows dataset), and the output looks like pure magic. Bonus point: it is still easily tweakable and maintainable by the whole team.
I do understand that's rough, so I'm open to any more efficient way to do the same thing, even if something else is needed in the stack (ES ?), but so far: it's a simple solution that just works.

Recursive query to produce edges of a path?

I have a table paths:
CREATE TABLE paths (
id_travel INT,
point INT,
visited INT
);
Sample rows:
id_travel | point | visited
-----------+-------+---------
10 | 35 | 0
10 | 16 | 1
10 | 93 | 2
5 | 15 | 0
5 | 26 | 1
5 | 193 | 2
5 | 31 | 3
And another table distances:
CREATE TABLE distances (
id_port1 INT,
id_port2 INT,
distance INT CHECK (distance > 0),
PRIMARY KEY (id_port1, id_port2)
);
I need to make a view:
id_travel | point1 | point2 | distance
-----------+--------+--------+---------
10 | 35 | 16 | 1568
10 | 16 | 93 | 987
5 | 15 | 26 | 251
5 | 26 | 193 | 87
5 | 193 | 31 | 356
I don't know how to make dist_trips by a recursive request here:
CREATE VIEW dist_view AS
WITH RECURSIVE dist_trips (id_travel, point1, point2) AS
(SELECT ????)
SELECT dt.id_travel, dt.point1, dt.point2, d.distance
FROM dist_trips dt
NATURAL JOIN distances d;
dist_trips is a recursive request witch and should return three columns: id_travel, point1, and point2 from table paths.
You don't need recursion. Can be plain joins:
SELECT id_travel, p1.point AS point1, p2.point AS point2, d.distance
FROM paths p1
JOIN paths p2 USING (id_travel)
LEFT JOIN distances d ON d.id_port1 = p1.point
AND d.id_port2 = p2.point
WHERE p2.visited = p1.visited + 1
ORDER BY id_travel, p1.visited;
db<>fiddle here
Your paths seem to have gapless ascending numbers. Just join each point with the next.
I threw in a LEFT JOIN to keep all edges of each path in the result, even if the distances table should not have a matching entry. Probably prudent.
Your NATURAL JOIN didn't go anywhere. Generally, NATURAL is rarely useful and breaks easily. The manual warns:
USING is reasonably safe from column changes in the joined relations
since only the listed columns are combined. NATURAL is considerably
more risky since any schema changes to either relation that cause a
new matching column name to be present will cause the join to combine
that new column as well.

In Dask, how would I remove data that is not repeated across all values of another column?

I'm trying to find a set of data that exists across multiple instances of a column's value.
As an example, let's say I have a DataFrame with the following values:
+-------------+------------+----------+
| hardware_id | model_name | data_v |
+-------------+------------+----------+
| a | 1 | 0.595150 |
+-------------+------------+----------+
| b | 1 | 0.285757 |
+-------------+------------+----------+
| c | 1 | 0.278061 |
+-------------+------------+----------+
| d | 1 | 0.578061 |
+-------------+------------+----------+
| a | 2 | 0.246565 |
+-------------+------------+----------+
| b | 2 | 0.942299 |
+-------------+------------+----------+
| c | 2 | 0.658126 |
+-------------+------------+----------+
| a | 3 | 0.160283 |
+-------------+------------+----------+
| b | 3 | 0.180021 |
+-------------+------------+----------+
| c | 3 | 0.093628 |
+-------------+------------+----------+
| d | 3 | 0.033813 |
+-------------+------------+----------+
What I'm trying to get would be a DataFrame with all elements except the rows that contain a hardware_id of d, since they do not occur at least once per model_name.
I'm using Dask as my original data size is on the order of 7 GB, but if I need to drop down to Pandas that is also feasable. I'm very happy to hear any suggestions.
I have tried splitting the dataframe into individual dataframes based on the model_name attribute, then running a loop:
models = ['1','1','1','2','2','2','3','3','3','3']
import dask.dataframe as dd
frame_1 = dd.DataFrame( {'hardware_id':['a','b','c','a','b','c','a','b','c','d'], 'model_name':mn,'data_v':np.random.rand(len(mn))} )
model_splits = []
for i in range(1,4):
model_splits.append(frame_1[frame_1['model_name'.eq(str(i))]])
aggregate_list = []
while len(model_splits) > 0:
data = aggregate_list.pop()
for other_models in aggregate_list:
data = data[data.hardware_id.isin(other_models.hardware_id.to__bag())]
aggregate_list.append(data)
final_data = dd.concat(aggregate_list)
However, this is beyond inefficient, and I'm not entirely sure that my logic is sound.
Any suggestions on how to achieve this?
Thanks!
One way to accomplish this is to treat it as a groupby-aggregation problem.
Pandas
First, we set up the data:
import pandas as pd
import numpy as np
np.random.seed(12)
models = ['1','1','1','2','2','2','3','3','3','3']
df = pd.DataFrame(
{'hardware_id':['a','b','c','a','b','c','a','b','c','d'],
'model_name': models,
'data_v': np.random.rand(len(models))
}
)
Then, collect the unique values of your model_name column.
unique_model_names = df.model_name.unique()
unique_model_names
array(['1', '2', '3'], dtype=object)
Next, we'll do several related steps at once. Our goal is to figure out which hardware_ids co-occur wiht the entire unique set of model_names. First we can do a groupby aggregation to get the unique model_names per hardware_id. This returns a list, but we want this as a tuple for efficiency so it works in the next step. At this point, every hardware ID is associated with a tuple of it's unique models. Next, we check to see if that tuple exactly matches our unique model names, using isin. If it doesn't we know the condition should be False (exactly what we get).
agged = df.groupby("hardware_id", as_index=False).agg({"model_name": "unique"})
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
agged
hardware_id model_name all_present_mask
0 a (1, 2, 3) True
1 b (1, 2, 3) True
2 c (1, 2, 3) True
3 d (3,) False
Finally, we can use this to get our list of "valid" hardware IDs, and then filter our initial dataframe.
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = df.loc[
df.hardware_id.isin(relevant_ids)
]
result
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Dask
We can do essentially the same thing, but we need to be a little clever with our calls to compute.
import dask.dataframe as dd
​
ddf = dd.from_pandas(df, 2)
unique_model_names = ddf.model_name.unique()
​
agged = ddf.groupby("hardware_id").model_name.unique().reset_index()
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
​
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = ddf.loc[
ddf.hardware_id.isin(relevant_ids.compute()) # cant pass a dask Series to `ddf.isin`
]
result.compute()
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Note that you would probably want to persist agged_df and relevant_ids if you have the memory available to avoid some redundant calculation.

Manipulate map using different ways?

I am hoping to find an elegant way of sorting a map by value first and then by the key.
For example:
B | 50
A | 50
C | 50
E | 10
D | 100
F | 99
I have the following code:
// Making the map into a list first
List<Map.Entry<String, Integer>> sortingList = new LinkedList<>(processMap.entrySet());
// Create a comparator that would compare the values of the map
Comparator<Map.Entry<String, Integer>> c = Comparator.comparingInt(Map -> Map.getValue());
// Sort the list in descending order
sortingList.sort(c.reversed());
I don't need the result to be map again, so this is sufficient, however, my result is:
D | 100
F | 99
B | 50
A | 50
C | 50
E | 10
I would like to sort not just by value, but also by the key, so the result becomes:
D | 100
F | 99
A | 50
B | 50
C | 50
E | 10
I had researched some possible solutions, but the problem is, my values need to be in descending, but my key has to be ascending...
Hoping if anyone can help me.
Try this:
Comparator<Map.Entry<String, Integer>> c = Comparator.comparing(Map.Entry<String, Integer>::getValue)
.reversed()
.thenComparing(Map.Entry::getKey);

Group and split records in postgres into several new column series

I have data of the form
-----------------------------|
6031566779420 | 25 | 163698 |
6031566779420 | 50 | 98862 |
6031566779420 | 75 | 70326 |
6031566779420 | 95 | 51156 |
6031566779420 | 100 | 43788 |
6036994077620 | 25 | 41002 |
6036994077620 | 50 | 21666 |
6036994077620 | 75 | 14604 |
6036994077620 | 95 | 11184 |
6036994077620 | 100 | 10506 |
------------------------------
and would like to create a dynamic number of new columns by treating each series of (25, 50, 75, 95, 100) and corresponding values as a new series. What I'm looking for as target output is,
--------------------------
| 25 | 163698 | 41002 |
| 50 | 98862 | 21666 |
| 75 | 70326 | 14604 |
| 95 | 51156 | 11184 |
| 100 | 43788 | 10506 |
--------------------------
I'm not sure what the name of the sql / postgres operation I want is called nor how to achieve it. In this case the data has 2 new columns but I'm trying to formulate a solution that has has many new columns as are groups of data in the output of the original query.
[Edit]
Thanks for the references to array_agg, that looks like it would be helpful! I should've mentioned this earlier but I'm using Redshift which reports this version of Postgres:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.1007
and it does not seem to support this function yet.
ERROR: function array_agg(numeric) does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
Query failed
PostgreSQL said: function array_agg(numeric) does not exist
Hint: No function matches the given name and argument types. You may need to add explicit type casts.
Is crosstab the type of transformation I should be looking at? Or something else? Thanks again.
I've used array_agg() here
select idx,array_agg(val)
from t
group by idx
This will produce result like below:
idx array_agg
--- --------------
25 {163698,41002}
50 {98862,21666}
75 {70326,14604}
95 {11184,51156}
100 {43788,10506}
As you can see the second column is an array of two values(column idx) that corresponding to column idx
The following select queries will give you result with two separate column
Method : 1
SELECT idx
,col [1] col1 --First value in the array
,col [2] col2 --Second vlaue in the array
FROM (
SELECT idx
,array_agg(val) col
FROM t
GROUP BY idx
) s
Method : 2
SELECT idx
,(array_agg(val)) [1] col1 --First value in the array
,(array_agg(val)) [2] col2 --Second vlaue in the array
FROM t
GROUP BY idx
Result:
idx col1 col2
--- ------ -----
25 163698 41002
50 98862 21666
75 70326 14604
95 11184 51156
100 43788 10506
You can use array_agg function. Asuming, your columns are named A,B,C:
SELECT B, array_agg(C)
FROM table_name
GROUP BY B
Will get you output in array form. This is as close as you can get to variable columns in a simple query. If you really need variable columns, consider defining a PL/pgSQL procedure to convert array into columns.