SQL: How to find string occurrences, sort them randomly by key and assign them to new attributes? - sql

I have the following sample data:
key | source_string
---------------------
1355 | hb;am;dr;cc;
3245 | am;xa;sp;cc;
9831 | dr;hb;am;ei;
What I need to do:
Find strings from a fixed list ('hb','am','dr','ac') in the source_string
Create 3 new attributes and assign the found strings to them randomly but fixed (no difference after query re-execution)
If possible no subqueries and all in one SQL SELECT statement
The solution should look like this:
key | source_string | t_1 | t_2 | t_3
---------------------------------------
1355 | hb;am;dr;cc; | hb | dr |
3245 | am;xa;sp;cc; | am | |
9831 | dr;hb;am;ei; | hb | dr | am
My thought process:
I wanted to return the strings that occurred per row -> 1355: hb,am,dr,cc, (no idea how)
Rank them based on the key to have it randomly (maybe with rank() and mod())
Assign the strings based on their rank to the new attributes. At key 1355 4 attributes match, but only 3 need to be assigned, so the one left has to be ignored.
(Everything in Postgres)
In my current solution I created a rule for every case, which results in a huge query which is not desirable.

One simple method is to split the string, reaggregate the matches to an array and use that for the results
select t.*,
ar[1], ar[2], ar[3]
from t cross join lateral
(select array_agg(el order by random()) as ar
from regexp_split_to_table(t.source_string, ';') el
where el in ('hb','am','dr','ac')
) s;
Here is a db<>fiddle;

Related

How can I return the best matched row first in sort order from a set returned by querying a single search term against multiple columns in Postgres?

Background
I have a Postgres 11 table like so:
CREATE TABLE
some_schema.foo_table (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
bar_text TEXT,
foo_text TEXT,
foobar_text TEXT
);
It has some data like this:
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('eddie', '123456', 'something0987');
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('Snake', '12345-54321', 'that_##$%_snake');
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('Sally', '12345', '24-7avocado');
id | bar_text | foo_text | foobar_text
----+----------+-------------+-----------------
1 | eddie | 123456 | something0987
2 | Snake | 12345-54321 | that_##$%_snake
3 | Sally | 12345 | 24-7avocado
The problem
I need to query each one of these columns and compare the values to a given term (passed in as an argument from app logic), and make sure the best-matched row (considering comparison with all the columns, not just one) is returned first in the sort order.
There is no way to know in advance which of the columns is likely to be a better match for the given term.
If I compare the given term to each value using the similarity() function, I can see at a glance which row has the best match in any of the three columns and can see that's the one I would want ranked first in the sort order.
SELECT
f.id,
f.foo_text,
f.bar_text,
f.foobar_text,
similarity('12345', foo_text) AS foo_similarity,
similarity('12345', bar_text) AS bar_similarity,
similarity('12345', foobar_text) AS foobar_similarity
FROM some_schema.foo_table f
WHERE
(
f.foo_text ILIKE '%12345%'
OR
f.bar_text ILIKE '%12345%'
OR
f.foobar_text ILIKE '%12345%'
)
;
id | foo_text | bar_text | foobar_text | foo_similarity | bar_similarity | foobar_similarity
----+-------------+----------+-----------------+----------------+----------------+-------------------
2 | 12345-54321 | Snake | that_##$%_snake | 0.5 | 0 | 0
3 | 12345 | Sally | 24-7avocado | 1 | 0 | 0
1 | 123456 | eddie | something0987 | 0.625 | 0 | 0
(3 rows)
Clearly in this case, id #3 (Sally) is the best match (exact, as it happens); this is the row I'd like to return first.
However, since I don't know ahead of time that foo_text is going to be the column with the best match, I don't know how to define the ORDER BY clause.
I figured this would be a common enough problem, but I haven't found any hints in a fair bit of SO and DDG .
How can I always rank the best-matched row first in the returned set, without knowing which column will provide the best match to the search term?
Use greatest():
greatest(similarity('12345', foo_text), similarity('12345', bar_text), similarity('12345', foobar_text)) desc

SQL Query to replace multiple values in template from many-to-many table

I want to translate a template in an sql query. Lets assume there are the following fourtables: state, stateProperty, state_stateproperty and translation:
state_stateproperty
|---------------------|--------------------|
| state_id | stateproperties_id |
|---------------------|--------------------|
| 1 | 2 |
|---------------------|--------------------|
| 1 | 3 |
|---------------------|--------------------|
stateproperty
|---------------------|------------------|
| id | key | value |
|------|--------------|------------------|
| 2 | ${firstName} | John |
|------|--------------|------------------|
| 3 | ${lastName} | Doe |
|------|--------------|------------------|
state
|---------------------|
| id | template |
|------|--------------|
| 1 | template |
|------|--------------|
translation
|------------|--------------|---------------------------------|
| language | messageId | value |
|------------|--------------|---------------------------------|
| en | template | ${lastName}, ${firstName} alarm |
|------------|--------------|---------------------------------|
The aim is to get a new entity named translatedstate that includes the translated template of the state. In this example the translated template would look like: "Doe, John alarm". How can you join a many to many table in native sql and translate the template of the state with the values of its related state properties?
To be honest I would create a little function where I would loop through your state_property and cumulative replace the found wildcard string with its text.
But I had some fun to solve it in a query. I am not sure if it matches all special cases but for your example it works:
demo:db<>fiddle
SELECT
string_agg( -- 8
regexp_replace(split_key, '^.*\}', value), -- 7
'' ORDER BY row_number
)
FROM (
SELECT
s.id,
sp.value,
substring(key, 3) as s_key, -- 5
split_table.*
FROM translation t
JOIN statechange sc ON t.messageid = sc.completemessagetemplateid -- 1
JOIN state s ON s.id = sc.state_id
JOIN state_stateproperty ssp ON s.id = ssp.state_id
JOIN stateproperty sp ON ssp.stateproperties_id = sp.id
JOIN translation stnme ON s.nameid = stnme.messageid
CROSS JOIN
regexp_split_to_table( -- 3
-- 2
replace(t.messagetranslation, '${state}', stnme.messagetranslation),
'\$\{'
) WITH ORDINALITY as split_table(split_key, row_number) -- 4
WHERE t.language = 'en'
) s
WHERE position(s_key in split_key) = 0 and split_key != '' -- 6
GROUP BY id -- 8
Simple join the tables together (for next time you could simplify your example a little bit so that we don't have to create these different table. I am sure you know how to join)
Hardly replace the ${state} variable with the state nameid
This splits the template string every time a ${ string is found. So it creates a new row which begins a certain wildcard. Note that ${firstName} would become firstName} because the string delimiter is being deleted.
Adding a row count to get a criteria how the rows are ordered when I aggregate them later (8). WITH ORDINALITY only works as part of the FROM clause so the whole function it has been added here with a join.
Because of (3) I strip the ${ part from the keys as well. So it can be better parsed and compared later (in 6)
Because (3) creates too much rows (cross join) I want only these where the key is the first wildcard of my split string. All others are wrong.
Now I replace the wildcard with this key
Because we have only one wildcard per row we need to merge them together into one string again (grouped by state_id). The achieve the right order, we are using the row number from (5)

Recursive self join over file data

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:
ID | Value | Parent id
-----------------------------
But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.
What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,5,6,7,8,...
2 | *,record,abc,efg,hij,lmn,opq,rst
3 | ,,1,x,y,z,t,u,v,...
4 | ,,2,q,r,s,l,m,n,...
I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:
I don't know how many sections will occur in the file for the same
object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
9 | ,4,Data,1,2,3,4,...
10 | *,record,lmn,opq,rst,...
11 | ,,1,t,u,v,...
In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.
UPDATE
I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.
Either way, #Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):
+----+------------+------------------------------------------+-----------------------+
| ID | header | x | value
|
+----+------------+------------------------------------------+-----------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 |3,Formula,1,2,3,4,5,6,7,8 |
| 2 | ,, | ,1,x,y,z,t,u,v | ,1,x,y,z,t,u,v |
| 3 | ,, | ,2,q,r,s,l,m,n | ,2,q,r,s,l,m,n |
| 4 | *,record, | ,abc,efg,hij,lmn,opq,rst |*,record,abc,efg,hij,lmn,opq,rst |
| 5 | ,4, | ,Data,1,2,3,4 |,4,Data,1,2,3,4 |
| 6 | *,record, | ,lmn,opq,rst | ,lmn,opq,rst |
| 7 | ,, | ,1,t,u,v | ,1,t,u,v |
+----+------------+------------------------------------------+-----------------------------------------------+
I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.
You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.
One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:
3,Formula,
*,record,
,,1,
,,2,
,4,Data,
Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.
Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.
How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.
BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.
Here is a sample, but has some different with you need.
It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better
DECLARE #testdata TABLE(ID int,Line varchar(8000))
INSERT INTO #testdata
SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL
SELECT 2,'*,record,abc,efg,hij,...' UNION ALL
SELECT 3,',,1,x,y,z,...' UNION ALL
SELECT 4,',,2,q,r,s,...' UNION ALL
SELECT 5,'3,Formula,5,6,7,8,...' UNION ALL
SELECT 6,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 7,',,1,t,u,v,...' UNION ALL
SELECT 8,',,2,l,m,n,...' UNION ALL
SELECT 9,',4,Data,1,2,3,4,...' UNION ALL
SELECT 10,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 11,',,1,t,u,v,...'
;WITH t AS(
SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','') AS data
FROM #testdata AS t
CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
)
SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
FROM t
OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
GROUP BY t.header,c.x
+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header | x | value |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 | 3,Formula,1,2,3,4,5,6,7,8 |
| 3 | ,, | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v |
| 2 | *,record, | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9 | ,4, | ,Data,1,2,3,4 | ,4,Data,1,2,3,4 |
+----+------------+------------------------------------------+-----------------------------------------------+

Flattening a relation with an array to emit one row per array entry

Given a table defined as such:
CREATE TABLE test_values(name TEXT, values INTEGER[]);
...and the following values:
| name | values |
+-------+---------+
| hello | {1,2,3} |
| world | {4,5,6} |
I'm trying to find a query which will return:
| name | value |
+-------+-------+
| hello | 1 |
| hello | 2 |
| hello | 3 |
| world | 4 |
| world | 5 |
| world | 6 |
I've reviewed the upstream documentation on accessing arrays, and tried to think about what a solution using the unnest() function would look like, but have been coming up empty.
An ideal solution would be easy to use even in cases where there were a significant number of columns other than the array being expanded and no primary key. Handling a case with more than one array is not important.
We can put the set-returning function unnest() into the SELECT list like Raphaƫl suggests. This used to exhibit corner case problems before Postgres 10. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
Since Postgres 9.3 we can also use a LATERAL join for this. It is the cleaner, standard-compliant way to put set-returning functions into the FROM list, not into the SELECT list:
SELECT name, value
FROM tbl, unnest(values) value; -- implicit CROSS JOIN LATERAL
One subtle difference: this drops rows with empty / NULL values from the result since unnest() returns no row, while the same is converted to a NULL value in the FROM list and returned anyway. The 100 % equivalent query is:
SELECT t.name, v.value
FROM tbl t
LEFT JOIN unnest(t.values) v(value) ON true;
See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Well, you give the data, the doc, so... let's mix it ;)
select
name,
unnest(values) as value
from test_values
see SqlFiddle

Crosstab splitting results due to presence of unrelated field

I'm using postgres 9.1 with tablefunc:crosstab
I have a table with the following structure:
CREATE TABLE marketdata.instrument_data
(
dt date NOT NULL,
instrument text NOT NULL,
field text NOT NULL,
value numeric,
CONSTRAINT instrument_data_pk PRIMARY KEY (dt , instrument , field )
)
This is populated by a script that fetches data daily. So it might look like so:
| dt | instrument | field | value |
|------------+-------------------+-----------+-------|
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1 |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT | 2 |
I then use the following crosstab query to pivot the table:
select dt, instrument, vol, oi
FROM crosstab($$
select dt, instrument, field, value
from marketdata.instrument_data
where field = 'PX_VOLUME' or field = 'OPEN_INT'
$$::text, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$::text
) vol(dt date, instrument text, vol numeric, oi numeric);
Running this I get the result:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | 2 |
The problem:
When running this with lot of real data in the table, I noticed that for some fields the function was splitting the result over two rows:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | |
| 2014-05-23 | SGX.MiniJGB.2014U | | 2 |
I checked that the dt and instrument fields were identical and produced a work-around by grouping the ouput of the crosstab.
Analysis
I've discovered that it's the presence of one other entry in the input table that causes the output to be split over 2 rows. If I have the input as follows:
| dt | instrument | field | value |
|------------+-------------------+-----------+-------|
| 2014-04-23 | EUX.Bund.2014M | PX_VOLUME | 0 |
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1 |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT | 2 |
I get:
| dt | instrument | vol | oi |
|------------+-------------------+-----+----|
| 2014-04-23 | EUX.Bund.2014M | 0 | |
| 2014-05-23 | SGX.MiniJGB.2014U | 1 | |
| 2014-05-23 | SGX.MiniJGB.2014U | | 2 |
Where it gets really weird...
If I recreate the above input table manually then the output is as we would expect, combined into a single row.
If I run:
update marketdata.instrument_data
set instrument = instrument
where instrument = 'EUX.Bund.2014M'
Then again, the output is as we would expect, which is surprising as all I've done is set the instrument field to itself.
So I can only conclude that there is some hidden character/encoding issue in that Bund entry that is breaking crosstab.
Are there any suggestions as to how I can determine what it is about that entry that breaks crosstab?
Edit:
I ran the following on the raw table to try and see any hidden characters:
select instrument, encode(instrument::bytea, 'escape')
from marketdata.bloomberg_future_data_temp
where instrument = 'EUX.Bund.2014M';
And got:
| instrument | encode |
|----------------+----------------|
| EUX.Bund.2014M | EUX.Bund.2014M |
Two problems.
1. ORDER BY is required.
The manual:
In practice the SQL query should always specify ORDER BY 1,2 to ensure that the input rows are properly ordered, that is, values with the same row_name are brought together and correctly ordered within the row.
With the one-parameter form of crosstab(), ORDER BY 1,2 would be necessary.
2. One column with distinct values per group.
The manual:
crosstab(text source_sql, text category_sql)
source_sql is a SQL statement that produces the source set of data.
...
This statement must return one row_name column, one category column,
and one value column. It may also have one or more "extra" columns.
The row_name column must be first. The category and value columns must
be the last two columns, in that order. Any columns between row_name
and category are treated as "extra". The "extra" columns are expected
to be the same for all rows with the same row_name value.
Bold emphasis mine. One column. It seems like you want to form groups over two columns, which does not work as you desire.
Related answer:
Pivot on Multiple Columns using Tablefunc
The solution depends on what you actually want to achieve. It's not in your question, you silently assumed the function would do what you hope for.
Solution
I guess you want to group on both leading columns: (dt, instrument). You could play tricks with concatenating or arrays, but that would be slow and / or unreliable. I suggest a cleaner and faster approach with a window function rank() or dense_rank() to produce a single-column unique value per desired group. This is very cheap, because ordering rows is the main cost and the order of the frame is identical to the required order anyway. You can remove the added column in the outer query if desired:
SELECT dt, instrument, vol, oi
FROM crosstab(
$$SELECT dense_rank() OVER (ORDER BY dt, instrument) AS rnk
, dt, instrument, field, value
FROM marketdata.instrument_data
WHERE field IN ('PX_VOLUME', 'OPEN_INT')
ORDER BY 1$$
, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$
) vol(rnk int, dt date, instrument text, vol numeric, oi numeric);
More details:
PostgreSQL Crosstab Query
You could run a query that replaces irregular characters with an asterisk:
select regexp_replace(instrument, '[^a-zA-Z0-9]', '*', 'g')
from marketdata.instrument_data
where instrument = 'EUX.Bund.2014M'
Perhaps the instrument = instrument assignment discards trailing whitespace. That would also explain why where instrument = 'EUX.Bund.2014M' matches two values that crosstab sees as different.