How to split string into rows by number of characters in Bigquery? - sql

if I have a table for example:
mydataset.itempf containing:
id | item
1 | ABCDEFGHIJKL
2 | ZXDFKDLFKFGF
And I would like the "item" field to be split by 4 characters into different rows like:
id | item
1 | ABCD
1 | EFGH
1 | IJKL
2 | ZXDF
2 | KDLF
2 | KFGF
How can I write this in bigquery? Please help.

Consider below approach
select id, item
from your_table,
unnest(regexp_extract_all(item, r'.{1,4}')) item
if applied to sample data in your question - output is

Use the Substring with the Count Method, That Should make it easier to see which ones are longer than others.

Related

How can I return the best matched row first in sort order from a set returned by querying a single search term against multiple columns in Postgres?

Background
I have a Postgres 11 table like so:
CREATE TABLE
some_schema.foo_table (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
bar_text TEXT,
foo_text TEXT,
foobar_text TEXT
);
It has some data like this:
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('eddie', '123456', 'something0987');
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('Snake', '12345-54321', 'that_##$%_snake');
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('Sally', '12345', '24-7avocado');
id | bar_text | foo_text | foobar_text
----+----------+-------------+-----------------
1 | eddie | 123456 | something0987
2 | Snake | 12345-54321 | that_##$%_snake
3 | Sally | 12345 | 24-7avocado
The problem
I need to query each one of these columns and compare the values to a given term (passed in as an argument from app logic), and make sure the best-matched row (considering comparison with all the columns, not just one) is returned first in the sort order.
There is no way to know in advance which of the columns is likely to be a better match for the given term.
If I compare the given term to each value using the similarity() function, I can see at a glance which row has the best match in any of the three columns and can see that's the one I would want ranked first in the sort order.
SELECT
f.id,
f.foo_text,
f.bar_text,
f.foobar_text,
similarity('12345', foo_text) AS foo_similarity,
similarity('12345', bar_text) AS bar_similarity,
similarity('12345', foobar_text) AS foobar_similarity
FROM some_schema.foo_table f
WHERE
(
f.foo_text ILIKE '%12345%'
OR
f.bar_text ILIKE '%12345%'
OR
f.foobar_text ILIKE '%12345%'
)
;
id | foo_text | bar_text | foobar_text | foo_similarity | bar_similarity | foobar_similarity
----+-------------+----------+-----------------+----------------+----------------+-------------------
2 | 12345-54321 | Snake | that_##$%_snake | 0.5 | 0 | 0
3 | 12345 | Sally | 24-7avocado | 1 | 0 | 0
1 | 123456 | eddie | something0987 | 0.625 | 0 | 0
(3 rows)
Clearly in this case, id #3 (Sally) is the best match (exact, as it happens); this is the row I'd like to return first.
However, since I don't know ahead of time that foo_text is going to be the column with the best match, I don't know how to define the ORDER BY clause.
I figured this would be a common enough problem, but I haven't found any hints in a fair bit of SO and DDG .
How can I always rank the best-matched row first in the returned set, without knowing which column will provide the best match to the search term?
Use greatest():
greatest(similarity('12345', foo_text), similarity('12345', bar_text), similarity('12345', foobar_text)) desc

Recursive self join over file data

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:
ID | Value | Parent id
-----------------------------
But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.
What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,5,6,7,8,...
2 | *,record,abc,efg,hij,lmn,opq,rst
3 | ,,1,x,y,z,t,u,v,...
4 | ,,2,q,r,s,l,m,n,...
I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:
I don't know how many sections will occur in the file for the same
object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
9 | ,4,Data,1,2,3,4,...
10 | *,record,lmn,opq,rst,...
11 | ,,1,t,u,v,...
In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.
UPDATE
I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.
Either way, #Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):
+----+------------+------------------------------------------+-----------------------+
| ID | header | x | value
|
+----+------------+------------------------------------------+-----------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 |3,Formula,1,2,3,4,5,6,7,8 |
| 2 | ,, | ,1,x,y,z,t,u,v | ,1,x,y,z,t,u,v |
| 3 | ,, | ,2,q,r,s,l,m,n | ,2,q,r,s,l,m,n |
| 4 | *,record, | ,abc,efg,hij,lmn,opq,rst |*,record,abc,efg,hij,lmn,opq,rst |
| 5 | ,4, | ,Data,1,2,3,4 |,4,Data,1,2,3,4 |
| 6 | *,record, | ,lmn,opq,rst | ,lmn,opq,rst |
| 7 | ,, | ,1,t,u,v | ,1,t,u,v |
+----+------------+------------------------------------------+-----------------------------------------------+
I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.
You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.
One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:
3,Formula,
*,record,
,,1,
,,2,
,4,Data,
Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.
Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.
How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.
BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.
Here is a sample, but has some different with you need.
It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better
DECLARE #testdata TABLE(ID int,Line varchar(8000))
INSERT INTO #testdata
SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL
SELECT 2,'*,record,abc,efg,hij,...' UNION ALL
SELECT 3,',,1,x,y,z,...' UNION ALL
SELECT 4,',,2,q,r,s,...' UNION ALL
SELECT 5,'3,Formula,5,6,7,8,...' UNION ALL
SELECT 6,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 7,',,1,t,u,v,...' UNION ALL
SELECT 8,',,2,l,m,n,...' UNION ALL
SELECT 9,',4,Data,1,2,3,4,...' UNION ALL
SELECT 10,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 11,',,1,t,u,v,...'
;WITH t AS(
SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','') AS data
FROM #testdata AS t
CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
)
SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
FROM t
OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
GROUP BY t.header,c.x
+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header | x | value |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 | 3,Formula,1,2,3,4,5,6,7,8 |
| 3 | ,, | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v |
| 2 | *,record, | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9 | ,4, | ,Data,1,2,3,4 | ,4,Data,1,2,3,4 |
+----+------------+------------------------------------------+-----------------------------------------------+

sqlite string replace/delete

I have a column in my database table with name tags which contains comma separated strings and it has records like this-
index | tags
-------------
1 | a,b,c
2 | b
3 | c
4 | z
5 | b,a,c
6 | p,f,w
7 | a,c,b
(for simplicity i am denoting strings with characters)
Now i want to replace/delete particular string.
Delete - say I want to delete b from all rows. If tags column become empty after this operation that row/record should be deleted (index 2 in this case). My records should look like this after this operation.
index | tags
-------------
1 | a,c
3 | c
4 | z
5 | a,c
6 | p,f,w
7 | a,c
Replace - say I want to replace all a with k on original records
index | tags
-------------
1 | k,b,c
2 | b
3 | c
4 | z
5 | b,k,c
6 | p,f,w
7 | k,c,b
Question - I am thinking of using replace function somehow but not sure how to meet above requirement with that. Can i do this in a single sql command? If not please suggest best way to do this (may be multiple sql commands).
I use MSSQL, I'm not sure sqlite. But, you use REPLACE function, like this:
To remove b:
UPDATE Your_Table
SET tags = REPLACE(REPLACE(tags, ',b', ' '), 'b,', ' ')
UPDATE Your_Table
SET tags = NULL WHERE tags = 'b'
To replace a with k:
UPDATE Your_Table
SET tags = REPLACE(tags, 'a', 'k')

PostgreSQL: Distribute rows evenly and according to frequency

I have trouble with a complex ordering problem. I have following example data:
table "categories"
id | frequency
1 | 0
2 | 4
3 | 0
table "entries"
id | category_id | type
1 | 1 | a
2 | 1 | a
3 | 1 | a
4 | 2 | b
5 | 2 | c
6 | 3 | d
I want to put entries rows in an order so that category_id,
and type are distributed evenly.
More precisely, I want to order entries in a way that:
category_ids that refer to a category that has frequency=0 are
distributed evenly - so that a row is followed by a different category_id
whenever possible. e.g. category_ids of rows: 1,2,1,3,1,2.
Rows with category_ids of categories with frequency<>0 should
be inserted from ca. the beginning with a minimum of frequency rows between them
(the gaps should vary). In my example these are rows with category_id=2.
So the result could start with row id #1, then #4, then a minimum of 4 rows of other
categories, then #5.
in the end result rows with same type should not be next to each other.
Example result:
id | category_id | type
1 | 1 | a
4 | 2 | b
2 | 1 | a
6 | 3 | d
.. some other row ..
.. some other row ..
.. some other row ..
5 | 2 | c
entries are like a stream of things the user gets (one at a time).
The whole ordering should give users some variation. It's just there to not
present them similar entries all the time, so it doesn't have to be perfect.
The query also does not have to give the same result on each call - using
random() is totally fine.
frequencies are there to give entries of certain categories a higher
priority so that they are not distributed across the whole range, but are placed more
at the beginning of the result list. Even if there are a lot of these entries, they
should not completely crowd out the frequency=0 entries at the beginning, through.
I'm no sure how to start this. I think I can use window functions and
ntile() to distribute rows by category_id and type.
But I have no idea how to insert the non-0-category-entries afterwards.

How to sort sql result using a pre defined series of rows

i have a table like this one:
--------------------------------
id | name
--------------------------------
1 | aa
2 | aa
3 | aa
4 | aa
5 | bb
6 | bb
... one million more ...
and i like to obtain an arbitrary number of rows in a pre defined sequence and the other rows ordered by their name. e.g. in another table i have a short sequence with 3 id's:
sequ_no | id | pos
-----------------------
1 | 3 | 0
1 | 1 | 1
1 | 2 | 2
2 | 65535 | 0
2 | 45 | 1
... one million more ...
sequence 1 defines the following series of id's: [ 3, 1, 2]. how to obtain the three rows of the first table in this order and the rest of the rows ordered by their name asc?
how in PostgreSQL and how in mySQL? how would a solution look like in hql (hibernate query language)?
an idea i have is to first query and sort the rows which are defined in the sequence and than concat the other rows which are not in the sequence. but this involves tow queries, can it be done with one?
Update: The final result for the sample sequence [ 3, 1, 2](as defined above) should look like this:
id | name
----------------------------------
3 | aa
1 | aa
2 | aa
4 | aa
5 | bb
6 | bb
... one million more ...
i need this query to create a pagination through a product table where part of the squence of products is a defined sequence and the rest of the products will be ordered by a clause i dont know yet.
I'm not sure I understand the exact requirement, but won't this work:
SELECT ids.id, ids.name
FROM ids_table ids LEFT OUTER JOIN sequences_table seq
WHERE ids.id = seq.id
ORDER BY seq.sequ_no, seq.pos, ids.name, ids.id
One way: assign a position (e.g. 0) to each id that doesn't have a position yet, UNION the result with the second table, join the result with the first table, and ORDER BY seq_no, pos, name.