Select any(array) in any(array) - sql

I have a bigint[] colunm:
person
------
id | name | other_information
------------------------------
1 | Zé | {1,2,3}
2 | João | {1,3}
3 | Maria | {3,5}
I need select persons with 2 or 5 in other_information. How?

select *
from person
where 2 = ANY(other_information)
or 5 = ANY(other_information)
However, using arrays is not recommended, as it results in a denormalized schema.
From the PostgreSQL docs:
Tip: Arrays are not sets; searching for specific array elements can be
a sign of database misdesign. Consider using a separate table with a
row for each item that would be an array element. This will be easier
to search, and is likely to scale better for a large number of
elements.

Related

How to split string into rows by number of characters in Bigquery?

if I have a table for example:
mydataset.itempf containing:
id | item
1 | ABCDEFGHIJKL
2 | ZXDFKDLFKFGF
And I would like the "item" field to be split by 4 characters into different rows like:
id | item
1 | ABCD
1 | EFGH
1 | IJKL
2 | ZXDF
2 | KDLF
2 | KFGF
How can I write this in bigquery? Please help.
Consider below approach
select id, item
from your_table,
unnest(regexp_extract_all(item, r'.{1,4}')) item
if applied to sample data in your question - output is
Use the Substring with the Count Method, That Should make it easier to see which ones are longer than others.

Sort SQL results and include missing keys

I have a Postgres table like this (greatly simplified):
id | object_id (foreign id) | key (text) | value (text)
1 | 1 | A | 0foo
2 | 1 | B | 1bar
3 | 1 | C | 2baz
4 | 1 | D | 3ham
5 | 2 | C | 4sam
6 | 3 | F | 5pam
…
(billions of rows)
I select object_ids according to some query (not relevant here), and then sort them according to the value of a specified key.
def sort_query_result(query, sort_by, limit, offset):
return query\
.with_entities(Table.object_id)\
.filter(Table.key == sort_by)\
.order_by(desc(Table.value))\
.limit(limit).offset(offset).subquery()
For example, assume a query matches object_ids 1 and 2 above. When sort_by=C, I want the result to be returned in the order [2, 1], because 4sam > 2baz.
This works well but there's one big problem:
Object ids that are returned by query but do not have any row for the sort_by key, are not returned at all.
For example, for a query that matches object_ids 1 and 2, sort_query_results(query, sort_by='D') == [1]. The object_id 2 is dropped because it has no D, which is undesirable.
Instead, I'd like to return all object_ids from the query. Those without the sort key should be sorted at the end, in any order: sort_query_results(query, sort_by='D') == [1, 2].
What's the best way to achieve that?
Note: I do not have the freedom to change the DB schema or business logic. But I can change the query code. I use SQLAlchemy ORM from Python, but could execute raw Postgres commands if necessary. Thank you.

Recursive self join over file data

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:
ID | Value | Parent id
-----------------------------
But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.
What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,5,6,7,8,...
2 | *,record,abc,efg,hij,lmn,opq,rst
3 | ,,1,x,y,z,t,u,v,...
4 | ,,2,q,r,s,l,m,n,...
I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:
I don't know how many sections will occur in the file for the same
object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
9 | ,4,Data,1,2,3,4,...
10 | *,record,lmn,opq,rst,...
11 | ,,1,t,u,v,...
In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.
UPDATE
I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.
Either way, #Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):
+----+------------+------------------------------------------+-----------------------+
| ID | header | x | value
|
+----+------------+------------------------------------------+-----------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 |3,Formula,1,2,3,4,5,6,7,8 |
| 2 | ,, | ,1,x,y,z,t,u,v | ,1,x,y,z,t,u,v |
| 3 | ,, | ,2,q,r,s,l,m,n | ,2,q,r,s,l,m,n |
| 4 | *,record, | ,abc,efg,hij,lmn,opq,rst |*,record,abc,efg,hij,lmn,opq,rst |
| 5 | ,4, | ,Data,1,2,3,4 |,4,Data,1,2,3,4 |
| 6 | *,record, | ,lmn,opq,rst | ,lmn,opq,rst |
| 7 | ,, | ,1,t,u,v | ,1,t,u,v |
+----+------------+------------------------------------------+-----------------------------------------------+
I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.
You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.
One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:
3,Formula,
*,record,
,,1,
,,2,
,4,Data,
Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.
Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.
How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.
BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.
Here is a sample, but has some different with you need.
It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better
DECLARE #testdata TABLE(ID int,Line varchar(8000))
INSERT INTO #testdata
SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL
SELECT 2,'*,record,abc,efg,hij,...' UNION ALL
SELECT 3,',,1,x,y,z,...' UNION ALL
SELECT 4,',,2,q,r,s,...' UNION ALL
SELECT 5,'3,Formula,5,6,7,8,...' UNION ALL
SELECT 6,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 7,',,1,t,u,v,...' UNION ALL
SELECT 8,',,2,l,m,n,...' UNION ALL
SELECT 9,',4,Data,1,2,3,4,...' UNION ALL
SELECT 10,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 11,',,1,t,u,v,...'
;WITH t AS(
SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','') AS data
FROM #testdata AS t
CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
)
SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
FROM t
OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
GROUP BY t.header,c.x
+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header | x | value |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 | 3,Formula,1,2,3,4,5,6,7,8 |
| 3 | ,, | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v |
| 2 | *,record, | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9 | ,4, | ,Data,1,2,3,4 | ,4,Data,1,2,3,4 |
+----+------------+------------------------------------------+-----------------------------------------------+

Searching a "vertical" table in SQLite

Tables are usually laid out in a "horizontal" fashion:
+-----+----+----+--------+
|recID|FirstName|LastName|
+-----+----+----+--------+
| 1 | Jim | Jones |
+-----+----+----+--------+
| 2 | Adam | Smith |
+-----+----+----+--------+
Here, however, is a table with the same data in a "vertical" layout:
+-----+-----+----+-----+-------+
|rowID|recID| Property | Value |
+-----+-----+----+-----+-------+
| 1 | 1 |FirstName | Jim | \
+-----+-----+----+-----+-------+ These two rows constitute a single logical record
| 2 | 1 |LastName | Jones | /
+-----+-----+----+-----+-------+
| 3 | 2 |FirstName | Adam | \
+-----+-----+----+-----+-------+ These two rows are another single logical record
| 4 | 2 |LastName | Smith | /
+-----+-----+----+-----+-------+
Question: In SQLite, how can I search the vertical table efficiently and in such a way that recIDs are not duplicated in the result set? That is, if multiple matches are found with the same recID, only one (any one) is returned?
Example (incorrect):
SELECT rowID from items WHERE "Value" LIKE "J%"
returns of course two rows with the same recID:
1 (Jim)
2 (Jones)
What is the optimal solution here? I can imagine storing intermediate results in a temp table, but hoping for a more efficient way.
(I need to search through all properties, so the SELECT cannot be restricted with e.g. "Property" = "FirstName". The database is maintained by a third-party product; I suppose the design makes sense because the number of property fields is variable.)
To avoid duplicate rows in the result returned by a SELECT, use DISTINCT:
SELECT DISTINCT recID
FROM items
WHERE "Value" LIKE 'J%'
However, this works only for the values that are actually returned, and only for entire result rows.
In the general case, to return one result record for each group of table records, use GROUP BY to create such groups.
For any column that does not appear in the GROUP BY clause, you then have to choose which rowID in the group to return; here we use MIN:
SELECT MIN(rowID)
FROM items
WHERE "Value" LIKE 'J%'
GROUP BY recID
To make this query more efficient, create an index on the recID column.

Advance Query with Join

I'm trying to convert a product table that contains all the detail of the product into separate tables in SQL. I've got everything done except for duplicated descriptor details.
The problem I am having all the products have size/color/style/other that many other products contain. I want to only have one size or color descriptor for all the items and reuse the "ID" for all the product which I believe is a Parent key to the Product ID which is a ...Foreign Key. The only problem is that every descriptor would have multiple Foreign Keys assigned to it. So I was thinking on the fly just have it skip figuring out a Foreign Parent key for each descriptor and just check to see if that descriptor exist and if it does use its Key for the descriptor.
Data Table
PI Colo Sz OTHER
1 | Blue | 5 | Vintage
2 | Blue | 6 | Vintage
3 | Blac | 5 | Simple
4 | Blac | 6 | Simple
===================================
Its destination table is this
===================================
DI Description
1 | Blue
2 | Blac
3 | 5
4 | 6
6 | Vintage
7 | Simple
=============================
Select Data.Table
Unique.Data.Table.Colo
Unique.Data.Table.Sz
Unique.Data.Table.Other
=======================================
Then the dual part of the questions after we create all the descriptors how to do a new query and assign the product ID to the descriptors.
PI| DI
1 | 1
1 | 3
1 | 4
2 | 1
2 | 3
2 | 4
By figuring out how to do this I should be able to duplicate this pattern for all 300 + columns in the product. Some of these fields are 60+ characters large so its going to save a ton of space.
Do I use a Array?
Okay, if I understand you correctly, you want all unique attributes converted from columns into rows in a single table (detailstable) that has an id and a description field:
Assuming the schema:
datatable
------------------
PI [PK]
Colo
Sz
OTHER
detailstable
------------------
DI [PK]
Description
You can first get all of the unique attributes into its own table with:
INSERT INTO detailstable (Description)
SELECT
a.description
FROM
(
SELECT DISTINCT Colo AS description
FROM datatable
UNION
SELECT DISTINCT Sz AS description
FROM datatable
UNION
SELECT DISTINCT OTHER AS description
FROM datatable
) a
Then to link up the datatable to the detailstable, I'm assuming you have a cross-reference table defined like:
datadetails
------------------
PI [PK]
DI [PK]
You can then do:
INSERT INTO datadetails (PI, DI)
SELECT
a.PI
b.DI
FROM
datatable a
INNER JOIN
detailstable b ON b.Description IN (a.Colo, a.Sz, a.OTHER)
I reckon you want to split description table for different categories, like - colorDescription, sizeDescription etc.
If that is not practical then I would recommend having an extra column showing an category attribute:
DI Description Category
1 | Blue | Color
2 | Blac | Color
3 | 5 | Size
4 | 6 | Size
6 | Vintage | Other
7 | Simple | Other
And then have primary key in this table as combination of ID and Category column.
This will have less chances for injecting any data errors. It will be also easy to track that down.