UNNEST list of integers in BigQuery query return ```TypeError: Object of type int64 is not JSON serializable``` - sql

I am querying a join of tables in BigQuery for a specific list of ids that are of type INT64 and I cannot find what to do, as I constantly have th following error
TypeError: Object of type int64 is not JSON serializable
My query looks like this:
client = bigquery.Client(credentials=credentials)
query = """
SELECT t1.*,
t2.*,
t3.*,
t4.* FROM `<project>.<dataset>.<tabel1>` as t1
join `<project>.<dataset>.<tabel2>` as t2
on t1.label = t2.id
join `<project>.<dataset>.<tabel3>` as t3
on t3.A = t2.A
join `<project>.<dataset>.<tabel4>` as t4
on t4.obj= t2.obj and t4.A = t3.A
where t1.id in unnest(#list)
"""
job_config = bigquery.QueryJobConfig(query_parameters=[
bigquery.ArrayQueryParameter("list", "STRING", list),
])
choices= client.query(query, job_config=job_config).to_dataframe()
where my list is of the type:
list = [3651056, 3651049, 3640195, 3629411, 3627024,3624939]
Now, this method works perfectly whenever the list is of the type:
list = ['3651056', '3651049', '3640195', '3629411', '3627024', '3624939']
I have tried casting the column I want to pick the list items from before querying but it implies I need to cast the entire table, which contain over 4 billion rows. Not efficient at all.
I would be grateful for any insights on how to solve this.
EDIT:
There is one option. Namely to first cast my list to string and then:
client = bigquery.Client(credentials=credentials)
query = """
SELECT t1.*,
t2.*,
t3.*,
t4.* FROM `<project>.<dataset>.<tabel1>` as t1
join `<project>.<dataset>.<tabel2>` as t2
on t1.label = t2.id
join `<project>.<dataset>.<tabel3>` as t3
on t3.A = t2.A
join `<project>.<dataset>.<tabel4>` as t4
on t4.obj= t2.obj and t4.A = t3.A
where cast(t1.id as STRING) in unnest(#list)
"""
job_config = bigquery.QueryJobConfig(query_parameters=[
bigquery.ArrayQueryParameter("list", "STRING", list),
])
choices= client.query(query, job_config=job_config).to_dataframe()
But is there a more direct way to do this?

As #PratikPatil mentioned in comments:
During comparison - both sides needs to be of same data type otherwise, BigQuery will result to an error.
Casting is still best choice for the issue you have with query. But if you want to check if any side of data can be natively converted to int64 (if you are not expecting any strings etc at all). That will avoid extra casting.
Posting the answer as community wiki as this is the best practice and for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

Related

How to cast only the part of a table using a single SQL command in PostgreSQL

In a PostgreSQL table I have several information stored as text. It depends on the context described by a type column what type of information is stored. The application is prepared to get by only one command the Id's of the row.
I got into trouble when i tried to compare the information (bigint stored as a string) with an external value (e.g. '9' > '11'). When I tried to cast the column, the datatbase return an error (not all values in the column are castable, e.g. datetime or normal text). Also when I try to cast only the result of a query command, I get a cast error.
I get the table with the castable rows by this command:
SELECT information.id as id, item.information::bigint as item
FROM information
INNER JOIN item
ON information.id = item.informationid
WHERE information.type = 'task'
The resulting rows are showing up only text that is castable. When I throw it into another command it results in an error.
SELECT x.id FROM (
SELECT information.id as id, item.information::bigint as item
FROM information
INNER JOIN item
ON information.id = item.informationid
WHERE information.type = 'task'
) AS x
WHERE x.item > '0'::bigint
Accroding to the error, the database tried to cast all rows in the table.
Technically, this happens because the optimizer thinks WHERE x.item > '0'::bigint is a much more efficient filter than information.type = 'task'. So in the table scan, the WHERE x.item > '0'::bigint condition is chosen to be the predicate. This thinking is not wrong but will make you fall into this seemingly illogical trouble.
The suggestion by Gordon to use CASE WHEN inf.type = 'task' THEN i.information::bigint END can avoid this, but however it may sometimes ruin your idea to put that as a sub-query and require the same condition to be written twice.
A funny trick I tried is to use OUTER APPLY:
SELECT x.* FROM (SELECT 1 AS dummy) dummy
OUTER APPLY (
SELECT information.id as id, item.information::bigint AS item
FROM information
INNER JOIN item
ON information.id = item.informationid
WHERE information.type = 'task'
) x
WHERE x.item > '0'::bigint
Sorry that I only verified the SQL Server version of this. I understand PostgreSQL has no OUTER APPLY, but the equivalent should be:
SELECT x.* FROM (SELECT 1 AS dummy) dummy
LEFT JOIN LATERAL (
SELECT information.id as id, item.information::bigint AS item
FROM information
INNER JOIN item
ON information.id = item.informationid
WHERE information.type = 'task'
) x ON true
WHERE x.item > '0'::bigint
(reference is this question)
Finally, a more tidy but less flexible method is add the optimizer hint to turn off it to force the optimizer to run the query as how it is written.
This is unfortunate. Try using a case expression:
SELECT inf.id as id,
(CASE WHEN inf.type = 'task' THEN i.information::bigint END) as item
FROM information inf JOIN
item i
ON inf.id = i.informationid
WHERE inf.type = 'task';
There is no guarantee that the WHERE filter is applied before the SELECT. However, CASE does guarantee the order of evaluation, so it is safe.

Update twice nested repeated record

I'm struggling with this query (dummy version, there are much more fields in it) :
UPDATE
table1 as base
SET
lines =
ARRAY(
SELECT AS STRUCT
b.line_id,
s.purch_id,
ARRAY(
SELECT AS STRUCT
wh.warehouse_id,
s.is_proposed,
FROM table1 as t, UNNEST(lines) as lb, UNNEST(lb.warehouses) as wh
INNER JOIN
(SELECT
l.line_id,
wh.is_proposed
FROM table2, UNNEST(lines) as l, UNNEST(l.warehouses) as wh) as s
ON lb.line_id = s.line_id AND wh.warehouse_id = s.warehouse_id)
FROM table1, UNNEST(lines) as b
INNER JOIN UNNEST(supply.lines) as s
ON b.line_id = s.line_id)
FROM
table2 as supply
WHERE
base.date = supply.date
AND
base.sales_id = supply.sales_id
table1 and table2 have the same nesting :
lines : repeated record
lines.warehouses : repeated record within lines
(so {... , lines [{... warehouses [)
Plus table1 is a subset of table2 with a subset of its fields, table1 having them NULL from start (I refresh information when data is available because informations are asynchroneous).
I first tried this as a first step (which succeeds) :
UPDATE
table1 as base
SET
lines =
ARRAY(
SELECT AS STRUCT
b.line_id,
s.purch_id,
b.warehouses
FROM table1, UNNEST(lines) as b
INNER JOIN UNNEST(supply.lines) as s
ON b.line_id = s.line_id)
FROM
table2 as supply
WHERE
base.date = supply.date
AND
base.sales_id = supply.sales_id
But the fact I actually need to update lines.warehouses too so I'm happy it works but it is not enough.
The full query is valid and when I try the last ARRAY part in a terminal, the query is fast and the output has no duplicate.
Still the complete UPDATE does not end (after 20 min, I killed it).
Tables are not that large, 20k from both side (220k completely flattened).
So am I doing something wrong ?
Is there a better way ?
Thanks
I finally solved the issue, Iit was way simpler than I thought.
I think I misunderstood how the whole query nesting worked.
So I just linked every data avaiable, from the first singular row matched to the last array since filtered data at top level is propagated to lower levels.
UPDATE
table1 as base
SET
lines =
ARRAY(
SELECT AS STRUCT
b.line_id,
s.purch_id,
ARRAY(
SELECT AS STRUCT
wh.warehouse_id,
sh.is_proposed,
FROM UNNEST(b.warehouses) as wh -- take only upper level data
INNER JOIN UNNEST(s.warehouses) as sh -- idem
ON wh.warehouse_id = sh.warehouse_id) -- no need to 'redo' the joining on already filtered ones
FROM UNNEST(base.lines) as b
INNER JOIN UNNEST(supply.lines) as s
ON b.line_id = s.line_id)
FROM
table2 as supply
WHERE
base.date = supply.date
AND
base.sales_id = supply.sales_id
The query succeeds in less than 1min

Grails Self Referencing Criteria

In the project I´m working there is a part of the database that is like the following diagram
The domain classes have a definition similar to the following:
class File{
String name
}
class Document{
File file
}
class LogEntry{
Document document
Date date
}
First I need to get only the latest LogEntry for all Documents; in SQL I do the following (SQL_1):
SELECT t1.* FROM log_entry AS t1
LEFT OUTER JOIN log_entry t2
on t1.document_id = t2.document_id AND t1.date < t2.date
WHERE t2.date IS NULL
Then in my service I have a function like this:
List<LogEntry> logs(){
LogEntry.withSession {Session session ->
def query = session.createSQLQuery(
"""SELECT t1.* FROM log_entry AS t1
LEFT OUTER JOIN log_entry t2
on t1.document_id = t2.document_id AND t1.date < t2.date
WHERE t2.date IS NULL"""
)
def results = query.with {
addEntity(LogEntry)
list()
}
return results
}
}
The SQL query does solve my problem, at least in a way. I need to also paginate, filter and sort my results as well as join the tables LogEntry, Document and File. Altough it is doable in SQL it might get complicated quite quickly.
In other project I´ve used criteriaQuery similar to the following:
Criteria criteria = LogEntry.createCriteria()
criteria.list(params){ //Max, Offset, Sort, Order
fetchMode 'document', FetchMode.JOIN //THE JOIN PART
fetchMode 'document.file', FetchMode.JOIN //THE JOIN PART
createAlias("document","_document") //Alias could be an option but I would need to add transients, since it seems to need an association path, and even then I am not so sure
if(params.filter){ //Filters
if(params.filter.name){
eq('name', filter.name)
}
}
}
In these kinds of criteria I´ve been able to add custom filters, etc. But I have no Idea how to translate my query(SQL_1) into a criteria. Is there a way to accomplish this with criteriaBuilders or should I stick to sql?

Improved way for multi-table SQL (MySQL) query?

Hoping you can help. I have three tables and would like to create a conditional query to make a subset based on a row's presence in one table then excluding the row from the results, then query a final, 3rd table. I thought this would be simple enough, but I'm not well practiced in SQL and after researching/testing for 6 hours on left joins, correlated sub-queries etc, it has helped, but I still can't hit the correct result set. So here's the setup:
T1
arn_mkt_stn
A00001_177_JOHN_FM
A00001_177_BILL_FM
A00001_174_DAVE_FM
A00002_177_JOHN_FM
A00006_177_BILL_FM
A00010_177_JOHN_FM - note: the name's relationship to the 3 digit prefix (e.g. _177) and the FM part always is consistent: '_177_JOHN_FM' only the A000XX changes
T2
arn_mkt
A00001_105
A00001_177
A00001_188
A00001_246
A00002_177
A00003_177
A00004_026
A00004_135
A00004_177
A00006_177
A00010_177
Example: So if _177_JOHN_FM is a substring of arn_mkt_stn rows in T1, exclude it when getting arn_mkts with a substring of 177 from T2 - in this case, the desired result set would be:
A00003_177
A00004_177
A00006_177
Similarly, _177_BILL_FM would return:
A00002_177
A00003_177
A00004_177
A00010_177
Then I would like to use this result set to pull records from a third table based on the 'A00003' etc
T3
arn
A00001
A00002
A00003
A00004
A00005
A00006
...
I've tried a number of methods [where here $stn_code = JOHN_FM and $stn_mkt = 177]
"SELECT * FROM T2, T1 WHERE arn != SUBSTRING(T1.arn_mkt_stn, 1,6)
AND SUBSTRING(T1.arn_mkt_stn, 12,7) = '$stn_code'
AND SUBSTRING(arn_mkt, 8,3) = '$stn_mkt' (then use this result to query T3..)
Also a left join and a subquery, but I'm clearly missing something!
Any pointers gratefully received, thanks,
Rich.
[EDIT: Thanks for helping out sgeddes. I'll expand on my logic above... first, the result set desired is always in connection with one name only per query, e.g. from T1, lets use JOHN_FM. In T1, JOHN_FM is currently associated with 'arn's (within the arn_mkt_stn): A00001, A00002 & A00010'. The next step in T2 is to find all the 'arn's (within arn_mkt)' that have JOHN_FM's 3 digit prefix (177), then exclude those that are in T1. Note: A00006 remains because it is not connected to JOHN_FM in T1. The same query for BILL_FM gives slightly different results, excluding A00001 & A00006 as it has this assoc in T1.. Thanks, R]
You can use a LEFT JOIN to remove the records from T2 that match those in T1. However, I'm not sure I'm understanding your logic.
You say A00001_177_JOHN_FM should return:
A00003_177
A00004_177
A00006_177
However, wouldn't A00006_177_BILL_FM exclude A00006_177 from the above results?
This query should be close (wasn't completely sure which fields you needed returned) to what you're looking for if I'm understanding you correctly:
SELECT T2.arn_mkt, T3.arn
FROM T2
LEFT JOIN T1 ON
T1.arn_mkt_stn LIKE CONCAT(T2.arn_mkt,'%')
INNER JOIN T3 ON
T2.arn_mkt LIKE CONCAT(T3.arn,'%')
WHERE T1.arn_mkt_stn IS NULL
Sample Fiddle Demo
--EDIT--
Reviewing the comments, this should be what you're looking for:
SELECT *
FROM T2
LEFT JOIN T1 ON
T1.arn_mkt_stn LIKE CONCAT(LEFT(T2.arn_mkt,LOCATE('_',T2.arn_mkt)),'%') AND T1.arn_mkt_stn LIKE '%JOHN_FM'
INNER JOIN T3 ON
T2.arn_mkt LIKE CONCAT(T3.arn,'%')
WHERE T1.arn_mkt_stn IS NULL
And here is the updated Fiddle: http://sqlfiddle.com/#!2/3c293/13

What is the difference between these two queries?

I am writing my join query by the following way
UPDATE UPLOAD_TEMP
SET UPLOAD_TEMP.Borr_Add_Req = t2.YesNoResponse,
FROM UPLOAD_TEMP t1
INNER JOIN GB_RequiredFields t2 ON t1.State = t2.StateCode
AND t1.County_Id = t2.CountyId
AND t1.Group_code = t2.Doc_type_group_code
However it can also be written this way as well
UPDATE UPLOAD_TEMP
SET UPLOAD_TEMP.Borr_Add_Req = t2.YesNoResponse,
FROM UPLOAD_TEMP t1
INNER JOIN GB_RequiredFields t2 ON t1.State = t2.StateCode
WHERE t1.County_Id = t2.CountyId
AND t1.Group_code = t2.Doc_type_group_code
IS there any difference between both and which is the preferred way to code.
That's an age-old argument - whether to specify additional WHERE arguments in the JOIN clause or as a separate WHERE.
I prefer the approach of defining only those arguments that really make up the JOIN inside the JOIN clause, and everything else later on in the WHERE clause. Seems cleaner to me.
But I think in the end, functionally, it's the same - it's just a matter of personal preference, really.
Both queries will have the same result and your sql-server should handle both in the same way. So there is no difference at all - just how you want to do it.
You even can do it the following way:
UPDATE UPLOAD_TEMP
SET UPLOAD_TEMP.Borr_Add_Req = t2.YesNoResponse,
FROM UPLOAD_TEMP t1, GB_RequiredFields t2
WHERE
t1.State = t2.StateCode
AND t1.County_Id = t2.CountyId
AND t1.Group_code = t2.Doc_type_group_code