BigQuery Using arrays in parameterized queries - google-bigquery

I need to run parameterized queries using arrays.
Python Client Library for BigQuery API
id_pull = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
query = "SELECT column1 FROM `table1` WHERE id = #get_id;"
query_params = [
bigquery.ArrayQueryParameter(
'get_id', 'INT64', id_pull)
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
query_job = client.query(query, location='US', job_config=job_config) #API request-starts query
results = query_job.result() # Waits for job to complete.
I followed instructions from the documentation, however, this error after execution appears:
raise self._exception google.api_core.exceptions.BadRequest: 400 No
matching signature for operator = for argument types: INT64,
ARRAY. Supported signatures: ANY = ANY at [1:67]
Does someone what the problem is and how to fix it?

I think the issue is in your WHERE clause
Instead of
WHERE id = #get_id
it should be something like
WHERE id IN UNNEST(#get_id)

Related

pd.read_sql_query string list parameter collapsed to first element

I'm encountering an odd problem with pandas.read_sql_query() method where when I pass a parameter that is a list formatted as a string, the parameter only reads the first element of the list and drops everything else. The original list is a list of numbers that is re-formatted as a string, so in theory, all elements of the list should be preserved. For example, the dataframe returned by this code:
l = [1, 2, 3]
l_as_str = ", ".join([x for x in map(str, l)])
QUERY = """SELECT * FROM table WHERE id in (%(l_as_str)s)"""
df = pd.read_sql_query(QUERY, params={"l_as_str": l_as_str}, con=engine)
...only consists of items where id = 1, the first element of l. If I switch the order of elements (e.g. l = [2, 1, 3]), then it only returns items where id = 2. In other words, the l_as_str appears to get collapsed to just the first element. Any idea what the issue is?
Just simplify with an f-string. No need to pass params
l = [1, 2, 3]
l_as_str = ", ".join([x for x in map(str, l)])
QUERY = f"""SELECT * FROM table WHERE id in ({l_as_str})"""
df = pd.read_sql_query(QUERY, con=engine)

Given a dataframe with N elements, how can make m smaller dataframes such that the size of each m is some fraction of N?

I have a dataset (call it Data) with ~25000 instances that I want to split into a train set, development set, and test set. I want it to be such that,
train set = 0.7*Data
development set = 0.1*Data
test set = 0.2*Data
When making the split, I want the instances to be randomly sampled and NOT REPEATED between the 3 sets. This is why I can't use something like,
train_set = Data.sample(frac=0.7)
dev_set = Data.sample(frac=0.1)
train_set = Data.sample(frac=0.2)
where instances from Data may be repeated in the sets. Is there a build in function that I am missing or could you help me write a function for doing this?
I will use an array to demonstrate an example of what I am looking for.
A = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
splits = [0.7, 0.1, 0.2]
def splitFunction(data, array_of_splits):
// I need your help here
splits = splitFunction(A, splits)
#output
[[1, 3, 8, 9, 6, 7, 2], [4], [5, 0]]
Thank you in advance!
from random import shuffle
def splitFunction(data, array_of_splits):
data_copy = data[:] # copy data if don't want to change original array
shuffle(data_copy) # randomizes data
splits = []
startIndex = 0
for val in array_of_splits:
split = data_copy[startIndex:startIndex + val*len(data)]
startIndex = startIndex + val*len(data)
splits.append(split)
return splits

Syntax error in spatial query?

I wrote a function for found all pois around a track
controller :
def index
#track = Track.friendly.find(params[:track_id])
#tracks = Track.where(way_id: #track.id)
#way = Way.find(1)
#poi_start = Poi.find(#way.point_start)
#pois = #track.pois.sleepsAndtowns
#pois = #way.poi_around_track_from(#poi_start, 50000, #pois)
end
way.rb
def poi_around_track_from(poi, dist, pois)
around_sql = <<-SQL
SELECT
ST_DWithin(
ST_LineSubstring(
way.path,
ST_LineLocatePoint(way.path, pta.lonlat::geometry) + #{dist} / ST_Length(way.path::geography),
ST_LineLocatePoint(way.path, pta.lonlat::geometry) + 100000 / ST_Length(way.path::geography)
),
ptb.lonlat,
2000) is true as pois
FROM ways way, pois pta, pois ptb
WHERE way.id = #{self.id}
and pta.id = #{poi.id}
and ptb.id = #{pois.ids}
SQL
Poi.find_by_sql(around_sql).pois
end
This function return :
syntax error at or near "["
LINE 13: and ptb.id = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
What's wrong, how can I fix it ?
Since you are using standard sql to build the query, (not the ActiveRecord), you will have to use the standard IN clues with where
It looks like pois.ids is returning an array, so, you will have to turn it to a string in the format as below
[1,2] #=> (1,2)
Change,
WHERE way.id = #{self.id}
and pta.id = #{poi.id}
and ptb.id = #{pois.ids}
to
WHERE way.id = #{self.id}
and pta.id = #{poi.id}
and ptb.id IN (#{pois.ids.join(',')})
You can change pois.ids as #semeera207 wrote to string or go another way and compare ptb.id to pois.ids as an array.
WHERE way.id = #{self.id}
and pta.id = #{poi.id}
and array[ptb.id] && #{pois.ids}
To make it faster create gin index
Create index on pois using gin((array[id]));

BigQuery Python Client Library - Named Parameters Error

I'm trying to write a simple query using the Python client library named "parameter", but kept encountering errors.
I keep getting "Undeclared query parameters" when I try to run the code. Did I miss out anything?
My Code:
import datetime
import os
from google.cloud import bigquery
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=<path>
client = bigquery.Client(project='project_id')
query = """
SELECT * from `<project_id>.<dataset_id>.*`
WHERE CAST(REGEXP_EXTRACT(_TABLE_SUFFIX, r"^(\d{8})$") AS INT64) = #date
limit 10;
"""
query_params = [
bigquery.ScalarQueryParameter(
'date',
'INT64',
int((datetime.date.today().strftime('%Y%m%d'))
)
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
query_job = client.query(
query,
location = 'US')
for row in query_job:
print(row)
assert query_job.state == 'DONE'
It looks like you are missing to enter your job_config into the arguments of your client.query() method. You should have:
query_job = client.query(
query,
location = 'US',
job_config=job_config)
Official docs here.

pandas series multi indexing error

I'm trying to slice into a multi-indexed data frame. I'm confused about conditions that generate IndexingError: Too many indexers. I'm also skeptical because I've found some bug reports about what may be this issue.
Specifically, this generates the error:
idx1 = [str(elem) for elem in [5, 6, 7, 8]]
idx2 = [str(elem) for elem in [10, 20, 30]]
index = pd.MultiIndex.from_product([idx1, idx2], names=('idx1', 'idx2'))
columns = ['m1', 'm2', 'm3']
df = pd.DataFrame(index=index, columns= columns)
df['m1'].loc[:,10]
That code above is trying to index into an index of dtypes of str, with an int, it seems to me. The error threw me off, as I don't understand why it says Too many indexers.
The below code works:
idx1 = [5, 6, 7, 8]
idx2 = [10, 20, 30]
index = pd.MultiIndex.from_product([idx1, idx2], names=('idx1', 'idx2'))
columns = ['m1', 'm2', 'm3']
df = pd.DataFrame(index=index, columns= columns)
df.loc[5,10] = [1,2,3]
df.loc[6,10] = [4,5,6]
df.loc[7,10] = [7,8,9]
type(df2['m1'])
df['m1'].loc[:,10]
There are some references to the same error: https://github.com/pandas-dev/pandas/issues/13597 which is marked closed and https://github.com/pandas-dev/pandas/issues/14885 which is open.
Is it ok to slice (a multi-indexed series) as in the lines above, assuming I get the dtype right? Also "Too many indexers" with DataFrame.loc
My pandas version is 20.3.