I am trying to write a query using Google BigQuery Python API. I am setting the project id and dataset name as parameters. I have looked into the parametrized queries implementation on Google Github.io. But when executing the query I get the following error
google.api_core.exceptions.BadRequest: 400 Invalid table name: #project:#dataset.AIRPORTS
I am confused whether we can substitute the project, dataset names with parameters.
Below is my code
from google.cloud import bigquery
client = bigquery.Client.from_service_account_json('service_account.json')
project = client.project
datasets = list(client.list_datasets())
dataset = datasets[0]
dataset_id = dataset.dataset_id
QUERY = (
'SELECT * '
'FROM `{}.{}.AIRPORTS`'.format(project, dataset_id)
)
query = (
'SELECT * '
'FROM `#project.#dataset.AIRPORTS`'
)
TIMEOUT = 30
param1 = bigquery.ScalarQueryParameter('project', 'STRING', project)
param2 = bigquery.ScalarQueryParameter('dataset', 'STRING', dataset_id)
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = [param1, param2]
query_job = client.query(
query, job_config=job_config)
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
print(rows)
You can only use parameters in place of expressions, such as column_name = #param_value in a WHERE clause. A table name is not an expression, so you cannot use parameters in place of the project or dataset names. Note also that you need to use standard SQL in order to use parameters.
Related
I am using boto3 library with executeStatement to get data from an RDS cluster using DATA API.
Query is working fine if i select 1 or 2 columns but as soon as I select another column to query, it returns an error with (BadRequestException) permission denied for relation table_name
I have checked using pgadmin the permissions are intact to query the whole db for the user I am using.
function included in call:
def execute_query(self, sql_query, sql_parameters=[]):
"""
Aurora DataAPI execute query. Generally used for select statements.
:param sql_query: Query
:param sql_parameters: parameters in sql query
:return: DataApi response
"""
client = self.api_access()
response = client.execute_statement(
resourceArn=RESOURCE_ARN,
secretArn=SECRET_ARN,
database='db_name',
sql=sql_query,
includeResultMetadata=True,
parameters=sql_parameters)
return response
function call: No errors
query = '''
SELECT id
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
function call: fails with above error
query = '''
SELECT id,name,event
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
Is there a horizontal limit on what we can get from DATA API using Boto3? I know there is a limit for 1MB, but it should return something as per the documentation if it exceeds the limit.
Backend is Postgres RDS
UPDATE:
I can select the same columns 10 times and its not a problem
query = '''
SELECT id,event,event,event,event,event
FROM schema_name.table_name
limit 1
'''
print(query)
result = conn.execute_query(query)
print(result)
So this means there are some columns that I cannot select.
I didnt know there are column level security in some tables. If there are column level securities in postgres for the user you are using that's obvious I cannot select those columns.
I am using SQLAlchemy by create engine to connect python to snowflake for fetching data. Adding a snippet of code on how I am doing it. Before you suggest to use connector.snowflake, I have tried that and it has query tag but I need to pull queries by thread pool method and couldn't find a way to add query tag.
Have also tried ALTER SESSION SET QUERY_TAG but since the query runs parallelly it doesn't give the query tag.
Code:
vendor_class_query ='select * from table'
query_list1 = [vendor_class_query]
pool = ThreadPool(8)
def query(x):
engine = create_engine(
'snowflake://{user}:{password}#{account}/{database_name}/{schema_name}?\
warehouse={warehouse}&role={role}¶mstyle={paramstyle}'.format(
user=---------,
password=----------,
account=----------,
database_name=----------,
schema_name=----------,
warehouse=----------,
role=----------,
paramstyle='pyformat'
),
poolclass=NullPool
)
try:
connection = engine.connect()
for df in pd.read_sql_query(x, engine, chunksize=1000000000):
df.columns = map(str.upper, df.columns)
return df
finally:
connection.close()
engine.dispose()
return df
results1 = pool.map(query, query_list1)
vendor_class = results1[0]'''
This is a continuation of my previous post for making api that takes url parameter , passes it to BigQuery and if the luid record has data in orderid column, it returns True . How to check whether data exists in specific column on BigQuery with Flask?
I changed sql and it seems this sql works well on GCP console but as you can see , it returns Flase({'f0_': 0})) if you input correct parameter from browser. Do I need to fix this sql ??
[URL:https://test-989898.df.r.appspot.com?luid=U77777]
The output of return str(row)
↓
Row((True,), {'f0_': 0})
The output of SQL with same luid above on console
↓
row | f0_
1 | true
SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE luid = "U77777" AND orderid != '' limit 1000)
and I tried this article as below . User input parameter can not be available in BigQuery ??
https://cloud.google.com/bigquery/docs/parameterized-queries
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
for row in query_res:
return str(row)
↓
Row((True,), {'f0_': 0})
I've been stack in this problem for a while , I'm welcome to any idea . Anyone has good solutions ??
from flask import Flask, request, jsonify
from google.cloud import bigquery
app = Flask(__name__)
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
# first_row = next(iter(query_job.result()))
for row in query_res:
return str(row)
#return jsonify({luid:query_res.total_rows})
"""
if query_res == :
return jsonify({luid: str(True)})
else:
return jsonify({luid: str(False)})
"""
if __name__ == "__main__":
app.run()
↓
Row((True,), {'f0_': 0})
You seem to have solved most of the bits, it's just a question of getting them working together. Here's a quick sample that should help with the BigQuery things, and shows a different way of writing your query pattern using a public dataset table.
from google.cloud import bigquery
client = bigquery.Client()
# assume you get this from your flask app's param. this is the "luid" you're checking.
value = "treason"
# rewriting the sql demonstrate a similar thing with a public dataset table
sql = "SELECT COUNTIF(word=#luid AND corpus='sonnets') > 0 as word_is_sonnet FROM `bigquery-public-data.samples.shakespeare`"
config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", value),
]
)
job = client.query(sql, job_config=config)
# this is a bit odd, but in this case we know we're dealing with a single row
# coming from the iterable based on the query structure.
first_row = next(iter(job.result()))
print(first_row.get("word_is_sonnet"))
However, that said I'd make sure you're understanding how BigQuery works and charges for queries. You seem to be doing point lookups for a range of tables (the wildcard table in your original query), which means you're potentially doing a lot of table scanning to satisfy this request.
I just wanted to call that out so you're not surprised by either the performance or the costs if the intent is that you're issuing many requests like this.
I am trying to add a date variable in my query in GBQ.
So I have variable x (ex: 2016-04-20) which I want to use in query like:
#Query the necessary data
customer_data_query = """
SELECT FirstName, LastName, Organisation, CustomerRegisterDate FROM `bigquery-bi.ofo.Customers`
where CustomerRegisterDate > #max_last_date LIMIT 5) """
print(customer_data_query)
# Creating a connection to the google bigquery
client = bigquery.Client.from_service_account_json('./credentials/cred_ofo.json')
print("Connection to Google BigQuery is established")
query_params = [
bigquery.ScalarQueryParameter("max_last_date", "STRING", max_last_date),
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
customer_data = client.query(
customer_data_query,
# Location must match that of the dataset(s) referenced in the query.
location="US",
job_config=job_config,
).to_dataframe() # API request - starts the query
Any tips on how I can do it?
I have tried in the code above but not worked.
There were 2 solutions:
1 was to use format:
"""SELECT FirstName, LastName, Organisation, CustomerRegisterDate FROM `bigquery-bi.ofo.Customers` where CustomerRegisterDate > {} LIMIT 5""".format(max_date)
2:
To define parameters to be able to use in job_config, which is mentioned in BigQuery parametrized query documentation.
We need to monitor table sizes in different environments.
Use Google metadata API to get the information for a given Project/Environment.
Need to create a view which will provide
1. What are all the datasets
2. What tables in each dataset
3. Table sizes
4. Dataset size
BigQuery has such views for you already built-in: INFORMATION_SCHEMA is a series of views that provide access to metadata about datasets, tables, and views
For example, below returns metadata for all datasets in the default project
SELECT * FROM INFORMATION_SCHEMA.SCHEMATA
or
for my_project
SELECT * FROM my_project.INFORMATION_SCHEMA.SCHEMATA
There are other such views for tables also
In addition, there is a meta table that can be used to get more info about tables in given dataset: __TABLES__SUMMARY and __TABLES__
SELECT * FROM `project.dataset.__TABLES__`
For example:
SELECT table_id,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date,
row_count,
size_bytes,
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type,
TIMESTAMP_MILLIS(creation_time) AS creation_time,
TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time,
dataset_id,
project_id
FROM `project.dataset.__TABLES__`
In order to automatize the query to check for every dataset in the project instead of adding them manually with UNION ALL, you can follow the advice given by #ZinkyZinky here and create a query that generates the UNION ALL calls for every dataset.__TABLES_. I have not managed to use this solution fully automatically in BigQuery because I don’t find a way to execute a command generated as a string (That is what string_agg is creating). Anyhow, I have managed to develop the solution in Python, adding the generated string in the next query. You can find the code below. It also creates a new table and stores the results there:
from google.cloud import bigquery
client = bigquery.Client()
project_id = "wave27-sellbytel-bobeda"
# Construct a full Dataset object to send to the API.
dataset_id = "project_info"
dataset = bigquery.Dataset(".".join([project_id, dataset_id]))
dataset.location = "US"
# Send the dataset to the API for creation.
# Raises google.api_core.exceptions.Conflict if the Dataset already
# exists within the project.
dataset = client.create_dataset(dataset) # API request
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))
schema = [
bigquery.SchemaField("dataset_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("table_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("size_bytes", "INTEGER", mode="REQUIRED"),
]
table_id = "table_info"
table = bigquery.Table(".".join([project_id, dataset_id, table_id]), schema=schema)
table = client.create_table(table) # API request
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table(table_id)
job_config.destination = table_ref
# QUERIES
# 1. Creating the UNION ALL list with the table information of each dataset
query = (
r"SELECT string_agg(concat('SELECT * from `', schema_name, '.__TABLES__` '), 'union all \n') "
r"from INFORMATION_SCHEMA.SCHEMATA"
)
query_job = client.query(query, location="US") # API request - starts the query
select_tables_from_all_datasets = ""
for row in query_job:
select_tables_from_all_datasets += row[0]
# 2. Using the before mentioned list to create a table.
query = (
"WITH ALL__TABLES__ AS ({})"
"SELECT dataset_id, table_id, size_bytes FROM ALL__TABLES__;".format(select_tables_from_all_datasets)
)
query_job = client.query(query, location="US", job_config=job_config) # job_config configures in which table the results will be stored.
for row in query_job:
print row
print('Query results loaded to table {}'.format(table_ref.path))