Simple query from Dataflow to BigQuerySource throws exception - google-bigquery

I'm trying to write a simple Dataflow job that utilizes the query parameter within the BigQuerySource class.
In simplest terms, I can access a BigQuery table using the BigQuerySource class, and then filter against it. I cannot query / filter directly against the BigQuery table using the BigQuerySource.
Here's some code. Filtering in-line, within my Dataflow pipeline, works fine:
import argparse
import apache_beam as beam
parser = argparse.ArgumentParser()
parser.add_argument('--output', required=True)
known_args, pipeline_args = parser.parse_known_args(None)
p = beam.Pipeline(argv=pipeline_args)
source = 'bigquery-public-data:samples.shakespeare'
rows = p | 'read'>>beam.io.Read(beam.io.BigQuerySource(source))
f = rows | 'filter' >> beam.Map(lambda row: 1 if (row['word_count'] > 1) else 0)
f | 'write' >> beam.io.WriteToText(known_args.output)
p.run()
Replacing that middle stanza with a single line query gives an error.
f = p | 'read' >> beam.io.Read(beam.io.BigQuerySource('SELECT 1 FROM ' \
+ 'bigquery-public-data:samples.shakespeare where word_count > 1'))
The error returned looks like a syntax error.
(a29eabc394a38f62): Workflow failed. Causes:
(a29eabc394a38cfa): S04:read+write/Write/WriteImpl/WriteBundles+write/Write/WriteImpl/Pair+write/Write/WriteImpl/WindowInto(WindowIntoFn)+write/Write/WriteImpl/GroupByKey/Reify+write/Write/WriteImpl/GroupByKey/Write failed.,
(fb6d0643d7f13886): BigQuery execution failed.,
(fb6d0643d7f13b03): Error: Message: Encountered " "-" "- "" at line 1, column 59. Was expecting: <EOF>
Do I need to escape the - characters in the BigQuery project name?

In BigQuery Legacy SQL - you should escape whole table reference with [ and ]
For Standard SQL you should use back-ticks for the same reason
See also Escaping reserved keywords and invalid identifiers

Related

Apache Calcite Fails to Parse Its Own Google BigQuery Output

I'm new to Apache Calcite and am running into a strange "gap" - given a simple select query:
select * from orders
I parse it using:
SqlParser.Config sqlParserConfig = SqlParser
.configBuilder()
.setConformance(SqlConformanceEnum.LENIENT)
.build();
SqlParser parser = SqlParser.create(sqlQuery, sqlParserConfig);
SqlNode parseAst = parser.parseQuery();
CalciteCatalogReader catalogReader = buildCatalogReader(schema, typeFactory);
SqlValidator.Config validatorConf = SqlValidator.Config.DEFAULT.withSqlConformance(SqlConformanceEnum.BIG_QUERY);
validatorConf = validatorConf.withIdentifierExpansion(true);
SqlValidator validator = SqlValidatorUtil.newValidator(SqlStdOperatorTable.instance(),
catalogReader, typeFactory,
validatorConf)
validator.validate(parseAst).toSqlString(BigQuerySqlDialect.DEFAULT).toString();
Which results in:
SELECT ORDERS.o_orderkey, ORDERS.o_custkey, ORDERS.o_orderstatus, ORDERS.o_totalprice, ORDERS.`o_order date`, ORDERS.o_orderpriority, ORDERS.o_clerk, ORDERS.o_shippriority, ORDERS.o_comment
FROM ORDERS AS ORDERS
(reasonable, note the quoted identifier for `o_order date`, necessary due to whitespace in columnname)
If I then take that query string and pass it back through (setting conformance to SqlConformanceEnum.BIG_QUERY) the parse fails with:
org.apache.calcite.sql.parser.SqlParseException: Lexical error at line 1, column 95. Encountered: "`" (96), after : ""
Puzzling on face, I tried again with a parse config:
SqlParser.Config sqlParserConfig = SqlParser
.configBuilder()
.setConformance(getConformance(SqlConformanceEnum.BIG_QUERY))
.setQuoting(Quoting.BACK_TICK_BACKSLASH)
.build();
to force handling backtick-quoted identifiers and I up with:
SELECT ORDERS.o_orderkey AS O_ORDERKEY, ORDERS.o_custkey AS O_CUSTKEY, ORDERS.o_orderstatus AS O_ORDERSTATUS, ORDERS.o_totalprice AS O_TOTALPRICE, ORDERS.`o_order date`, ORDERS.o_orderpriority AS O_ORDERPRIORITY, ORDERS.o_clerk AS O_CLERK, ORDERS.o_shippriority AS O_SHIPPRIORITY, ORDERS.o_comment AS O_COMMENT
FROM ORDERS AS ORDERS
which is usable... but
why is the "default" conformance for BigQuery setting DOUBLE_QUOTE when bigquery uses backticks
why does loop-parsing lead to a different query than on the input? (running the aliased query back through a second time gets itself back, so it does stabilize, but the initial inconsistency is weird

'where' operator: Failed to resolve table or column or scalar expression named

For a Query in Microsoft Defender Advanced Hunting I want to use Data from an external Table (here the KQL_Test_Data.csv) but when I try to run it I get the Error message:
'where' operator: Failed to resolve table or column or scalar expression named 'IOC'
and when i highlight the whole Query as told in 'where' operator: failed to resolve scalar expression named 'timeOffsetMin' i get this error message:
No tabular expression statement found
This is the code i used:
let IOC = externaldata(column:string)
[
h#"https://raw.githubusercontent.com/Kornuptiko/TEMP/main/KQL_Test_Data.csv"
]
with(format="csv");
DeviceNetworkEvents
| where Timestamp > ago(30d)
| where RemoteIP in (IOC);
Assuming microsoft365-defender supports externaldata:
Your file is not a valid CSV, and KQL is strict about this.
As a work-around we can read the file as txt and then parse it.
let IOC = externaldata(column:string)
[
h#"https://raw.githubusercontent.com/Kornuptiko/TEMP/main/KQL_Test_Data.csv"
]
with(format="txt")
| parse column with * '"' ip '"' *
| project ip;
DeviceNetworkEvents
| where Timestamp > ago(30d)
| where RemoteIP in (IOC);

I've performed a JOIN using bigrquery and the dbGetQuery function. Now I'd like to query the temporary table I've created but can't connect

I'm afraid that if a bunch of folks start running my actual code I'll be billed for the queries so my example code is for a fake database.
I've successfully established my connection to BigQuery:
con <- dbConnect(
bigrquery::bigquery(),
project = 'myproject',
dataset = 'dataset',
billing = 'myproject'
)
Then performed a LEFT JOIN using the coalesce function:
dbGetQuery(con,
"SELECT
`myproject.dataset.table_1x`.Pokemon,
coalesce(`myproject.dataset.table_1`.Type_1,`myproject.dataset.table_2`.Type_1) AS Type_1,
coalesce(`myproject.dataset.table_1`.Type_2,`myproject.dataset.table_2`.Type_2) AS Type_2,
`myproject.dataset.table_1`.Total,
`myproject.dataset.table_1`.HP,
`myproject.dataset.table_1`.Attack,
`myproject.dataset.table_1`.Special_Attack,
`myproject.dataset.table_1`.Defense,
`myproject.dataset.table_1`.Special_Defense,
`myproject.dataset.table_1`.Speed,
FROM `myproject.dataset.table_1`
LEFT JOIN `myproject.dataset.table_2`
ON `myproject.dataset.table_1`.Pokemon = `myproject.dataset.table_2`.Pokemon
ORDER BY `myproject.dataset.table_1`.ID;")
The JOIN produced the table I intended and now I'd like to query that table but like...where is it? How do I connect? Can I save it locally so that I can start working my analysis in R? Even if I go to BigQuery, select the Project History tab, select the query I just ran in RStudio, and copy the Job ID for the temporary table, I still get the following error:
Error: Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Run `rlang::last_error()` to see where the error occurred.
And if I follow up:
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlang_error>
Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Backtrace:
1. DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
2. DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
3. DBI:::.local(conn, statement, ...)
5. bigrquery::dbSendQuery(conn, statement, ...)
6. bigrquery:::BigQueryResult(conn, statement, ...)
7. bigrquery::bq_job_wait(job, quiet = conn#quiet)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
Job 'poke-340100.job_y0IBocmd6Cpy-irYtNdLJ-mWS7I0.US' failed
x Syntax error: Unexpected string literal 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae' at [2:6] [invalidQuery]
Backtrace:
x
1. +-DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
2. \-DBI::dbGetQuery(con, "SELECT *\nFROM 'poke-340100:US.bquxjob_7c3a7664_17ed44bb4ae'\nWHERE Type_1 IS NULL;")
3. \-DBI:::.local(conn, statement, ...)
4. +-DBI::dbSendQuery(conn, statement, ...)
5. \-bigrquery::dbSendQuery(conn, statement, ...)
6. \-bigrquery:::BigQueryResult(conn, statement, ...)
7. \-bigrquery::bq_job_wait(job, quiet = conn#quiet)
Can someone please explain? Is it just that I can't query a temporary table with the bigrquery package?
From looking at the documentation here and here, the problem might just be that you did not assign the results anywhere.
local_df = dbGetQuery(...
should take the results from your database query and copy them into local R memory. Take care as there is no check for the size of the results, so it is easy to run out of memory in when doing this.
You have tagged the question with dbplyr, but it looks like you are just using the DBI package. If you want to be writing R and have it translated to SQL, then you can do this using dbplyr. It would look something like this:
con <- dbConnect(...) # your connection details here
remote_tbl1 = tbl(con, from = "table_1")
remote_tbl2 = tbl(con, from = "table_2")
new_remote_tbl = remote_tbl1 %>%
left_join(remote_tbl2, by = "Pokemon", suffix = c("",".y")) %>%
mutate(Type_1 = coalesce(Type_1, Type_1.y),
Type_2 = coalesce(Type_2, Type_2.y)) %>%
select(ID, Pokemon, Type_1, Type_2, ...) %>% # list your return columns
arrange(ID)
When you use this approach, new_remote_tbl can be thought of as a new table in the database which you can query and manipulate further. (It is not actually a table - no data was saved to disc - but you can query it and interact with it as if it were and the database will produce it for you on demand).
There are some limitations of working with a remote table (the biggest is you are limited to commands that dbplyr can translate into SQL). When you want to copy the current remote table into local R memory, use collect:
local_df = remote_df %>%
collect()

Passing table name and list of values as argument to psycopg2 query

Context
I would like to pass a table name along with query parameters in a psycopg2 query in a python3 function.
If I understand correctly, I should not format the query string using python .format() method prior to the execution of the query, but let psycopg2 do that.
Issue
I can't succeed passing both the table name and the parameters as argument to my query string.
Code sample
Here is a code sample:
import psycopg2
from psycopg2 import sql
connection_string = "host={} port={} dbname={} user={} password={}".format(*PARAMS.values())
conn = psycopg2.connect(connection_string)
curs = conn.cursor()
table = 'my_customers'
cities = ["Paris", "London", "Madrid"]
data = (table, tuple(customers))
query = sql.SQL("SELECT * FROM {} WHERE city = ANY (%s);")
curs.execute(query, data)
rows = cursLocal.fetchall()
Error(s)
But I get the following error message:
TypeError: not all arguments converted during string formatting
I also tried to replace the data definition by:
data = (sql.Identifier(table), tuple(object_types))
But then this error pops:
ProgrammingError: can't adapt type 'Identifier'
If I put ANY {} instead of ANY (%s) in the query string, in both previous cases this error shows:
SyntaxError: syntax error at or near "{"
LINE 1: ...* FROM {} WHERE c...
^
Initially, I didn't used the sql module and I was trying to pass the data as the second argument to the curs.execute() method, but then the table name was single quoted in the command, which caused troubles. So I gave the sql module a try, hopping it's not a deprecated habit.
If possible, I would like to keep the curly braces {} for parameters substitution instead of %s, except if it's a bad idea.
Environment
Ubuntu 18.04 64 bit 5.0.0-37-generic x86_64 GNU/Linux
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
psycopg2.__version__
'2.8.4 (dt dec pq3 ext lo64)'
You want something like:
table = 'my_customers'
cities = ["Paris", "London", "Madrid"]
query = sql.SQL("SELECT * FROM {} WHERE city = ANY (%s)").format(sql.Identifier(table))
curs.execute(query, (cities,))
rows = cursLocal.fetchall()

Response too large to return while using limit

After trying to export result by UI BigQuery export table to csv file, now i'm trying to use glcoud command line to do it.
here a snippet:
def main(query, file_, max_=1000):
page = 0
start = 0
while True:
with open(file_, "ab") as fh:
try:
query = ("bq query --start_row=%d --max_rows_per_request=%d "
"'%s'" % (start, max_, query))
query_result = subprocess.check_output(query, shell=True)
fh.write(query_result)
except subprocess.CalledProcessError:
break
page += 1
start = page * max_ + 1
But failed, running the first query give my:
Using --allow_large_results give me "BigQuery error in query operation: allow_large_results requires destination_table." error.
So my question is pretty simple: how to paginate into large table to export the result ?
how to paginate into large table to export the result?
It looks like your result is greater than 128MB which is the limit above which you have to use destination table to write your result into. After this done you can export result to GCS as it is outlined in BigQuery export table to csv file