I'm using bigquery with a dataset called '87891428' containing daily tables. I try to query a dates range thanks to the function TABLE_DATE_RANGE:
SELECT avg(foo)
FROM (
TABLE_DATE_RANGE(87891428.a_abc_,
TIMESTAMP('2014-09-30'),
TIMESTAMP('2014-10-19'))
)
But this leads to a very explicit error message:
Error: Encountered "" at line 3, column 21. Was expecting one of:
I've the feeling that TABLE_DATE_RANGE doesn"t like to have a dataset starting with a number cause when I copy few tables into a new dataset called 'test' the query run properly. Does anyone has already encountered this issue and if so what is the best workaround (as far as I know you can't rename a dataset) ?
The fix for this is to use brackets around the dataset name and table prefix:
SELECT avg(foo)
FROM (
TABLE_DATE_RANGE([87891428.a_abc_],
TIMESTAMP('2014-09-30'),
TIMESTAMP('2014-10-19'))
)
Related
I'm new with Airflow and I'm currently stuck on an issue with the Bigquery operator.
I'm trying to execute a simple query on a table from a given dataset and copy the result on a new table in the same dataset. I'm using the bigquery operator to do so, since according to the doc the 'destination_dataset_table' parameter is supposed to do exactly what I'm looking for (source:https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/bigquery_operator/index.html#airflow.contrib.operators.bigquery_operator.BigQueryOperator).
But instead of copying the data, all I get is a new empty table with the schema of the one I'm querying from.
Here's my code
default_args = {
'owner':'me',
'depends_on_past':False,
'start_date':datetime(2019,1,1),
'end_date':datetime(2019,1,3),
'retries':10,
'retry_delay':timedelta(minutes=1),
}
dag = DAG(
dag_id='my_dag',
default_args=default_args,
schedule_interval=timedelta(days=1)
)
copyData = BigQueryOperator(
task_id='copyData',
dag=dag,
sql=
"SELECT some_columns,x,y,z FROM dataset_d.table_t WHERE some_columns=some_value",
destination_dataset_table='dataset_d.table_u',
bigquery_conn_id='something',
)
I don't get any warnings or errors, the code is running and the tasks are marked as success. It does create the table I wanted, with the columns I specified, but totally empty.
Any idea what I'm doing wrong?
EDIT: I tried the same code on a much smaller table (from 10Gb to a few Kb), performing a query with a much smaller result (from 500Mb to a few Kb), and it did work this time. Do you think the size of the table/the query result matters? Is it limited? Or does performing a too large query cause a lag of some sort?
EDIT2: After a few more tests I can confirm that this issue is not related to the size of the query or the table. It seems to have something to do with the Date format. In my code the WHERE condition is actually checking if a date_column = 'YYYY-MM-DD'. When I replace this condition with an int or string comparison it works perfectly. Do you guys know if Bigquery uses a particular date format or requires a particular syntax?
EDIT3: Finally getting somewhere: When I cast my date_column as a date (CAST(date_column AS DATE)) to force its type to DATE, I get an error that says that my field is actually an int-32 (Argument type mismatch). But I'm SURE that this field is a date, so that implies that either Bigquery stores it as an int while displaying it as a date, or that the Bigquery operator does some kind of hidden type conversion while loading tables. Any idea on how to fix this?
I had a similar issue when transferring data from other data sources than big-query.
I suggest casting the date_column as follows: to_char(date_column, 'YYYY-MM-DD') as date
In general, I have seen that big-query auto detection schema is often problematic. The safest way is to always specify schema before executing its corresponding query, or use operators that support schema definition.
I am running into something that appears to be a global BigQuery issue that started maybe only a few days ago. It was definitely working on Jan 7th 2019. I narrowed down the issue to a simple SELECT * FROM TABLE which throws a Cannot read field 'records' of type INT64 as UINT64. The records field is declared as INTEGER in the schema and the table is a result of an aggregate query.
I am getting the same error both programmatically as well as in BigQuery UI.
If I explicitly list STRING fields, the query works. As soon as I reference records which is INTEGER, the query fails.
Job id is dulcet-outlook-94110:US.bquxjob_5883645e_16858aba0ae.
Alternatively, everyone can reproduce this using public data by saving the following query into a temp table and then doing a simple SELECT * from temp.
SELECT state, count(*) cnt FROM [bigquery-public-data:samples.natality]
group by state
This gives a slightly different but essentially the same error: Type mismatch for column 'cnt' in table temp. Expected type 'uint64', actual type 'int64' in file :mdb=cloud-dataengine.
(EDIT: Make sure to use "Allow Large Results" otherwise it will work fine).
Thank you for raising this up. This is indeed a bug in BigQuery, a fix has been completely rolled out now.
For the broken tables, although data is not lost, they have an inconsistent state with the schema. So please try to regenerate them if you can, as for now their schemas won't automatically fix themselves yet. We are working on ways to fix the schema of the existing affected tables, but it might take some time.
If you still have any problem feel free to report to the public issue tracker wpfwannabe created above.
Background:
I have two datasets on BigQuery.
Dataset 1 is named '12345678' with tables having the names 'ga_sessions_yyyymmdd'. For example, the table names are like ga_sessions_20140721, ga_sessions_20150413 etc.
Dataset 2 is named 'DestinationTables'. The tables names are in the format yyyymmdd. For example, 20140721, 20150413 etc.
Problem:
Using the TABLE_DATE_RANGE(), I ran the following query on Dataset 1:
SELECT
[fullVisitorId] AS [fullVisitorId]
FROM TABLE_DATE_RANGE([12345678.ga_sessions_],TIMESTAMP('2014-07-21'),TIMESTAMP('2014-07-25'));
This query successfully runs.
I now run a similar query on Dataset 2:
SELECT
[fullVisitorId] AS [fullVisitorId]
FROM TABLE_DATE_RANGE([DestinationTables.],TIMESTAMP('2014-07-21'),TIMESTAMP('2014-07-25'));
However, this errors out with the message:
Error: Can't parse table: DestinationTables
Why is this happening? Any insight on this would be greatly appreciated.
Thanks in advance!
The syntax for identifying a dataset and a table prefix are correct in your first example:
[12345678.ga_sessions_]
And as explained in the docs for this function, it will expand to cover tables (in dataset 12345678) of the format:
ga_sessions_yyyymmdd
However, in your second example, the identifier stops with just a dot where it should continue to identify a table prefix. I think the issue is that you have no prefix and so the naked dot on the end of the string is confusing the interpreter.
You may need to change your tables to have some kind of prefix, even if it's just an underscore, so that you can properly specify the prefix when calling TABLE_DATE_RANGE
I am using Spark with Scala and trying to get data from a database using JdbcRDD.
val rdd = new JdbcRDD(sparkContext,
driverFactory,
testQuery,
rangeMinValue.get,
rangeMaxValue.get,
partitionCount,
rowMapper)
.persist(StorageLevel.MEMORY_AND_DISK)
Within the query there are no ? values to set (since the query is quite long I am not putting it here.) So I get an error saying that,
java.sql.SQLException: Parameter index out of range (1 > number of parameters, which is 0).
I have no idea what the problem is. Can someone suggest any kind of solution ?
Got the same problem.
Used this:
SELECT * FROM tbl WHERE ... AND ? = ?
And then call it with lowerbound 1, higher bound 1 and partition 1.
Will always run only one partition.
Your problem is Spark expected that your query String has a couple of ? parameters.
From Spark user list:
In order for Spark to split the JDBC query in parallel, it expects an
upper and lower bound for your input data, as well as a number of
partitions so that it can split the query across multiple tasks.
For example, depending on your data distribution, you could set an
upper and lower bound on your timestamp range, and spark should be
able to create new sub-queries to split up the data.
Another option is to load up the whole table using the HadoopInputFormat
class of your database as a NewHadoopRDD.
I'm relatively new to SQL and have been attempting to run a script wherein I can bring up the number of days that have passed between two points in time. I understand how this should look based on your website, but for some reason when I input the values, my database is returning the following error:
ProgrammingError: ERROR: column "day" does not exist
The code I'm using is:
select datediff(day, '2014-01-01', '2014-02-01')
I assume I'm missing something very simple (this is a hugely basic query I'm sure), but would be appreciative of any assistance. I've variously tried pointing it towards the specific table I want to draw from, but it keeps stumbling on this error.
If you are doing this in postgresql then use
select DATE_PART('day', '2014-01-01'::timestamp - '2014-02-01'::timestamp)