Error: Schema changed for Timestamp field (additional) - google-bigquery

I am getting an error message when I query a specific table in my data set that has a nullable timestamp field. In the BigQuery web tool, I run simple query, e.g.:
SELECT * FROM [reztrack.201401] LIMIT 100
The result I get is: Error: Schema changed for Timestamp field date
Example Job ID: esiteisthebomb:job_6WKi7ZhSi8D_Ewr8b5rKV-a5Eac
This is the exact same issue that was noted here: Error: Schema changed for Timestamp field.
Also logged this under: https://code.google.com/p/google-bigquery/issues/detail?id=307 but I was unsure since it said we should be logging everything in Stackoverlfow.
Any information on how to fix this for this or other tables would be greatly appreciated.
Note: The original answer states to contact google support, but Google support for BigQuery was moved to StackOverflow. Therefore I assume that means to open it as a new question in hopes the engineers will respond.

BigQuery recently improved the representation of its internal timestamp format (there had previously been a lot of cases where timestamps broke in strange ways and this change should fix that). Your table still was using the old timestamp format, and you tickled a bug in the old format when schemas changed (in this case, the field went from REQUIRED to OPTIONAL).
We have an automated process that coalesces tables to make their storage more efficient. I scheduled this to run over your table, and have verified that it has rewritten your table using the new timestamp format.
You should now be able to query this field of your table without further problems.

Related

Is the column of type JSON deprecated?

In the bigquery console, when creating a table, there used to be type JSON as an option for the column types but weirdly enought it was never present in their docs We used this column type in our production tables, and discovered later on that you can't select it in queries otherwise bigquery throws an error, and the json functions also didn't work with it. So we simply stopped using this column in the queries but they still exist in our tables.
However, in the past couple of days, all queries against this table are failing with this error 400 Json is not enabled for current project. and this column type is not present in the bigquery console anymore. It seems it was removed or deprecated? I checked the release notes, but the latest release was way before the error occured. This broke our production environment, and we couldnt even export the data because exporting gave the same error. Instead we had to use a new table without this column which meant we lost all our history.
Did anyone face the same problem with any other column types before, is it normal that a type is deprecated without users being notified beforehand. This is making me question the reliability of bigquery.
Please reach out to Google Cloud support and we will help you fix your issue with that problematic table. You may also want to try fixing it yourself using the ALTER TABLE DROP COLUMN statement that is currently in public preview [1]. This will drop the erroneous column (the data in that column only will be lost). The rest of the data will remain usable.
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_table_drop_column_statement
I ran into the same error message few days ago and was surprised to read about this policy change that's not backed up by a mitigation process. My attempt to use Vlad Grachev suggestion to drop this column did not prevail, as the console does not allow to query this table (same "Json is not enabled for current project." error).
My only remediation at this point is:
build a new table where the json column is switched to type string
create a pipeline that transforms the objects to strings
migrate the data through the pipeline to the new table
In BigQuery Json data can be stored in a column type "Record.Are you referring the same by JSON column type?
BigQuery uses the RECORD (or STRUCT) type to represent nested structure. A column of RECORD type is in fact a large column containing multiple child columns. For more information Refer the link below,
Json Data in BigQuery
if you are not refering to the Record Data type, The Json Column type might be a test feature that might not dependent on deprecation scheme

Airflow - Bigquery operator not working as intended

I'm new with Airflow and I'm currently stuck on an issue with the Bigquery operator.
I'm trying to execute a simple query on a table from a given dataset and copy the result on a new table in the same dataset. I'm using the bigquery operator to do so, since according to the doc the 'destination_dataset_table' parameter is supposed to do exactly what I'm looking for (source:https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/bigquery_operator/index.html#airflow.contrib.operators.bigquery_operator.BigQueryOperator).
But instead of copying the data, all I get is a new empty table with the schema of the one I'm querying from.
Here's my code
default_args = {
'owner':'me',
'depends_on_past':False,
'start_date':datetime(2019,1,1),
'end_date':datetime(2019,1,3),
'retries':10,
'retry_delay':timedelta(minutes=1),
}
dag = DAG(
dag_id='my_dag',
default_args=default_args,
schedule_interval=timedelta(days=1)
)
copyData = BigQueryOperator(
task_id='copyData',
dag=dag,
sql=
"SELECT some_columns,x,y,z FROM dataset_d.table_t WHERE some_columns=some_value",
destination_dataset_table='dataset_d.table_u',
bigquery_conn_id='something',
)
I don't get any warnings or errors, the code is running and the tasks are marked as success. It does create the table I wanted, with the columns I specified, but totally empty.
Any idea what I'm doing wrong?
EDIT: I tried the same code on a much smaller table (from 10Gb to a few Kb), performing a query with a much smaller result (from 500Mb to a few Kb), and it did work this time. Do you think the size of the table/the query result matters? Is it limited? Or does performing a too large query cause a lag of some sort?
EDIT2: After a few more tests I can confirm that this issue is not related to the size of the query or the table. It seems to have something to do with the Date format. In my code the WHERE condition is actually checking if a date_column = 'YYYY-MM-DD'. When I replace this condition with an int or string comparison it works perfectly. Do you guys know if Bigquery uses a particular date format or requires a particular syntax?
EDIT3: Finally getting somewhere: When I cast my date_column as a date (CAST(date_column AS DATE)) to force its type to DATE, I get an error that says that my field is actually an int-32 (Argument type mismatch). But I'm SURE that this field is a date, so that implies that either Bigquery stores it as an int while displaying it as a date, or that the Bigquery operator does some kind of hidden type conversion while loading tables. Any idea on how to fix this?
I had a similar issue when transferring data from other data sources than big-query.
I suggest casting the date_column as follows: to_char(date_column, 'YYYY-MM-DD') as date
In general, I have seen that big-query auto detection schema is often problematic. The safest way is to always specify schema before executing its corresponding query, or use operators that support schema definition.

How to translate internal BQ column userid (INTEGER) to e-mail

I'm extracting data from Scheduled Queries using this command
bq ls --transfer_config --transfer_location=us --format=csv
One column in the result is called userid (data type INTEGER) and it's a column referring to the user, who created the scheduled query. Hence quite important information.
I'd like to transform this information to a more readable format = an e-mail address. But I haven't been able to find out how.
PS I wonder why this internal value is used here. In other BigQuery system data user names are always presented in readable e-mail format. (Maybe it's because Scheduled Queries is still in beta version).
userID has been deprecated according to Docs
Please don't rely on that field. There is no other field in place right now. AFAIK only from auditlogs you can obtain these informations.

Doing a SELECT * from TABLE gives "Cannot read field 'records' of type INT64 as UINT64"

I am running into something that appears to be a global BigQuery issue that started maybe only a few days ago. It was definitely working on Jan 7th 2019. I narrowed down the issue to a simple SELECT * FROM TABLE which throws a Cannot read field 'records' of type INT64 as UINT64. The records field is declared as INTEGER in the schema and the table is a result of an aggregate query.
I am getting the same error both programmatically as well as in BigQuery UI.
If I explicitly list STRING fields, the query works. As soon as I reference records which is INTEGER, the query fails.
Job id is dulcet-outlook-94110:US.bquxjob_5883645e_16858aba0ae.
Alternatively, everyone can reproduce this using public data by saving the following query into a temp table and then doing a simple SELECT * from temp.
SELECT state, count(*) cnt FROM [bigquery-public-data:samples.natality]
group by state
This gives a slightly different but essentially the same error: Type mismatch for column 'cnt' in table temp. Expected type 'uint64', actual type 'int64' in file :mdb=cloud-dataengine.
(EDIT: Make sure to use "Allow Large Results" otherwise it will work fine).
Thank you for raising this up. This is indeed a bug in BigQuery, a fix has been completely rolled out now.
For the broken tables, although data is not lost, they have an inconsistent state with the schema. So please try to regenerate them if you can, as for now their schemas won't automatically fix themselves yet. We are working on ways to fix the schema of the existing affected tables, but it might take some time.
If you still have any problem feel free to report to the public issue tracker wpfwannabe created above.

Compare table schemas before starting a job

We are currently working on a project where we need to check if the database schema has changed everytime we start a Spoon job, since our origin is a third party database that we have little to no control.
The most obvious solution to us would be to create a script that would call a tool like apgdiff and then compare the schema to a previous generated schema file. If there was any change, we would then send a notification.
The question is basically: is this the best way to achieve this?
Any help would be appreciated.
Thanks for your time.
P.S.: I'm not sure whether stackoverflow is the best place for this kind of question so, if not, please feel free to suggest any interesting forum.
Solution I:
Assuming it is a PostgreSQL database you are referring to, if you have sufficient privileges to INFORMATION_SCHEMA, I would suggest that you query the database like so:
select column_name, data_type, character_maximum_length
from INFORMATION_SCHEMA.COLUMNS where table_name = '<name of table>';
Store the expected result in a persistent way, like you mentioned, and then just compare the results in a sub-transformation. The persistent schema could be a CSV file that stores the definition like so:
app_id character varying 255
platform character varying 255
etl_tstamp timestamp without time zone (null)
collector_tstamp timestamp without time zone (null)
dvce_tstamp timestamp without time zone (null)
event character varying 128
event_id character 36
Then just simply compare the two files: (1) the file that holds the expected schema definition and (2) the file that you just generated, fresh from the database. You could use the File Compare step to do so:
I hope this helps a bit.
EDIT:
Solution II:
Another solution you could apply: you can also use the Table Compare step (contributed by www.kjube.de) to compare two tables from different sources.
What's nice about this step is that you can specify two different connections for the two tables you are comparing.