Airflow - Bigquery operator not working as intended - sql

I'm new with Airflow and I'm currently stuck on an issue with the Bigquery operator.
I'm trying to execute a simple query on a table from a given dataset and copy the result on a new table in the same dataset. I'm using the bigquery operator to do so, since according to the doc the 'destination_dataset_table' parameter is supposed to do exactly what I'm looking for (source:https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/bigquery_operator/index.html#airflow.contrib.operators.bigquery_operator.BigQueryOperator).
But instead of copying the data, all I get is a new empty table with the schema of the one I'm querying from.
Here's my code
default_args = {
'owner':'me',
'depends_on_past':False,
'start_date':datetime(2019,1,1),
'end_date':datetime(2019,1,3),
'retries':10,
'retry_delay':timedelta(minutes=1),
}
dag = DAG(
dag_id='my_dag',
default_args=default_args,
schedule_interval=timedelta(days=1)
)
copyData = BigQueryOperator(
task_id='copyData',
dag=dag,
sql=
"SELECT some_columns,x,y,z FROM dataset_d.table_t WHERE some_columns=some_value",
destination_dataset_table='dataset_d.table_u',
bigquery_conn_id='something',
)
I don't get any warnings or errors, the code is running and the tasks are marked as success. It does create the table I wanted, with the columns I specified, but totally empty.
Any idea what I'm doing wrong?
EDIT: I tried the same code on a much smaller table (from 10Gb to a few Kb), performing a query with a much smaller result (from 500Mb to a few Kb), and it did work this time. Do you think the size of the table/the query result matters? Is it limited? Or does performing a too large query cause a lag of some sort?
EDIT2: After a few more tests I can confirm that this issue is not related to the size of the query or the table. It seems to have something to do with the Date format. In my code the WHERE condition is actually checking if a date_column = 'YYYY-MM-DD'. When I replace this condition with an int or string comparison it works perfectly. Do you guys know if Bigquery uses a particular date format or requires a particular syntax?
EDIT3: Finally getting somewhere: When I cast my date_column as a date (CAST(date_column AS DATE)) to force its type to DATE, I get an error that says that my field is actually an int-32 (Argument type mismatch). But I'm SURE that this field is a date, so that implies that either Bigquery stores it as an int while displaying it as a date, or that the Bigquery operator does some kind of hidden type conversion while loading tables. Any idea on how to fix this?

I had a similar issue when transferring data from other data sources than big-query.
I suggest casting the date_column as follows: to_char(date_column, 'YYYY-MM-DD') as date
In general, I have seen that big-query auto detection schema is often problematic. The safest way is to always specify schema before executing its corresponding query, or use operators that support schema definition.

Related

cannot do a where clause with jooq DSL in Kotlin/Java

I am trying to run a query of the following form with jooq in Kotlin:
val create = DSL.using(SQLDialect.POSTGRES)
val query: Query = create.select().from(DSL.table(tableName))
.where(DSL.field("timestamp").between("1970-01-01T00:00:00Z").and("2021-11-05T00:00:00Z"))
.orderBy(DSL.field("id").desc())
The code above gives me:
syntax error at or near \"and\
Also, looking at this query in the debugger, the query.sql renders as:
select * from data_table where timestamp between ? and ? order by id desc
I am not sure if the ? indicates that it could not render the values to SQL or somehow they are some sort of placeholders..
Also, the code works without the where chain.
Additionally, on the Postgres command line I can run the following and the query executes:
select * from data_table where timestamp between '1970-01-01T00:00:00Z' and '2021-11-05T00:00:00Z' order by id
Querying the datatypes on the schema, the timestamp column type is rendered as timestamp without time zone.
Before I had declared variables as:
val lowFilter = "1970-01-01T00:00:00Z"
val highFilter = "2021-11-05T00:00:00Z"
and this did not work and it seems passing raw strings does not work either. I am very new to this, so I am pretty sure I am messing up the usage here.
EDIT
Following #nulldroid suggestion, did something like:
.where(DSL.field("starttime").between(DSL.timestamp("1970-01-01T00:00:00Z")).and(DSL.timestamp("2021-11-05T00:00:00Z")))
and this resulted in:
Type class org.jooq.impl.Val is not supported in dialect POSTGRES"
Not using the code generator:
I'm going to assume you have a good reason not to use the code generator for this particular query, the main reason usually being that your schema is dynamic.
So, the correct way to write your query is this:
create.select()
.from(DSL.table(tableName))
// Attach a DataType to your timestamp field, to let jOOQ know about this
.where(DSL.field("timestamp", SQLDataType.OFFSETDATETIME)
// Use bind values of a temporal type
.between(OffsetDateTime.parse("1970-01-01T00:00:00Z"))
.and(OffsetDateTime.parse("2021-11-05T00:00:00Z")))
.orderBy(DSL.field("id").desc())
Notice how I'm using actual temporal data types, not strings to compare dates and declare fields.
I'm assuming, from your question's UTC timestamps, that you're using TIMESTAMPTZ. Otherwise, if you're using TIMESTAMP, just replace OffsetDateTime by LocalDateTime...
Using the code generator
If using the code generator, which is always recommended if your schema isn't dynamic, you'd write almost the same thing as above, but type safe:
create.select()
.from(MY_TABLE)
// Attach a DataType to your timestamp field, to let jOOQ know about this
.where(MY_TABLE.TIMESTAMP
// Use bind values of a temporal type
.between(OffsetDateTime.parse("1970-01-01T00:00:00Z"))
.and(OffsetDateTime.parse("2021-11-05T00:00:00Z")))
.orderBy(MY_TABLE.ID.desc())

Unable to load partitions in Athena with case sensitivity ON

I have data in S3 which is partitioned in YYYY/MM/DD/HH/ structure (not year=YYYY/month=MM/day=DD/hour=HH)
I set up a Glue crawler for this, which creates a table in Athena, but when I query the data in Athena it gives an error as one field has duplicate name (URL and url , which the SerDe converts to lowercase, causing a name conflict).
To fix this, I manually create another table (using the above table definition SHOW CREATE TABLE), adding 'case.insensitive'= FALSE to the SERDEPROPERTIES
WITH SERDEPROPERTIES ('paths'='deviceType,emailId,inactiveDuration,pageData,platform,timeStamp,totalTime,userId','case.insensitive'= FALSE)
I changed the s3 directory structure to the hive-compatible naming year=/month=/day=/hour= and then created the table with 'case.insensitive'= FALSE, then ran the MSCK REPAIR TABLE command for the new table, which loads all the partitions.
(Complete CREATE TABLE QUERY)
But upon querying, I can only find 1 data column (platform) and the partition columns, rest of all the columns are not parsed. But I've actually copied the Glue-generated CREATE TABLE query, with the case_insensitive=false condition.
How can I fix this?
I think you have multiple, separate issues: one with the crawler, and one with the serde, and one with duplicate keys:
Glue Crawler
If Glue Crawler delivered on what they promise they would be a fairly good solution for most situations and would save us from writing the same code over and over again. Unfortunately, if you stray outside of the (undocumented) use cases Glue Crawler was designed for, you often end up with various issues, from the strange to the completely broken (see for example this question, this question, this question, this question, this question, or this question).
I recommend that you skip Glue Crawler and instead write the table DDL by hand (you have a good template in what the crawler created, it just isn't good enough). Then you write a Lambda function (or shell script) that you run on a schedule to add new partitions.
Since your partitioning is only on time, this is a fairly simple script: it just needs to run every once in a while and add the partition for the next period.
It looks like your data is from Kinesis Data Firehose which produces a partitioned structure at hour granularity. Unless you have lots of data coming every hour I recommend you create a table that is only partitioned on date, and run the Lambda function or script once per day to add the next day's partition.
A benefit from not using Glue Crawler is that you don't have to have a one-to-one correspondence between path components and partition keys. You can have a single partition key that is typed as date, and add partitions like this: ALTER TABLE foo ADD PARTITION (dt = '2020-05-13') LOCATION 's3://some-bucket/data/2020/05/13/'. This is convenient because it's much easier to do range queries on a full date than when the components are separate.
If you really need hourly granularity you can either have two partition keys, one which is the date and one the hour, or just the one with the full timestamp, e.g. ALTER TABLE foo ADD PARTITION (ts = '2020-05-13 10:00:00') LOCATION 's3://some-bucket/data/2020/05/13/10/'. Then run the Lambda function or script every hour, adding the next hour's partition.
Having too a granular partitioning doesn't help with performance, and can instead hurt it (although the performance hit comes mostly from the small files and the directories).
SerDe config
As for the reason why you're only seeing the value of the platform column, it's because it's the only case where the column name and property have the same casing.
It's a bit surprising that the DDL you link to doesn't work, but I can confirm that it really doesn't. I tried creating a table from that DDL, but without the pagedata column (I also skipped the partitioning, but that shouldn't make a difference for the test), and indeed only the platform column had any value when I queried the table.
However, when I removed the case.insensitive serde property it worked as expected, which got me thinking that it might not work the way you think it does. I tried setting it to TRUE instead of FALSE, which made the table work as expected again. I think we can conclude from this that the Athena documentation is just wrong when it says "By default, Athena requires that all keys in your JSON dataset use lowercase". In fact, what happens is that Athena lower cases the column names, but it also lower cases the property names when reading the JSON.
With further experimentation it turned out the path property was redundant too. This is a table that worked for me:
CREATE EXTERNAL TABLE `json_case_test` (
`devicetype` string,
`timestamp` string,
`totaltime` string,
`inactiveduration` int,
`emailid` string,
`userid` string,
`platform` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
I'd say that case.insensitive seems to cause more problems than it solves.
Duplicate keys
When I added the pagedata column (as struct<url:string>) and added "pageData":{"URL":"URL","url":"url"} to the data, I got the error:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "url"
And I got the error regardless of whether the pagedata column was involved in the query or not (e.g. SELECT userid FROM json_case_test also errored). I tried the case.insensitive serde property with both TRUE and FALSE, but it had no effect.
Next, I took a look at the source documentation for the serde, which first of all is worded much better, and secondly contains the key piece of information: that you also need to provide mappings for the columns when you turn off case insensitivity.
With the following serde properties I was able to get the duplicate key issue to go away:
WITH SERDEPROPERTIES (
"case.insensitive" = "false",
"mapping.pagedata" = "pageData",
"mapping.pagedata.url" = "pagedata.url",
"mapping.pagedata.url2"= "pagedata.URL"
)
You would have to provide mappings for all the columns except for platform, too.
Alternative: use JSON functions
You mentioned in a comment to this answer that the schema of the pageData property is not constant. This is another case where Glue Crawlers unfortunately don't really work. If you're unlucky you'll end up with a flapping schema that includes some properties some days (see for example this question).
What I realised when I saw your comment is that there is another solution to your problem: set up the table manually (as described above) and use string as the type for the pagedata column. Then you can use functions like JSON_EXTRACT_SCALAR to extract the properties you want during query time.
This solution trades increased complexity of the queries for way fewer headaches trying to keep up with an evolving schema.

Doing a SELECT * from TABLE gives "Cannot read field 'records' of type INT64 as UINT64"

I am running into something that appears to be a global BigQuery issue that started maybe only a few days ago. It was definitely working on Jan 7th 2019. I narrowed down the issue to a simple SELECT * FROM TABLE which throws a Cannot read field 'records' of type INT64 as UINT64. The records field is declared as INTEGER in the schema and the table is a result of an aggregate query.
I am getting the same error both programmatically as well as in BigQuery UI.
If I explicitly list STRING fields, the query works. As soon as I reference records which is INTEGER, the query fails.
Job id is dulcet-outlook-94110:US.bquxjob_5883645e_16858aba0ae.
Alternatively, everyone can reproduce this using public data by saving the following query into a temp table and then doing a simple SELECT * from temp.
SELECT state, count(*) cnt FROM [bigquery-public-data:samples.natality]
group by state
This gives a slightly different but essentially the same error: Type mismatch for column 'cnt' in table temp. Expected type 'uint64', actual type 'int64' in file :mdb=cloud-dataengine.
(EDIT: Make sure to use "Allow Large Results" otherwise it will work fine).
Thank you for raising this up. This is indeed a bug in BigQuery, a fix has been completely rolled out now.
For the broken tables, although data is not lost, they have an inconsistent state with the schema. So please try to regenerate them if you can, as for now their schemas won't automatically fix themselves yet. We are working on ways to fix the schema of the existing affected tables, but it might take some time.
If you still have any problem feel free to report to the public issue tracker wpfwannabe created above.

Creating a UDF in BigQuery

I would like to create a UDF named maxDate in BigQuery that does the following:
maxDate('table_name') returns the result from running the query below:
select max(table_id) from fact.___TABLES____ where table_id < 'table_name';
I'm quite new to JS and not too sure how to start. This looks like a simple thing to write. Could anyone point me in the right way? I've read the documentation, and unsure of how to write this.
Scalar UDF are not existent yet in BigQuery
See more about BigQuery User-Defined Functions to understand what are they today.
To simplify - think of today's UDF as virtual table that you can query and this table in turn powered by real table where each row is processed row-by-row and javascript code is applied for each row and generates (instead of this input row) zero, one or many (depends of inplemented in js logic) rows)

Error: Schema changed for Timestamp field (additional)

I am getting an error message when I query a specific table in my data set that has a nullable timestamp field. In the BigQuery web tool, I run simple query, e.g.:
SELECT * FROM [reztrack.201401] LIMIT 100
The result I get is: Error: Schema changed for Timestamp field date
Example Job ID: esiteisthebomb:job_6WKi7ZhSi8D_Ewr8b5rKV-a5Eac
This is the exact same issue that was noted here: Error: Schema changed for Timestamp field.
Also logged this under: https://code.google.com/p/google-bigquery/issues/detail?id=307 but I was unsure since it said we should be logging everything in Stackoverlfow.
Any information on how to fix this for this or other tables would be greatly appreciated.
Note: The original answer states to contact google support, but Google support for BigQuery was moved to StackOverflow. Therefore I assume that means to open it as a new question in hopes the engineers will respond.
BigQuery recently improved the representation of its internal timestamp format (there had previously been a lot of cases where timestamps broke in strange ways and this change should fix that). Your table still was using the old timestamp format, and you tickled a bug in the old format when schemas changed (in this case, the field went from REQUIRED to OPTIONAL).
We have an automated process that coalesces tables to make their storage more efficient. I scheduled this to run over your table, and have verified that it has rewritten your table using the new timestamp format.
You should now be able to query this field of your table without further problems.