How to query date-partitioned Google BigQuery table using AWS Glue BigQuery Connector? - google-bigquery

I have linked Firebase events to BigQuery and my goal is to pull the events into S3 from BigQuery using AWS Glue.
When you link Firebase to BigQuery, it creates a default dataset and a date-partitioned table something like this:
analytics_456985675.events_20230101
analytics_456985675.events_20230102
I'm used to querying the events in BigQuery using
Select
...
from analytics_456985675.events_*
where date >= [date]
However, when configuring the Glue ETL job, it refuses to work with this format for a table analytics_456985675.events_* I get this error message:
it seems the Glue job will only work when I specify a single table.
How can I create a Glue ETL job that pulls data from BigQuery incrementally if I have to specify a single partition table?

Related

How can we update existing partition data in aws glue table without running crawler?

When we are updating data in existing partition by using manual upload to s3 bucket then the data is showing in existing partition in athena glue table.
But when data is updated by using API, the data uploaded to s3 bucket is in existing partition, but in glue table data is stored in the different partition which is current date[last modified] (August 2, 2022, 17:52:15 (UTC+05:30)),
but in my s3 bucket partition date is different(s3://aiq-grey-s3-sink-created-at-partition/topics/core.Test.s3/2022/07/19/) which is 2022/07/19 this.
so when I check same object in glue table I want the partition by this date 2022/07/19.
but it shows the partition by current date without running crawler.
When I run crawler it writes data in correct partition,
but I don't want to run crawler every single time.
How can I update data in existing partition on glue table by using API ?
Am I missing some configuration that is needed to achieve required result for this process ?
Please suggest if anybody has Idea on this.
Here're two solutions I proposed:
use boto3 to run an Athena query to alter partition: ALTER TABLE ADD PARTITION
athena = boto3.client('athena')
response = athena.start_query_execution(
QueryString='ALTER TABLE table ADD PARTITION ... LOCATION ... ', // compose the query as you need
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': output,
}
)
use boto3 to create partition via glue data catalog: glue.Client.create_partition(**kwargs)

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

Table replication in Bigquery

Need to replicate tables from Prod to Test within Bigquery. Apart from BQ export/import, please let me know if there are any replication utility/tools within Bigquery.
Thanks.
To copy a table in BigQuery you can use several methods:
bq tool
Transfer the entire dataset using the transfer tool
Copy using the BigQuery UI
You can also query the table and write its results to a new table
You can try these options:
BigQuery Data Transfer Service:
https://cloud.google.com/bigquery-transfer/docs/working-with-transfers
Copy Tables:
bq cp source-project:dataset.table target-project:dataset.table
CREATE TABLE AS SELECT (CTAS):
CREATE TABLE `target-project.dataset.table` AS SELECT * FROM `source-project.dataset.table`
BigQuery API Client Libraries:
https://cloud.google.com/bigquery/docs/reference/libraries

Export DynamoDB Tables with dynamically generated names to S3

I am storing time series data in DynamoDB tables that are generated daily (Example). The naming convention of the tables is "timeseries_2019-12-20", where 2019-12-20 takes the current days date. I want to send the previous days table to an S3 bucket in a CSV format. What is the recommended method for this? I was looking at AWS Glue but not seeing how to have it find the new table name each day. Maybe a lambda function with a cloudwatch event would be better? The DynamoDB tables are not large in size, a few hundred stored numbers.
So you can achieve this by following below steps:
Assuming that you are using boto3(python) in lambda
Calculate yesterday's date using today's date.
Pass this date by adding prefix(matching table name) as DynamoDBTargets to Glue create/update crawler boto3 API call [1] and start the crawler.
Once crawler finishes creating table in Glue catalog then you can import it to Glue ETL and convert it to CSV.
Create a lambda trigger for DynamoDB table so that Glue crawler will be triggered or you can schedule the crawler to run at some point of time everyday.

Move data from hive tables in Google Dataproc to BigQuery

We are doing the data transformations using Google Dataproc and all our data is residing in Dataproc Hive tables. How do i transfer/move this data to BigQuery.
Transfer to BigQuery from Hive seems to have a standard pattern:
dump your Hive into Avro files
Load those files in BigQuery
See an example here: Migrate hive table to Google BigQuery
As mentioned above, take care about the types compatibility between Hive/Avro/BigQuery.
And for the first time I guess it would not hurt to do some validations by comparing that the tables on both Hive and BigQuery have the same data: https://github.com/bolcom/hive_compared_bq