I have used the datatransfer table that will automatically send data from a s3 bucket to GCS. This transfer will be run everyday automatically in the morning.
I created a table in bigquery that will read the data from the GCS , so far no problem.
Now my concern is even though the files update on a daily basis on the GCS the bigquery table that is suppose to consume the GCS parquet file don't seem to update everyday.
What is the procedure for the table to be able to consume the latest data on the GCS bucket.
Example I created my table on the 17 April .
My data transfer send some file on the 19
But when i do a select max(created_at) from mytable
it doesn't give me the last updated data
Related
I have a table in BigQuery with 100 columns. Now I want to append more rows to it via Transfer but the new CSV has only 99 columns. How should I proceed with this?
I tried creating a schema and adding that column as NULLABLE but it didn't work
I am presuming your CSV file is stored in GCS Bucket and trying to use BQ Data Transfer service to load data periodically by scheduling it.
You can not directly Load/Append data into BQ Table due to schema mismatch.
But as an alternative, create a Staging table named staging_table_csv with 99 columns and Schedule a Data transfer service to load CSV to this table on Overwrite mode.
Now write a query to Append the contents of this staging table staging_table_csv to the target BQ table.
Query might look like this:
#standardSQL
INSERT INTO `project.dataset.target_table`
SELECT
*,
<DEFAULT_VALUE> AS COL100
FROM
`project.dataset.staging_table_csv`
Now schedule this query to run after the staging table is loaded
Make sure to keep a buffer between the Staging table load and the Target Table load. You can perform trials to find a suitable buffer.
For eg: If Transfer is scheduled at 12:00, Schedule Target Table load Query t 12:05 or 12:10
Note: Creating an extra Staging table would incur storage costs but
since it is overwritten for each load, historical data cost is not
incurred
We have one s3 bucket called Customers/
Inside this we have multiple folders and again sub folders inside them.
And finally we have parquet files of data.
Now I want to read any parquet file (not specific to any file) and load data into oracle.
For now my script is working for one s3 path where it reads one parquet file e.g. customer_info.parquet and it loads data in oracle database table called customer.customer_info
I need help on generating a generic script where we can read any parquet file and load data in any corresponding database table.
for e.g.
S3 location : s3/Customers/new_customrers/new_customer_info.parquet
Oracle Database: Customer
Oracle table : new_customers
S3 location :s3/Customers/old_customrers/old_customer_info.parquet
Oracle Database:Customer
Oracle table:old_customers
S3 location : s3/Customers/current_customrers/current_customer_info.parquet
Oracle Database :Customer
Oracle table:current_customers**
Is there any way to make this copy process generic. Database will be same only oracle tables will be changed according to the parquet file.
My current script is a pyspark script where we are reading one s3 file data into spark dataframe and writting that dataframe to one oracle table.
On GCP I have a new csv added everyday at 2 am in a Cloud Storage folder (format is something like mydata_yyyymmdd.csv).
I am trying to schedule an upload of this file at 2:30 am everyday in a Big Query Table.
I succeeded in creating a dataflow that constantly screen my Cloud Storage folder and constantly update my BigQuery table if a new file is added but I don't find this optimal as :
my dataflow is running all day long while I only need to run it once a day (increasing costs)
this solution doesn't create a new Big Query table everyday, it just appends all csv to my Big Query table
Can you help me with the tools I should use in GCP to achieve this ?
Thanks a lot for your help, it is very much appreciated
Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'
I have a folder of CSV files separated by date in Google Cloud Storage. How can I upload it directly to BigQuery as a partitioned table?
You can do the following:
Create partitioned table (for example: T)
Run multiple load jobs to load each day's data into the corresponding partition. So for example, you can load data for May 15th, 2016 by specifying the destination table of load as 'T$20160515'
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition