I’m working on a project to pull data from a csv file that resides in an amazon s3 bucket onto Bigquery using a data transfer. The csv file gets dropped into s3 every day, replacing the previous day’s file under the same name.
I set up a data transfer to run every day pulling this data into a table in Bigquery.
However, when the transfer runs, instead of overwriting the previous day’s data in the table, it’s simply appending it, thereby duplicating the data. Does anybody have any idea on how to tackle this issue?
Related
I am pulling data from mysql DB using pyspark and trying to upload the same data using Pyspark.
While doing so, it takes around 5-7 mins to upload a chunk of 100K records.
This process will take months for the data pull as there are around 3,108,700,000 recs in source.
Is there any better way by which the S3 upload process can be improved.
NOTE : Data pull for a single fetch of 100K recs take only 20-30 seconds, its just the S3 upload causing the issue.
Here is how I am writing the DF to S3.
df = spark.read.format("jdbc").
option('url', jdbcURL).
option('driver', driver).
option('user', user_name).
option('password', password).
option('query', data_query).load()
output_df = df.persist()
output_df.repartition(1).write.mode("overwrite").parquet(target_directory)
Reparation is a good move as writing large files to S3 is better than writing small files.
Persist will slow you down as your writing all the files to S3 with that. So you are writing the data to S3 twice.
S3 is made for large, slow, inexpensive storage. It's not made to move data quickly. If you want to migrate the database AWS has tools for that and it's worth looking into them. Even if its so you can then move the files into S3.
S3 writes to buckets and it determines the buckets by file path, It uses tail variation to assign & auto split buckets. (/heres/some/variation/at/the/tail1,/heres/some/variation/at/the/tail2) Buckets are your bottleneck here. To get multiple buckets, keep the vary the file at the head of the file path.(/head1/variation/isfaster/,/head2/variation/isfaster/)
Try and remove the persist. (At least consider cache() as a cheaper alternative.
Keep the repartition
vary the head of the file path to get assigned more buckets.
consider a redesign that pushes the data into S3 with rest api multi-part upload.
When we run a "COPY INTO from AWS S3 Location" command, does the data-files physically get copied from S3 to EC2-VM-Storage (SSD/Ram)? Or does the data still reside on S3 and get converted to Snowflake format?
And, if I run copy Into and then suspend the warehouse, would I lose data on resumption?
Please let me know if you need any other information.
The data is loaded onto Snowflake tables from an external location like S3. The files would still be there on S3 and if there is the requirement to remove these files post copy operation then one can use "PURGE=TRUE" parameter along with "COPY INTO" command.
The files as such will be on the S3 location, the values from it is copied to the tables in Snowflake.
Warehouse operations that are running are not affected even if the WH is shut down and is allowed to complete. So, there is no data loss in the event.
When we run a "COPY INTO from AWS S3 Location" command, Snowflake copies data file from your S3 location to Snowflake S3 storage. Snowflake S3 location is only accessible by querying the table, in which you have loaded the data.
When you suspend a warehouse, Snowflake immediately shuts down all idle compute resources for the warehouse, but allows any compute resources that are executing statements to continue until the statements complete, at which time the resources are shut down and the status of the warehouse changes to “Suspended”. Compute resources waiting to shut down are considered to be in “quiesce” mode.
More details: https://docs.snowflake.com/en/user-guide/warehouses-tasks.html#suspending-a-warehouse
Details on the loading mechanism you are using are in docs: https://docs.snowflake.com/en/user-guide/data-load-s3.html#bulk-loading-from-amazon-s3
I'm looking to load CSV data from Google Cloud Storage to a BigQuery table (see docs) as a batch (see docs) using wildcards, and was wondering whether:
The data in the table will only be available once all CSVs have been loaded (i.e the files get collated in some way before being loaded into BigQuery)
The data in the table will be updated incrementally with each CSV that's loaded (i.e. each CSV is loaded separately, as a separate job)
For some context, I'm trying to work out if it will be possible for a user to view incomplete table data if they access the table before the job to load the batch of CSVs has finished.
A similar question has been asked here before, but I don't have enough reputation to comment :'(
Thanks for the help!
The data are viewable when the job is completed, therefore after all the files ingested.
Indeed, when you define a job, you can specify a WriteTruncate disposition. That means all the current data will be replaced by the new one. If the job fails, the current data stay unchanged. That behaviour won't be possible in case of incremental load.
If you want to propose incremental load, you can use a not efficient process: read the file with dataflow and stream the content into BigQuery. this time, the users will be able to view the incremental data by requesting them (not with the preview feature in the Console, because the streamed data stayed for a while in the buffer of BigQuery (max 90 minutes))
I am using unload operator from Airflow to dump data from redshift. I need to know the record count, for that i have used the option manifest, it is working good. But the problem is i need to reload this manifest file back to redshift again and read the files. I am expecting some function/code to read the count of the records while unloading from redshift and loading into s3.
is there any way to make daily dynamodb backup to s3 bucket but backup should contain only current day added records in dynamodb.
In other work, I want to take daily dynamodb backup to s3 but daily backup contain only today records added in dynamobd.
Please help if there is any way.
thanks,
If you enable streams on your DynamoDB table you get visibility into every item change in the table. You can write a Lambda function that processes the stream events and dumps items added to the table into an S3 bucket.