When using bigquery data transfer to move data into BigQuery from S3, I get intermittent success (I've actually only seen it work correctly one time).
The success:
6:00:48 PM Summary: succeeded 1 jobs, failed 0 jobs.
6:00:14 PM Job bqts_5f*** (table test_json_data) completed successfully. Number of records: 516356, with errors: 0.
5:59:13 PM Job bqts_5f*** (table test_json_data) started.
5:59:12 PM Processing files from Amazon S3 matching: "s3://bucket-name/*.json"
5:59:12 PM Moving data from Amazon S3 to Google Cloud complete: Moved 2661 object(s).
5:58:50 PM Starting transfer from Amazon S3 for files with prefix: "s3://bucket-name/"
5:58:49 PM Starting transfer from Amazon S3 for files modified before 2020-07-27T16:48:49-07:00 (exclusive).
5:58:49 PM Transfer load date: 20200727
5:58:48 PM Dispatched run to data source with id 138***3616
The usual instance those is just 0 success, 0 failures, like the following:
8:33:13 PM Summary: succeeded 0 jobs, failed 0 jobs.
8:32:38 PM Processing files from Amazon S3 matching: "s3://bucket-name/*.json"
8:32:38 PM Moving data from Amazon S3 to Google Cloud complete: Moved 3468 object(s).
8:32:14 PM Starting transfer from Amazon S3 for files with prefix: "s3://bucket-name/"
8:32:14 PM Starting transfer from Amazon S3 for files modified between 2020-07-27T16:48:49-07:00 and 2020-07-27T19:22:14-07:00 (exclusive).
8:32:13 PM Transfer load date: 20200728
8:32:13 PM Dispatched run to data source with id 13***0415
What might be going on such that the second log above doesn't have the Job bqts... run? Is there somewhere I can get more details about these data transfer jobs? I had a different job that ran into a JSON error, so I don't believe it was that.
Thanks!
I was a bit confused by the logging, since it finds and moves the objects like
I believe I misread the docs, I had thought previously that an amazon URI of s3://bucket-name/*.json would crawl the directory for the json files, but even though the message above seems to indicate such, it only loads files into bigquery that are at the top level (for the s3://bucket-name/*.json URI).
Related
This is for Flink on KDA and so limited to version 1.13
The desired flow of the application:
ingest from kinesis/kafka source
sink to s3
post s3 sink get objectkey/filename and publish to kafka (inform other processes that a file is ready to be examined)
The first 2 steps are simple enough. The 3rd is the tricky one.
From my understanding the S3 object key will be made of the following:
bucket partition (lets assume the defaul here of processing date)
filename. The filename is made up of + + + . It then goes through a series of changes - in-progress, pending and final.
The file will be in a finished state when a checkpoint occurs and all pending files are move to finished.
What I would like is this information as a trigger to kafka publish.
On checkpoint give me a list of all the files(object keys) that have moved to a finished state. These can then be put on a kafka topic.
Is this possible?
Thanks in advance
I have a flink batch job that reads a very large parquet file from S3 then it sinks a json into Kafka topic.
The problem is how can I make the file reading process stateful? I mean whenever the job interrupted or crushed, the job should start from previous reading state? I don't want send duplicate item to Kafka when the job restarted.
Here is my example code
val env = ExecutionEnvironment.getExecutionEnvironment
val input = Parquet.input[User](new Path(s"s3a://path"))
env.createInput(input)
.filter(r => Option(r.token).getOrElse("").nonEmpty)
Glue job configured to max 10 nodes capacity, 1 job in parallel and no retries on failure is giving an error "Failed to delete key: target_folder/_temporary", and according to stacktrace the issue is that S3 service starts blocking the Glue requests due to the amount of requests: "AmazonS3Exception: Please reduce your request rate."
Note: The issue is not with IAM as the IAM role that glue job is using has permissions to delete objects in S3.
I found a suggestion for this issue on GitHub with a proposition of reducing the worker count: https://github.com/aws-samples/aws-glue-samples/issues/20
"I've had success reducing the number of workers."
However, I don't think that 10 is too many workers and would even like to actually increase the worker count to 20 to speed up the ETL.
Did anyone have any success who faced this issue? How would I go about solving it?
Shortened stacktrace:
py4j.protocol.Py4JJavaError: An error occurred while calling o151.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: target_folder/_temporary
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
...
Caused by: java.io.IOException: 1 exceptions thrown from 12 batch deletes
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:384)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
...
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ...
Part of Glue ETL python script (just in case):
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name", transformation_ctx="datasource0")
... relationalizing, renaming and etc. Transforming from DynamicDataframe to PySpark dataframe and back.
partition_ready = Map.apply(frame=processed_dataframe, f=map_date_partition, transformation_ctx="map_date_partition")
datasink = glueContext.write_dynamic_frame.from_options(frame=partition_ready, connection_type="s3", connection_options={"path": "s3://bucket/target_folder", "partitionKeys": ["year", "month", "day", "hour"]}, format="parquet", transformation_ctx="datasink")
job.commit()
Solved(Kind of), thank you to user ayazabbas
Accepted the answer that helped me into the correct direction of a solution. One of the things I was searching for is how to reduce many small files into big chunks and repartition does exactly that. Instead of repartition(x) I used coalesce(x) where x is 4*worker count of a glue job so that Glue service could allocate each data chunk to each available vCPU resource. It might make sense to have x at least 2*4*worker_count to account for slower and faster transformation parts if they do exist.
Another thing I did was reduce the number of columns by which I was partitioning the data before writing it to S3 from 5 to 4.
Current drawback is that I haven't figured out how to find the worker count within the glue script that the glue service allocates for the job, thus the number is hardcoded according to the job configuration (Glue service allocates sometimes more nodes than what is configured).
I had this same issue. I worked around it by running repartition(x) on the dynamic frame before writing to S3. This forces x files per partition and the max parallelism during the write process will be x, reducing S3 the request rate.
I set x to 1 as I wanted 1 parquet file per partition so I'm not sure what the safe upper limit of parallelism you can have is before the request rate gets too high.
I couldn't figure out a nicer way to solve this issue, it's annoying because you have so much idle capacity during the write process.
Hope that helps.
I try to setup transfer with following configuration:
Source: Google Ads (formerly AdWords)
Destination dataset: app_google_ads
Schedule (UTC): every day 08:24
Notification Cloud Pub/Sub topic: None
Email notifications: None
Data source details
Customer ID: xxx-xxx-xxxx
Exclude removed/disabled items: None
I got no error during transfer but my dataset is empty, why?
12:02:00 PM Summary: succeeded 72 jobs, failed 0 jobs.
12:01:04 PM Job 77454333956:adwords_5cdace41-0000-2184-a73e-001a11435098 (table p_VideoConversionStats_2495318378$20190502) completed successfully
12:00:04 PM Job 77454333956:adwords_5cdace37-0000-2184-a73e-001a11435098 (table p_HourlyAccountStats_2495318378$20190502) completed successfully
12:00:04 PM Job 77454333956:adwords_5cd88a2b-0000-2117-b857-089e082679e4 (table p_HourlyCampaignStats_2495318378$20190502) completed successfully
12:00:04 PM Job 77454333956:adwords_5cd0ba27-0000-2c7c-aed0-f40304362f4a (table p_AudienceBasicStats_2495318378$20190502) completed successfully
12:00:04 PM Job 77454333956:adwords_5cd907f8-0000-2e16-a735-089e082678cc (table p_KeywordStats_2495318378$20190502) completed successfully
12:00:04 PM Job 77454333956:adwords_5cd88a32-0000-2117-b857-089e082679e4 (table p_ShoppingProductConversionStats_2495318378$20190502) completed successfully
12:00:04 PM Job 77454333956:adwords_5cce5c09-0000-28bd-86d3-f4030437b908 (table p_AdBasicStats_2495318378$20190502) completed successfully
etc
I have AdBlocked enabled in my browser. So it prevent me to see google ads tables in my dataset. So I turn off it and it works!
I have a Dataflow pipeline, running locally. The objective is to read a JSON file using TEXTIO, make sessions and load it into BigQuery. Given the structure I have to create a temp directory in GCS and then load it into BigQuery using that. Previously I had a data schema error that prevented me to load the data, see here. That issue is resolved.
So now when I run the pipeline locally it ends with dumping a temporary JSON newline delimited file into GCS. The SDK then gives me the following:
Starting BigQuery load job beam_job_xxxx_00001-1: try 1/3
INFO [main] (BigQueryIO.java:2191) - BigQuery load job failed: beam_job_xxxx_00001-1
...
Exception in thread "main" com.google.cloud.dataflow.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:187)
at pedesys.Dataflow.main(Dataflow.java:148)
Caused by: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.load(BigQueryIO.java:2198)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.processElement(BigQueryIO.java:2146)
The errors are not very descriptive and the data is still not loaded in BigQuery. What is puzzling is that if I go to the BigQuery UI and load the same temporary file from GCS that was dumped by the SDK's Dataflow pipeline manually, in the same table, it works beautifully.
The relevant code parts are as follows:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://test/temp");
Pipeline p = Pipeline.create(options)
...
...
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.named("loadJob")
.to("myproject:db.table")
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
The SDK is swallowing the error/exception and not reporting it to users. It's most likely a schema problem. To get the actual error that is happening you need to fetch the job details by either:
CLI - bq show -j job beam_job_<xxxx>_00001-1
Browser/Web: use "try it" at the bottom of the page here.
#jkff has raised an issue here to improve the error reporting.