AWS Gluescript missing a Parquet file - amazon-s3

AWS Gluescript written in pyspark usually works great, creates Parquet files, but occasionally I am missing a Parquet file. How can I ensure / mitigate missing data?
pertinent code is:
FinalDF.write.partitionBy("Year", "Month").mode('append').parquet(TARGET)
I can see the S3 folder with lots of parquet files and can find series with naming convention of
part-<sequential number> - <guid>
which makes it obvious that 1 parquet file is missing
e.g.
part-00001-c7b1b83c-8a28-49a7-bce8-0c31be30ac30.c000.snappy.parquet
so there is
part-00001 through part-00032 ***except *** part-00013 is missing
I can also see log file in cloudwatch which states :
WARN [Executor task launch worker for task 587] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(587)): No Output found for attempt_2022 ....
Downloaded source files and they process fine / cannot reproduce issue.
Any ideas on how to avoid / troubleshoot further? Many thanks.
Googled and searched existing posts and searched AWS docs with no luck. Tried to reproduce in dev environment - Cannot reproduce problem. Double checked backup/ DR folder. Has same data, same file is missing there.

Related

Gitlab-CI: AWS S3 deploy is failing

I am trying to create a deployment pipeline for Gitlab-CI on a react project. The build is working fine and I use artifacts to store the dist folder from my yarn build command. This is working fine as well.
The issue is regarding my deployment with command: aws s3 sync dist/'bucket-name'.
Expected: "Done in x seconds"
Actual:
error Command failed with exit code 2. info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command. Running after_script 00:01 Uploading artifacts for failed job 00:01 ERROR: Job failed: exit code 1
The files seem to have been uploaded correctly to the S3 bucket, however I do not know why I get an error on the deployment job.
When I run the aws s3 sync dist/'bucket-name' locally everything works correctly.
Check out AWS CLI Return Codes
2 -- The meaning of this return code depends on the command being run.
The primary meaning is that the command entered on the command line failed to be parsed. Parsing failures can be caused by, but are not limited to, missing any required subcommands or arguments or using any unknown commands or arguments. Note that this return code meaning is applicable to all CLI commands.
The other meaning is only applicable to s3 commands. It can mean at least one or more files marked for transfer were skipped during the transfer process. However, all other files marked for transfer were successfully transferred. Files that are skipped during the transfer process include: files that do not exist, files that are character special devices, block special device, FIFO's, or sockets, and files that the user cannot read from.
The second paragraph might explain what's happening.
There is no yarn build command. See https://classic.yarnpkg.com/en/docs/cli/run
As Anton mentioned, the second paragraph of his answer was the problem. The solution to the problem was removing special characters from a couple SVGs. I suspect uploading the dist folder as an artifact(zip) might have changed some of the file names altogether which was confusing to S3. By removing ® and + from the filename the issue was resolved.

How to fix 'Not found: Files /bigstore/project/testing/filename.json' error when loading into Bigquery

I'm trying to load multiple json (4000) files into a table in Bigquery using the following command bq load --source_format=NEWLINE_DELIMITED_JSON --replace=true kx-test.store_requests gs://kx-gam-test/store/requests/*, and I am getting the following error:
Error encountered during job execution:
Not found: Files /bigstore/kx-gam-test/store/requests/7fb27d63-5581-43a1-821d-fcf47b3412fd.json.gz
Failure details:
- Not found: Files /bigstore/kx-gam-test/store/requests/93b54246-2284-4b85-8620-76657f4a338b.json.gz
- Not found: Files /bigstore/kx-gam-test/store/requests/fd24a53d-2c49-4f66-bf54-a7ccf14a1cfe.json.gz
- Not found: Files /bigstore/kx-gam-test/store/requests/35a27032-930c-456a-846d-67481a21e52d.json.gz
I am not sure why it is not working, is it possibly due to the number of files I am trying to load? And what is this bigstore folder prefixed in front of my GCS bucket?
I would like to highlight that the folder structure is such that there are some folders inside of kx-gam-test/store/requests, and I would want to load the json gzip files inside all these folders.
According to the documentation:
BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash.
Also, here is some additional info to consider when loading data to cloud storage.
Few things you can check:
Make sure that you have the necessary permissions
Make sure that the files do exist in GCS
Do you have any process which deletes the file after the loading? Check the audit logs for any traces whether the file might have been deleted while BQ is actually reading/loading it.

EMR S3 : FileDeletedInMetadataNotFoundException: File is marked as deleted in the metadata

I try to run a hadoop job which creates, copies, deletes files on S3 and reads these files from S3 when required.
My job intermittently fails with following exception to which I am looking for a permanent fix or workaround. The exception is:
Caused by: com.amazon.ws.emr.hadoop.fs.consistency.exception.FileDeletedInMetadataNotFoundException: File '' is marked as deleted in the metadata
When I run the command emrfs diff externally after the job fails, the outpout has MANIFEST_ONLY files in red color.
Then I run command emrfs sync which removes these files and then my job runs with no error.
I do not want to debug after the job fails as that is not acceptable for me. Also I do not want any manual intervention to make sure that my jobs runs seamlessly.
My job itself creates, copies, deletes files and then is not able to read it which seems to be confusing and I haven't been able to find out any solution in the documentation.
Would appreciate all the relevant suggestions.

Parquet Warning Filling up Logs in Hive MapReduce on Amazon EMR

I am running a custom UDAF on a table stored as parquet on Hive on Tez. Our Hive jobs are run on YARN, all set up in Amazon EMR. However, due to the fact that the parquet data we have was generated with an older version of Parquet (1.5), I am getting a warning that is filling up the YARN logs and causing the disk to run out of space before the job finishes. This is the warning:
PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring
statistics
because created_by could not be parsed (see PARQUET-251): parquet-mr version
It also prints a stack track. I have been trying to silence the warning logs to no avail. I have managed to turn off just about every type of log except this warning. I have tried modifying just about every Log4j settings file using the AWS config as outlined here.
Things I have tried so far:
I set the following settings in tez-site.xml (writing them in JSON format because that's what AWS requires for configuration) It is in proper XML format of course on the actual instance.
"tez.am.log.level": "OFF",
"tez.task.log.level": "OFF",
"tez.am.launch.cluster-default.cmd-opts": "-Dhadoop.metrics.log.level=OFF -Dtez.root.logger=OFF,CLA",
"tez.task-specific.log.level": "OFF;org.apache.parquet=OFF"
I have the following settings on mapred-site.xml. These settings effectively turned off all logging that occurs in my YARN logs except for the warning in question.
"mapreduce.map.log.level": "OFF",
"mapreduce.reduce.log.level": "OFF",
"yarn.app.mapreduce.am.log.level": "OFF"
I have these settings in just about every other log4j.properties file .I found in the list shown in previous AWS link.
"log4j.logger.org.apache.parquet.CorruptStatistics": "OFF",
"log4j.logger.org.apache.parquet": "OFF",
"log4j.rootLogger": "OFF, console"
Honestly at this point, I just want to find some way turn off logs and get the job running somehow. I've read about similar issues such as this link where they fixed it by changing log4j settings, but that's for Spark and it just doesn't seem to be working on Hive/Tez and Amazon. Any help is appreciated.
Ok, So I ended up fixing this by modifying the java logging.properties file for EVERY single data node and the master node in EMR. In my case the file was located at /etc/alternatives/jre/lib/logging.properties
I added a shell command to the bootstrap action file to automatically add the following two lines to the end of the properties file:
org.apache.parquet.level=SEVERE
org.apache.parquet.CorruptStatistics.level = SEVERE
Just wanted to update in case anyone else faced the same issue as this is really not set up properly by Amazon and required a lot of trial and error.

Map Reduce job on Amazon: argument for custom jar

This is one of my first try with Map Reduce on AWS in its Management Console.
Hi have uploaded on AWS S3 my runnable jar developed on Hadoop 0.18, and it works on my local machine.
As described on documentation, I have passed the S3 paths for input and output as argument of the jar: all right, but the problem is the third argument that is another path (as string) to a file that I need to load while the job is in execution. That file resides on S3 bucket too, but it seems that my jar doesn't recognize the path and I got a FileNotFound Exception while it tries to load it. That is strange because this is a path exactly like the other two...
Anyone have any idea?
Thank you
Luca
This is a problem with AWS, please check Lesson 2 at http://meghsoft.com/blog/. See if you can use FileSystem.get(uri, conf) to obtain a file system supporting your path.
Hope this helps.
Sonal
Sonal,
thank you for your suggestion.
In the end the solution was using the DistributedCache.
Loading the file before to run the job I can access inside the Map Class everithing I need by overriding the confiure method and taking the file from the distributed cache (already loaded with the file).
Thank you,
Luca