Parquet Warning Filling up Logs in Hive MapReduce on Amazon EMR - hive

I am running a custom UDAF on a table stored as parquet on Hive on Tez. Our Hive jobs are run on YARN, all set up in Amazon EMR. However, due to the fact that the parquet data we have was generated with an older version of Parquet (1.5), I am getting a warning that is filling up the YARN logs and causing the disk to run out of space before the job finishes. This is the warning:
PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring
statistics
because created_by could not be parsed (see PARQUET-251): parquet-mr version
It also prints a stack track. I have been trying to silence the warning logs to no avail. I have managed to turn off just about every type of log except this warning. I have tried modifying just about every Log4j settings file using the AWS config as outlined here.
Things I have tried so far:
I set the following settings in tez-site.xml (writing them in JSON format because that's what AWS requires for configuration) It is in proper XML format of course on the actual instance.
"tez.am.log.level": "OFF",
"tez.task.log.level": "OFF",
"tez.am.launch.cluster-default.cmd-opts": "-Dhadoop.metrics.log.level=OFF -Dtez.root.logger=OFF,CLA",
"tez.task-specific.log.level": "OFF;org.apache.parquet=OFF"
I have the following settings on mapred-site.xml. These settings effectively turned off all logging that occurs in my YARN logs except for the warning in question.
"mapreduce.map.log.level": "OFF",
"mapreduce.reduce.log.level": "OFF",
"yarn.app.mapreduce.am.log.level": "OFF"
I have these settings in just about every other log4j.properties file .I found in the list shown in previous AWS link.
"log4j.logger.org.apache.parquet.CorruptStatistics": "OFF",
"log4j.logger.org.apache.parquet": "OFF",
"log4j.rootLogger": "OFF, console"
Honestly at this point, I just want to find some way turn off logs and get the job running somehow. I've read about similar issues such as this link where they fixed it by changing log4j settings, but that's for Spark and it just doesn't seem to be working on Hive/Tez and Amazon. Any help is appreciated.

Ok, So I ended up fixing this by modifying the java logging.properties file for EVERY single data node and the master node in EMR. In my case the file was located at /etc/alternatives/jre/lib/logging.properties
I added a shell command to the bootstrap action file to automatically add the following two lines to the end of the properties file:
org.apache.parquet.level=SEVERE
org.apache.parquet.CorruptStatistics.level = SEVERE
Just wanted to update in case anyone else faced the same issue as this is really not set up properly by Amazon and required a lot of trial and error.

Related

AWS Gluescript missing a Parquet file

AWS Gluescript written in pyspark usually works great, creates Parquet files, but occasionally I am missing a Parquet file. How can I ensure / mitigate missing data?
pertinent code is:
FinalDF.write.partitionBy("Year", "Month").mode('append').parquet(TARGET)
I can see the S3 folder with lots of parquet files and can find series with naming convention of
part-<sequential number> - <guid>
which makes it obvious that 1 parquet file is missing
e.g.
part-00001-c7b1b83c-8a28-49a7-bce8-0c31be30ac30.c000.snappy.parquet
so there is
part-00001 through part-00032 ***except *** part-00013 is missing
I can also see log file in cloudwatch which states :
WARN [Executor task launch worker for task 587] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(587)): No Output found for attempt_2022 ....
Downloaded source files and they process fine / cannot reproduce issue.
Any ideas on how to avoid / troubleshoot further? Many thanks.
Googled and searched existing posts and searched AWS docs with no luck. Tried to reproduce in dev environment - Cannot reproduce problem. Double checked backup/ DR folder. Has same data, same file is missing there.

Apache Flink to use S3 for backend state and checkpoints

Background
I was planning to use S3 to store the Flink's checkpoints using the FsStateBackend. But somehow I was getting the following error.
Error
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
Flink version: I am using Flink 1.10.0 version.
I have found the solution for the above issue, so here I am listing it in steps that are required.
Steps
We need to add some configs in the flink-conf.yaml file which I have listed below.
state.backend: filesystem
state.checkpoints.dir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
state.backend.fs.checkpointdir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
s3.access-key: XXXXXXXXXXXXXXXXXXX #your-access-key
s3.secret-key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #your-secret-key
s3.endpoint: http://127.0.0.1:9000 #your-endpoint-hostname (I have used Minio)
After completing the first step we need to copy the respective(flink-s3-fs-hadoop-1.10.0.jar and flink-s3-fs-presto-1.10.0.jar) JAR files from the opt directory to the plugins directory of your Flink.
E.g:--> 1. Copy /flink-1.10.0/opt/flink-s3-fs-hadoop-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.10.0.jar // Recommended for StreamingFileSink
2. Copy /flink-1.10.0/opt/flink-s3-fs-presto-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-presto/flink-s3-fs-presto-1.10.0.jar //Recommended for checkpointing
Add this in checkpointing code
env.setStateBackend(new FsStateBackend("s3://s3-bucket/checkpoints/"))
After completing all the above steps re-start the Flink if it is already running.
Note:
If you are using both(flink-s3-fs-hadoop and flink-s3-fs-presto) in Flink then please use s3p:// specificly for flink-s3-fs-presto and s3a:// for flink-s3-fs-hadoop instead of s3://.
For more details click here.

Gitlab-CI: AWS S3 deploy is failing

I am trying to create a deployment pipeline for Gitlab-CI on a react project. The build is working fine and I use artifacts to store the dist folder from my yarn build command. This is working fine as well.
The issue is regarding my deployment with command: aws s3 sync dist/'bucket-name'.
Expected: "Done in x seconds"
Actual:
error Command failed with exit code 2. info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command. Running after_script 00:01 Uploading artifacts for failed job 00:01 ERROR: Job failed: exit code 1
The files seem to have been uploaded correctly to the S3 bucket, however I do not know why I get an error on the deployment job.
When I run the aws s3 sync dist/'bucket-name' locally everything works correctly.
Check out AWS CLI Return Codes
2 -- The meaning of this return code depends on the command being run.
The primary meaning is that the command entered on the command line failed to be parsed. Parsing failures can be caused by, but are not limited to, missing any required subcommands or arguments or using any unknown commands or arguments. Note that this return code meaning is applicable to all CLI commands.
The other meaning is only applicable to s3 commands. It can mean at least one or more files marked for transfer were skipped during the transfer process. However, all other files marked for transfer were successfully transferred. Files that are skipped during the transfer process include: files that do not exist, files that are character special devices, block special device, FIFO's, or sockets, and files that the user cannot read from.
The second paragraph might explain what's happening.
There is no yarn build command. See https://classic.yarnpkg.com/en/docs/cli/run
As Anton mentioned, the second paragraph of his answer was the problem. The solution to the problem was removing special characters from a couple SVGs. I suspect uploading the dist folder as an artifact(zip) might have changed some of the file names altogether which was confusing to S3. By removing ® and + from the filename the issue was resolved.

EMR S3 : FileDeletedInMetadataNotFoundException: File is marked as deleted in the metadata

I try to run a hadoop job which creates, copies, deletes files on S3 and reads these files from S3 when required.
My job intermittently fails with following exception to which I am looking for a permanent fix or workaround. The exception is:
Caused by: com.amazon.ws.emr.hadoop.fs.consistency.exception.FileDeletedInMetadataNotFoundException: File '' is marked as deleted in the metadata
When I run the command emrfs diff externally after the job fails, the outpout has MANIFEST_ONLY files in red color.
Then I run command emrfs sync which removes these files and then my job runs with no error.
I do not want to debug after the job fails as that is not acceptable for me. Also I do not want any manual intervention to make sure that my jobs runs seamlessly.
My job itself creates, copies, deletes files and then is not able to read it which seems to be confusing and I haven't been able to find out any solution in the documentation.
Would appreciate all the relevant suggestions.

use dfs does not work in later versions of drill on the drill web page

When using the web page displayed by drill on localhost:8047/query (by default) running the following commands fail:
use dfs.mydfs;
and then:
show files;
Then I receive this error:
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR: SHOW FILES is supported in workspace type schema only. Schema [] is not a workspace schema. [Error Id: 872e6708-0aaa-480e-af32-9aaf6f84de2b on 172.28.128.1:31010]
While if I use the terminal to enter the same commands the command works correct.
I've also found that this affects 1.6 and above and that this behaviour is not seen on 1.5 below.
This command works in both the web and commmand line/terminal version:
show files in df.workspace;
I have configured multiple types of dfs and have tried both OS X and Windows 10 and found the issue to be the same.
I tried looking through the drill jira to see if this was registered as bug and I looked briefly through the release notes as well.