How to fix AWS S3 error: socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed - amazon-s3

We have a long running Pentaho job in Rundeck that has recently started to experience failures.
The Rundeck step definition is:
echo "Using AMI $(aws ec2 describe-images --owners 660354881677 --filters 'Name=name,Values="ds-pentaho-v1.15.0"' --query 'sort_by(Images, &CreationDate)[-1].ImageId' --output text)"
/usr/local/bin/pentaho-cli run-ec2 --environment prd \
--ami-id $(aws ec2 describe-images --owners 660354881677 --filters 'Name=name,Values="ds-pentaho-v1.15.0"' --query 'sort_by(Images, &CreationDate)[-1].ImageId' --output text) \
--arg BOOL_ARG_1=true \
--instance-type "m5.2xlarge" \
--on-demand \
--max-fleet-retries "2" \
--emr-host "some_emr_host.net" \
--dir "/some/job/path/" \
--job "some_job_name"
Specifically, the job has gone from consistently completing to not completing and just hangs in a long-running status. Digging through the logs, we find this:
2022/12/02 06:21:31 - tfo: task_read_table_2 - linenr 13900000
2022/12/02 06:21:11 - tfo: task_read_table_1 - linenr 900000 <- Last Line output for table 1 here
2022/12/02 06:21:59 - tfo: task_read_table_2 - linenr 15150000
2022/12/02 06:22:00 - tfo: task_read_table_2 - linenr 15200000
2022/12/02 06:22:01 - tfo: task_read_table_2 - linenr 15250000
com.amazonaws.services.s3.model.AmazonS3Exception: Your socket
connection to the server was not read from or written to within the
timeout period. Idle connections will be closed. (Service: Amazon S3;
Status Code: 400; Error Code: RequestTimeout;
2022/12/02 06:22:02 - tfo: task_read_table_2 - linenr 15300000
2022/12/02 06:22:03 - tfo: task_read_table_2 - linenr 15350000
2022/12/02 06:24:52 - in: task_read_table_2 - Finished reading query, closing connection.
2022/12/02 06:24:54 - tfo: task_read_table_3 - linenr 50000
...
2022/12/02 06:35:53 - tfo: task_read_table_3 - linenr 37500000
2022/12/02 06:35:54 Finished reading query, closing connection.
<The job hangs here!>
The AWS S3 exception above occurs anywhere between 10 minutes and 2 hours of the job running.
The job is on a parallel step that is pulling data from multiple PostgreSQL server tables and loading the data into S3 with a text output task. It seems like S3 is hanging up after the read on task_read_table_1 - the data is not written to S3. The job continues to pull records from the other source tables until completion, but it never writes those table outputs to S3 either. From there, the job just hangs. The site engineers are not sure what is going on here. I think this may be an issue with how either AWS or Rundeck are setup. Note: we use terraform to manage timeouts and those are currently set to 24 hours, which is well within the timeout period.
The number of records between successful and unsuccessful runs appears to be the same. There does not appear to be much recent reliable search results on the internet - most results are 5-10 years old, which does not seem relevant.
I do not think this is a problem with the Pentaho job itself because it has completed without fail in the past and the overall records counts of what is being pulled/loaded is stable.
Does anyone know what is potentially causing this issue or how it can be diagnosed?
Note: This is my first engagement working with AWS, Rundeck, Terraform, and Pentaho. I am more of an ETL developer than a site engineer. Any help is appreciated.

Related

Is there a way to get a nice error report summary when running many jobs on DRMAA cluster?

I need to run a snakemake pipeline on a DRMAA cluster with a total number of >2000 jobs. When some of the jobs have failed, I would like to receive in the end an easy readable summary report, where only the failed jobs are listed instead of the whole job summary as given in the log.
Is there a way to achieve this without parsing the log file by myself?
These are the (incomplete) cluster options:
jobs: 200
latency-wait: 5
keep-going: True
rerun-incomplete: True
restart-times: 2
I am not sure if there is another way than parsing the log file yourself, but I've done it several times with grep and I am happy with the results:
cat .snakemake/log/[TIME].snakemake.log | grep -B 3 -A 3 error
Of course you should change the TIME placeholder for whichever run you want to check.

Webalizer database error

Webalizer stops to generate statistics.
When I try to check database i see:
# webalizer --db-info
Stone Steps Webalizer v3.10.2.5 (Linux 4.6.4-grsec-zfs+)
Using database /home/www/1/statystyka/webalizer.db
Reading history file... /home/www/1/statystyka/webalizer.hist
Cannot find the last URL (ID: 752154) of an active visit (ID: 3)
Saving history information...
When I do it on other site I see:
# webalizer --db-info
Stone Steps Webalizer v3.10.2.5 (Linux 4.6.4-grsec-zfs+)
Using database /home/www/2/statystyka/webalizer.db
Reading history file... /home/www/2/statystyka/webalizer.hist
Creating output in /home/www/2/statystyka
Database : /home/www/2/statystyka/webalizer.db
Created by : 3.10.2.5
Last updated by : 3.10.2.5
First day : 2017/12/01
Log time : 2017/12/27 01:18:15
Active visits : 2
Active downloads: 0
Incremental : yes
Batch : no
Maintenance time is 0.00 seconds
Total run time is 0.00 seconds
Saving history information...
I tried to run webalizer --end-month but it failed.
How to fix that problem?
My fix:
Get webalizer database from backup before damage and rebuild database using log:
/usr/local/sbin/webalizer -c /etc/webalizer/webalizer-1.conf -q -T /var/log/apache/1.access.log

Hadoop jobs getting poor locality

I have some fairly simple Hadoop streaming jobs that look like this:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
-files hdfs:///apps/local/count.pl \
-input /foo/data/bz2 \
-output /user/me/myoutput \
-mapper "cut -f4,8 -d," \
-reducer count.pl \
-combiner count.pl
The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary.
The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed).
When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too.
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files.
It looks like this previous thread [ why map task always running on a single node ] is relevant but not conclusive.
EDIT: at #jtravaglini's suggestion I tried the following variation and saw the same problem - all 45 map jobs running on a single node:
yarn jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar \
wordcount /foo/data/bz2 /user/me/myoutput
At the end of the output of that task in my shell, I see:
Launched map tasks=45
Launched reduce tasks=1
Data-local map tasks=18
Rack-local map tasks=27
which is the number of data-local tasks you'd expect to see on one node just by chance alone.

ElasticSearch: Accidentally starting 2 instances in same server

So I accidentally started 2 ElasticSearch instances on the same machine. One with port 9200, the other with port 9201. This means there's 2 cluster nodes, each with the same name, and each has 1/2 of the total shards for each index.
If I kill one of the instances, I now end up with 1 instance having 1/2 the shards.
How do I fix this problem? I want to have just 1 instance with all the shards in it (like it used to be)
SO... there is a clean way to resolve this. Although I must say the ElasticSearch documentation is very very confusing (all these buzzwords like cluster and zen discovery boggles my mind!)
1)
Now, if you have 2 instances, one in port 9200, and the other in 9201. And you want ALL the shards to be in 9200.
Run this command to disable allocation in the 9201 instance. You can change persistent to transient if you want this change to not be permanent. I'd keep it persistent so this doesn't ever happen again.
curl -XPUT localhost:9201/_cluster/settings -d '{
"persistent" : {
"cluster.routing.allocation.disable_allocation" : true
}
}'
2) Now, run the command to MOVE all the shards in the 9201 instance to 9200.
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"move" :
{
"index" : "<NAME OF INDEX HERE>", "shard" : <SHARD NUMBER HERE>,
"from_node" : "<ID OF 9201 node>", "to_node" : "<ID of 9200 node>"
}
}
]
}'
You need to run this command for every shard in the 9201 instance (the one you wanna get rid of).
If you have ElasticSearch head, that shard will be purple, and will have "REALLOCATING" status. If you have lots of data, say > 1 GB, it will take awhile for the shard to move - perhaps up to a hour or even more, so be patient. Don't shutdown the instance/node until everything is done moving.
That's it!

Unable to resolve ERROR 2017: Internal error creating job configuration on EMR when running PIG

I have been trying to run a very simple task with Pig on Amazon EMR. When I run the commands in interactive shell, everything works fine. But when I ran the same thing as batch job, I get
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal
error creating job configuration.
and the running the script fails.
Here's my 7 line script. It's just computing averages over tuples of Google bigrams. mc is match count and vc is volume count.
bigrams = LOAD 's3n://<<bucket-name>>/gb­bigrams/*' AS (bigram:chararray, year:int, mc:int, vc:int);
grouped_bigrams = group bigrams by bigram;
answer1 = foreach grouped_bigrams generate group, ((DOUBLE) SUM(bigrams.mc))/COUNT(bigrams) AS avg_mc;
sort_answer1 = ORDER answer1 BY avg_mc desc;
answer2 = LIMIT sort_answer1 5;
STORE answer1 INTO 's3n://<bucket-name>/output/bigram/20130409/answer1';
STORE answer2 INTO 's3n://<bucket-name>/output/bigram/20130409/answer2';
I was guessing the error has to something to do with STORE and s3 path. So I have tried various combinations like using $OUTPUT, backslashes, etc. But keep getting the same error.
Any help would be highly appreciated.
Have you tried using the S3 Block File System instead of the native file system?
e.g.
s3://<<bucket-name>>/gb­bigrams/*
s3://<bucket-name>/output/bigram/20130409/answer1