Amazon Elastic Map Reduce for analyzing s3 logs - amazon-s3

I am using EMR to analyze web nginx logs. But I need to process the logs so that it can fall into rows and columns in order to make it easy for querying. Thus i made two tables - rawlog, processedlog in the following manner:
create table rawlog(line string)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://istreamanalytics/logs/';
CREATE EXTERNAL TABLE processedlog (
day string,
hour int,
playSessionId string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
and added a ruby script to hive which can do the transformation, the script is as follows:
#!/usr/bin/env ruby
mon={"Jan" => '01',"Feb" => '02',"Mar" => '03',"Apr" => '04',"May" => '05',"Jun" => '06',"Jul" => '07',"Aug" => '08',"Sep" => '09',"Oct" => '10',"Nov" => '11',"Dec" => '12'}
STDIN.each_line do |line|
if line =~ /(\d+)\/(\w+)\/(\d+):(\d+):\d+:\d+ \+\d+] "GET \/api\?playSessionId=(^&*)/
d = "#{$3}-#{mon$2}-#{$1}"
h = $4
pid = $5
puts "#{d}\t#{h}\t#{pid}"
end
end
Now when i run the job using the following command on hive:
from rawlog insert overwrite table processedlog select transform (line) using 'ruby /mnt/var/lib/hive_081/downloaded_resources/hive_transformer.rb' as (day String, hour INT, playSessionId String);
I am getting the following error:
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201206061145_0015, Tracking URL = http://domU-12-31-39-0F-86-07.compute-1.internal:9100/jobdetails.jsp?jobid=job_201206061145_0015
Kill Command = /home/hadoop/.versions/0.20.205/libexec/../bin/hadoop job -Dmapred.job.tracker=10.193.133.241:9001 -kill job_201206061145_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-06-08 09:47:49,644 Stage-1 map = 0%, reduce = 0%
2012-06-08 09:48:50,267 Stage-1 map = 0%, reduce = 0%
2012-06-08 09:48:52,278 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201206061145_0015 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201206061145_0015_m_000002 (and more) from job job_201206061145_0015
Exception in thread "Thread-41" java.lang.RuntimeException: Error while reading from task log url
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL:
http://10.254.139.143:9103/tasklogtaskid=attempt_201206061145_0015_m_000000_2&start=-8193
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
... 3 more
Counters:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Can someone tell me what's wrong ?

EMR is a very generic tool to deal with logs.
Why not use more tailored technology.
E.g.:
Sumo Logic
Splunk Storm
Loggly
At least with Sumo you could make that kind of processing much easier.

The only suggestion I would make is make sure the script is working properly before EMR. Using EMR to test is script should be the very last step in the process. Beyond that it is usually a basic config problem.
Some basic googling found:
http://entxtech.blogspot.com/2010/10/how-to-unit-test-apache-hive-scripts.html
http://jairam.me/2011/09/08/hive-on-amazon-emr/

More details on the error can be found in the log files or see the details here in your case: http://10.254.139.143:9103/tasklogtaskid=attempt_201206061145_0015_m_000000_2&start=-8193

Related

Hive Mapreduce Query runs only once after running HIVESERVER

hive mapreduce query runs only once successfully after running hiveserver once afterwards it gives error eventhough the same query
hiveserver info:-
Connecting to jdbc:hive2://localhost:10000
to: Apache Hive (version 4.0.0-alpha-2)
Driver: Hive JDBC (version 4.0.0-alpha-2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 4.0.0-alpha-2 by Apache Hive
0: jdbc:hive2://localhost:10000\>insert into table abc values(3983);
1 row affected (10.589 seconds)
0: jdbc:hive2://localhost:10000\>insert into table abc values(3973);
Error: Error while compiling statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
0: jdbc:hive2://localhost:10000\>insert into table abc values(3173);
Error: Error while compiling statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
first mapreduce successful query
Query ID = ndv_20230203183337_f004b3f8-5f56-4a47-822c-d756ce652cef
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=\<number\>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=\<number\>
In order to set a constant number of reducers:
set mapreduce.job.reduces=\<number\>
Job running in-process (local Hadoop)
2023-02-03 18:33:45,085 Stage-1 map = 100%, reduce = 0%
2023-02-03 18:33:46,117 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1222265821_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory file:/Users/ndv/Downloads/work/hivestuff/warehouse/abc/.hive-staging_hive_2023-02-03_18-33-37_374_1488321338071048426-1/-ext-10000
Loading data to table default.abc
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
second same query which failed
Query ID = ndv_20230203183352_2fb0fdba-3e66-4e7e-8624-43a93cbdfc9b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=\<number\>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=\<number\>
In order to set a constant number of reducers:
set mapreduce.job.reduces=\<number\>
Job running in-process (local Hadoop)
2023-02-03 18:33:54,841 Stage-1 map = 0%, reduce = 0%
Ended Job = job_local2047992482_0002 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
in hiveserver web ui it is showing it failed in stage 1 only
Stage-1:MAPRED Failure, ReturnVal 2
Stage 1 statistics:
Status: Failure, ReturnVal 2
Error FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce job progress:
0% map0% reduce
"Map Progress (%)": 0
"Reduce Progress (%)": 0
"Cleanup Progress (%)": 100
"Setup Progress (%)": 100
"Complete": true
"Successful": false

Returning a map object in cypher

I need to create edges between a set of nodes but it is not guaranteed that the edge is not exists already, I need to know which edges has been created so I can increment the edges counter for the two connected nodes.
I want to know the edges count for every node without querying the graph each time.
Example:
MERGE (u:user {id:999049043279872})
MERGE (g1:group {id:346709075951616})
MERGE (g2:group {id:346709075951617})
MERGE (g1)-[m1:member]->(u)
MERGE (g2)-[m2:member]->(u)
Sometimes the user is already a member of the group so I don't want to increment the counter in this case.
I tried to use the result statistics but it returns the created relationships number only, I thought also about using a map and then fill the content using ON CREATE SET after MERGE:
WITH {g1:0, g2:0} as res
MERGE (u:user {id:999049043279872})
MERGE (g1:group {id:346709075951616})
MERGE (g2:group {id:346709075951617})
MERGE (g1)-[m1:member]->(u)
ON CREATE SET res.g1 = 1
MERGE (g2)-[m2:member]->(u)
ON CREATE SET res.g2 = 1
RETURN res
But it does not works; the server crashes immediately after executing the query.
Exception:
------ FAST MEMORY TEST ------
17235:M 28 Feb 2022 16:56:50.016 # main thread terminated
17235:M 28 Feb 2022 16:56:50.017 # Bio thread for job type #0 terminated
17235:M 28 Feb 2022 16:56:50.017 # Bio thread for job type #1 terminated
17235:M 28 Feb 2022 16:56:50.018 # Bio thread for job type #2 terminated
Fast memory test PASSED, however your memory can still be broken.
Please run a memory test for several hours if possible.
------ DUMPING CODE AROUND EIP ------
Symbol: (null) (base: (nil))
Module: /lib/x86_64-linux-gnu/libc.so.6 (base 0x7fbfe3dcc000)
$ xxd -r -p /tmp/dump.hex /tmp/dump.bin
$ objdump --adjust-vma=(nil) -D -b binary -m i386:x86-64 /tmp/dump.bin
=== REDIS BUG REPORT END. Make sure to include from START to END. ===
Please report the crash by opening an issue on github:
http://github.com/redis/redis/issues
Suspect RAM error? Use redis-server --test-memory to verify it.
Segmentation fault
Any ideas?
Thanks in advance
Neo4j stores already a counter inside each node to count the number of relationships and to provide a fast count access. When you want to get the number of members in a group, you can simply do:
MATCH (g:group)
return size((g)<-[:member]-())

Is there a way to get a nice error report summary when running many jobs on DRMAA cluster?

I need to run a snakemake pipeline on a DRMAA cluster with a total number of >2000 jobs. When some of the jobs have failed, I would like to receive in the end an easy readable summary report, where only the failed jobs are listed instead of the whole job summary as given in the log.
Is there a way to achieve this without parsing the log file by myself?
These are the (incomplete) cluster options:
jobs: 200
latency-wait: 5
keep-going: True
rerun-incomplete: True
restart-times: 2
I am not sure if there is another way than parsing the log file yourself, but I've done it several times with grep and I am happy with the results:
cat .snakemake/log/[TIME].snakemake.log | grep -B 3 -A 3 error
Of course you should change the TIME placeholder for whichever run you want to check.

Pig local mode, group, or join = java.lang.OutOfMemoryError: Java heap space

Using Apache Pig version 0.10.1.21 (reported),
CentOS release 6.3 (Final), jdk1.6.0_31 (The Hortonworks Sandbox v1.2 on Virtualbox, with 3.5 GB RAM)
$ cat data.txt
11,11,22
33,34,35
47,0,21
33,6,51
56,6,11
11,25,67
$ cat GrpTest.pig
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
DESCRIBE B;
DUMP B;
pig -x local GrpTest.pig
[Thread-12] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
[Thread-12] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[Thread-13] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#19a9bea3
[Thread-13] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
[Thread-13] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0002
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B
The java.lang.OutOfMemoryError: Java heap space error occurs each time I use GROUP or JOIN in a pig script executed in local mode. There is no error when the script is executed in mapreduce mode on HDFS.
Question 1: How come there is an OutOfMemory error while the data sample is minuscule and local mode is supposed to use less resources than HDFS mode?
Question 2: Is there a solution to run successfully a small pig scripts with GROUP or JOIN in local mode?
Solution: force pig to allocate less memory for the java property io.sort.mb
I set to 10 MB here and the error disappears. Not sure what would be the best value but at least, this allow to practice pig syntax in local mode
$ cat GrpTest.pig
--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
DESCRIBE B;
DUMP B;
The reason is you have less memory allocated to Java locally than you do on your Hadoop cluster machines. This is actually a pretty common error in Hadoop. It mostly occurs when you create a really long relation in Pig at any point, and happens because Pig always wants to load an entire relation into memory and doesn't want to lazy load it in any way.
When you do something like GROUP BY where the tuple you're grouping by is non-sparse over many records, you frequently wind up creating single long relations at least temporarily since you're basically taking a whole bunch of individual relations and cramming them all into one single long relation. Either change your code so you don't wind up creating single very long relations at any point (i.e. group by something more sparse), or increase the memory available to Java.

Unable to resolve ERROR 2017: Internal error creating job configuration on EMR when running PIG

I have been trying to run a very simple task with Pig on Amazon EMR. When I run the commands in interactive shell, everything works fine. But when I ran the same thing as batch job, I get
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal
error creating job configuration.
and the running the script fails.
Here's my 7 line script. It's just computing averages over tuples of Google bigrams. mc is match count and vc is volume count.
bigrams = LOAD 's3n://<<bucket-name>>/gb­bigrams/*' AS (bigram:chararray, year:int, mc:int, vc:int);
grouped_bigrams = group bigrams by bigram;
answer1 = foreach grouped_bigrams generate group, ((DOUBLE) SUM(bigrams.mc))/COUNT(bigrams) AS avg_mc;
sort_answer1 = ORDER answer1 BY avg_mc desc;
answer2 = LIMIT sort_answer1 5;
STORE answer1 INTO 's3n://<bucket-name>/output/bigram/20130409/answer1';
STORE answer2 INTO 's3n://<bucket-name>/output/bigram/20130409/answer2';
I was guessing the error has to something to do with STORE and s3 path. So I have tried various combinations like using $OUTPUT, backslashes, etc. But keep getting the same error.
Any help would be highly appreciated.
Have you tried using the S3 Block File System instead of the native file system?
e.g.
s3://<<bucket-name>>/gb­bigrams/*
s3://<bucket-name>/output/bigram/20130409/answer1