Apache Pig: OutOfMemoryError with FOREACH aggregation - apache-pig

I'm getting an OutOfMemoryError with running a FOREACH operation in Apache Pig.
16/06/24 15:14:17 INFO util.SpillableMemoryManager: first memory
handler call- Usage threshold init = 164102144(160256K) used =
556137816(543103K) committed = 698875904(682496K) max =
698875904(682496K)
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill -9 %p"
Executing /bin/sh -c "kill -9 4095"... Killed
My Pig script:
A = LOAD 'PageCountTest/' USING PigStorage(' ') AS (Project:chararray,
Title:chararray, count:int , size:int);
B = GROUP A BY (Project,Title);
C = FOREACH B
generate group, SUM(A.count) AS COUNT; D = ORDER C BY COUNT DESC;
STORE C INTO '/user/hadoop/wikistats';
Sample Data:
aa.b Main_Page 1 14335
aa.d India 1 4075
aa.d Main_Page 1 13190
aa.d Special:RecentChanges 1 200
aa.d Talk:Main_Page/ 1 14147
aa.d w/w/index.php 9 137502
aa Main_Page 6 9872
aa Special:Statistics 1 324
Can anyone please help?

I do suspect that a memory issue arise when you order by as it is the most heavy weighted. The easiest way is to play with the heap size parameters while launching your pig job.
pig -Dmapred.child.java.opts=-Xms2096M yourjob.pig
Also you can declare heap size directly inside the script.
export PIG_HEAPSIZE=2096

Related

Unable to load large pandas dataframe to pyspark

I've been trying to join two large pandas dataframes using pyspark using the following code. I'm trying to vary executor cores allocated for the application and measure scalability of pyspark (strong scaling).
r = 1000000000 # 1Bn rows
it = 10
w = 256
unique = 0.9
TOTAL_MEM = 240
TOTAL_NODES = 14
max_val = r * unique
rng = default_rng()
frame_data = rng.integers(0, max_val, size=(r, 2))
frame_data1 = rng.integers(0, max_val, size=(r, 2))
print(f"data generated", flush=True)
df_l = pd.DataFrame(frame_data).add_prefix("col")
df_r = pd.DataFrame(frame_data1).add_prefix("col")
print(f"data loaded", flush=True)
procs = int(math.ceil(w / TOTAL_NODES))
mem = int(TOTAL_MEM*0.9)
print(f"world sz {w} procs per worker {procs} mem {mem} iter {it}", flush=True)
spark = SparkSession\
.builder\
.appName(f'join {r} {w}')\
.master('spark://node:7077')\
.config('spark.executor.memory', f'{int(mem*0.6)}g')\
.config('spark.executor.pyspark.memory', f'{int(mem*0.4)}g')\
.config('spark.cores.max', w)\
.config('spark.driver.memory', '100g')\
.config('sspark.sql.execution.arrow.pyspark.enabled', 'true')\
.getOrCreate()
sdf0 = spark.createDataFrame(df_l).repartition(w).cache()
sdf1 = spark.createDataFrame(df_r).repartition(w).cache()
print(f"data loaded to spark", flush=True)
try:
for i in range(it):
t1 = time.time()
out = sdf0.join(sdf1, on='col0', how='inner')
count = out.count()
t2 = time.time()
print(f"timings {r} {w} {i} {(t2 - t1) * 1000:.0f} ms, {count}", flush=True)
del out
del count
gc.collect()
finally:
spark.stop()
Cluster:
I am using standalone spark cluster in a 15 node cluster with 48 cores and 240GB RAM each. I've spawned master and the driver code in node1, while other 14 nodes have spawned workers allocating maximum memory.
In the spark context, I am reserving 90% of total memory to executor, splitting 60% to jvm and 40% to pyspark.
Issue:
When I run the above program, I can see that the executors are being assigned to the app. But it doesn't move forward, even after 60 mins. For smaller row count (10M), this was working without a problem.
Driver output
world sz 256 procs per worker 19 mem 216 iter 8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/26 14:52:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/N/u/d/dnperera/.conda/envs/cylonflow/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:425: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Negative initial size: -589934400
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warn(msg)
Any help on this is much appreciated.

TEZ mapper resource request

We recently migrated from MapReduce to TEZ for executing Hive queries on EMR. We are seeing cases where for the exact hive query launches very different number of mappers. See Map 3 phase below. On the first run it requested for 305 resources and on another run it requested for 4534 mappers. ( Please ignore the KILLED status because I manually killed the query.) Why does this happen ? How can we change it to be based on underlying data size instead ?
Run 1
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 container KILLED 5 0 0 5 0 0
Map 3 container KILLED 305 0 0 305 0 0
Map 5 container KILLED 16 0 0 16 0 0
Map 6 container KILLED 1 0 0 1 0 0
Reducer 2 container KILLED 333 0 0 333 0 0
Reducer 4 container KILLED 796 0 0 796 0 0
----------------------------------------------------------------------------------------------
VERTICES: 00/06 [>>--------------------------] 0% ELAPSED TIME: 14.16 s
----------------------------------------------------------------------------------------------
Run 2
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 5 5 0 0 0 0
Map 3 container KILLED 4534 0 0 4534 0 0
Map 5 .......... container SUCCEEDED 325 325 0 0 0 0
Map 6 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 container KILLED 333 0 0 333 0 0
Reducer 4 container KILLED 796 0 0 796 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/06 [=>>-------------------------] 5% ELAPSED TIME: 527.16 s
----------------------------------------------------------------------------------------------
This article explains the process in which Tez allocates resources. https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
If Tez grouping is enabled for the splits, then a generic grouping
logic is run on these splits to group them into larger splits. The
idea is to strike a balance between how parallel the processing is and
how much work is being done in each parallel process.
First, Tez tries to find out the resource availability in the cluster for these tasks. For that, YARN provides a headroom value (and
in future other attributes may be used). Lets say this value is T.
Next, Tez divides T with the resource per task (say M) to find out how many tasks can run in parallel at one (ie in a single wave). W =
T/M.
Next W is multiplied by a wave factor (from configuration - tez.grouping.split-waves) to determine the number of tasks to be used.
Lets say this value is N.
If there are a total of X splits (input shards) and N tasks then this would group X/N splits per task. Tez then estimates the size of
data per task based on the number of splits per task.
If this value is between tez.grouping.max-size & tez.grouping.min-size then N is accepted as the number of tasks. If
not, then N is adjusted to bring the data per task in line with the
max/min depending on which threshold was crossed.
For experimental purposes tez.grouping.split-count can be set in configuration to specify the desired number of groups. If this config
is specified then the above logic is ignored and Tez tries to group
splits into the specified number of groups. This is best effort.
After this the grouping algorithm is executed. It groups splits by node locality, then rack locality, while respecting the group size
limits.

Using the COV function in Pig

For some reason, I am not able to get a grasp of the proper syntax for this function.
I have a file called testing.txt:
1
2
3
4
5
6
7
8
I have a Pig script:
testing = load '/testing.txt' using PigStorage(',') as (var1:double);
t = foreach testing generate var1, var1 as var2;
grp = group t all;
result = foreach grp generate AVG(t.var1) as average, COV(t.var1,t.var2) as variance;
dump result;
This should give me the mean and variance.
I tried this as well:
testing = load '/testing.txt' using PigStorage(',') as (var1:double);
grp = group t all;
result = foreach grp generate AVG(testing.var1) as average, COV(testing.var1,testing.var1) as variance;
dump result;
Both these scripts give me the same error:
ERROR 2078: Caught error from UDF: org.apache.pig.builtin.COV$Intermed [Caught exception in COV.Intermed]
I looked in the Java code and couldn't find anything out of the ordinary.
I was wondering how to use function COV in Pig.

pig latin - not showing the right record numbers

I have written a pig script for wordcount which works fine. I could see the results from pig script in my output directory in hdfs. But towards the end of my console, I see the following:
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_local1695568121_0002 1 1 0 0 0 0 0 0 0 0 words_sorted SAMPLER
job_local2103470491_0003 1 1 0 0 0 0 0 0 0 0 words_sorted ORDER_BY /output/result_pig,
job_local696057848_0001 1 1 0 0 0 0 0 0 0 0 book,words,words_agg,words_grouped GROUP_BY,COMBINER
Input(s):
Successfully read 0 records from: "/data/pg5000.txt"
Output(s):
Successfully stored 0 records in: "/output/result_pig"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local696057848_0001 -> job_local1695568121_0002,
job_local1695568121_0002 -> job_local2103470491_0003,
job_local2103470491_0003
2014-07-01 14:10:35,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
As you can see, the job is success. but not the Input(s) and output(s). Both of the them say successfully read/stored 0 records and the counter values are all 0.
why the value is zero. These should not be zero.
I am using hadoop2.2 and pig-0.12
Here is the script:
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
words = foreach book generate FLATTEN(TOKENIZE(lines)) as word;
words_grouped = group words by word;
words_agg = foreach words_grouped generate group as word, COUNT(words);
words_sorted = ORDER words_agg BY $1 DESC;
STORE words_sorted into '/output/result_pig' using PigStorage(':','-schema');
NOTE: my data is present in /data/pg5000.txt and not in default directory which is /usr/name/data/pg5000.txt
EDIT: here is the output of printing my file to console
hadoop fs -cat /data/pg5000.txt | head -10
The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete
by Leonardo Da Vinci
(#3 in our series by Leonardo Da Vinci)
Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.
This header should be the first thing seen when viewing this Project
Gutenberg file. Please do not remove it. Do not change or edit the
cat: Unable to write to output stream.
Please correct the following line
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
to
book = load '/data/pg5000.txt' using PigStorage(',') as (lines:chararray);
I am assuming the delimiter as comma here use the one which is used to separate the records in your file. This will solve the issue
Also note --
If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter.

Convert data in a specific format in Apache Pig.

I want to convert data in to a specific format in Apache Pig so that I can use a reporting tool on top of it.
For example:
10:00,abc
10:00,cde
10:01,abc
10:01,abc
10:02,def
10:03,efg
The output should be in the following format:
abc cde def efg
10:00 1 1 0 0
10:01 2 0 0 0
10:02 0 0 1 0
The main problem here is that a value can occur multiple times in a row, depending on the different values available in the sample csv file, up to a total of 120.
Any suggestions to tackle this are more than welcome.
Thanks
Gagan
Try something like the following:
A = load 'data' using PigStorage(",") as (key:chararray,value:chararray);
B = foreach A generate key,(value=='abc'?1:0) as abc,(value=='cde'?1:0) as cde,(value=='efg'?1:0) as efg;
C = group B by key;
D = foreach C generate group as key, COUNT(abc) as abc, COUNT(cde) as cde, COUNT(efg) as efg;
That should get you a count of the occurances of a particular value for a particular key.
EDIT: just noticed the limit 120 part of the question. If you cannot go above 120 put the following code
E = foreach D generate key,(abc>120?"OVER 120":abc) as abc,(cde>120?"OVER 120":cde) as cde,(efg>120?"OVER 120":efg) as efg;