Oozie taking longer to run compared to Hive CLI - hive

There are 2 problems I'm currently encountering:-
Oozie job hive SQL is taking 2 hours to complete whereas running the same SQL in Hive/Beeline CLI, it took around 6 to 7 minutes.
The same SQL ran in Oozie and Hive/Beeline CLI but both seems to be using different memory allocation even though the same hive-site.xml was used in both execution.
I get an error when running in Oozie but no error when running in CLI with the same SQL. Somehow the hybridhashtablecontainer memory allocation was different.
From the logs below, when running in Oozie, the memory allocated was 5000000, but when I ran in Hive/Beeline CLI, the allocation was 10000000 and the partitioning allocation was successful if memory is higher
Any help would be greatly appreciated.
Logs
|tez.HashTableLoader|: Memory manager allocates 5000000 bytes for the loading hashtable.
|persistence.HashMapWrapper|: Key count from statistics is 1; setting map size to 2
|persistence.HybridHashTableContainer|: Available memory is not enough to create a HybridHashTableContainer!
|persistence.HybridHashTableContainer|: Total available memory: 5000000
|persistence.HybridHashTableContainer|: Estimated small table size: 105
|persistence.HybridHashTableContainer|: Number of hash partitions to be created: 16
|persistence.HybridHashTableContainer|: Total available memory is: 5000000
|persistence.HybridHashTableContainer|: Write buffer size: 524288
|persistence.HybridHashTableContainer|: Using a bloom-1 filter 2 keys of size 8 bytes
|persistence.HybridHashTableContainer|: Each new partition will require memory: 65636
|persistence.HybridHashTableContainer|: Hash partition 0 is created in memory. Total memory usage so far: 65644
|persistence.HybridHashTableContainer|: Hash partition 1 is created in memory. Total memory usage so far: 131280
|persistence.HybridHashTableContainer|: Hash partition 2 is created in memory. Total memory usage so far: 196916
|persistence.HybridHashTableContainer|: Hash partition 3 is created in memory. Total memory usage so far: 262552
|persistence.HybridHashTableContainer|: Hash partition 4 is created in memory. Total memory usage so far: 328188
|persistence.HybridHashTableContainer|: Hash partition 5 is created in memory. Total memory usage so far: 393824
|persistence.HybridHashTableContainer|: Hash partition 6 is created in memory. Total memory usage so far: 459460
|persistence.HybridHashTableContainer|: Hash partition 7 is created in memory. Total memory usage so far: 525096
|persistence.HybridHashTableContainer|: Hash partition 8 is created in memory. Total memory usage so far: 590732
|persistence.HybridHashTableContainer|: Hash partition 9 is created in memory. Total memory usage so far: 656368
|persistence.HybridHashTableContainer|: Hash partition 10 is created in memory. Total memory usage so far: 722004
|persistence.HybridHashTableContainer|: Hash partition 11 is created in memory. Total memory usage so far: 787640
|persistence.HybridHashTableContainer|: Hash partition 12 is created in memory. Total memory usage so far: 853276
|persistence.HybridHashTableContainer|: Hash partition 13 is created in memory. Total memory usage so far: 918912
|persistence.HybridHashTableContainer|: Hash partition 14 is created in memory. Total memory usage so far: 984548
|persistence.HybridHashTableContainer|: Hash partition 15 is created in memory. Total memory usage so far: 1050184
|persistence.HybridHashTableContainer|: There is not enough memory to allocate 16 hash partitions.
|persistence.HybridHashTableContainer|: Number of partitions created: 16
|persistence.HybridHashTableContainer|: Number of partitions spilled directly to disk on creation: 0
hive-site.xml
....
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
<property>
<name>tez.lib.uris</name>
<value>maprfs:///apps/tez/tez-0.8,maprfs:///apps/tez/tez-0.8/lib</value>
</property>
<property>
<name>hive.tez.container.size</name>
<value>6144</value>
</property>
...

Related

Redis MEMORY USAGE & INFO MEMORY

MEMORY USAGE KEY gives the memory in bytes that key is taking(https://redis.io/commands/memory-usage)
If I sum up the values returned by the command by all of the keys in redis, should it sum up to one of the memory stats returns from INFO MEMORY ?
If yes. Which one would it be?
used_memory_rss
used_memory_rss_human
used_memory_dataset
No, even if you sum up that output from MEMORY USAGE, you will not get to the sums reported by INFO MEMORY.
MEMORY USAGE attempts to estimate the memory usage associated with a given key - the data but also its overheads.
used_memory_rss is the amount of memory allocated, inclusive of server overheads and fragmentation.
used_memory_dataset attempts to account for the data itself, without overheads.
So, roughly: used_memory_dataset < sum of MEMORY USAGE < used_memory_rss

Memory required by Primary Index

According to Aerospike's website, Primary Indexes will occupy space in RAM given by:
64 bytes × (replication factor) × (number of records)
I was confused if this is the space which will be occupied on each replica or this is the space occupied by the Primary Index in total i.e. the sum of the space required on each replica.
Every record that you store in Aerospike has two components - the data part and a Primary Index in RAM (which takes 64 bytes) that is used to locate the data - where ever you have stored it. (You get to choose where you want to store the data part - it can be in the process RAM or an SSD drive and other exotic options.) Aerospike is a distributed database - so typically has more than one node over which you store your records and easily horizontally scalable. To avoid losing data upon losing a node, you will typically ask Aerospike to store two copies (r=2) of every record - each on different nodes. So when you look at the RAM usage across the entire cluster of nodes for just the primary index, you need n x r x 64 bytes of RAM just for the PI. This is all the RAM required to just store the primary index for master records and replica records over all the nodes in the cluster.
So if you had 100 records, 2 copies on a 5 node cluster, RAM needed just for PI will be 100 x 2 x 64 bytes over 5 servers, ie each server will need about (100 x 2 x 64)/5 bytes of RAM consumed for PI storage. (In reality RAM for PI is allocated in minimum 1GB chunks in Enterprise Edition.)

Cassandra Compacted partition maximum byte size is higher than total space used for the table

I am working on Cassandra version 2.1.13.1218 and cqlsh version 5.0.1.
For a given table, when I run cfstats command, Compacted partition maximum bytes is greater than Space used (total). For example:
Compacted partition maximum bytes: 4.64 MB
and
Space used (total): 2.28 MB.
Total space used by a table should always be higher since all large/small partition sizes are part of the total space of the given table. How can compacted partition maximum byte size be higher than total space used for the table?
Command is: ./cqlsh cfstats keyspace.columnfamilyname -H
Can someone help me understand this and what is the different between Space used (live) and Space used (total)?
The Space used indicates how much space is used by the table on disk. This depends on the OS and the compression ratio.
Whereas the Compacted partition max bytes is just max encountered partition size (after compaction). This is based on the data modeling/schema and logical record size used. For instance, 100kb record size times 40 records (each going into the same partition) will give you a 4MB partition.
This when it sits on the disk may be compressed further and you may get 2MB on disk. Can you share the rest of the stats too (compression info for ex, min and avg size, number of keys)?

Optaplanner; What is the allowable size limit?

My problem has a problem size of 80000 but I stuck when I exceeds this limit,
Is there a limit for the problem size used in Optaplanner?
What is this limit?
I get a java heap exception when I exceed this limit (80000)
Some idea's to look into:
1) Give the JVM more memory: -Xmx=2G
2) Use a more efficient data structure. 80k instances will easy fit into a small memory. My bet is you have some sort of cross matrix between 2 collections. For example, a distance matrix for 20k VRP locations needs (20k)² = 400m integers (each of which at least 4 bytes), so it requires almost 2GB of RAM to keep in memory in its most efficient form (an array). Use a profiler such as JProfiler or VisualVM to find out which datastructures are taken such much memory.
3) Read the chapter about "planning clone". Sometimes splitting a Job up in a Job and JobAssignment can save memory because only the JobAssignment needs to be cloned, while in the other case everything that references Job needs to be planning cloned too.

Sequential scan in postgres is taking surprisingly long. How do I determine the hardware bottleneck?

I have a vanilla postgres database running on a small server with only one table called "posts".
The table is on the order of ~5GB and contains 9 million rows.
When I run a simple sequential scan opertaion it takes about 51 seconds!:
EXPLAIN ANALYZE select count(*) from posts;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=988701.41..988701.42 rows=1 width=0) (actual time=51429.607..51429.607 rows=1 loops=1)
-> Seq Scan on posts (cost=0.00..966425.33 rows=8910433 width=0) (actual time=0.004..49530.025 rows=9333639 loops=1)
Total runtime: 51429.639 ms
(3 rows)
The server specs:
Xeon E3-1220v2
4GB RAM
500GB hard drive (stock 7200rpm, No RAID)
postgres 9.1
Ubuntu 12.04
No L1 or L2 cache
Postgres runs on 1 of 4 cores
Postgres configuration is standard, nothing special
I have isolated the server, and nothing else significant is running on the server
When the query runs, the disk is getting read at a rate of ~122M/s (according to iotop) and a "IO>" of ~90%. Only 1 core is getting used at 12% capacity of it's capacity. It looks like little to no memory is used in this operation, maybe ~5MB.
From these statistics is sounds like the bottleneck is IO, but I'm confused because the disk is capable of reading way faster, (from a speed test I did using sudo hdparm -Tt /dev/sda I was getting about 10,000M/s) but at the same time iotop is showing a value of 90% which I'm not fully understanding.
Your disk certainly does not read at 10GB/sec :) This is cached performance. The hardware is maxed out here. 120MB/sec is a typical sequential rate.
I see no indication of a hardware problem. The hardware is being used maximally efficiently.
51sec * 120MB/sec ~ 6GB
You say the table is 5GB in size. Probably it is more like 6GB.
The numbers make sense. No problem here.