AWS Glue - Spark Job - how to increase Memory Limit or run more efficiently? - dataframe

While running Spark (Glue) job - during writing of Dataframe to S3 - getting error:
Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead or
disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Is there an easy cure for this?
How writing of Dataframe to S3 can be optimized (to use less memory)?
How memory can be increased for containers so that they we have more room to work with?

As you may already know, AWS Glue jobs doesn't support increasing memory. But you can select G1.X as worker type for the Glue job. AWS recommends to use this for memory intensive work loads.
https://docs.aws.amazon.com/en_us/glue/latest/dg/add-job.html
Apart from that, I don't see any configuration option to increase the memory.
Did you check the memory profile of job runtime metrics?

Related

why Ignite server shows heap usage without any activity?

Ignite version : 2.12
OS : Windows 10
I am trying to understand ignites heap usage.
I started Ignite server with below command and no special vm args. As suggested by https://ignite.apache.org/docs/latest/quick-start/java
ignite.bat -v ..\examples\config\example-ignite.xml
Post that started analyzing heap usage of same with visualvm tool and the heap usage looks like this
Next thing that I tried is increase the heap memory and restart the server.
Surprisingly Now ignite is consuming even more memory as seen in this graph
I Know the GC is working its way to clear the heap, but why does ignite memory consumption increases with increase in heap space ?
How will this impact a server with ~40-60G memory, how much memory I can expect to be consumed by Ignite?
I'm planning to use ignite as in memory cache along with Cassandra as DB.
Just like Cassandra, Hadoop or Kafka, Ignite is a Java middleware that uses the Java Heap for various needs. But your data is always stored in an off-heap memory that allows utilizing all available memory space without worrying about garbage collection. This gives Ignite complete control over how the data is managed, and ensures the long-term performance of the system.
Ignite uses a page memory model for storing everything, including user data, indices, meta information, etc. This allows Ignite to utilize memory management, improve performance and it also can use the whole disk without any data modifications.
In other words, you might think that direct page memory access is being performed by memory pointers (outside of JVM), but some internal tasks like bootstrapping Ignite itself, performing local SQL processing tasks, etc. do require JVM heap because Ignite itself is written in Java.
Check this and that pages for details.
How will this impact a server with ~40-60G memory, how much memory I
can expect to be consumed by Ignite?
You would need 40-60 GB of RAM + something for JVM itself (Java heap), recommended values might differ, but 2GB of Java heap should be enough.

How does the YARN container use the allocated CPU?

I am struggling to understand how yarn containers are limited to allocated resources, especially the CPU.
I am running Spark or Flink jobs in the YARN cluster. Each executor or task manager requests a yarn container that has 1 CPU. Basically, the number of containers is equal to the number of CPUs available in the host.
I understand that YARN monitors the memory usage, and if the container exceeds the limit, it sends a kill signal. I am wondering about how CPU scheduling really works.
My JVM job in the YARN container(1CPU) can try to create multiple CPU-bound work threads. Will JVM be limited to 1 CPU core to execute those threads, or will it steal resources from other containers? Can technically a YARN container affect other containers' CPU performance?
Let's say I have 10 CPU in the host and I created a single container. Will that containers CPU performance be 10% of the host CPU performance?
By Default, yarn only allocates resources by RAM. so by default it hopes everyone plays nicely and you can get affected by CPU hungry jobs. You can change this:
From Apache:
yarn.scheduler.capacity.resource-calculator The ResourceCalculator
implementation to be used to compare Resources in the scheduler. The
default i.e.
org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator only
uses Memory while DominantResourceCalculator uses Dominant-resource to
compare multi-dimensional resources such as Memory, CPU etc. A Java
ResourceCalculator class name is expected.
In general it's enough to estimate by Memory. Most people actually estimate they're requirements for memory and threads very poorly. It's usually best to ignore [threads] unless you encounter issues. If it maintains to be an issue then maybe consider looking at DominantResourceCalculator. If/when you turn on resourceDominantCalculator be ready for a lot of people to feel the impact. You may have grossly over allocated threads and when we start counting threads, they will suddenly have to account for what they've asked for. (Or at least this was my experience.) This could grossly appear to shrink capacity of your cluster as space is reserved where it wasn't before.
TLDF: Don't touch this unless you have a good reason. (Wait until it's a problem, don't optimize until there is a bottleneck ). Users can make innocent mistakes in their resource estimation and it can be painful to grow their ability to correctly estimate what they need.

Can I use MRJob to process big files in local mode?

I have a relatively big file - around 10GB to process. I suspect it won't fit into my laptop's RAM, if MRJob decides to sort it in RAM or something similar.
At the same time, I don't want to setup hadoop or EMR - the job is not urgent and I can simple start worker before going to sleep and get the results the next morning. In other words, I'm quite happy with local mode. I know, the performance won't be perfect but it's ok for now.
So can it process such 'big' files at a single weak machine? If yes - what would you recommend to do (besides setting a custom tmp dir to point to the filesystem, not to the ramdisk which will be exhausted quickly). Let's assume we use version 0.4.1.
I think the RAM size won't be an issue with the python runner of mrjob. The output of each step should be written out to temporary file on disk, so it should not fill up the RAM I believe. Dumping output to disk is the way it should be with Hadoop (and the reason why it is slow due to IO). So I would just run the job and see how it goes.
If the RAM size is an issue, you can create enough swap space on your laptop to make it at least run, thought it will be slow if the partition isn't on SSD.

Pig script minimum execution time

I'm currently learning Pig and I'm executing my scripts inside Hortonworks Sandbox. What is bugging me from the very beginning is that the minimum execution time for a Pig script seems to be at least 30-40 seconds. Is that because I'm using the Hortonworks Sandbox or is a normal for Pig scripts? Is there a way to reduce the execution time, because this is really slowing my learning progress? If this execution time is normal can you explain me what is going on and why is that?
PS
I've allocated 2GB RAM for the Hortonworks virtual machine. And just to mention I'm currently executing just simple scripts on small data sets.
If you execute pig in local mode (pig -x local) then it'll run a lot faster but it won't do map-reduce and won't access hdfs - it's good for learning though!
Yes, 30-40 seconds is absolutely normal for Pig, because it has a big overhead for compiling the job, launching JVMs, etc.
As stated in the other answer - you can try to run in local mode. It usually takes me about 15 seconds for a simple job with input containing just a few lines of data. My Cloudera VM is allocated with 4G of RAM, btw.

Redis - Can data size be greater than memory size?

I'm rather new to Redis and before using it I'd like to learn some important (as for me) details on it. So....
Redis is using RAM and HDD for storing data. RAM is used as fast read/write storage, HDD is used to make this data persistant. When Redis is started it loads all data from HDD to RAM or it loads only often queried data to the RAM? What if I have 500Mb Redis storage on HDD, but I have only 100Mb or RAM for Redis. Where can I read about it?
Redis loads everything into RAM. All the data is written to disk, but will only be read for things like restarting the server or making a backup.
There are a couple of ways you can use it with less RAM than data though. You can set it up in combination with MySQL or another disk based store to work much like memcached - you manage cache misses and persistence manually.
Redis has a VM mode where all keys must fit in RAM but infrequently accessed data can be on disk. However, I'm not sure if this is in the stable builds yet.
Recent versions (>2.0) have improved significantly and memory management is more efficient. See this blog post that explains how to use hashes to optimize RAM memory footprint: http://antirez.com/post/redis-weekly-update-7.html
The feature called Virtual Memory and it official deprecated
Redis VM is now deprecated. Redis 2.4 will be the latest Redis version featuring Virtual Memory (but it also warns you that Virtual Memory usage is discouraged). We found that using VM has several disadvantages and problems. In the future of Redis we want to simply provide the best in-memory database (but persistent on disk as usual) ever, without considering at least for now the support for databases bigger than RAM. Our future efforts are focused into providing scripting, cluster, and better persistence.
more information about VM: https://redis.io/topics/virtual-memory