SOLR shards abruptly goes down - apache

I am processing around 7 billion documents per day into my solr cloud with 10 instances running with 5GB XMX and XMS values respectovely, which is getting pushed to one collection named 'X'. Its schema has around 150+ fields, out of which almost all of them are indexed. The collection X has 240 shards with 2 replication factor each.
The problem that I am facing currently is that, out of 240 shards, 3 to 4 shards randomly goes down with the following exception :
org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: X slice: shard118
at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:747)
at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:733)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:305)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Followed by this we found another exception in the solr logs :
ERROR (zkCallback-4-thread-4-processing-n:<IP>:8983_solr) [c:X s:shard63 r:core_node57 x:X_shard63_replica1] o.a.s.c.Overseer Could not create Overseer node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
at org.apache.solr.cloud.Overseer.createOverseerNode(Overseer.java:731)
at org.apache.solr.cloud.Overseer.getStateUpdateQueue(Overseer.java:604)
at org.apache.solr.cloud.Overseer.getStateUpdateQueue(Overseer.java:591)
at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:314)
at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:56)
at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:348)
at org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(SolrZkClient.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
As a solution, I deleted all the replicas and recreated them for the shards which were down. This solved the problem but it works intermittently.
Also this has a risk losing a good amount of data in case no in sync replicas are found.
Can anyone please suggest me a better way to resolve this issue. Also this is happening in production so I cannot recreate the collection ( which solves the problem but the same issue may re-appear again after some time ) nor can I afford to restart zookeeper as many other spark jobs are depenedent on the same.
I am stuck into this for long.
UPDATE :
We are not performing any operations on the SolrCloud due to which the shard may come down. The only operation that occurs is that a spark batch job runs on top of this collection, to process data. The spark batch job runs twice a day but shards do not come down during this period of time.

Related

mondrian.olap.ResourceLimitExceededException: Mondrian Error:Number of members to be read exceeded limit (10,000)

I have a problem similar to the one reported on the Pentaho blog in the post below:
https://forums.pentaho.com/threads/47819-Help-regd-member-restriction/?p=141499
And that query is very simple:
MDX: SELECT NON EMPTY {[Measures].[QTD_EMPRESAS]} on 0, NON EMPTY {[BAIRRO.BAIRRO_H].[TODOS_BAIRRO_H],[BAIRRO.BAIRRO_H].[06], [BAIRRO.BAIRRO_H].[061]} on 1 FROM [V_DM_EMPRESAS_LOCALIDADE] WHERE {[BAIRRO_F.BAIRRO_H].[06],[BAIRRO_F.BAIRRO_H].[061]}
If i use one filter it is return successfully, but with two filter, like the query above error occurs. Both filters together add up to 200 records, less than the limit reported in the error.
The dimension BAIRRO have more than 10000 records, but i filter only two BAIRRO.
Error occurred:
Caused by: mondrian.olap.ResourceLimitExceededException: Mondrian Error:Number of members to be read exceeded limit (10,000)
at mondrian.resource.MondrianResource$_Def11.ex(MondrianResource.java:1180)
at mondrian.rolap.SqlMemberSource.getMemberChildren2(SqlMemberSource.java:993)
at mondrian.rolap.SqlMemberSource.getMemberChildren(SqlMemberSource.java:891)
at mondrian.rolap.SqlMemberSource.getMemberChildren(SqlMemberSource.java:864)
at mondrian.rolap.NoCacheMemberReader.getMemberChildren(NoCacheMemberReader.java:179)
at mondrian.rolap.RolapCubeHierarchy$NoCacheRolapCubeHierarchyMemberReader.readMemberChildren(RolapCubeHierarchy.java:970)
at mondrian.rolap.RolapCubeHierarchy$NoCacheRolapCubeHierarchyMemberReader.getMemberChildren(RolapCubeHierarchy.java:1027)
at mondrian.rolap.NoCacheMemberReader.getMemberChildren(NoCacheMemberReader.java:159)
at mondrian.rolap.RolapSchemaReader.internalGetMemberChildren(RolapSchemaReader.java:186)
at mondrian.rolap.RolapSchemaReader.getMemberChildren(RolapSchemaReader.java:169)
at mondrian.rolap.RolapSchemaReader.getMemberChildren(RolapSchemaReader.java:162)
at mondrian.olap.DelegatingSchemaReader.getMemberChildren(DelegatingSchemaReader.java:78)
at mondrian.olap.fun.AggregateFunDef$AggregateCalc.getChildCount(AggregateFunDef.java:571)
at mondrian.olap.fun.AggregateFunDef$AggregateCalc.optimizeMemberSet(AggregateFunDef.java:490)
at mondrian.olap.fun.AggregateFunDef$AggregateCalc.optimizeChildren(AggregateFunDef.java:398)
at mondrian.olap.fun.AggregateFunDef$AggregateCalc.optimizeTupleList(AggregateFunDef.java:252)
at mondrian.rolap.RolapResult.<init>(RolapResult.java:314)
at mondrian.rolap.RolapConnection.executeInternal(RolapConnection.java:662)
at mondrian.rolap.RolapConnection.access$000(RolapConnection.java:52)
at mondrian.rolap.RolapConnection$1.call(RolapConnection.java:613)
at mondrian.rolap.RolapConnection$1.call(RolapConnection.java:611)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
To avoid this bug, in the Mondrian schema, I configured Role and in the HierarchyGrant configuration i am setting rollupPolicy to false.
Is this really a good strategy? Or is there a better one?
Role and Hierarchy grants are for security so I wouldn't set those unless you need them for data access control.
It looks like your mondrian.result.limit configuration setting is set too low at 10,000. For optimizing Mondrian performance, I would start with this mondrian.properties and tweak settings from there:
https://github.com/pentaho/pentaho-platform/blob/master/assemblies/pentaho-solutions/src/main/resources/pentaho-solutions/system/mondrian/mondrian.properties

Dataflow Apache beam Python job stuck at Group by step

I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.
Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message
I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished
I figured out the thing myself.
Below are the 2 changes that I did in my pipeline:
1. I added a Combine function just after the Group by Key, see screenshot
since the Group by key when running on multiple worker, does a lot of network traffic exchange, and by default the network we use, does not allow the inter network communication, so I have to create a firewall rule to allow traffic from one worker to another worker i.e. ip range to network traffic.

Kafka Connect S3 Connector OutOfMemory errors with TimeBasedPartitioner

I'm currently working with the Kafka Connect S3 Sink Connector 3.3.1 to copy Kafka messages over to S3 and I have OutOfMemory errors when processing late data.
I know it looks like a long question, but I tried my best to make it clear and simple to understand.
I highly appreciate your help.
High level info
The connector does a simple byte to byte copy of the Kafka messages and add the length of the message at the beginning of the byte array (for decompression purposes).
This is the role of the CustomByteArrayFormat class (see configs below)
The data is partitioned and bucketed according to the Record timestamp
The CustomTimeBasedPartitioner extends the io.confluent.connect.storage.partitioner.TimeBasedPartitioner and its sole purpose is to override the generatePartitionedPath method to put the topic at the end of the path.
The total heap size of the Kafka Connect process is of 24GB (only one node)
The connector process between 8,000 and 10,000 messages per second
Each message has a size close to 1 KB
The Kafka topic has 32 partitions
Context of OutOfMemory errors
Those errors only happen when the connector has been down for several hours and has to catch up data
When turning the connector back on, it begins to catch up but fail very quickly with OutOfMemory errors
Possible but incomplete explanation
The timestamp.extractor configuration of the connector is set to Record when those OOM errors happen
Switching this configuration to Wallclock (i.e. the time of the Kafka Connect process) DO NOT throw OOM errors and all of the late data can be processed, but the late data is no longer correctly bucketed
All of the late data will be bucketed in the YYYY/MM/dd/HH/mm/topic-name of the time at which the connector was turn back on
So my guess is that while the connector is trying to correctly bucket the data according to the Record timestamp, it does too many parallel reading leading to OOM errors
The "partition.duration.ms": "600000" parameter make the connector bucket data in six 10 minutes paths per hour (2018/06/20/12/[00|10|20|30|40|50] for 2018-06-20 at 12pm)
Thus, with 24h of late data, the connector would have to output data in 24h * 6 = 144 different S3 paths.
Each 10 minutes folder contains 10,000 messages/sec * 600 seconds = 6,000,000 messages for a size of 6 GB
If it does indeed read in parallel, that would make 864GB of data going into memory
I think that I have to correctly configure a given set of parameters in order to avoid those OOM errors but I don't feel like I see the big picture
The "flush.size": "100000" imply that if there is more dans 100,000 messages read, they should be committed to files (and thus free memory)
With messages of 1KB, this means committing every 100MB
But even if there is 144 parallel readings, that would still only give a total of 14.4 GB, which is less than the 24GB of heap size available
Is the "flush.size" the number of record to read per partition before committing? Or maybe per connector's task?
The way I understand "rotate.schedule.interval.ms": "600000" config is that data is going to be committed every 10 minutes even when the 100,000 messages of flush.size haven't been reached.
My main question would be what are the maths allowing me to plan for memory usage given:
the number or records per second
the size of the records
the number of Kafka partitions of the topics I read from
the number of Connector tasks (if this is relevant)
the number of buckets written to per hour (here 6 because of the "partition.duration.ms": "600000" config)
the maximum number of hours of late data to process
Configurations
S3 Sink Connector configurations
{
"name": "xxxxxxx",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"s3.region": "us-east-1",
"partition.duration.ms": "600000",
"topics.dir": "xxxxx",
"flush.size": "100000",
"schema.compatibility": "NONE",
"topics": "xxxxxx,xxxxxx",
"tasks.max": "16",
"s3.part.size": "52428800",
"timezone": "UTC",
"locale": "en",
"format.class": "xxx.xxxx.xxx.CustomByteArrayFormat",
"partitioner.class": "xxx.xxxx.xxx.CustomTimeBasedPartitioner",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"name": "xxxxxxxxx",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"s3.bucket.name": "xxxxxxx",
"rotate.schedule.interval.ms": "600000",
"path.format": "YYYY/MM/dd/HH/mm",
"timestamp.extractor": "Record"
}
Worker configurations
bootstrap.servers=XXXXXX
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
consumer.auto.offset.reset=earliest
consumer.max.partition.fetch.bytes=2097152
consumer.partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor
group.id=xxxxxxx
config.storage.topic=connect-configs
offset.storage.topic=connect-offsets
status.storage.topic=connect-status
rest.advertised.host.name=XXXX
Edit:
I forgot to add an example of the errors I have:
2018-06-21 14:54:48,644] ERROR Task XXXXXXXXXXXXX-15 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:482)
java.lang.OutOfMemoryError: Java heap space
[2018-06-21 14:54:48,645] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:483)
[2018-06-21 14:54:48,645] ERROR Task XXXXXXXXXXXXXXXX-15 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:148)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I was finally able to understand how the Heap Size usage works in the Kafka Connect S3 Connector
The S3 Connector will write the data of each Kafka partition into partitioned paths
The way those paths are partitioned depends on the partitioner.class parameter;
By default, it is by timestamp, and the value of partition.duration.ms will then determine the duration of each partitioned paths.
The S3 Connector will allocate a buffer of s3.part.size Bytes per Kafka partition (for all topics read) and per partitioned paths
Example with 20 partitions read, a timestamp.extractor set to Record, partition.duration.ms set to 1h, s3.part.size set to 50 MB
The Heap Size needed each hour is then equal to 20 * 50 MB = 1 GB;
But, timestamp.extractor being set to Record, messages having a timestamp corresponding to an earlier hour then the one at which they are read will be buffered in this earlier hour buffer. Therefore, in reality, the connector will need minimum 20 * 50 MB * 2h = 2 GB of memory because there is always late events, and more if there is events with a lateness superior to 1 hour;
Note that this isn't true if timestamp.extractor is set to Wallclock because there will virtually never be late events as far as Kafka Connect is concerned.
Those buffers are flushed (i.e. leave the memory) at 3 conditions
rotate.schedule.interval.ms time has passed
This flush condition is always triggered.
rotate.interval.ms time has passed in terms of timestamp.extractor time
This means that if timestamp.extractor is set to Record, 10 minutes of Record time can pass in less or more and 10 minutes of actual time
For instance, when processing late data, 10 minutes worth of data will be processed in a few seconds, and if rotate.interval.ms is set to 10 minutes then this condition will trigger every second (as it should);
On the contrary, if there is a pause in the flow of events, this condition will not trigger until it sees an events with a timestamp showing that more than rotate.interval.ms has passed since the condition last triggered.
flush.size messages have been read in less than min(rotate.schedule.interval.ms, rotate.interval.ms)
As for the rotate.interval.ms, this condition might never trigger if there is not enough messages.
Thus, you need to plan for Kafka partitions * s3.part.size Heap Size at the very least
If you are using a Record timestamp for partitioning, you should multiply it by max lateness in milliseconds / partition.duration.ms
This is a worst case scenario where you have constantly late events in all partitions and for the all range of max lateness in milliseconds.
The S3 connector will also buffer consumer.max.partition.fetch.bytes bytes per partition when it reads from Kafka
This is set to 2.1 MB by default.
Finally, you should not consider that all of the Heap Size is available to buffer Kafka messages because there is also a lot of different objects in it
A safe consideration would be to make sure that the buffering of Kafka messages does not go over 50% of the total available Heap Size.
#raphael has explained the working perfectly.
Pasting a small variation of similar problem (too little events to process but across many hours/days) that I had faced.
In my case I had about 150 connectors and 8 of them were failing with OOM as they had to process about 7 days worth of data (Our kafka in test env was down for about 2 weeks)
Steps Followed:
Reduced s3.part.size from 25MB to 5MB for all connectors. (In our scenario, rotate.interval was set to 10min with flush.size as 10000. Most of our events should easily fit with in this limit).
After this setting, only one connector was still getting OOM and this one connector goes into OOM within 5seconds of start (based on heap analysis), it shoots up from 200MB-1.5GB of Heap utilization. On looking at kafka offset lag, there were only 8K events to process across all 7 days. So this wasn't because of too many events to handle but rather too little events to handle/flush.
Since we were using Hourly partition and for an hour there were hardly 100 events, all the buffers for these 7 days were getting created without being flushed (without being released to JVM) - 7 * 24 * 5MB * 3 partitions = 2.5GB (xmx-1.5GB)
Fix:
Perform one of the below steps until your connector catches up and then restore your old config back. (Recommend approach - 1)
Update the Connector config to process 100 or 1000 records flush.size (depending on how your data is structured). Drawback : Too many small files get created in the hour, if actual events are more than 1000.
Change Partition to Daily so there will only be daily partitions. Drawback : You'll now have a mix of Hourly and Daily partition in your S3.

IBM JDK dumps too many heaps per OutOfMemory. How can I decrease this #?

We're new to the IBM jvm. When reviewing heap dumps caused by OutOfMemoryError (i..e -XX:+HeapDumpOnOutOfMemoryError), we see often multiple dumps (.phd files) generated in the same instant. Example:
heapdump.20141111.011601.8944.0003.phd
heapdump.20141111.011601.8944.0005.phd
heapdump.20141111.011601.8944.0007.phd
heapdump.20141111.011601.8944.0009.phd
As I read these, the jvm generated these 4 heap dumps at 2014-11-11 01:16:01 am, for pid #8944.
So why 4? And Why 4 in the same second? [I assumed because 4 actually OOM's occurred in the same second]
reviewing these dump, I found them to be pretty identical. Dumps 2,3, and 4 don't add any info, but only clutter and fill up the drive.
How can I configure the IBM jvm to dump only one heap dump? Can I configure a 'wait time' between heap dumps?
thanks
Not 100%sure about the cause of the multiple dump, could be that the dump/recovery process is generating another OOM.
You can use the IBM JDK -Xdump option (controls the way you use dump agents and dumps) to limit the number of dumps you get on OutOfMemoryErrors by providing a range.
For example, to limit the number to a single dump use a range of 0..1: -Xdump:heap:range=0..1
The default range is 1..4, you can see it by running the JVM with -Xdump:what

Hive Map join : out of memory Exception

I am trying to perform map side with one big Table (10G) and small Table (230 MB). With the small i will use all the columns to produce output records, after joining on key columns
I have used below setting
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=262144000;
Logs :
**2013-09-20 02:43:50 Starting to launch local task to process map join; maximum memory = 1065484288
2013-09-20 02:44:05 Processing rows: 200000 Hashtable size: 199999 Memory usage: 430269904 rate:0.404
2013-09-20 02:44:14 Processing rows: 300000 Hashtable size: 299999 Memory usage: 643070664 rate:0.604
Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space
at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:313)
at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:308)
at java.util.jar.Manifest.read(Manifest.java:176)
at java.util.jar.Manifest.<init>(Manifest.java:50)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:168)
at java.util.jar.JarFile.getManifest(JarFile.java:149)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:696)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:228)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at org.apache.hadoop.util.RunJar$1.run(RunJar.java:126)
Execution failed with exit status: 3
Obtaining error information
Task failed!
Task ID:
Stage-7
Logs:
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.MapredLocalTask
ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask**
but still i am facing OOM exception , Heap size set in my cluster is 1 GB.
Please assist which properties do i need to consider and tune to make this map side join work
Processing rows: 300000 Hashtable size: 299999 Memory usage: 643070664 rate:0.604
At 300k rows the HT already uses 60% of your heap. First question to ask: are you sure you got the table order right, is the small table in the join really the smaller table in your data? When writing the query, the large table should be the last in the JOIN clause. Which Hive version are you on 0.9 or 0.11?
If you are on Hive 0.11 and you are specifying the join correctly then the first thing to try would be to increase the Heap size. From the above data (300k row ~> 650Mb Heap) you can figure out how much heap you need.
set hive.auto.convert.join = false;
it will not give u memory exception.
I faced this problem and was only able to get over it by using
set hive.auto.convert.join=false
set hive.auto.convert.join = false;
It won't give you a memory exception because it is not using mapside join. It is using the normal mapreduce task itself.
You should take this into account, especially when tables are stored with compression, the table size maybe not be too large but when it is decompressed it can grow 10x or more, on top of that representing the data in hash table takes even more space. So your table might be smaller than ~260MB which is the value you set for hive.mapjoin.smalltable.filesize but the hash table representation of decompressed version of it might not, and that's why hive tries to load the table in memory which eventually causes the OutOfMemoryError exception.
According to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization:
"There is no check to see if the table is a compressed one or not and what the potential size of the table can be."