I'm running the prebuilt version of Spark 1.2 for CDH 4 on CentOS. I have copied the hive-site.xml file into the conf directory in Spark so it should see the Hive metastore.
I have three tables in Hive (facility, newpercentile, percentile), all of which I can query from the Hive CLI. After I log into Spark and create the Hive Context like so: val hiveC = new org.apache.spark.sql.hive.HiveContext(sc) I am running into an issue querying these tables.
If I run the following command: val tableList = hiveC.hql("show tables") and do a collect() on tableList, I get this result: res0: Array[org.apache.spark.sql.Row] = Array([facility], [newpercentile], [percentile])
If I then run this command to get the count of the facility table: val facTable = hiveC.hql("select count(*) from facility"), I get the following output, which I take to mean that it cannot find the facility table to query it:
scala> val facTable = hiveC.hql("select count(*) from facility")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
14/12/26 10:27:26 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
14/12/26 10:27:26 INFO ParseDriver: Parsing command: select count(*) from facility
14/12/26 10:27:26 INFO ParseDriver: Parse Completed
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(355177) called with curMem=0, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 346.9 KB, free 264.6 MB)
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(50689) called with curMem=355177, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 49.5 KB, free 264.6 MB)
14/12/26 10:27:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on (size: 49.5 KB, free: 264.9 MB)
14/12/26 10:27:26 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
14/12/26 10:27:26 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:68
facTable: org.apache.spark.sql.SchemaRDD =
SchemaRDD[2] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
Aggregate false, [], [Coalesce(SUM(PartialCount#38L),0) AS _c0#5L]
Exchange SinglePartition
Aggregate true, [], [COUNT(1) AS PartialCount#38L]
HiveTableScan [], (MetastoreRelation default, facility, None), None
Any assistance would be appreciated. Thanks.

scala> val facTable = hiveC.hql("select count(*) from facility")
Great! You have an RDD, now what do you want to do with it?
scala> facTable.collect()
Remember that an RDD is an abstraction on top of your data and is not materialized until you invoke an action on it such as collect() or count().
You would get a very obvious error if you tried to use a non-existent table name.


Apache Ignite + Spark Dataframes: Client vs Server Doubts

I've been trying to integrate ignite and spark. The goal of my application is to write and read spark dataframes to/from ignite. However, I'm facing several issues with larger datasets (> 200 000 000 rows).
I have a 6-node Ignite cluster running on YARN. It has 160Gb of memory and 12 cores. I am trying to save the dataframe using spark (around 20Gb of raw text data) in an Ignite cache (partitioned 1 backup):
def main(args: Array[String]) {
val ignite = setupIgnite
closeAfter(ignite) { _ ⇒
implicit val spark: SparkSession = SparkSession.builder
.appName("Ignite Benchmark")
val customer = readDF("csv", "|", Schemas.customerSchema, "hdfs://master.local:8020/apps/hive/warehouse/ssbplus100/customer")
val part = readDF("csv", "|", Schemas.partSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/part")
val supplier = readDF("csv", "|", Schemas.supplierSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/supplier")
val dateDim = readDF("csv", "|", Schemas.dateDimSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/date_dim")
val lineorder = readDF("csv", "|", Schemas.lineorderSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/lineorder")
writeDF(customer, "customer", List("custkey"), TEMPLATES.REPLICATED)
writeDF(part, "part", List("partkey"), TEMPLATES.REPLICATED)
writeDF(supplier, "supplier", List("suppkey"), TEMPLATES.REPLICATED)
writeDF(dateDim, "date_dim", List("datekey"), TEMPLATES.REPLICATED)
writeDF(lineorder.limit(200000000), "lineorder", List("orderkey, linenumber"), TEMPLATES.NO_BACKUP)
At some point, the spark application retrieves this error:
class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Out of memory in data region [name=default, initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false] Try the following:
^-- Increase maximum off-heap memory size (DataRegionConfiguration.maxSize)
^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled)
^-- Enable eviction or expiration policies
at org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.allocatePage(
at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.allocateDataPage(
at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.insertDataRow(
at org.apache.ignite.internal.processors.cache.persistence.RowStore.addRow(
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.createRow(
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$6200(
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(
at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(
at org.apache.ignite.internal.managers.communication.GridIoManager$
at org.apache.ignite.internal.util.StripedExecutor$
I think the problem lies in the ignite server being initiated before the spark session, as in the official ignite examples. This server starts caching data that I am writing to the ignite cache and exceeds its default region size max (12Gb, which is different from the 20GB I defined for my yarn cluster). However, I don’t understand how the examples and documentation tells us to create an ignite server before the spark context (and session I assume). I understand that without this the application will hang once all the spark jobs are terminated, but I don’t understand the logic of having a server on the spark application that starts caching data. I’m very confused by this concept, and for now I have setup this ignite instance inside spark to be a client.
This is a strange behavior as all my ignite nodes (running on YARN) have 20GB defined for the default region (I changed it and verified it). This indicates me that the error must come from the ignite servers started on Spark (I think it is one on the driver and one per worker), as I did not changed the default region size in the ignite-config.xml of the spark application (defaults to 12GB as the error demonstrates). However, does this make sense? Should Spark throw out this error being its only goal to read and write data from/to ignite? Is Spark participating in caching any data and does this mean that I should set client mode in the ignite-config.xml of my application, despite the fact that the official examples are not using client mode?
Best regards,
First, the Spark-Ignite connector already connects in client mode.
I'm going to assume that you have enough memory, but you can follow the example in the Capacity Planning guide to be sure.
However, I think the problem is that you're following the sample application a bit too closely(!). The sample -- so as to be self-contained -- includes both a server and a Spark client. If you already have an Ignite cluster, you don't need to start a server in your Spark client.
This is a slightly hacked down example from a real application (in Java, sorry):
try (SparkSession spark = SparkSession
.config("spark.executor.extraClassPath", igniteClassPath())
.getOrCreate()) {
// Get source DataFrame
DataSet<Row> results = ....
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), igniteCfgFile)
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Results")
.option(IgniteDataFrameSettings.OPTION_STREAMER_ALLOW_OVERWRITE(), true)
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "name")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "backups=1")
I didn't test, but you should get the idea: you need to provide a URL to an Ignite configuration file; it creates the client to connect to that server behind the scenes.

Cloudera ToolRunner

I am using Hue for accessing Hive Service. I Created a Hive table using
create table tablename(colname type,.....)
row format delimited fields terminated by ',';
I Uploaded the data with 300 000 record perfectly. But while executing a query like:
select count(*) from tablename;
it is creating MapReduce job and at this time I get the following warning, How to resolve this warning.
WARN : Hadoop command-line option parsing not performed. Implement
the Tool interface and execute your application with ToolRunner to
remedy this.
Complete Log:
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1442315442114_0017
INFO : The url to track the job: http://dwiclmaster:8088/proxy/application_1442315442114_0017/
INFO : Starting Job = job_1442315442114_0017, Tracking URL = http://dwiclmaster:8088/proxy/application_1442315442114_0017/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hadoop/bin/hadoop job -kill job_1442315442114_0017
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2015-09-15 18:29:06,910 Stage-1 map = 0%, reduce = 0%
INFO : 2015-09-15 18:29:15,257 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.65 sec
INFO : 2015-09-15 18:29:21,513 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.19 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 190 msec
INFO : Ended Job = job_1442315442114_0017
This is just a warning coming up from MapReduce as jobs submitted by Hive do not implement the interface. This can be safely ignored.
More about Tool Runner.

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?

Hive always run mapred jobs in local mode

We are testing a multi node hadoop cluster (2.4.0) with Hive (0.13.0). The cluster works fine, but when we runa a query in hive, the mapred job are always executed locally.
For example:
Without hive-site.xml (in fact, without any configuration file other than defaults) we set mapred.job.tracker:
hive> SET mapred.job.tracker=;
And run a query:
hive> select count(1) from suricata;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
OpenJDK 64-Bit Server VM warning: You have loaded library /hadoop/hadoop-2.4.0/lib/native/ which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/04/29 12:48:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/29 12:48:02 WARN conf.Configuration: file:/tmp/hadoopuser/hive_2014-04-29_12-47-57_290_2455239450939088471-1/-local-10003/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/29 12:48:02 WARN conf.Configuration: file:/tmp/hadoopuser/hive_2014-04-29_12-47-57_290_2455239450939088471-1/-local-10003/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
Execution log at: /tmp/hadoopuser/hadoopuser_20140429124747_badfcce6-620e-4718-8c3b-e4ef76bdba7e.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2014-04-29 12:48:05,450 null map = 0%, reduce = 0%
2014-04-29 12:52:26,982 null map = 100%, reduce = 100%
Ended Job = job_local1983771849_0001
Execution completed successfully
**MapredLocal task succeeded**
Time taken: 270.176 seconds, Fetched: 1 row(s)
What are we missing?
Set as false which will disable the local mode execution in Hive
For each query the compiler generates DAG of map-reduce jobs. If the job runs in local mode, check below properties:;;
If auto option is enabled then hive run the job in local mode if
Total input size <
Total number of map tasks <
Total number of reduce tasks =< 1 or 0
These options are available from 0.7

Accumulo-Pig error - Connector info for AccumuloInputFormat can only be set once per job

Accumulo 1.5
Pig 0.10
Read/write data in/into Accumulo from Pig, using accumulo-pig.
Encountered an error - any insight into getting past this error is greatly appreciated.
Switching to Accumulo 1.4 is not an option as we are using the Accumulo Thrift Proxy in our C# codebase.
This is currently a roadblock in our project.
Source reference:
Source code -
In attemtping to read a dataset in Accumulo, from Pig, I am getting the following error-
org.apache.pig.backend.executionengine.ExecException: ERROR 2118:
Connector info for AccumuloInputFormat can only be set once per job
Code snippet:
DATA = LOAD 'accumulo://departments?instance=indra&user=root&password=xxxxxxx&zookeepers=cdh-dn01:2181' using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);
dump DATA;
Try using the ACCUMULO-1783-1.5 branch from the same repository. The way that Pig sets up the InputFormat doesn't play nicely with how Accumulo sets up InputFormats (notably, Accumulo makes a funny assertion that you never call the same static method more than one for a Configuration).
I have been using pig 0.12 -- I doubt there's a difference in how 0.10 sets up the InputFormats as opposed to 0.12, but I'm not positive YMMV.
I just pushed a fix to the above branch that gets rid of the previously mentioned limitation on Hadoop version.