Kylin build is failed at 3rd step - hive

I am new to Kylin,I create kylin model and cube by following url,
http://kylin.apache.org/
initially it is successfull,again i created new cube for the same model,at that time cube build is failed at 3rd step as,
#3 Step Name: Extract Fact Table Distinct Columns
actually i have some duplicated rows,so i deleted those rows in hive and i did sync kylin tables with hive tables.But that is not completing that 3rd step.I gone through the logs,i find the following error,
2016-12-29 11:50:45,421 ERROR [IPC Server handler 18 on 46096] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1482297779079_0128_m_000000_0 - exited : java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.kylin.engine.mr.steps.FactDistinctHiveColumnsMapper.putRowKeyToHLL(FactDistinctHiveColumnsMapper.java:179)
at org.apache.kylin.engine.mr.steps.FactDistinctHiveColumnsMapper.map(FactDistinctHiveColumnsMapper.java:155)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
2016-12-29 11:50:45,421 INFO [IPC Server handler 18 on 46096] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1482297779079_0128_m_000000_0: Error: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.kylin.engine.mr.steps.FactDistinctHiveColumnsMapper.putRowKeyToHLL(FactDistinctHiveColumnsMapper.java:179)
at org.apache.kylin.engine.mr.steps.FactDistinctHiveColumnsMapper.map(FactDistinctHiveColumnsMapper.java:155)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
anybody please share any idea how to solve this one.what is cardinality means in kylin data sources.

Related

Issue in saving the content of a dataframe to table

I have a data source (hive external tables) which refresh the data in adhoc manner. To avoid any discrepancies in the execution i'm trying to save the data as a table in my location.
Initially, i have loaded the data from data source to a dataframe
source = hqlContext.table("datasourcedb.table1") // this is working fine
Then, trying to save it the my application location -
source.write.mode('overwrite').saveAsTable("appdb.table1") //No read/write operations on appdb.table1 while doing this action
Above actions throwing exceptions:
java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: BLOCK
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask$1(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/03/02 04:31:32 ERROR TaskSetManager: Task 9 in stage 1.0 failed 4 times; aborting job
18/03/02 04:31:32 ERROR InsertIntoHadoopFsRelation: Aborting job.
**Note: The size of the source is abot 6GB. Hence, no persist action is planned **

HSQLDB throws Asset failed exception and file io error on db.script.new file during Checkpoint

Our application is a Java based desktop application which will download the binary data from the source, parses it and add it to HSQLDB database. When downloading from the sources individually, application works perfectly. But when doing the same from multiple sources simultaneously with each source in an individual thread, I am getting an error of
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 23 in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
or sometimes,
java.sql.SQLException: Assert failed: java.lang.ArrayIndexOutOfBoundsException: 1016 in statement [CHECKPOINT]
followed by
java.sql.SQLException: File input/output error: C:\ProgramData\test\data\database\db.script.new in statement [CHECKPOINT]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.execute(Unknown Source)
Java: 1.8;
HSQL version: 1.8.10
We are not in the position to migrate the HSQLDB to latest version because of various reasons.
HSQL Properties:
hsqldb.script_format=0
runtime.gc_interval=0
sql.enforce_strict_size=false
hsqldb.cache_size_scale=8
readonly=false
hsqldb.nio_data_file=true
hsqldb.cache_scale=14
version=1.8.0
hsqldb.default_table_type=memory
hsqldb.cache_file_scale=1
hsqldb.log_size=200
modified=yes
hsqldb.cache_version=1.7.0
hsqldb.original_version=1.8.0
hsqldb.compatible_version=1.8.0
Any help or hint will be appreciated.
This is an 7 year old version which is not ideal for multi-threaded usage.
The simple solution is to perform the database updates with a single thread. You can retrofit your multi-threaded application with a synchronized block over a singleton object around the code that performs the database update.

Configuration values for hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode in HIVE

I am trying to add data to an external table using apache-hive. I am getting the following error in the hive logs
2015-06-15 17:27:44,614 ERROR [LocalJobRunner Map Task Executor #0]: mr.ExecMapper (ExecMapper.java:map(171)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"transactiondate":"05-01-2015 08:26:21","transactiontype":"CASHOUT","transactionid":144590889,"sourcenumber":null,"destnumber":null,"amount":19000,"assumedfield1":880,"customerid":33394093,"transactionstatus":"COMPLETED","assumedfield2":325,"assumedfield3":175870}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 256
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:933)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:709)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
... 10 more
I googled for this error and came across this link which says that we must change the values of hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode variables to higher values. What are the optimum configurations for these variables on a single node hadoop installation? None of these configuration values are working for me. Please help.
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=250;
Please do not try to increase hive partitions to higher value .
It may cause Namenode crash . If possible try to change the partition column and apply new logic over it

PriviledgedActionException: Able to populate Hbase via Hive, however, unable to query HBase via Hive

I'm using the current Cloudera Quick Start VM. I've created an Hive table with some data. Then, I've created an external table with the Hive Storage Handler. I was able to populate the HBase table. However, while quering the Hive/HBase table, I got the following error (NullpointerException):
14/04/16 01:18:51 ERROR security.UserGroupInformation: PriviledgedActionException as:hbase (auth:SIMPLE) cause:BeeswaxException(message:java.io.IOException: java.lang.NullPointerException, log_context:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11, handle:QueryHandle(id:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11, log_context:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11), SQLState: )
14/04/16 01:18:51 ERROR beeswax.BeeswaxServiceImpl: Caught BeeswaxException
BeeswaxException(message:java.io.IOException: java.lang.NullPointerException, log_context:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11, handle:QueryHandle(id:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11, log_context:3ecc8100-e8f8-40a0-916b-00fa5a9b6b11), SQLState: )
at com.cloudera.beeswax.BeeswaxServiceImpl$RunningQueryState.fetch(BeeswaxServiceImpl.java:545)
at com.cloudera.beeswax.BeeswaxServiceImpl$5.run(BeeswaxServiceImpl.java:986)
at com.cloudera.beeswax.BeeswaxServiceImpl$5.run(BeeswaxServiceImpl.java:981)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at com.cloudera.beeswax.BeeswaxServiceImpl.doWithState(BeeswaxServiceImpl.java:772)
at com.cloudera.beeswax.BeeswaxServiceImpl.fetch(BeeswaxServiceImpl.java:980)
at com.cloudera.beeswax.api.BeeswaxService$Processor$fetch.getResult(BeeswaxService.java:987)
at com.cloudera.beeswax.api.BeeswaxService$Processor$fetch.getResult(BeeswaxService.java:971)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:244)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
I embedded Guava, zookeeper, hbase and hive-hbase-handler JARs. I followed the instructions made in this tutorial: http://www.n10k.com/blog/hbase-via-hive-pt2/
I am using the current Cloudera-Quick-Start VM. Job and Task-Tracker logs as well as Beeswax logs are telling me nothing.
Do you have any ideas about what I am doing wrong?
I am thankfull for any advise!
Best regards, Lena
This is the solution:
Nullpointer exception in HBase MapReduce
The logs were misleading (for me). HBase or Hive was not able to resolve the NameNode.

Create back of hbase data on S3 and the restore

I had hbase cluster running on amazon ec2 nodes. I want to create the backup of my hbase table. So, I came up with this tool. I was able to create the back up of table dummy on s3 using the following command :
java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=dummy
But when i tried to restore the same data on some table(model). It failed with the following :
`13/10/24 10:52:52 WARN mapred.FileOutputCommitter: Output path is null in cleanup
13/10/24 10:52:52 WARN mapred.LocalJobRunner: job_local_0002
java.lang.NullPointerException
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveBlock(Jets3tFileSystemStore.java:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy5.retrieveBlock(Unknown Source)
at org.apache.hadoop.fs.s3.S3InputStream.blockSeekTo(S3InputStream.java:160)
at org.apache.hadoop.fs.s3.S3InputStream.read(S3InputStream.java:119)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
13/10/24 10:52:53 INFO mapred.JobClient: Job complete: job_local_0002
13/10/24 10:52:53 INFO mapred.JobClient: Counters: 0
Error in Job completetion Params
tablename inputputdir
model s3://mybucket/Wed_Oct_23_19_45_49_IST_2013/model
Access Failure to s3://mybucket/Wed_Oct_23_19_45_49_IST_2013/model , tries=1
`.
java com.bizosys.oneline.maintenance.HBaseBackup mode=restore backup.folder=s3://mybucket/Wed_Oct_23_19_45_49_IST_2013 tables="model"
FYI, please don't suggest me that there is an option of installation of hbase as well as back up on EMR. That i know but for some reason i am not using it.