Hudi supports `update` operation?

Hudi supports `update` operation? - apache-hudi

I have an exception when update record with spark sql for hudi as following.
update hudi.cow1 set price=1300 where id=2;
22/10/17 19:24:44 ERROR Executor: Exception in task 0.0 in stage 206.0 (TID 2442)
org.apache.avro.AvroRuntimeException: Not a valid schema field:
at org.apache.avro.generic.GenericData$Record.get(GenericData.java:256)
at org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:503)
at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$11(HoodieSparkSqlWriter.scala:295)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/10/17 19:24:44 WARN TaskSetManager: Lost task 0.0 in stage 206.0 (TID 2442) (192.168.2.228 executor driver): org.apache.avro.AvroRuntimeException: Not a valid schema field:
at org.apache.avro.generic.GenericData$Record.get(GenericData.java:256)
at org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:503)
at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$11(HoodieSparkSqlWriter.scala:295)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I created the table as following.
create table if not exists cow1 (
id int,
name string,
price double
) using hudi
options (
type = 'cow',
primaryKey = 'id'
);
My env is:
mac system;
spark: spark-3.2.2-bin-hadoop3.2
hudi: hudi-spark3.2-bundle_2.12-0.12.0.jar
I put the hudi jar in the jars dir under the spark home.
And I start spark sql with:
./spark-sql --jars ../../hudi-spark3.2-bundle_2.12-0.12.0.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
Is this a hudi bug?

I asked in github. the preComineKey property is necessiary for update.

Related

Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks

I'm reading a huge csv file including 39,795,158 records and writing into MSSQL server, on Azure Databricks. The Databricks(notebook) is running on a cluster node with 56 GB Memory, 16 Cores, and 12 workers.
This is my code in Python and PySpark:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from time import sleep
url = "jdbc:sqlserver://{0}:{1};database={2}".format(server, port, database)
spark.conf.set("spark.databricks.io.cache.enabled", True)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Read csv file.
df_lake = spark.read \
.option('header', 'false') \
.schema(s) \
.option('delimiter', ',') \
.csv('wasbs://...')
batch_size = 60000
rows = df_lake.count()
org_pts = df_lake.rdd.getNumPartitions() # 566
new_pts = 1990
# Re-partition the DataFrame
df_repartitioned = df_lake.repartition(new_pts)
# Write the DataFrame into MSSQL server, by using JDBC driver
df_repartitioned.write \
.format("jdbc") \
.mode("overwrite") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", url) \
.option("dbtable", tablename) \
.option("user", username) \
.option("password", password) \
.option("batchsize", batch_size) \
.save()
sleep(10)
Then I got the logs and errors as following as:
rows: 39795158
org_pts: 566
new_pts: 1990
Copy error: An error occurred while calling o9647.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62 in stage 462.0 failed 4 times, most recent failure: Lost task 62.3 in stage 462.0 (TID 46609) (10.139.64.12 executor 27): com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:234)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:1217)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.rollback(SQLServerConnection.java:3508)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:728)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:857)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1025)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1025)
at org.apache.spark.SparkContext.$anonfun$runJob$2(SparkContext.scala:2517)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2828)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2775)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2769)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2769)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1305)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1305)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1305)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3036)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2977)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2965)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1067)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2477)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2460)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2517)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2542)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$1(RDD.scala:1025)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:419)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:1023)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:855)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:63)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:96)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:213)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:257)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:209)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:167)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:166)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:1080)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:130)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:273)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:104)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:223)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:1080)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:469)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:439)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:312)
at sun.reflect.GeneratedMethodAccessor448.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:234)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:1217)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.rollback(SQLServerConnection.java:3508)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:728)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:857)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1025)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1025)
at org.apache.spark.SparkContext.$anonfun$runJob$2(SparkContext.scala:2517)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
For 3 - 6 millions records, it was no problem. But for 10 millions or above records, it was failed.
I'm not sure why it was happened on 10 millions or above records.
Are there any solutions for huge DataFrame process on Azure Databricks?

Using too many partitions when reading from the external database risks overloading that database with too many queries. Most DBMS systems have limits on the concurrent connections. As a starting point, aim to have the number of partitions be close to the number of cores or task slots in your Spark cluster in order to maximize parallelism but keep the total number of queries capped at a reasonable limit.
Workaround
If you need lots of parallelism after fetching the JDBC rows (because you’re doing something CPU bound in Spark) but don’t want to issue too many concurrent queries to your database then consider using a lower numPartitions for the JDBC read and then doing an explicit repartition() in Spark.
Refer this official doc

I solved by reducing the Memory and Cores of my cluster. I setup the cluster again, with 14GB Memory, 4 Cores, and 8 Workers. It worked. It's writing without any error. I'm not sure why it was failed on bigger settings for cluster

FileNotFound Error While Fetching data from Hive External Table using HiveContext

I am trying to fetch data from a hive external table using HiveContext and storing it in a text file. The path of data for hive external table is hdfs:/data/abc/job_log. My code is failing intermittently with below error.
WARN TaskSetManager: Lost task 1524.0 in stage 0.0 (TID 1524, ): java.io.FileNotFoundException: File does not exist: /data/abc/job_log/abc_job_20171027001515.COPYING
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:672)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
I am using Spark 1.6.1 , Scala 2.10.5 and HDP 2.4.2 cluster.Any help will be appreciated.

HBase/Hive table queried from Squirrel SQL - Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler

I am trying to query a HBase table through Squirrel SQL. Created a Hive external table like the following
create external table tweets_hbase(key string, value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" = ":key,data:tweet_text")
tblproperties ("hbase.table.name" = "tweets_hbase")
I am able to query through command line HIVE
hive> select * from tweets_hbase;
OK
20160725001730109 {"createdat":"25-Jul-2016 12:17:03","tweet_date":"2016-07-25","text":"私のランドールスゴビ:) \n#abyssrium\nhts:t.co/NcKtQi9lzm ht/t.co/WNgQIxLU05","user":"uw_kyaaaan","uniqueid":1469420239464,"searchtag":"Apple"}
20160725001730266 {"createdat":"25-Jul-2016 12:17:03","tweet_date":"2016-07-25","text":"2016年7月24日\n8422 Steps\n移動距離 6.485 km\n消費カロリー 467.6 kcal\n\n#M7POPOPO ht/t.co/eFathZXTHD","user":"matsuwichi","uniqueid":1469420239465,"searchtag":"Apple"}
20160725001730308 {"createdat":"25-Jul-2016 12:17:03","tweet_date":"2016-07-25","text":"RT #JBCrewdotcom: Don't forget to leave a nice review for #Coldwater after purchasing! \niTunes: t.co/p5YKRwPKNw\nGoogle Play: ht\u2026","user":"2016OLLGAndUGRL","uniqueid":1469420239466,"searchtag":"Apple"}
However when i try to query through Squirrel SQL, i get an Error in loading. The necessary JARs have been added to Extra Class Path.
hive-hbase-handler-1.1.0.jar
hbase-client-1.1.5.jar
hbase-common-1.1.5.jar
hbase-protocal-1.1.5.jar
hbase-server-1.1.5.jar
hive-jdbc-1.1.1-standalone.jar
Please help
java.sql.SQLException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
at net.sourceforge.squirrel_sql.client.session.StatementWrapper.execute(StatementWrapper.java:165)
at net.sourceforge.squirrel_sql.client.session.SQLExecuterTask.processQuery(SQLExecuterTask.java:369)
at net.sourceforge.squirrel_sql.client.session.SQLExecuterTask.run(SQLExecuterTask.java:212)
at net.sourceforge.squirrel_sql.fw.util.TaskExecuter.run(TaskExecuter.java:82)
at java.lang.Thread.run(Unknown Source)

I solved this myself. The following is what I had to do:
Upgrade HBase to 1.2.2
While starting thriftServer start with the following jars with --jars option
./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001
--hiveconf hive.server2.thrift.bind.host=xxx.xxx.xxx.xxx --hiveconf spark.cores.max=2 --master spark://xxx.xxx.xxx.xxx:7077 --name
ThriftServer --jars
file:///home/hadoop/software/apache-hive-1.2.1-bin/lib/hive-hbase-handler-1.2.1.jar,file:///home/hadoop/software/hbase-1.2.2/lib/hbase-common-1.2.2.jar,file:///home/hadoop/software/hbase-1.2.2/lib/hbase-protocol-1.2.2.jar,file:///home/hadoop/software/hbase-1.2.2/lib/hbase-client-1.2.2.jar,file:///home/hadoop/software/hbase-1.2.2/lib/guava-12.0.1.jar,file:///home/hadoop/software/hbase-1.2.2/lib/hbase-server-1.2.2.jar,file:///home/hadoop/software/hbase-1.2.2/lib/htrace-core-3.1.0-incubating.jar,file:///home/hadoop/software/hbase-1.2.2/lib/metrics-core-2.2.0.jar

sqoop-export is failing when I have \N as data

Iam getting below error when I run my sqoop export command.
This is my content to be exported by sqoop command
00001|Content|1|Content-article|\N|2015-02-1815:16:04|2015-02-1815:16:04|1 |\N|\N|\N|\N|\N|\N|\N|\N|\N
00002|Content|1|Content-article|\N|2015-02-1815:16:04|2015-02-1815:16:04|1 |\N|\N|\N|\N|\N|\N|\N|\N|\N
sqoop command
sqoop export --connect jdbc:postgresql://10.11.12.13:1234/db --table table1 --username user1 --password pass1--export-dir /hivetables/table/ --fields-terminated-by '|' --lines-terminated-by '\n' -- --schema schema
15/06/09 08:05:16 INFO mapreduce.Job: Task Id :
attempt_1431442954745_1210_m_000001_0, Status : FAILED Error:
java.io.IOException: Can't export data, please check failed map task
logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.lang.RuntimeException: Can't parse input data: '\N'
at duser.__loadFromFields(duser.java:690)
at duser.parse(duser.java:558)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:202)
at duser.__loadFromFields(duser.java:627)
Can you help me resolve it ?

Try adding these arguments to the export statement
--input-null-string "\\\\N" --input-null-non-string "\\\\N"
From the documentation:
If --input-null-string is not specified, then the string "null" will
be interpreted as null for string-type columns. If
--input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string
columns.
If you don't add those arguments, it won't be able to understand that the \N in your data is actually null.

The problem seems to be the order in which columns are being imported. Sqoop doesn't automatically understand the column mapping. Try using --columns argument to specify the order the columns appear in. Here's how to use it:
sqoop export --connect jdbc:postgresql://10.11.12.13:5432/reports ... --columns col1,col2,col3,...
See http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_4 for documentation on how to use --columns.

hive-HBase ClassNotFound happend when do mapreduce job

I have a hive+hbase integration cluster.
I created a table by:
CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
it is ok when execute:
select * from hbase_table_1;
but when I execute count operation, the classnotfound error will happen.
select count(*) from hbase_table_1;
error info is:
java.io.IOException:cannot find class
at org.apache.............HiveInputformat.getRecordReader(HiveInputFormat.java:220)
...........
Caused by:java.lang.ClassNoteFoundException:
at java.lang.Class.forName0(Native Method)
those error message does not notice me which class.
Sorry for my poor English.
Any one encounter this issue?

1) COPY THESE FILES TO THE HADOOP LIBRARY.
sudo cp /usr/lib/hive/lib/hive-common-0.7.0-cdh3u0.jar /usr/lib/hadoop/lib/
sudo cp /usr/lib/hive/lib/hbase-0.90.1-cdh3u0.jar /usr/lib/hadoop/lib/
sudo cp /usr/lib/hive/lib/hbase-0.90.1-cdh3u0.jar /usr/lib/hadoop/lib/
2)CLOSE HBASE AND HADOOP USING FOLLOWING COMMOND
/usr/lib/hadoop/bin/stop-all.sh
/usr/lib/hbase/bin/stop-hbase.sh
3) RESTART HBASE AND HADOOP USING COMMOND
/usr/lib/hadoop/bin/start-all.sh
/usr/lib/hadoop/bin/start-hbase.sh
Now create table in hive using Hbase storage handler.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hudi supports `update` operation? - apache-hudi

I asked in github. the preComineKey property is necessiary for update.

Related

Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks

FileNotFound Error While Fetching data from Hive External Table using HiveContext

HBase/Hive table queried from Squirrel SQL - Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler

sqoop-export is failing when I have \N as data

hive-HBase ClassNotFound happend when do mapreduce job

Categories

Resources