Solr Indexing using Nutch Crawler

Solr Indexing using Nutch Crawler - apache

I'm using Apache Nutch-1.13 and solr 6.6.0 versions.
I'm running the following command to crawl the content:
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch urls/seed.txt TestCrawl 2
I got this exception:
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Error running:
/Users/myedlapalli/documents/nutch-solr-3/apache-nutch-1.13/runtime/local/bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch TestCrawl/crawldb -linkdb TestCrawl/linkdb TestCrawl/segments/20171017090519
Failed with exit value 255.
And in the logs:
2017-10-17 09:36:35,032 INFO solr.SolrIndexWriter - Indexing 1/1 documents
2017-10-17 09:36:35,032 INFO solr.SolrIndexWriter - Deleting 0 documents
2017-10-17 09:36:35,161 INFO solr.SolrIndexWriter - Indexing 1/1 documents
2017-10-17 09:36:35,161 INFO solr.SolrIndexWriter - Deleting 0 documents
2017-10-17 09:36:35,174 WARN mapred.LocalJobRunner - job_local193014604_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/nutch: ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:210)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:188)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2017-10-17 09:36:36,109 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Can someone please help me?
Thanks in advance.

In this cases is usually a good idea to check the logs on the Solr side, but for this particular error. You already have the answer, especially the portion:
ERROR: [doc=http://www.cmo.com/features/articles/2017/8/21/5-emerging-technologies-rewrite-the-media-and-entertainment-script-.html] unknown field 'sp_type'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
Solr, is complaining that you're sending a document (with id http://www.cmo.com/features/articles/2017/8/21/5...) with one field that is not defined in the schema: sp_type.
You should check what you're sending in that field, or just add the field in the Solr schema.
Keep in mind that if you have more fields that are not defined in the Solr schema, this error will continue to appear. Is usually a good idea to run the bin/nutch indexchecker <URL> command to see what Nutch is going to send to Solr.

The first thing i can tell you is to follow the crawling process with bin/nutch over the crawl command. https://cwiki.apache.org/confluence/display/nutch/NutchTutorial gives you more details.

Related

Hive INSERT OVERWRITE query not writing to hdfs directory:Cannot get DistCp constructor

I have created a view of hbase in hive with 10 miliion rows and when i am running below query ,distcp is invoked and it throws below error.
INSERT OVERWRITE DIRECTORY '/mapred/INPUT' select hive_cdper1.cid,hive_cdper1.emptyp,hive_cdper1.ethtyp,hive_cdper1.gdtyp,hive_cdseg.mrtl from hive_cdper1 join hive_cdseg on hive_cdper1.cnm=hive_cdseg.cnm;
Output:map 100% reduce 100%
2016-10-17 15:05:34,688 INFO [main]: exec.Task (SessionState.java:printInfo(951)) - Moving data to: /mapred/INPUT
from
hdfs://mycluster/mapred/INPUT/.hive-staging_hive_2016-10-17_14-57-48_620_6609613978089243090-1/-ext-10000 2016-10-17 15:05:34,693 INFO [main]: common.FileUtils
(FileUtils.java:copy(551)) - Source is 483335659 bytes. (MAX: 4000000)
2016-10-17 15:05:34,693 INFO [main]: common.FileUtils
(FileUtils.java:copy(552)) - Launch distributed copy (distcp) job.
2016-10-17 15:05:34,695 ERROR [main]: exec.Task
(SessionState.java:printError(960)) - Failed with exception Unable to
move source
hdfs://mycluster/mapred/INPUT/.hive-staging_hive_2016-10-17_14-57-48_620_6609613978089243090-1/-ext-10000 to destination /mapred/INPUT
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move
source
hdfs://mycluster/mapred/INPUT/.hive-staging_hive_2016-10-17_14-57-48_620_6609613978089243090-1/-ext-10000 to destination /mapred/INPUT
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:105)
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:222)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1653)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1412)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1195)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.io.IOException: Cannot get DistCp constructor:
org.apache.hadoop.tools.DistCp.()
at org.apache.hadoop.hive.shims.Hadoop23Shims.runDistCp(Hadoop23Shims.java:1160)
at org.apache.hadoop.hive.common.FileUtils.copy(FileUtils.java:553)
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2622)
... 21 more
What i wonder here is:i am writing to the same cluster ,then why it is invoking distcp instead of normal cp.
Here i am using hive 1.2.1 with hadoop 2.7.2 and my cluster name is mycluster.
Note:i have tried setting hive.exec.copyfile.maxsize=4000000 but didnt work.
Appreciate your suggestions..

1) check permission of your destination path /mapred/INPUT
2) If write permission is not there for other user, then hadoop fs -chmod a+w /mapred/INPUT

Setting below properties in hive-site.xml solved my issue.
<property>
<name>hive.exec.copyfile.maxsize</name>
<value>3355443200</value>
<description>Maximum file size (in Mb) that Hive uses to do single HDFS copies between directories.Distributed copies (distcp) will be used instead for bigger files so that copies can be done faster.</description>
</property>

Why, dows 'neo4j console' work, and 'neo4j start' doesn't?

I want to use neo4j. I installed neo4j-community 2.3.2-1 from archlinux AUR and when I ise neo4j consoleeverything works fine. But when I want to start the server in the background with neo4j startthe server won't start with error message:
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Starting Neo4j Server...process [20559]... waiting for server to be ready... Failed to start within 120 seconds.
Neo4j Server may have failed to start, please check the logs.
The server did not try to start or 120 seconds, more like 2 seconds. In addition I callot find the log-files anywhere.
Google told me to tryout neo4j start-no-wait
When I do this command:
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Starting Neo4j Server...process [21088]...Started the server in the background, returning...
But nothing is started and the webclient doesn't work like it does when I use neo4j console.
So my basic question is: Why does neo4j consolework and neo4j startdoes not? And how can I start neo4j in the background and stop it again without killing the process?
EDIT:
the console.log says:
2016-01-29 12:51:18.338+0100 INFO Successfully shutdown Neo4j Server
2016-01-29 12:51:18.348+0100 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase#77e4c80f' was successfully initialized, but failed to start. Please see attached cause exception. Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase#77e4c80f' was successfully initialized, but failed to start. Please see attached cause exception.
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase#77e4c80f' was successfully initialized, but failed to start. Please see attached cause exception.
at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:67)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:234)
at org.neo4j.server.Bootstrapper.start(Bootstrapper.java:97)
at org.neo4j.server.CommunityBootstrapper.start(CommunityBootstrapper.java:48)
at org.neo4j.server.CommunityBootstrapper.main(CommunityBootstrapper.java:35)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.database.LifecycleManagingDatabase#77e4c80f' was successfully initialized, but failed to start. Please see attached cause exception.
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:462)
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:111)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:194)
... 3 more
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /var/lib/neo4j/data/graph.db/messages.log (Permission denied)
at org.neo4j.kernel.impl.factory.PlatformModule.createLogService(PlatformModule.java:261)
at org.neo4j.kernel.impl.factory.PlatformModule.<init>(PlatformModule.java:140)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.createPlatform(GraphDatabaseFacadeFactory.java:181)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.newFacade(GraphDatabaseFacadeFactory.java:124)
at org.neo4j.kernel.impl.factory.CommunityFacadeFactory.newFacade(CommunityFacadeFactory.java:43)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.newFacade(GraphDatabaseFacadeFactory.java:108)
at org.neo4j.server.CommunityNeoServer$1.newGraphDatabase(CommunityNeoServer.java:66)
at org.neo4j.server.database.LifecycleManagingDatabase.start(LifecycleManagingDatabase.java:95)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:452)
... 5 more
Caused by: java.io.FileNotFoundException: /var/lib/neo4j/data/graph.db/messages.log (Permission denied)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.neo4j.io.fs.DefaultFileSystemAbstraction.openAsOutputStream(DefaultFileSystemAbstraction.java:61)
at org.neo4j.io.file.Files.createOrOpenAsOuputStream(Files.java:47)
at org.neo4j.logging.RotatingFileOutputStreamSupplier.openOutputFile(RotatingFileOutputStreamSupplier.java:254)
at org.neo4j.logging.RotatingFileOutputStreamSupplier.<init>(RotatingFileOutputStreamSupplier.java:138)
at org.neo4j.logging.RotatingFileOutputStreamSupplier.<init>(RotatingFileOutputStreamSupplier.java:122)
at org.neo4j.kernel.impl.logging.StoreLogService.<init>(StoreLogService.java:164)
at org.neo4j.kernel.impl.logging.StoreLogService.<init>(StoreLogService.java:43)
at org.neo4j.kernel.impl.logging.StoreLogService$Builder.toFile(StoreLogService.java:110)
at org.neo4j.kernel.impl.logging.StoreLogService$Builder.inStoreDirectory(StoreLogService.java:105)
at org.neo4j.kernel.impl.factory.PlatformModule.createLogService(PlatformModule.java:252)
... 13 more

Nutch problems executing crawl

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7.
Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl.
I am getting the following error when I try to run a crawl execute with nutch:
Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seed.txt
Failed with exit value 127.
I have my JAVA_HOME classpath set, and I have altered the host file to include the 127.0.0.1 as the localhost.
I am curious if I am calling the write directory correctly, if maybe that is the problem.
The full printout looks like:
User5#User5-PC /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local
$ bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/ TestCrawl/ 2
Injecting seed URLs
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Injector: starting at 2015-12-23 17:48:21
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Error running:
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl//crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/
Failed with exit value 127.
The hadoop log that I think may have something to do with the error I am getting is:
2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
2016-01-07 12:24:40,450 ERROR crawl.Injector - Injector: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://localhost:8983/solr
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parse(URI.java:3048)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 4 more

You are running linux commands from Cygwin and there is no C:\ path in linux systems. Correct command should be something like
/cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/bin/nutch inject TestCrawl/crawldb /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch1.11/runtime/local/urls/seed.txt

You have answer to your problem in this message:
2016-01-07 12:24:40,360 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
This is happening because hadoop version included with nutch 1.11 is designed to work in linux out of the box and not on windows.
I had same situation and I ended up using nutch1.11 in ubuntu virtual box.

hadoop-core jar file is needed when you are working with nutch
with nutch 1.11 compatible hadoop-core jar is 0.20.0
please download jar from this link :
http://www.java2s.com/Code/Jar/h/Downloadhadoop0200corejar.htm
paste that jar into "C:\cygwin64\home\apache-nutch-1.11\lib" folder
and it will run successfully.

The problem is pretty clear. According to your hadoop log, it cannnot find the winutils.exe file. Include winutils.exe in %HADOOP_HOME%/bin folder

Show databases command not working in hive?

I connected hive, and when I try to show all databases using command below, I get the following error,:
techgene#slaveone:~/apps/hive-0.12.0$ hive
Logging initialized using configuration in jar:file:/home/techgene/apps/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties
hive> show databases;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Can you please provide a solution for this?

This problem usually occurs when hive CLI session is improperly ended. In such case, kill the improperly closed hive CLI session as follows. After this launch hive CLI fresh.
ramisetty#aspire:~$ jps
3710 SecondaryNameNode
4103 RunJar -------------------------> hive CLI instance.
4019 TaskTracker
3467 DataNode
3242 NameNode
4366 Jps
3788 JobTracker
ramisetty#aspire:~$ kill -9 4103
ramisetty#aspire:~$
still problem persists means follow the available solutions # FAILED: Error in metadata: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

Install Spark on existing Hadoop cluster (ISSUE with HIVE)

I am trying to get a Spark/Shark cluster up but keep running into the same problem. I have followed the instructions on https://github.com/amplab/shark/wiki/Running-Shark-on-a-Cluster and addressed Hive as stated.
Here are the details, any help would be great.
I already installed the following package:
Spark/Shark 1.0.0
Apache Hadoop 2.4.0
Apache Hive 0.13
Scala 2.9.3
Java 7
I configure ~/spark/conf/spark-env.sh as:
export HADOOP_HOME=/path/to/hadoop/
export HIVE_HOME=/path/to/hive/
export MASTER=spark://xxx.xxx.xxx.xxx:7077
export SPARK_HOME=/path/to/spark
export SPARK_MEM=4g
export HIVE_CONF_DIR=/path/to/hive/conf/
source $SPARK_HOME/conf/spark-env.sh
When start spark with "./spark-withinfo", I get the following errors:
-hiveconf hive.root.logger=INFO,console
Starting the Shark Command Line Client
14/07/07 16:26:57 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
14/07/07 16:26:57 [main]: WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
Logging initialized using configuration in jar:file:/path/to/hive/lib/hive-exec-0.13.0.jar!/hive-log4j.properties
14/07/07 16:26:57 [main]: INFO SessionState:
Logging initialized using configuration in jar:file:/path/to/hive/lib/hive-exec-0.13.0.jar!/hive-log4j.properties
14/07/07 16:26:57 [main]: INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:344)
at shark.SharkCliDriver$.main(SharkCliDriver.scala:128)
at shark.SharkCliDriver.main(SharkCliDriver.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:51)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2444)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2456)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:338)
... 2 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1137)
... 7 more
Caused by: java.lang.NoSuchFieldError: METASTOREINTERVAL
at org.apache.hadoop.hive.metastore.RetryingRawStore.init(RetryingRawStore.java:78)
at org.apache.hadoop.hive.metastore.RetryingRawStore.<init>(RetryingRawStore.java:60)
at org.apache.hadoop.hive.metastore.RetryingRawStore.getProxy(RetryingRawStore.java:71)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:413)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:401)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:439)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:325)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:285)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:54)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4102)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:121)
... 12 more
I guess Spark can not find some libs ton connect metastore in Hive, but I have been stacked here for a couple days and don't know how to solve it. BTW, I use MYSQL for hive metadata, and everything works well in hive.
Any help is appreciated. Thanks in advance.

You may need to add mysql connector jar file before you start spark...
In my case, I added mysql connector jar like below.
$SPARK_HOME/bin/compute-classpath.sh
CLASSPATH=$CLASSPATH:/opt/big/hive/lib/mysql-connector-java-5.1.25-bin.jar

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Solr Indexing using Nutch Crawler - apache

The first thing i can tell you is to follow the crawling process with bin/nutch over the crawl command. https://cwiki.apache.org/confluence/display/nutch/NutchTutorial gives you more details.

Related

Hive INSERT OVERWRITE query not writing to hdfs directory:Cannot get DistCp constructor

Why, dows 'neo4j console' work, and 'neo4j start' doesn't?

Nutch problems executing crawl

Show databases command not working in hive?

Install Spark on existing Hadoop cluster (ISSUE with HIVE)

Categories

Resources