nutch 2.0 fetch page repeatedly when a job failed - apache

I use mysql as storage backend with nutch.
Job failed when crawling some sites. Got the following exception and exit nutch when reaching this page: http://www.appchina.com/users.html
Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
So I modify the ./src/java/org/apache/nutch/util/NutchJob.java
change the
if (getConfiguration().getBoolean("fail.on.job.failure", true)) {
to
if (getConfiguration().getBoolean("fail.on.job.failure", false)) {
After recompiling, I won't get any exception, but unlimited restart crawling.
FetcherJob : timelimit set for : -1
FetcherJob: threads: 30
FetcherJob: parsing: false
FetcherJob: resuming: false
Using queue mode : byHost
Fetcher: threads: 30
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.appchina.com/
fetching http://www.appchina.com/users.html
-finishing thread FetcherThread0, activeThreads=29
-finishing thread FetcherThread29, activeThreads=28
...
0/0 spinwaiting/active, 2 pages, 0 errors, 0.4 0.4 pages/s, 137 137 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://www.appchina.com/
Parsing http://www.appchina.com/users.html
UPDATE
error in hadoop.log
2012-09-17 18:48:51,257 WARN mapred.LocalJobRunner - job_local_0004
java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
... 6 more
Caused by: java.sql.SQLException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
... 8 more
UPDATE again
I've drop the table gora created and create a similar table with a VARCHAR(128) id and utf8mb4 DEFAULT CHARSET. It works now. Why?
Anyone help?

You need to add the hadoop logs for the Parse job. The stack trace attached is not showing that info. After u did that code change, did parsing happen successfully ?

Related

Structured Streaming in Databricks Azure throwing exception - java.lang.IllegalStateException: Error reading delta file dbfs:/raw_zone/1.delta

We are using Structured Streaming in Databricks environment, Every time while we run this program - kAFKA - Structured Streaming (DBR6.6, Spark 2.4.5) - Writing to CosmosDB, we are getting the same exception as below just before we do the final joins to save the data to Cosmos DB. We haven't modified any spark specific settings and leveraging the default spark /DBR configurations.
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 174 in stage 9353.0 failed 4 times, most recent failure:
Lost task 174.3 in stage 9353.0 (TID 60863, 10.139.64.9, executor 1):
java.lang.IllegalStateException:
Error reading delta file dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta of HDFSStateStoreProvider[id = (op=8,part=174),dir = dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues]:
dbfs:/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta does not exist
Caused by: java.io.FileNotFoundException:
/6455647419774311/raw_zone/uffRetail_jointbl_dev_cp1/state/8/174/left-keyToNumValues/1.delta

RavenDB build 35215 Concurrent merge failed warning

That warning is spamming the logs, it happens every few seconds. can someone tells me what does it mean, how is it affecting the raven server and how to fix it ?
Message
Concurrent merge failed
Exception
Lucene.Net.Index.MergePolicy+MergeException: Exception of type
'Lucene.Net.Index.MergePolicy+MergeException' was thrown. --->
Lucene.Net.Index.CorruptIndexException: docs out of order (0 <= 0 )
at Lucene.Net.Index.IndexWriter.HandleMergeException(Exception t,
OneMerge merge) at Lucene.Net.Index.IndexWriter.Merge(OneMerge
merge) at
Lucene.Net.Index.ConcurrentMergeScheduler.MergeThread.Run() --- End
of inner exception stack trace --- at
Lucene.Net.Index.ConcurrentMergeScheduler.HandleMergeException(Exception
exc) at
Raven.Database.Indexing.ErrorLoggingConcurrentMergeScheduler.HandleMergeException(Exception
exc) in
C:\Builds\RavenDB-Stable-3.5\Raven.Database\Indexing\ErrorLoggingConcurrentMergeScheduler.cs:line
15
Logged
3 hours ago (01/09/18, 11:57am)
Level
Warn
Logger
Raven.Database.Indexing.ErrorLoggingConcurrentMergeScheduler

Kylin build is failed at 3rd step

I am new to Kylin,I create kylin model and cube by following url,
http://kylin.apache.org/
initially it is successfull,again i created new cube for the same model,at that time cube build is failed at 3rd step as,
#3 Step Name: Extract Fact Table Distinct Columns
actually i have some duplicated rows,so i deleted those rows in hive and i did sync kylin tables with hive tables.But that is not completing that 3rd step.I gone through the logs,i find the following error,
2016-12-29 11:50:45,421 ERROR [IPC Server handler 18 on 46096] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1482297779079_0128_m_000000_0 - exited : java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.kylin.engine.mr.steps.FactDistinctHiveColumnsMapper.putRowKeyToHLL(FactDistinctHiveColumnsMapper.java:179)
at org.apache.kylin.engine.mr.steps.FactDistinctHiveColumnsMapper.map(FactDistinctHiveColumnsMapper.java:155)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
2016-12-29 11:50:45,421 INFO [IPC Server handler 18 on 46096] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1482297779079_0128_m_000000_0: Error: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.kylin.engine.mr.steps.FactDistinctHiveColumnsMapper.putRowKeyToHLL(FactDistinctHiveColumnsMapper.java:179)
at org.apache.kylin.engine.mr.steps.FactDistinctHiveColumnsMapper.map(FactDistinctHiveColumnsMapper.java:155)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
anybody please share any idea how to solve this one.what is cardinality means in kylin data sources.

Exception while extracting substring from hive column

I have one column "category" which contain data like this
"Failed extract of third-party root list from auto update cab at: <http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: The data is invalid."
I need to select url part in between " < > " sign of category column.
I have written a hive query -
select level,category,regexp_extract(category,'http://[^\>]*') AS url from event where level='Error';
I got an exception :
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201406122248_0014, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201406122248_0014
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201406122248_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-13 02:13:35,696 Stage-1 map = 0%, reduce = 0%
2014-06-13 02:14:13,895 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201406122248_0014 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201406122248_0014
Examining task ID: task_201406122248_0014_m_000002 (and more) from job job_201406122248_0014
Task with the most failures(4):
-----
Task ID:
task_201406122248_0014_m_000000
URL:
http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_201406122248_0014&tipid=task_201406122248_0014_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"level":"Error","datetimes":"6/13/2014 9:24:05 AM","source":"Microsoft-Windows-CAPI2","eventid":4107,"task":"None","category":"\"Failed extract of third-party root list from auto update cab at: <http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab> with error: The data is invalid."}
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:159)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
how to fix this?
Please help.

Orchard.Alias.Implementation.Updater.AliasHolderUpdater - Exception during Alias refresh

crosspost: https://orchard.codeplex.com/discussions/473454
I want to start by saying I'm currently migrating from Orchard CMS 1.6 to 1.7.2. So it used to work in 1.6 but I'm now having issues with 1.7.2.
2 of my Content Types are having issues when creating items, they never finish saving and when I check the logs I get this:
Orchard.Alias.Implementation.Updater.AliasHolderUpdater - Exception during Alias refresh
NHibernate.Exceptions.GenericADOException: could not execute query
[ select aliasrecor0_.Id as Id1829_, aliasrecor0_.Path as Path1829_, aliasrecor0_.RouteValues as RouteVal3_1829_, aliasrecor0_.Source as Source1829_, aliasrecor0_.Action_id as Action5_1829_ from Orchard_Alias_AliasRecord aliasrecor0_ where aliasrecor0_.Id>#p0 order by aliasrecor0_.Id asc ]
Name:p1 - Value:48
[SQL: select aliasrecor0_.Id as Id1829_, aliasrecor0_.Path as Path1829_, aliasrecor0_.RouteValues as RouteVal3_1829_, aliasrecor0_.Source as Source1829_, aliasrecor0_.Action_id as Action5_1829_ from Orchard_Alias_AliasRecord aliasrecor0_ where aliasrecor0_.Id>#p0 order by aliasrecor0_.Id asc] ---> System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception: The wait operation timed out
When I stop it and view the site (anywhere really), it's entirely wrecked with this error:
Exception Details: System.ComponentModel.Win32Exception: The wait operation timed out
[Win32Exception (0x80004005): The wait operation timed out]
[SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.]
Line 162: return criteria
Line 163: .List<ContentItemVersionRecord>()
Line 164: .Select(x => ContentManager.Get(x.ContentItemRecord.Id, _versionOptions != null && _versionOptions.IsDraftRequired ? _versionOptions : VersionOptions.VersionRecord(x.Id)))
Source File: d:\Projects\Office Ignite\Main-1.7\src\Orchard\ContentManagement\DefaultContentQuery.cs Line: 162
I don't know why this is isolated with those two CTs. They don't have parts with custom tables or anything.
Any piece of information would be highly appreciated. Thanks!
I have same error, but it seems that problem is not related directly for my code.
I found two solutions for now:
1.) Taxonomy corruption problem https://orchard.codeplex.com/workitem/20411
2.) Static is dirty and lock which is default in select statment is heavly used https://serverfault.com/questions/419997/the-wait-operation-timed-out-when-running-sql-server-in-hyper-v