Nutch 1.8 and Apache Solr 4.8 integration job fail - apache

I am trying to crawl the web using Nutch 1.8 and Solr 4.8 on Windows 7.
bin/crawl urls newsolr http://localhost:8983/solr/ 1 -depth 1
I keep getting the following error
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
Here is the part of the log file:
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: content dest: content
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: title dest: title
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: host dest: host
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: segment dest: segment
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: boost dest: boost
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: digest dest: digest
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: url dest: id
2014-07-01 16:58:33,613 INFO solr.SolrMappingReader - source: url dest: url
2014-07-01 16:58:33,643 INFO solr.SolrIndexWriter - Indexing 1 documents
2014-07-01 16:58:33,773 WARN mapred.LocalJobRunner - job_local_0001
org.apache.solr.common.SolrException: Method Not Allowed
Method Not Allowed
request: http://localhost:8983/solr/
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2014-07-01 16:58:34,628 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
Finally, Solr's error log:
org.apache.solr.common.SolrException: ERROR: [doc=http://.com/] unknown field 'tstamp' `
This is my first solr/nutch setup.

Just stop the solr instance and start it again. It should solve your problem.
The error occurred because you made changes to the schema file and didn't restart solr in order for the changes to be saved hence solr is unable to "see" the newly added fields.

Related

Problem with filebeat yml file on Windows

I am quite new to the Elastic stack and trying to experiment with visualization of apache log files in Kibana. I am using filebeat to ingest the apache logs. However when I run .\filebeat.exe setup -e, I get the following error:
2019-02-05T20:53:10.515+0530 INFO elasticsearch/client.go:165 Elasticsearch url: http://localhost:9200
2019-02-05T20:53:10.520+0530 INFO elasticsearch/client.go:721 Connected to Elasticsearch version 6.6.0
2019-02-05T20:53:10.520+0530 INFO kibana/client.go:118 Kibana url: http://localhost:5601
2019-02-05T20:53:10.567+0530 WARN fileset/modules.go:388 X-Pack Machine Learning is not enabled
2019-02-05T20:53:10.572+0530 ERROR instance/beat.go:911 Exiting: 1 error: error loading config file: invalid con
fig: yaml: line 4: did not find expected hexdecimal number
My filebeat.yml file looks like this:
filebeat.inputs:
- type: log
enabled: true
paths: C:\Users\bigdataadmin\Downloads\ApacheLogs\*
#============================= Filebeat modules ===============================
filebeat.config.modules:
path: C:\Program Files\Filebeat\modules.d\*.yml
reload.enabled: true
reload.period: 60s
#==================== Elasticsearch template setting ==========================
setup.template.settings:
index.number_of_shards: 3
setup.kibana:
host: "localhost:5601"
output.elasticsearch:
hosts: ["localhost:9200"]
# Configure processors to enhance or manipulate events generated by the beat.
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
I also checked the yml on http://www.yamllint.com/ but didn't find any problems. I can't seem to figure out what's wrong with line 4 of this file.
I am using filebeat 6.6
The path key(on line 4) is an array. So you need to represent an array there.
Example :
filebeat.inputs:
- type: log
enabled: true
paths:
- C:\Users\bigdataadmin\Downloads\ApacheLogs\*
Please be very cautious about the data type you are representing in such config files, I had made the same mistake while I was working on Filebeat and I had to spend a lot of time for a small mistake...

Ceph S3 / Swift bucket create failed / error 416

I am getting 416 errors while creating buckets using S3 or Swift. How to solve this?
swift -A http://ceph-4:7480/auth/1.0 -U testuser:swift -K 'BKtVrq1...' upload testas testas
Warning: failed to create container 'testas': 416 Requested Range Not Satisfiable: InvalidRange
Object PUT failed: http://ceph-4:7480/swift/v1/testas/testas 404 Not Found b'NoSuchBucket'
Also S3 python test:
File "/usr/lib/python2.7/dist-packages/boto/s3/connection.py", line 621, in create_bucket
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 416 Requested Range Not Satisfiable
<?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidRange</Code><BucketName>mybucket</BucketName><RequestId>tx00000000000000000002a-005a69b12d-1195-default</RequestId><HostId>1195-default-default</HostId></Error>
Here is my ceph status:
cluster:
id: 1e4bd42a-7032-4f70-8d0c-d6417da85aa6
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-2,ceph-3,ceph-4
mgr: ceph-1(active), standbys: ceph-2, ceph-3, ceph-4
osd: 3 osds: 3 up, 3 in
rgw: 2 daemons active
data:
pools: 7 pools, 296 pgs
objects: 333 objects, 373 MB
usage: 4398 MB used, 26309 MB / 30708 MB avail
pgs: 296 active+clean
I am using CEPH Luminous build with bluestore
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
User created:
sudo radosgw-admin user create --uid="testuser" --display-name="First User"
sudo radosgw-admin subuser create --uid=testuser --subuser=testuser:swift --access=full
sudo radosgw-admin key create --subuser=testuser:swift --key-type=swift --gen-secret
Logs on osd:
2018-01-25 12:19:45.383298 7f03c77c4700 1 ====== starting new request req=0x7f03c77be1f0 =====
2018-01-25 12:19:47.711677 7f03c77c4700 1 ====== req done req=0x7f03c77be1f0 op status=-34 http_status=416 ======
2018-01-25 12:19:47.711937 7f03c77c4700 1 civetweb: 0x55bd9631d000: 192.168.109.47 - - [25/Jan/2018:12:19:45 +0200] "PUT /mybucket/ HTTP/1.1" 1 0 - Boto/2.38.0 Python/2.7.12 Linux/4.4.0-51-generic
Linux ubuntu, 4.4.0-51-generic
set default pg_num and pgp_num to lower value(8 for example), or set mon_max_pg_per_osd to a high value in ceph.conf

Flink 1.4.0 ClassDefNotFoundError ... S3ErrorResponseHandler

Working on setting up a local test of Flink 1.4.0 that writes to s3 and I'm getting the following error:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.internal.S3ErrorResponseHandler
at org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:363)
at org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:542)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem.createAmazonS3Client(PrestoS3FileSystem.java:639)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem.initialize(PrestoS3FileSystem.java:212)
at org.apache.flink.fs.s3presto.S3FileSystemFactory.create(S3FileSystemFactory.java:132)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:397)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:320)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory.<init>(FsCheckpointStreamFactory.java:99)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createStreamFactory(FsStateBackend.java:277)
at org.apache.flink.streaming.runtime.tasks.StreamTask.createCheckpointStreamFactory(StreamTask.java:787)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:247)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Following the documentation here I added the flink-s3-fs-presto-1.4.0.jar from opt/ to lib/ so I'm not exactly sure why I'm getting this error. Any help would be appreciated let me know if I can add additional information.
Here is some more information about my system and process:
I start the local job manager:
[flink-1.4.0] ./bin/start-local.sh
Warning: this file is deprecated and will be removed in 1.5.
Starting cluster.
Starting jobmanager daemon on host MBP0535.local.
Starting taskmanager daemon on host MBP0535.local.
OS information:
[flink-1.4.0] system_profiler SPSoftwareDataType
Software:
System Software Overview:
System Version: macOS 10.13.2 (17C205)
Kernel Version: Darwin 17.3.0
Boot Volume: Macintosh HD
Try to run jar:
[flink-1.4.0] ./bin/flink run streaming.jar
I'm actually having trouble reproducing the error. Here is the task manager log:
2018-01-18 10:17:07,668 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 # 11:08:40 UTC)
2018-01-18 10:17:07,668 INFO org.apache.flink.runtime.taskmanager.TaskManager - OS current user: k
2018-01-18 10:17:08,002 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-01-18 10:17:08,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - Current Hadoop/Kerberos user: k
2018-01-18 10:17:08,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.152-b16
2018-01-18 10:17:08,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum heap size: 1024 MiBytes
2018-01-18 10:17:08,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - JAVA_HOME: /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - Hadoop version: 2.8.1
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM Options:
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+UseG1GC
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xms1024M
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xmx1024M
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:MaxDirectMemorySize=8388607T
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog.file=/Users/k/flink-1.4.0/log/flink-k-taskmanager-0-MBP0535.local.log
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog4j.configuration=file:/Users/k/flink-1.4.0/conf/log4j.properties
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlogback.configurationFile=file:/Users/k/flink-1.4.0/conf/logback.xml
2018-01-18 10:17:08,087 INFO org.apache.flink.runtime.taskmanager.TaskManager - Program Arguments:
2018-01-18 10:17:08,088 INFO org.apache.flink.runtime.taskmanager.TaskManager - --configDir
2018-01-18 10:17:08,088 INFO org.apache.flink.runtime.taskmanager.TaskManager - /Users/k/flink-1.4.0/conf
2018-01-18 10:17:08,088 INFO org.apache.flink.runtime.taskmanager.TaskManager - Classpath: /Users/k/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/Users/k/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/Users/k/flink-1.4.0/lib/flink-shaded-hadoop2-uber-1.4.0.jar:/Users/k/flink-1.4.0/lib/log4j-1.2.17.jar:/Users/k/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/Users/k/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar:::
2018-01-18 10:17:08,089 INFO org.apache.flink.runtime.taskmanager.TaskManager - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-01-18 10:17:08,094 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum number of open file descriptors is 10240
2018-01-18 10:17:08,117 INFO org.apache.flink.runtime.taskmanager.TaskManager - Loading configuration from /Users/k/flink-1.4.0/conf
2018-01-18 10:17:08,119 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: classloader.resolve-order, parent-first
2018-01-18 10:17:08,119 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: classloader.parent-first-patterns, java.;org.apache.flink.;javax.annotation;org.slf4j;org.apache.log4j;org.apache.logging.log4j;ch.qos.logback;com.mapr.;org.apache.
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: s3.access-key, XXXXXXXXXXXXXXXXXXXX
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: s3.secret-key, YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2018-01-18 10:17:08,120 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-01-18 10:17:08,121 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2018-01-18 10:17:08,121 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-01-18 10:17:08,121 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.port, 8082
2018-01-18 10:17:08,199 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to k (auth:SIMPLE)
2018-01-18 10:17:08,289 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2018-01-18 10:17:08,289 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2018-01-18 10:17:08,291 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address localhost/127.0.0.1:6123.
2018-01-18 10:17:08,472 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address 'MBP0535.local' (10.1.11.139) for communication.
2018-01-18 10:17:08,482 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager
2018-01-18 10:17:08,482 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at MBP0535.local:54024.
2018-01-18 10:17:08,484 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to start actor system at mbp0535.local:54024
2018-01-18 10:17:08,898 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2018-01-18 10:17:08,960 INFO akka.remote.Remoting - Starting remoting
2018-01-18 10:17:09,087 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#mbp0535.local:54024]
2018-01-18 10:17:09,097 INFO org.apache.flink.runtime.taskmanager.TaskManager - Actor system started at akka.tcp://flink#mbp0535.local:54024
2018-01-18 10:17:09,105 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2018-01-18 10:17:09,111 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor
2018-01-18 10:17:09,115 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: MBP0535.local/10.1.11.139, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 1 (manual), number of client threads: 1 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2018-01-18 10:17:09,118 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2018-01-18 10:17:09,122 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T': total 465 GB, usable 333 GB (71.61% usable)
2018-01-18 10:17:09,236 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 101 MB for network buffer pool (number of memory segments: 3255, bytes per segment: 32768).
2018-01-18 10:17:09,323 WARN org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime is not in the classpath. Please put the corresponding jar from the opt to the lib folder.
2018-01-18 10:17:09,324 WARN org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Server. Probable reason: flink-queryable-state-runtime is not in the classpath. Please put the corresponding jar from the opt to the lib folder.
2018-01-18 10:17:09,324 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2018-01-18 10:17:09,353 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 23 ms).
2018-01-18 10:17:09,378 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 25 ms). Listening on SocketAddress /10.1.11.139:54026.
2018-01-18 10:17:09,381 WARN org.apache.flink.runtime.taskmanager.TaskManagerLocation - No hostname could be resolved for the IP address 10.1.11.139, using IP address as host name. Local input split assignment (such as for HDFS files) may be impacted.
2018-01-18 10:17:09,431 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2018-01-18 10:17:09,437 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/flink-io-186cf8c8-5a0d-44cc-9d78-e81c943b0b9f for spill files.
2018-01-18 10:17:09,439 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/flink-dist-cache-a9a568cd-c7cd-45c6-abbe-08912d051583
2018-01-18 10:17:09,509 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/flink-dist-cache-bd3cc98c-cebb-4569-98d3-5357393d8c5b
2018-01-18 10:17:09,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1044592356.
2018-01-18 10:17:09,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: 97b3a934f84ba25e20aae8a91a40e336 # 10.1.11.139 (dataPort=54026)
2018-01-18 10:17:09,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 1 task slot(s).
2018-01-18 10:17:09,518 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 112/1024/1024 MB, NON HEAP: 35/36/-1 MB (used/committed/max)]
2018-01-18 10:17:09,522 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#localhost:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2018-01-18 10:17:09,692 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink#localhost:6123/user/jobmanager), starting network stack and library cache.
2018-01-18 10:17:09,696 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be localhost/127.0.0.1:54025. Starting BLOB cache.
2018-01-18 10:17:09,699 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/blobStore-77287aab-5128-4363-842c-1a124114fd91
2018-01-18 10:17:09,702 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /var/folders/sw/jcdfbbc15td51f3635hvt77w0000gp/T/blobStore-c9f62e97-bf53-4fc4-9e4a-1958706e78ec
2018-01-18 10:26:25,993 INFO org.apache.flink.runtime.taskmanager.TaskManager - Received task Source: Kafka -> Sink: S3 (1/1)
2018-01-18 10:26:25,993 INFO org.apache.flink.runtime.taskmanager.Task - Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) switched from CREATED to DEPLOYING.
2018-01-18 10:26:25,994 INFO org.apache.flink.runtime.taskmanager.Task - Creating FileSystem stream leak safety net for task Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) [DEPLOYING]
2018-01-18 10:26:25,996 INFO org.apache.flink.runtime.taskmanager.Task - Loading JAR files for task Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) [DEPLOYING].
2018-01-18 10:26:25,998 INFO org.apache.flink.runtime.blob.BlobClient - Downloading 34e7c81bd4a0050e7809a1343af0c7cb/p-4eaec529eb247f30ef2d3ddc2308e029e625de33-93fe90509266a50ffadce2131cedc514 from localhost/127.0.0.1:54025
2018-01-18 10:26:26,238 INFO org.apache.flink.runtime.taskmanager.Task - Registering task at network: Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) [DEPLOYING].
2018-01-18 10:26:26,240 INFO org.apache.flink.runtime.taskmanager.Task - Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) switched from DEPLOYING to RUNNING.
2018-01-18 10:26:26,249 INFO org.apache.flink.streaming.runtime.tasks.StreamTask - Using user-defined state backend: File State Backend # s3://stream-data/checkpoints.
2018-01-18 10:26:26,522 INFO org.apache.flink.fs.s3hadoop.shaded.org.apache.hadoop.util.NativeCodeLoader - Skipping native-hadoop library for flink-s3-fs-hadoop's relocated Hadoop... using builtin-java classes where applicable
2018-01-18 10:26:29,041 ERROR org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink - Error while creating FileSystem when initializing the state of the BucketingSink.
java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.createHadoopFileSystem(BucketingSink.java:1196)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initFileSystem(BucketingSink.java:411)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:355)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
2018-01-18 10:26:29,048 INFO org.apache.flink.runtime.taskmanager.Task - Source: Kafka -> Sink: S3 (1/1) (95b54853308d69fbb84ee308508bf397) switched from RUNNING to FAILED.
java.lang.RuntimeException: Error while creating FileSystem when initializing the state of the BucketingSink.
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:358)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.createHadoopFileSystem(BucketingSink.java:1196)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initFileSystem(BucketingSink.java:411)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.initializeState(BucketingSink.java:355)
... 9 more

Unable to open iterator for alias <alias_name>

I know this is one of the most repeated question. I have looked almost everywhere and none of the resources could resolve the issue I am facing.
Below is the simplified version of my problem statement. But in actual data is little complex so I have to use UDF
My input File: (input.txt)
NotNeeded1,NotNeeded11;Needed1
NotNeeded2,NotNeeded22;Needed2
I want the output to be
Needed1
Needed2
So, I am writing the below UDF
(Java code):
package com.company.pig;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class myudf extends EvalFunc<String>{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String s = (String)input.get(0);
String str = s.split("\\,")[1];
String str1 = str.split("\\;")[1];
return str1;
}
}
And packaging it into
rollupreg_extract-jar-with-dependencies.jar
Below is my pig shell code
grunt> REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
grunt> DEFINE myudf com.company.pig.myudf;
grunt> data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(',');
grunt> extract = FOREACH data GENERATE myudf($1);
grunt> DUMP extract;
And I get the below error:
2017-05-15 15:58:15,493 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2017-05-15 15:58:15,577 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2017-05-15 15:58:15,659 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2017-05-15 15:58:15,774 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2017-05-15 15:58:15,865 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2017-05-15 15:58:15,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2017-05-15 15:58:16,184 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:16,196 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:16,396 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:16,576 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2017-05-15 15:58:16,580 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:16,584 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2017-05-15 15:58:16,588 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2017-05-15 15:58:17,258 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/pig/rollupreg_extract-jar-with-dependencies.jar to DistributedCache through /tmp/temp-1119775568/tmp-858482998/rollupreg_extract-jar-with-dependencies.jar
2017-05-15 15:58:17,276 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2017-05-15 15:58:17,294 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2017-05-15 15:58:17,295 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2017-05-15 15:58:17,354 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2017-05-15 15:58:17,510 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:17,511 [JobControl] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:17,753 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2017-05-15 15:58:17,820 [JobControl] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
2017-05-15 15:58:17,830 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-05-15 15:58:17,830 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2017-05-15 15:58:17,884 [JobControl] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2017-05-15 15:58:17,889 [JobControl] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7a4b57bedce694048432dd5bf5b90a6c8ccdba80]
2017-05-15 15:58:17,922 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2017-05-15 15:58:18,525 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2017-05-15 15:58:18,692 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1494853652295_0023
2017-05-15 15:58:18,879 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2017-05-15 15:58:18,973 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1494853652295_0023
2017-05-15 15:58:19,029 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1494853652295_0023/
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1494853652295_0023
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,extract
2017-05-15 15:58:19,030 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[2,7],extract[3,10] C: R:
2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2017-05-15 15:58:19,044 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1494853652295_0023]
2017-05-15 15:58:29,156 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-05-15 15:58:29,156 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1494853652295_0023 has failed! Stop running all dependent jobs
2017-05-15 15:58:29,157 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-05-15 15:58:29,790 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:29,791 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:29,793 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,311 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:30,312 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:30,313 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,465 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2017-05-15 15:58:30,467 [main] WARN org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:30,472 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.3.2.5.0.0-1245 root 2017-05-15 15:58:16 2017-05-15 15:58:30 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_1494853652295_0023 data,extract MAP_ONLY Message: Job failed! hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225,
Input(s):
Failed to read data from "/pig_hdfs/input.txt"
Output(s):
Failed to produce result in "hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1494853652295_0023
2017-05-15 15:58:30,472 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-05-15 15:58:30,499 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias extract
Details at logfile: /pig/pig_1494863836458.log
I know it complaints that
Failed to read data from "/pig_hdfs/input.txt"
But I am sure this is not the actual issue. If I don't use the udf and directly dump the data, I get the output. So, this is not the issue.
First, you do not need an udf to get the desired output.You can use semi colon as the delimiter in load statement and get the needed column.
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(';');
extract = FOREACH data GENERATE $1;
DUMP extract;
If you insist on using udf then you will have to load the record into a single field and then use the udf.Also,your udf is incorrect.You should split the string s with ';' as the delimiter, which is passed from the pig script.
String s = (String)input.get(0);
String str1 = s.split("\\;")[1];
And in your pig script,you need to load the entire record into 1 field and use the udf on field $0.
REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
DEFINE myudf com.company.pig.myudf;
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' AS (f1:chararray);
extract = FOREACH data GENERATE myudf($0);
DUMP extract;

ERROR 1066: Unable to open iterator for alias

I am getting an error Unable to open iterator for alias for a simple LOAD and DUMP operation with Pig. I already took a look at the answers at:
ERROR 1066: Unable to open iterator for alias - Pig
But they don't help me.
My environment:
OS: Windows 7
Pig version: 0.13.0
Mode: Local
It shows in the error that the exception is 'Caused by' a failure to change
permission of a file in the TMP directory. But when I checked the TMP directory, there's no such file (probably it got deleted after the command is finished?).
Logs below (with -v and -w options) :
'D:\H\HADOOP-2.6.0\bin\hadoop-config.cmd' is not recognized as an internal or external command,
operable program or batch file.
15/01/24 09:20:22 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/01/24 09:20:22 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2015-01-24 09:20:22,909 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:29:34
2015-01-24 09:20:22,909 [main] INFO org.apache.pig.Main - Logging error messages to: d:\Pig\pig-0.13.0\bin\
2015-01-24 09:20:24,267 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file C:\Users\Venkat/.pigbootup not found
2015-01-24 09:20:24,438 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2015-01-24 09:20:26,205 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-01-24 09:20:26,236 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias a
2015-01-24 09:20:26,236 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias a
at org.apache.pig.PigServer.openIterator(PigServer.java:912)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:752)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:228)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:479)
at org.apache.pig.Main.main(Main.java:156)
Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 0: java.io.IOException: Failed to set permissions of path: \tmp\temp946561981 to 0700
at org.apache.pig.impl.io.FileLocalizer.relativeRoot(FileLocalizer.java:484)
at org.apache.pig.impl.io.FileLocalizer.getTemporaryPath(FileLocalizer.java:515)
at org.apache.pig.impl.io.FileLocalizer.getTemporaryPath(FileLocalizer.java:511)
at org.apache.pig.PigServer.openIterator(PigServer.java:887)
... 7 more
Caused by: java.io.IOException: Failed to set permissions of path: \tmp\temp946561981 to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:286)
at org.apache.pig.backend.hadoop.datastorage.HPath.setPermission(HPath.java:122)
at org.apache.pig.impl.io.FileLocalizer.createRelativeRoot(FileLocalizer.java:495)
at org.apache.pig.impl.io.FileLocalizer.relativeRoot(FileLocalizer.java:481)
... 10 more
Details also at logfile: D:\Pig\pig-0.13.0\bin\data-1.txt1422071424329.log
Pig Stack Trace
ERROR 1066: Unable to open iterator for alias a
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias a
at org.apache.pig.PigServer.openIterator(PigServer.java:912)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:752)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:228)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:479)
at org.apache.pig.Main.main(Main.java:156)
Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 0: java.io.IOException: Failed to set permissions of path: \tmp\temp946561981 to 0700
at org.apache.pig.impl.io.FileLocalizer.relativeRoot(FileLocalizer.java:484)
at org.apache.pig.impl.io.FileLocalizer.getTemporaryPath(FileLocalizer.java:515)
at org.apache.pig.impl.io.FileLocalizer.getTemporaryPath(FileLocalizer.java:511)
at org.apache.pig.PigServer.openIterator(PigServer.java:887)
... 7 more
Caused by: java.io.IOException: Failed to set permissions of path: \tmp\temp946561981 to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:286)
at org.apache.pig.backend.hadoop.datastorage.HPath.setPermission(HPath.java:122)
at org.apache.pig.impl.io.FileLocalizer.createRelativeRoot(FileLocalizer.java:495)
at org.apache.pig.impl.io.FileLocalizer.relativeRoot(FileLocalizer.java:481)
... 10 more
Pig file contents:
a = LOAD 's.csv' AS (NAME:chararray,COUNTRY:chararray,YEAR:int,SPORT:chararray,GOLD:int,SILVER:int,BRONZE:int,TOTAL:int);
DUMP a;
Contents of s.csv:
Yang Yilin China 2008 Gymnastics 1 0 2 3
Leisel Jones Australia 2000 Swimming 0 2 0 2
Is there anything wrong in the syntax of the LOAD statement?
Are there any environment variables that need to be set specifically other
than JAVA and JAVA_HOME?
first check your hadoop permission status if its root change it too user.
$sudo chown -R testuser:testuser /(path of hadoop folder)
permission problem will be solve.
I hope this will work
Generally we will get this error as "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias" due to incorrect path or not able to access the file path.
check the file location is correct or not.
check are you able to access the file.
I got this issue with hadoop namenode safemode is on.
So I did below command to OFF safemode:
sudo -u hdfs hadoop dfsadmin -safemode leave
Then it works for me.