Something rarely in Pig, cloudera quickstart - apache-pig

I do not understand because when I run a pig script in the editor,
a workflow is created in ozzie and also three jobs like image , rather than simply running the script like in hive.
Image
entrada = LOAD '/user/cloudera/Divisas/Barril-WTI.csv' using PigStorage (',') AS (Fecha:chararray, Valor: float);
entrada_sin_cabecera = filter entrada by Fecha != 'Date';
orden = order entrada_sin_cabecera by Valor;
dump orden;
Also gives me the following error when running: pig -f WTI.pig
Error before Pig is launched
ERROR 2997: Encountered IOException. File WTI.pig does not exist
java.io.FileNotFoundException: File WTI.pig does not exist at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:424)
at
org.apache.pig.impl.io.FileLocalizer.fetchFilesInternal(FileLocalizer.java:747)
at
org.apache.pig.impl.io.FileLocalizer.fetchFile(FileLocalizer.java:688)
at org.apache.pig.Main.run(Main.java:424) at
org.apache.pig.Main.main(Main.java:158) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Related

Pyspark -- java.lang.NullPointerException (When using jdbc reading from Teradata to a dataframe)

Below is the Pyspark code to load data from EDW (Teradata) to HDFS(Hadoop system) using the JDBC driver:
from pyspark.sql import *
from pyspark.sql.types import *
q = """(select columns
from TD.Table1 as a
inner join TD.Table2 as b
on a.col=b.col
where b.col = date '2017-09-30'
) foo"""
df = spark.read.format('jdbc') \
.option('url','jdbc:teradata://teradata-dns-sysa.ab.xyz.com') \
.option('driver','com.teradata.jdbc.TeraDriver') \
.option('user','username') \
.option('password','####') \
.option('tmode','tera') \
.option('dbtable',q) \
.option('type','fastload') \
.load()
df.show(20,False)
This throws the error below:
Py4JJavaError: An error occurred while calling o64.showString.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2854)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2838)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2154)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2367)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Basically when I trying to do any actions on this df, like save as a hive table to hive, and save as an orc file to hdfs. It throws out the same error.
I did a df.printSchema and I found out that maybe the reason is that the dataset contains null values for (nullable = false) columns
So, I also tried to create a new Dataframe from the previous one, specifying the wanted schema:
df2 = spark.createDataFrame(df.rdd,schema)
Still got the same NPE error.
Any idea how to solve this one? Thank you!

Spark parquet reading error

I am working on a Spark project, Here i had one file which is in parquet format when I try to load this file using java it gives me the below error. But when I loaded the same file in hive with the same path and write a query select * from table_name, so its working fine and data is also coming properely. Please help me regarding this issue.
java.io.IOException: Could not read footer:
java.lang.RuntimeException: corrupted file: the footer index is not
within the file at
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:754)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:743)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:710)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:710)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at
org.apache.spark.scheduler.Task.run(Task.scala:88) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Caused by:
java.lang.RuntimeException: corrupted file: the footer index is not
within the file at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)
at
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
at
org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
You can try below options
1) sqlContext.read.parquet("path")
2) sqlContext.read.format(fileFormat)
.option("header", header) // Use first line of all files as header
.option("inferSchema", inferSchema) // Automatically infer data types
.load(source)
If your issue didn't resolved, please post the sample of code.

how to remove headers using piggybank?

I have a directory which contains 10 files and I want to remove the headers from the files present in directory and while executing using piggybank, I am getting an error. Is there any other way which can remove header from all the files present in a directory.My code is:-
REGISTER /usr/lib/pig/piggybank.jar;
input = LOAD 'insurance_data' using CSVExcelStorage(
',','default','NOCHANGE','SKIP_INPUT_HEADER')
as (population:int, private:int,public:int,uninsecured:int);
dump input;
The error which I am getting is :-
2016-09-13 14:01:48,239 [main] ERROR org.apache.pig.PigServer -
exception during parsing: Error during parsing. mismatched input 'input' expecting EOF Failed to parse:
mismatched input 'input' expecting
EOF at
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:241)
at
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:179)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1688) at
org.apache.pig.PigServer$Graph.access$000(PigServer.java:1421) at
org.apache.pig.PigServer.parseAndBuild(PigServer.java:354) at
org.apache.pig.PigServer.executeBatch(PigServer.java:379) at
org.apache.pig.PigServer.executeBatch(PigServer.java:365) at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at
org.apache.pig.Main.run(Main.java:613) at
org.apache.pig.Main.main(Main.java:158) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136) 2016-09-13
14:01:48,250 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
1200: mismatched input 'input'
expecting EOF Details at logfile: /home/cloudera/pig_1473800504430.log
'input' is a keyword.Reference here.Change the name of the relation 'input' to something else.
REGISTER /usr/lib/pig/piggybank.jar;
A = LOAD 'insurance_data' USING CSVExcelStorage(',','default','NOCHANGE','SKIP_INPUT_HEADER') as (population:int, private:int,public:int,uninsecured:int);
DUMP A;

Spark 1.4.1 - Using pyspark

I tried using this command , I get error
Code
instances = sqlContext.sql("SELECT instance_id ,instance_usage_code
FROM ib_instances WHERE (instance_usage_code) = 'OUT_OF_ENTERPRISE' ")
instances.write.format("orc").save("instances2")
hivectx.sql(""" CREATE TABLE IF NOT EXISTS instances2 (instance_id
string, instance_usage_code STRING)""" )
hivectx.sql (" LOAD DATA LOCAL INPATH '/home/hduser/instances2' into
table instances2 ")
Error
Traceback (most recent call last): File
"/home/hduser/spark_script.py", line 57, in
instances.write.format("orc").save("instances2") File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/s
ql/readwriter.py", line 304, in save File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/
py4j/java_gateway.py", line 538, in call File
"/usr/local/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/
py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o55.save.
: java.lang.AssertionError: assertion failed: The ORC data source can
only be used with HiveContext. at
scala.Predef$.assert(Predef.scala:179) at
org.apache.spark.sql.hive.orc.DefaultSource.createRelation(OrcRelation
.scala:54) at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:322)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
ava:57) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
orImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at
py4j.Gateway.invoke(Gateway.java:259) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:207) at
java.lang.Thread.run(Thread.java:745)
My guess is you create a standard SQLContext, instead of a Hive one (that adds a couple of options). Create your sqlContext as a HiveContext instance. The scala version is:
val sqlContext = new HiveContext(sc)

Getting error message Unexpected character ' ' when running LOAD command in PIG

I have a file stored in HDFS at this path: /user/hdfs/countries
(the file is in comma separated format).
To import this HDFS data into PIG I ran the below command in PIG:
test = load ‘/ user/hdfs/countries’ using PigStorage(',') as (id:int, Name:chararray, Language:chararray);
where,
ID: is the primary key column in HDFS file
Name and Language are the column names in HDFS file
I am getting below error when I run the above mentioned pig command:
Pig Stack Trace
ERROR 1200: <line 1, column 12> Unexpected character ''
Failed to parse: <line 1, column 12> Unexpected character ''
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:243)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:179)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1648)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1621)
at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Can someone please help me with this? Is my command incorrect or any jar file is missing?
Thank you in advance!
It tells you exactly where the problem is: the ‘ should be replaced by ' which is not the same character.
Also, the space after the / seems fishy.