java.io.IOException: Mkdirs failed to create file:/pig/deidentifiedDir/ - apache-pig

When I try store data using pig command
STORE D into '/deidentifiedDir';
Getting this error
2015-05-06 20:17:14,587 [Thread-96] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0009
java.io.IOException: Mkdirs failed to create file:/pig/deidentifiedDir/_temporary/_attempt_local_0009_m_000000_0
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:366)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:465)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:372)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextOutputFormat.getRecordWriter(PigTextOutputFormat.java:98)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:83)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:488)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:610)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Related

converting dataframe to csv throws error pyspark

I have huge dataframe around 7GB records.
I am trying to get the count of the dataframe and download it as csv
Both of them result in below error.
is there any other way of downloading the dataframe without multiple partitions
print(df.count())
df.coalesce(1).write.option("header", "true").csv('/user/ABC/Output.csv')
Error:
java.io.IOException: Stream is corrupted
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/26 18:15:44 ERROR scheduler.TaskSetManager: Task 8 in stage 360.0 failed 1 times; aborting job
[Stage 360:=======> (8 + 1) / 60]
Py4JJavaError: An error occurred while calling o18867.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 360.0 failed 1 times, most recent failure: Lost task 8.0 in stage 360.0 (TID 13986, localhost, executor driver): java.io.IOException: Stream is corrupted
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

ERROR 1066: Unable to open iteratorfor alias

Command run (trying to get Maximum run scored)
Run_M = foreach Run_Group_All generate (Match.Player, Match.Run) , MAX(Match.Run);
As per log Group command is failing , can anybody help where is problem?
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:556)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:84)
at org.apache.pig.builtin.AlgebraicLongMathBase.exec(AlgebraicLongMathBase.java:93)
at org.apache.pig.builtin.AlgebraicLongMathBase.exec(AlgebraicLongMathBase.java:37)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:326)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextLong(POUserFunc.java:410)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:351)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:400)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:317)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:474)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:442)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:422)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:269)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:346)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Number
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:77)
... 20 more
2017-09-03 07:48:03,212 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-09-03 07:48:03,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local1294624349_0011 has failed! Stop running all dependent jobs
2017-09-03 07:48:03,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-09-03 07:48:03,213 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2017-09-03 07:48:03,214 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2017-09-03 07:48:03,214 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2017-09-03 07:48:03,215 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.8.1 0.15.0 goldi 2017-09-03 07:48:01 2017-09-03 07:48:03 GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_local1294624349_0011 Cric,Match,Run_Group_All,Run_M GROUP_BY Message: Job failed! file:/tmp/temp-1949037811/tmp1601097545,
Input(s):
Failed to read data from "/home/goldi/Batting.csv"
Output(s):
Failed to produce result in "file:/tmp/temp-1949037811/tmp1601097545"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local1294624349_0011
2017-09-03 07:48:03,217 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-09-03 07:48:03,218 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias Run_M
Details at logfile: /home/goldi/pig_1504365116860.log
Replace '(Match.Player, Match.Run)' with 'group'.
Run_M = foreach Run_Group_All generate FLATTEN(group) as (player,run) , MAX(Match.Run);

Unexpected error running Liquibase: ERROR: relation "databasechangelog" does not exist

When running a liquibase (version 3.5.3) deployment against a new postgres database, we are getting the error below. The table, databasechangelog, did not get created by liquibase, but the table, databasechangeloglock, did get created.
INFO 2/7/17 1:27 PM: liquibase: Successfully acquired change log lock
INFO 2/7/17 1:27 PM: liquibase: Successfully released change log lock
Unexpected error running Liquibase: ERROR: relation "audit.databasechangelog" does not exist
Position: 20
SEVERE 2/7/17 1:27 PM: liquibase: ERROR: relation "audit.databasechangelog" does
not exist
Position: 20
liquibase.exception.DatabaseException: Error executing SQL SELECT MD5SUM FROM au
dit.databasechangelog WHERE MD5SUM IS NOT NULL LIMIT 1: ERROR: relation "audit.d
atabasechangelog" does not exist
Position: 20
at liquibase.executor.jvm.JdbcExecutor.execute(JdbcExecutor.java:68)
at liquibase.executor.jvm.JdbcExecutor.query(JdbcExecutor.java:126)
at liquibase.executor.jvm.JdbcExecutor.query(JdbcExecutor.java:134)
at liquibase.executor.jvm.JdbcExecutor.queryForList(JdbcExecutor.java:20
0)
at liquibase.executor.jvm.JdbcExecutor.queryForList(JdbcExecutor.java:19
4)
at liquibase.changelog.StandardChangeLogHistoryService.init(StandardChan
geLogHistoryService.java:212)
at liquibase.Liquibase.checkLiquibaseTables(Liquibase.java:1124)
at liquibase.Liquibase.update(Liquibase.java:205)
at liquibase.Liquibase.update(Liquibase.java:192)
at liquibase.integration.commandline.Main.doMigration(Main.java:1130)
at liquibase.integration.commandline.Main.run(Main.java:188)
at liquibase.integration.commandline.Main.main(Main.java:103)
Caused by: org.postgresql.util.PSQLException: ERROR: relation "audit.databasecha
ngelog" does not exist
Position: 20
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryEx
ecutorImpl.java:2455)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutor
Impl.java:2155)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.ja
va:288)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:430)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:356)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:303
)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:289
)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:266
)
at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:233)
at liquibase.executor.jvm.JdbcExecutor$QueryStatementCallback.doInStatem
ent(JdbcExecutor.java:345)
at liquibase.executor.jvm.JdbcExecutor.execute(JdbcExecutor.java:55)
... 11 more
There are two schemas, ods and audit. And the search_path is ods, audit, public. We specify the target schema in the connection string (currentSchema=audit). Additionally, we ran successfully against the ods schema.
As a workaround, we can manually create the log table. However, I am wondering if this is a bug with liquibase or if we are doing something wrong? My thought is that liquibase is somehow seeing the ods.databasechangelog and skips creating it.
Any thoughts would be appreciated.
m
Maybe try to use following liquibase parameter:
--defaultSchemaName=<schema>

Reading Multiple Orc Files in Pig

I am trying to read/Load multiple Orc files present in a directory Using pig's OrcStorage(). I tried to use glob technique but that was not working for me and throwing error saying file dose not exist, where as it is available.Please let me know how i can implement this functionality in pig.
Sample Files Used:
hadoop fs -ls /sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
Found 2 items
-rw-r--r-- 3 as303e hdfs 302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000000_0
-rw-r--r-- 3 as303e hdfs 302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000001_0
Found 2 items
-rw-r--r-- 3 as303e ksndbx28 302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000000_0
-rw-r--r-- 3 as303e ksndbx28 302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000001_0
Code Used:
A = load '/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}' Using OrcStorage();
B= limit A 2;
DUMP B;
Error log:
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Store(hdfs://localhost:8020/tmp/temp666047359/tmp808921130:org.apache.pig.impl.io.InterStorage) - scope-5 Operator Key: scope-5): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNextTuple(POStore.java:159)
at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.runPipeline(FetchLauncher.java:161)
at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.launchPig(FetchLauncher.java:81)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:278)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
... 15 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNextTuple(POLimit.java:122)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
... 22 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:131)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
... 24 more
Caused by: org.apache.hadoop.mapred.InvalidInputException: File does not exist: hdfs://localhost:8020/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:961)
at org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.getSplits(OrcNewInputFormat.java:121)
at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:190)
at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:146)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:99)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:127)
... 25 more

Avro : java.lang.RuntimeException: Unsupported type in record

Input: test.csv
100
101
102
Pig Script :
REGISTER required jars are registered;
A = LOAD 'test.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (code:chararray);
STORE A INTO 'test' USING org.apache.pig.piggybank.storage.avro.AvroStorage
('schema',
'{"namespace":"com.pig.test.avro","type":"record","name":"Avro_Test","doc":"Avro Test Schema",
"fields":[
{"name":"code","type":["string","null"],"default":null}
]}'
);
Getting a runtime error while STORE. Any inputs on resolving the same.
Error Log :
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Unsupported type in record:class java.lang.String
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
at org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
at org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:722)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:558)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMap
2015-06-02 23:06:03,934 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-06-02 23:06:03,934 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
Looks like this is a bug: https://issues.apache.org/jira/browse/PIG-3358
If you can, try to update to pig 0.14, according to the comments this has been fixed.