Load File delimited by double colon :: in pig - apache-pig

Following is a sample dataset delimited by double colon(::).
1::Toy Story (1995)::Animation|Children's|Comedy
I want to extract three fields from above data set as movieID,title and genre. I have written following code for that
movies = LOAD 'location/of/dataset/on/hdfs '
using PigStorage('::')
as
(MovieID:int,title:chararray,genre:chararray);
But i am getting following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<file script.pig, line 1, column 9> pig script failed to validate:
java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[::]'

Use MyRegExloader: You will need piggybank.jar for this.
REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('([^\\:]+)::([^\\:]+)::([^\\:]+)')
as (movieid:int, title:chararray, genre:chararray);
Output :
(1,Toy Story (1995),Animation|Children's|Comedy)

Related

why dbt runs in cli but throws an error on cloud UI for the exact same model?

I am executing dbt run -s model_name on CLI and the task completes successfully. However, when I run the exact same command on dbt cloud, I get this error:
Syntax or semantic analysis error thrown in server while executing query.
Error message from server: org.apache.hive.service.cli.HiveSQLException:
Error running query: org.apache.spark.sql.AnalysisException: cannot
resolve '`pv.meta.uuid`' given input columns: []; line 6 pos 4;
\n'Project ['pv.meta.uuid AS page_view_uuid#249595,
'pv.user.visitorCookieId AS (80) (SQLExecDirectW)")
it looks like it fails recognizing 'pv.meta.uuid' syntax which extract data from a json format. It is not clear to me what is going on. Any thoughts? Thank you!

Error during parsing. Encountered " <IDENTIFIER> "Error "" at line 1, column 2

I have a simple pig script used to import data file.
My data file is located in : /home/fs188
It's a csv file which contains data as :
011958029,00000024,,1,20100209,1
011951228,00000036,,1,20100209,1
011964431,00000814,,1,20100227,1
003526500,00000863,,1,20080122,1
011950864,00001478,,1,20100209,1
011999168,00002495,X0,1,20100331,0
001684881,00002641,,1,19861126,1
001677981,00003165,,1,19861119,1
001677457,00003311,,1,19870114,1
001677161,00003440,,1,19870116,1
002594705,00003475,,1,19870122,1
011958074,00004327,,1,20100210,1
I just want to execute my script pig named PigScript and test it in local mode.
It contains this code :
ENEE_ENR_FILTER = LOAD '/home/fs188/DataExempleUdf.csv' USING PigStorage(',') AS (idt_gcp:chararray,idt_ent_pse:chararray,cd_not:chararray,idc_pse_pci:chararray,da_pram_ett:chararray,idc_cd_not:chararray);
DUMP ENEE_ENR_FILTER;
So I call my script :
pig -x local PigScript.pig
I get this error :
2019-08-07 12:03:14,277 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1000: Error during parsing. Encountered " "-x "" at line 1, column 2.
This is weird because I don't have any synthax error

Issue with Complex data types processing in pig with comma delimited data

I have the data like this:
$ cat samp.txt
Ramesh,[city#Bangalore],123
Arun,[city#Anantapur],345
Pranith,[city#US],456
I have written the following pig query:
A = load 'samp.txt' using PigStorage(',')
as(name:chararray,addr:map[chararray,chararray],empno:int);
When I execute the above code in pig I am getting the following error:
error: mismatched input ',' expecting RIGHT_BRACKET Details at logfile: /home/training/pig_1471586597209.log
Can any one help me to resolve this error?
A= load 'pdemo/samp' using PigStorage(',') as (name:chararray,add:map[],empno:int);
Now it will work..

Unable to extract data with double pipe delimiter in Pig Script

I am trying to extract data which is pipe delimited in Pig. Following is my command
L = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('||');
Iam getting following error
2016-08-04 23:58:21,122 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 1, column 4> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[||]'
My input sample file has exactly 5 lines as following
POS_TIBCO||HDFS||POS_LOG||1||7806||2016-07-18||1||993||0
POS_TIBCO||HDFS||POS_LOG||2||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||3||7806||2016-07-18||1||0||5
POS_TIBCO||HDFS||POS_LOG||4||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||5||7806||2016-07-18||1||0||19.99
I tried several options like using the backslash before delimiter(\||,\|\|) but everything failed. Also, I tried with schema but got the same error.I am using Horton works(HDP2.2.4) and pig (0.14.0).
Any help is appreciated. Please let me know if you need any further details.
I have faced this case, and by checking PigStorage code source, i think PigStorage argument should be parsed into only one character.
So we can use this code instead:
L0 = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('|');
L = FOREACH L0 GENERATE $0,$2,$4,$6,$8,$10,$12,$14,$16;
Its helpful if you know how many column you have, and it will not affect performance because it's map side.
When you load data using PigStorage, It only expects single character as delimiter.
However if still you want to achieve this you can use MyRegExLoader-
REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('||')
as (movieid:int, title:chararray, genre:chararray);

pig file load error

I am trying to run this commang over pig env.
grunt> A = LOAD inp;
But I am getting this error in the log files:
Pig Stack Trace:
ERROR 1200: mismatched input 'inp' expecting QUOTEDSTRING
Failed to parse: mismatched input 'inp' expecting QUOTEDSTRING
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:226)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:168)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1565)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1538)
at org.apache.pig.PigServer.registerQuery(PigServer.java:540)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:490)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
And in console Iam getting like this:
grunt> A = LOAD inp;
2012-10-26 12:18:34,627 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'inp' expecting QUOTEDSTRING
Details at logfile: /usr/local/hadoop/pig_1351232517175.log
Can any body provide me appropriate solution for this?
The syntax for load has been used wrongly. Check out the correct example provided herewith.
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#LOAD
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.
1 2 3
4 2 1
8 3 4
In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type bytearray.
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('\t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
Example from http://pig.apache.org/docs
I believe the error log is self explanatory, it says - expecting QUOTEDSTRING
Please put the file name in single quotes to solve this issue.