Error 1070 Pig Could not resolve PigSorage using imports: - apache-pig

I'm trying to read a file with pig and I have the error indicated in the title.
data = LOAD '/user/cloudera/pigexample/commands' USING PigSorage('\n') as
(command:chararray);
DUMPdata;
The file contains the following:
**SOF**
whoami
pwd
ls
say
saw
source
<1>
source
<1>
exit
**EOF**
**SOF**
where's
<1>
I don't understand why I get the error supposedly its delimiter is the line break, I'm also trying with another file whose delimiter is the tab ('\t') and it doesn't work either. Does anyone know what the delimiter is? PS: I don't know what tag to put on the question.
As explained above, I expected it to open the file with the indicated delimiter but it does not work.
I have already tried with \t \n

I have already seen my mistake
data = LOAD '/user/cloudera/pigexample/commands' USING PigSorage('\n') as (command:chararray);
data = LOAD '/user/cloudera/pigexample/commands' USING PigStorage('\n') as (command:chararray);

Related

Bigquery error (ASCII 0) encountered for external table and when loading table

I'm getting this error
"Error: Error detected while parsing row starting at position: 4824. Error: Bad character (ASCII 0) encountered."
The data is not compressed.
My external table points to multiple CSV files, and one of them contains a couple of lines with that character. In my table definition I added "MaxBadRecords", but that had no effect. I also get the same problem when loading the data in a regular table.
I know I could use DataFlow or even try to fix the CSVs, but is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
Try below in Google Cloud SDK Shell (with use of tr utility)
gsutil cp gs://bucket/badfile.csv - | tr -d '\000' | gsutil cp - gs://bucket/fixedfile.csv
This will
Read your "bad" file
Remove ASCII 0
Save "fixed" file into new file
After you have new file - just make sure your table now points to that fixed one
Sometimes it occurs that a final byte appears in file.
What could help is replacing it thanks to :
tr '\0' ' ' < file1 > file2
You can clean the file using an external tool like python or PowerShell. There is no way to load any file with an ASCII0 in bigquery
This is a script that can clear the file with python:
def replace_chars(self,file_path,orignal_string,new_string):
#Create temp file
fh, abs_path = mkstemp()
with os.fdopen(fh,'w', encoding='utf-8') as new_file:
with open(file_path, encoding='utf-8', errors='replace') as old_file:
print("\nCurrent line: \t")
i=0
for line in old_file:
print(i,end="\r", flush=True)
i=i+1
line=line.replace(orignal_string, new_string)
new_file.write(line)
#Copy the file permissions from the old file to the new file
shutil.copymode(file_path, abs_path)
#Remove original file
os.remove(file_path)
#Move new file
shutil.move(abs_path, file_path)
The same but for PowerShell:
(Get-Content "C:\Source.DAT") -replace "`0", " " | Set-Content "C:\Destination.DAT"

Plink. Error: No people remaining after --keep

I'm trying to subset individuals from the ACB population from the file named allconcat39.vcf , using Plink 1.9. For that, I created a text file (tab delimited) in R called indACB, which looks like this:
head indACB.txt
684_HG01879 684_HG01879
685_HG01880 685_HG01880
686_HG01882 686_HG01882
687_HG01883 687_HG01883
688_HG01885 688_HG01885
689_HG01886 689_HG01886
690_HG01889 690_HG01889
691_HG01890 691_HG01890
694_HG01894 694_HG01894
695_HG01896 695_HG01896
when I run the following code:
./plink --vcf allconcat39.vcf --keep indACB.txt --recode --out allconcat39ACB
the following error occurs:
Error: No people remaining after --keep.
I made sure than the vcf and the indACB.txt file had compatible individual IDs and sample IDs. I don't know where else the problem can be. Any thoughts? Thank you in advance !
It was solved in another forum by Christopher Chang: Add --double-id to your command line; otherwise plink treats '_' as a delimiter between the FID and IID.

Load CSV file in PIG

In PIG, When we load a CSV file using LOAD statement without mentioning schema & with default PIGSTORAGE (\t), what happens? Will the Load work fine and can we dump the data? Else will it throw error since the file has ',' and the pigstorage is '/t'? Please advice
When you load a csv file without defining a schema using PigStorage('\t'), since there are no tabs in each line of the input file, the whole line will be treated as one tuple. You will not be able to access the individual words in the line.
Example:
Input file:
john,smith,nyu,NY
jim,young,osu,OH
robert,cernera,mu,NJ
a = LOAD 'input' USING PigStorage('\t');
dump a;
OUTPUT:
(john,smith,nyu,NY)
(jim,young,osu,OH)
(robert,cernera,mu,NJ)
b = foreach a generate $0, $1, $2;
dump b;
(john,smith,nyu,NY,,)
(jim,young,osu,OH,,)
(robert,cernera,mu,NJ,,)
Ideally, b should have been:
(john,smith,nyu)
(jim,young,osu)
(robert,cernera,mu)
if the delimiter was a comma. But since the delimiter was a tab and a tab does not exist in the input records, the whole line was treated as one field. Pig doe snot complain if a field is null- It just outputs nothing when there is a null. Hence you see only the commas when you dump b.
Hope that was useful.

Unable to extract data with double pipe delimiter in Pig Script

I am trying to extract data which is pipe delimited in Pig. Following is my command
L = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('||');
Iam getting following error
2016-08-04 23:58:21,122 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 1, column 4> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[||]'
My input sample file has exactly 5 lines as following
POS_TIBCO||HDFS||POS_LOG||1||7806||2016-07-18||1||993||0
POS_TIBCO||HDFS||POS_LOG||2||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||3||7806||2016-07-18||1||0||5
POS_TIBCO||HDFS||POS_LOG||4||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||5||7806||2016-07-18||1||0||19.99
I tried several options like using the backslash before delimiter(\||,\|\|) but everything failed. Also, I tried with schema but got the same error.I am using Horton works(HDP2.2.4) and pig (0.14.0).
Any help is appreciated. Please let me know if you need any further details.
I have faced this case, and by checking PigStorage code source, i think PigStorage argument should be parsed into only one character.
So we can use this code instead:
L0 = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('|');
L = FOREACH L0 GENERATE $0,$2,$4,$6,$8,$10,$12,$14,$16;
Its helpful if you know how many column you have, and it will not affect performance because it's map side.
When you load data using PigStorage, It only expects single character as delimiter.
However if still you want to achieve this you can use MyRegExLoader-
REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('||')
as (movieid:int, title:chararray, genre:chararray);

Error in BQ shell Loading Datastore with write_disposition as Write append

1: I tried to load on an existing table [using Datastore file]
2. Bq Shell asked me to add write_disposition to write append to load to existing table
3. If I do the above, throws an error as follows:
load --source_format=DATASTORE_BACKUP --write_disposition=WRITE_append --allow_jagged_rows=None sample_red.t1estchallenge_1 gs://test.appspot.com/bucket/ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiBwLgCDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.entity.backup_info
Error parsing command: flag --allow_jagged_rows=None: ('Non-boolean argument to boolean flag',None)
I tried allow jagged rows = 0 and allow jagged rows = None, nothing works just the same error.
Please advise on this.
UPDATE: As Mosha suggested --allow_jagged_rows=false has worked. It should be before --write_disposition=Write_truncate. But this has led to another issue on encoding. Can anyone say what should be the encoding type for DATASTORE_BACKUP?. I tried both --encoding=UTF-8 and --encoding=ISO-8859.
load --source_format=DATASTORE_BACKUP --allow_jagged_rows=false --write_disposition=WRITE_TRUNCATE sample_red.t1estchallenge_1 gs://test.appspot.com/STAGING/ahFzfnZpcmdpbi1yZWQtdGVzdHJBCxIcX0FFX0RhdGFzdG9yZUFkbWluX09wZXJhdGlvbhiBwLgCDAsSFl9BRV9CYWNrdXBfSW5mb3JtYXRpb24YAQw.entityname.backup_info
Please advise.
You should use "false" (or "true") with boolean arguments, i.e.
--allow_jagged_rows=false