I am trying to write a Pig Script for compacting small files having data in the parquet format. Below mentioned lines are trying to load the small files in the directory and then store them. The files have complex nested structures which are nullable and they contain lots of the NULLs.
LOGS = LOAD '/dt=20150307/hr=2015030700/*' USING parquet.pig.ParquetLoader();
STORE LOGS INTO '/user/compaction_output' USING parquet.pig.ParquetStorer();
I am getting the following error:
2015-04-29 17:00:45,883 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2118: Cannot build an empty group
My suspicion is that it is because of the null values in the input files.
Can someone help out ?
Related
I keep getting an error when I try and upload data from my own documents in csv format under the table destination. How do I fix this so I can write queries for my data?
I tried changing the names of the file I was uploading and following the instructions from my course exactly without any luck. I was expecting the data to be uploaded into my project file so I could write queries to analyze the data.
Pig uses variables to store the data.
When I load the data from HDFS into the variable in pig. Where is the data temporarily stored?
What exactly happens in the background when we load the data into the variable ?
Kindy help
Pig lazily evaluates most expressions. In most cases, it checks for syntax errors etc. Like,
a = load 'hdfs://I/Dont/Exist'
won't throw an error unless you use STORE or DUMP or something along those lines which result in the evaluation of a
Similarly, if a file exists and you load it to a relation and perform transformations on it, the file is spooled to /tmp folder usually and then the transformations are performed. If you look at the messages that appear when you run commands on grunt, you'll notice file paths starting with file:///tmp/xxxxxx_201706171047235. These are the files that store intermediate data.
I have successfully loaded large number of AVRO files (of same schema type into same table), stored on Google Storage, using bq CLI utility.
However, for some of the AVRO files I am getting very cryptic error while loading into bigquery, the error says:
The Apache Avro library failed to read data with the follwing error: EOF
reached (error code: invalid)
With avro-tools validated that the AVRO file is not corrupted, report output:
java -jar avro-tools-1.8.1.jar repair -o report 2017-05-15-07-15-01_48a99.avro
Recovering file: 2017-05-15-07-15-01_48a99.avro
File Summary:
Number of blocks: 51 Number of corrupt blocks: 0
Number of records: 58598 Number of corrupt records: 0
I tried creating a brand new table with one of the failing files in case it was due to schema mismatch but that didnt help as the error was exactly the same.
need help to figure out what could be causing the error here?
No way to pinpoint the issue without more information, but I ran into this error message and filed a ticket here.
I a number of files in a single load job were missing columns which was causing the error.
Explanation from the ticket.
BigQuery uses the alphabetically last file from the directory as the avro schema to read the other Avro files. I suspect the issue is with schema incompatibility between the last file and the "problematic" file. Do you know if all the files have the exact same schema or differ? One thing you could try to help verify this is to copy the alphabetically last file of the directory and the "problematic" file to a different folder and try to load those two files in one BigQuery load job and see if the error reproduces.
I am new to big-query and i was trying to load avro file to bigQuery table.For the first two times i was able to load avro file to bigquery table .For the third times onwords it starts failing and the error message is -
Waiting on bqjob_r77fb1a791c9ab204_0000015c88ab3ad8_1 ... (0s) Current
status: DONE BigQuery error in load operation: Error processing job 'xxx-yz-
df:bqjob_r77fb1a791c9ab204_0000015c88ab3ad8_1': Provided Schema does not
match Table xxx-yz-df:adityadb.avro_poc3_part_stage$20120611.
i tried many times .How schema can be mismatch for the same file ,if you try more than two times .The load command which i was using is-
bq load --source_format=AVRO adityadb.avro_poc3_part_stage$20120611 gs://reair_ddh/apps/hive/warehouse/adityadb1.db/avro_poc3_part_txt/ingestion_time=20120611/000000_0
I dont know why this is happening,Any help would be appreciated. Thank you.
I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.
Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.
Here is what I tried so far.
0) If I try to do it in a straightforward way, bq fails with error:
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached
1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.
2) Actually, schema id is mentioned in the file name, so files look like:
gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*
Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*
3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:
bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...
4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:
BigQuery error in query operation: Table definition may not have more than 500 source_uris
I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.