BQ loading: Received "unexpected error" during loading with no additional output - google-bigquery

I submitted a load job to Google BigQuery which loads 12 compressed (gzip) tabular files from google cloud storage. Each file is about 2 gigs compressed. The command I ran was similar to:
bq load --nosync --skip_leading_rows=1 --source_format=CSV
--max_bad_records=14000 -F "\t" warehouse:some_dataset.2014_lines
gs://bucket/file1.gz,gs://bucket/file2.gz,gs://bucket/file12.gz
schema.txt
I'm receiving the following error from my BigQuery load job with no explanation of why:
Error Reason:internalError. Get more information about this error at
Troubleshooting Errors: internalError.
Errors: Unexpected. Please try again.
I'm certain that the schema file is correctly formatted as I've successfully loaded files using the same schema but different set of files.
I'm wondering in what kinds of situation would an internal error like this occur and what are some ways I could go about debugging this issue?
My BQ job id: bqjob_r78ca777a8ad4bdd9_0000014e2dc86e0e_1
Thank you!

There are some cases you can get into with large .gz input files that are not always reported with a clear cause. This can happen especially (but not exclusively) with highly compressible text, so that 1 GB of compressed data represents an unusually large amount of text.
The documented limit on this page for compressed CSV/JSON is 1 GB. If that is current, I would actually expect an error on your 2 GB input. Let me check that.
Are you able to split these files into smaller pieces and try again?
(Meta: Grace, you are correct that Google says that "Google engineers monitor and answer questions with the tag google-bigquery" on StackOverflow. I am a Google engineer, but there are also many knowledgeable people here who are not. Google's docs could perhaps give more explicit guidance: the questions that are most valuable to the StackOverflow community are ones that a future person can identify they're seeing this same problem, and preferably that a non-Googler can answer it from public information. It's tough in your case because the error is broad and the cause is unclear. But if you're able to reproduce the problem using an input file that you can make public, more people here will be able to take a crack at the problem. You can also file an issue for questions that really no one outside Google can do much with.)

Related

Silverstripe 4 large Files in Uploadfield

when uploading a large file with uploadfield i get the error.
"Server responded with an error.
Expected a value of type "Int" but received: 4008021167"
to set the allowed filesize i used $upload->getValidator()->setAllowedMaxFileSize(6291456000);
$upload is an UploadField.
every file larger than 2gb gets this error. smaller files are uploaded without any error.
where can i adjust that i can upload bigger files.
I remember that there has been a 2GB border in the past, but i don't know where to adjust it
thanks for your answers
klaus
The regular file upload limits don't seem to be the issue, if you are already at 2 GB. This might be the memory limit of the process itself. I would recommend looking into chunked uploads - this allows you processing larger files.
I know, this answer is late - but the problem is rooted in the graphQL type definition of the File type (it is set to Int). I've submitted a pull request to the upstream repository. Also here is the sed one-liner to patch it:
sed -i 's/size\: Int/size\: Float/g' vendor/silverstripe/asset-admin/_graphql/types/File.yml

pyPDF2 error - PyPDF2.utils.PdfStreamError: Stream has ended unexpectedly

I am Python newbie and wrote a script about a year back to retrieve pdf files and merge them into a single file/book. The script works and does what I need it to do, but lately as I am no longer the only user there seems to be some files that I think are causing it to crash. My suspicion is that it may be newer pdf files that have higher resolution images or forms etc. that pyPDF2 may not be able to handle, but I don't know that for sure. One thing that would help to know is what file(s) is creating the issue to scan the files, instead of using the native pdf file, but I can't figure that out since the error occurs upon the writing of all files. Below is the snippet of my code that I use for merging them along with the error. Is pyPDF2 still being updated? Is there are better supported/commercial utility out there that can do this? Any suggestions and help would be greatly appreciated!
for f in Submittal_Files:
h_path, f_name = os.path.split(f)
outfile.append(open(f, 'rb'))
outfile.write(open(Final_Submittal_File_Name + ".pdf", 'wb'))
print("\nSUCCESSFULLY COMPLETED!")
print("PRESS ENTER TO END!")
program_holder = input()
break
Sorry about the long error log before. I won't be able to get the files until Monday 7/23 when I will have access to them. I will provide 2 examples.
LOG:
https://www.dropbox.com/s/ymbe1tnak7uuvs7/PDF-Merge-Crash-log.txt?dl=0
File Set #1 - Large Submittal - I had to delete some files to protect the innocent. They are mostly cover pages for sections and I know those are not an issue as they are used all the time.
https://www.dropbox.com/s/xavrpfxmo6dr7mb/Case%20%231%20-%20JUST%20PDFs.rar?dl=0
File Set #2 - Smaller Submittal - Same as above.
https://www.dropbox.com/s/2lxk8bts0w2qsx8/Case%20%232%20-%20JUST%20PDFS.rar?dl=0

BigQuery InternalError loading from Cloud Storage (works with direct file upload)

Whenever I try to load a CSV file stored in CloudStorage into BigQuery, I get an InternalError (both using the web interface as well as the command line). The CSV is (an abbreviated) part of the Google Ngram dataset.
command like:
bq load 1grams.ngrams gs://otichybucket/import_test.csv word:STRING,year:INTEGER,freq:INTEGER,volume:INTEGER
gives me:
BigQuery error in load operation: Error processing job 'otichyproject1:bqjob_r28187461b449065a_000001504e747a35_1': An internal error occurred and the request could not be completed.
However, when I load this file directly using the web interface and the File upload as a source (loading from my local drive), it works.
I need to load from Cloud Storage, since I need to load much larger files (original ngrams datasets).
I tried different files, always the same.
I'm an engineer on the BigQuery team. I was able to look up your job, and it looks like there was a problem reading the Google Cloud Storage object.
Unfortunately, we didn't log much of the context, but looking at the code, the things that could cause this are:
The URI you specified for the job is somehow malformed. It doesn't look malformed, but maybe there is some odd UTF8 non-printing character that I didn't notice.
The 'region' for your bucket is somehow unexpected. Is there any chance you've set data location on your GCS bucket to something other than {US, EU, or ASIA}. See here for more info on bucket locations. If so, and you've set location to a region, rather than a continent, that could cause this error.
There could have been some internal error in GCS that caused this. However, I didn't see this in any of the logs, and it should be fairly rare.
We're putting in some more logging to detect this in the future and to fix the issue with regional buckets (however, regional buckets may fail, because bigquery doesn't support cross-region data movement, but at least they will fail with an intelligible error).

BigQuery - 1 file duplicate out of many using Java API

I am using Java API quick start program to upload CSV files into bigquery tables. I uploaded more than thousand file but in one of the table 1 file's rows are duplicated in bigquery. I went through the logs but found only 1 entry of its upload. Also went through many similar questions where #jordan mentioned that the bug is fixed. Can there be any reason of this behavior? I am yet to try a solution mentioned of setting a job id. But just could not get any reason from my side of the duplicate entries...
Since BigQuery is append only by design, you need to accept there will be some duplicates in your system, and be able to write your queries in such way that selects the most recent version of it.

Checksum Exception when reading from or copying to hdfs in apache hadoop

I am trying to implement a parallelized algorithm using Apache hadoop, however I am facing some issues when trying to transfer a file from the local file system to hdfs. A checksum exception is being thrown when trying to read from or transfer a file.
The strange thing is that some files are being successfully copied while others are not (I tried with 2 files, one is slightly bigger than the other, both are small in size though). Another observation that I have made is that the Java FileSystem.getFileChecksum method, is returning a null in all cases.
A slight background on what I am trying to achieve: I am trying to write a file to hdfs, to be able to use it as a distributed cache for the mapreduce job that I have written.
I have also tried the hadoop fs -copyFromLocal command from the terminal, and the result is the exact same behaviour as when it is done through the java code.
I have looked all over the web, including other questions here on stackoverflow however I haven't managed to solve the issue. Please be aware that I am still quite new to hadoop so any help is greatly appreciated.
I am attaching the stack trace below which shows the exceptions being thrown. (In this case I have posted the stack trace resulting from the hadoop fs -copyFromLocal command from terminal)
name#ubuntu:~/Desktop/hadoop2$ bin/hadoop fs -copyFromLocal ~/Desktop/dtlScaleData/attr.txt /tmp/hadoop-name/dfs/data/attr2.txt
13/03/15 15:02:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/15 15:02:51 INFO fs.FSInputChecker: Found checksum error: b[0, 0]=
org.apache.hadoop.fs.ChecksumException: Checksum error: /home/name/Desktop/dtlScaleData/attr.txt at 0
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:230)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:176)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1183)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:130)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1762)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895)
copyFromLocal: Checksum error: /home/name/Desktop/dtlScaleData/attr.txt at 0
You are probably hitting the bug described in HADOOP-7199. What happens is that when you download a file with copyToLocal, it also copies a crc file in the same directory, so if you modify your file and then try to do copyFromLocal, it will do a checksum of your new file and compare to your local crc file and fail with a non descriptive error message.
To fix it, please check if you have this crc file, if you do just remove it and try again.
I face the same problem solved by removing .crc files
Ok so I managed to solve this issue and I'm writing the answer here just in case someone else encounters the same problem.
What I did was simply create a new file and copied all the contents from the problematic file.
From what I can presume it looks like some crc file is being created and attached to that particular file, hence by trying with another file, another crc check will be carried out. Another reason could be that I have named the file attr.txt, which could be a conflicting file name with some other resource. Maybe someone could expand even more on my answer, since I am not 100% sure on the technical details and these are just my observations.
CRC file holds serial number for the Particular block data. Entire data is spiltted into Collective Blocks. Each block stores metada along with the CRC file inside /hdfs/data/dfs/data folder. If some one makes correction to the CRC files...the actual and current CRC serial numbers would mismatch and it causes the ERROR!! Best practice to fix this ERROR is to do override the meta data file along with CRC file.
I got the exact same problem and didn't fid any solution. Since this was my first hadoop experience, I could not follow some instruction over the internet. I solved this problem by formatting my namenode.
hadoop namenode -format