BigQuery InternalError loading from Cloud Storage (works with direct file upload) - google-bigquery

Whenever I try to load a CSV file stored in CloudStorage into BigQuery, I get an InternalError (both using the web interface as well as the command line). The CSV is (an abbreviated) part of the Google Ngram dataset.
command like:
bq load 1grams.ngrams gs://otichybucket/import_test.csv word:STRING,year:INTEGER,freq:INTEGER,volume:INTEGER
gives me:
BigQuery error in load operation: Error processing job 'otichyproject1:bqjob_r28187461b449065a_000001504e747a35_1': An internal error occurred and the request could not be completed.
However, when I load this file directly using the web interface and the File upload as a source (loading from my local drive), it works.
I need to load from Cloud Storage, since I need to load much larger files (original ngrams datasets).
I tried different files, always the same.

I'm an engineer on the BigQuery team. I was able to look up your job, and it looks like there was a problem reading the Google Cloud Storage object.
Unfortunately, we didn't log much of the context, but looking at the code, the things that could cause this are:
The URI you specified for the job is somehow malformed. It doesn't look malformed, but maybe there is some odd UTF8 non-printing character that I didn't notice.
The 'region' for your bucket is somehow unexpected. Is there any chance you've set data location on your GCS bucket to something other than {US, EU, or ASIA}. See here for more info on bucket locations. If so, and you've set location to a region, rather than a continent, that could cause this error.
There could have been some internal error in GCS that caused this. However, I didn't see this in any of the logs, and it should be fairly rare.
We're putting in some more logging to detect this in the future and to fix the issue with regional buckets (however, regional buckets may fail, because bigquery doesn't support cross-region data movement, but at least they will fail with an intelligible error).

Related

Dask dataframe read parquet format fails from http

I have been dealing with this problem for a week.
I use the command
from dask import dataframe as ddf
ddf.read_parquet("http://IP:port/webhdfs/v1/user/...")
I got invalid parquet magic.
However ddf.read_parquet is Ok with "webhdfs://"
I would like the ddf.read_parquet works for http because I want to use it in dask-ssh cluster for workers without hdfs access.
Although the comments already partly answer this question, I thought I would add some information as an answer
HTTP(S) is supported by dask (actually fsspec) as a backend filesystem; but to get partitioning within a file, you need to get the size of that file, and to resolve globs, you need to be able to get a list of links, neither of which are necessarily provided by any given server
webHDFS (or indeed httpFS) don't work like HTTP downloads, you need to use a specific API to open a file and fetch a final URL on a cluster member to that file; so the two methods are not interchangeable
webHDFS is normally intended for use outside of the hadoop cluster; within the cluster, you would probably use plain HDFS ("hdfs://"). However, kerberos-secured webHDFS can be tricky to work with, depending on how the security was set up.

GoogleHadoopFileSystemBase.setTimes() not working

I have a reference to the GoogleHadoopFileSystemBase in my java code, and I’m trying to call setTimes(Path p, long mtime, long atime) to modify the timestamp of a file. It doesn’t seem to be working though, even though other FileSystem apis work fine. From the console UI, I can only see the "last uploaded" time. Does the google file system only store the upload time, and thus ignore changes to the modification time and access time? I also checked the timestamp using hadoop fs -ls gs://mybucket/, but that timestamp also shows up as unchanged.
This is unfortunately a limitation of Google Cloud Storage (GCS) and the GCS Hadoop connector. Specifically, GCS objects do not have mtime and atime which causes the underlying failure. As a workaround right now, HDFS attached to clusters should function normally.

Can Someone Help Me Troubleshoot Error In BQ "does not contain valid backup metadata."

I keep trying to upload a new table onto my companies BQ, but I keep getting the error you see in the title ("does not contain valid backup metadata.").
For reference, I'm uploading a .csv file that has been saved to our Google Cloud data storage. It's being uploaded as a native table.
Can anyone help me troubleshoot this?
It sounds like you are specifying the file type DATASTORE_BACKUP. When you specify that file type, BigQuery will take whatever uri you provide (even if it has a .CSV suffix) and search for the Google Cloud Data Storage Backup files relative to that url.

Cannot process data in separate locations

I am trying to load csv file to BigQuery from Google Cloud Storage by WebUI.
But sometimes occurs error.
Error message is "Cannot process data in separate locations".
What does it mean?
And how can I fix it?
This was an unintended consequence of an update to the BigQuery service. We'll provide additional followup on this bug:
https://code.google.com/p/google-bigquery/issues/detail?id=270

Simple way to load new files only into Redshift from S3?

The documentation for the Redshift COPY command specifies two ways to choose files to load from S3, you either provide a base path and it loads all the files under that path, or you specify a manifest file with specific files to load.
However in our case, which I imagine is pretty common, the S3 bucket periodically receives new files with more recent data. We'd like to be able to load only the files that haven't already been loaded.
Given that there is a table stl_file_scan that logs all the files that have been loaded from S3, it would be nice to somehow exclude those that have successfully been loaded. This seems like a fairly obvious feature, but I can't find anything in the docs or online about how to do this.
Even the Redshift S3 loading template in AWS Data Pipeline appears to manage this scenario by loading all the data -- new and old -- to a staging table, and then comparing/upserting to the target table. This seems like an insane amount of overhead when we can tell up front from the filenames that a file has already been loaded.
I know we could probably move the files that have already been loaded out of the bucket, however we can't do that, this bucket is the final storage place for another process which is not our own.
The only alternative I can think of is to have some other process running that tracks files that have been successfully loaded to redshift, and then periodically compares that to the s3 bucket to determine the differences, and then writes the manifest file somewhere before triggering the copy process. But what a pain! We'd need a separate ec2 instance to run the process which would have it's own management and operational overhead.
There must be a better way!
This is how I solved the problem,
S3 -- (Lambda Trigger on newly created Logs) -- Lambda -- Firehose -- Redshift
It works at any scale. With more load, more calls to Lambda, more data to firehose and everything taken care automatically.
If there are issues with the format of the file, you can configure dead letter queues, events will be sent there and you can reprocess once you fix lambda.
Here I would like to mention some steps that includes process that how to load data in redshift.
Export local RDBMS data to flat files (Make sure you remove invalid
characters, apply escape sequence during export).
Split files into 10-15 MB each to get optimal performance during
upload and final Data load.
Compress files to *.gz format so you don’t end up with $1000
surprise bill :) .. In my case Text files were compressed 10-20
times.
List all file names to manifest file so when you issue COPY command
to Redshift its treated as one unit of load.
Upload manifest file to Amazon S3 bucket.
Upload local *.gz files to Amazon S3 bucket.
Issue Redshift COPY command with different options.
Schedule file archiving from on-premises and S3 Staging area on AWS.
Capturing Errors, setting up restart ability if something fails
Doing it easy way you can follow this link.
In general compare of loaded files to existing on S3 files is a bad but possible practice. The common "industrial" practice is to use message queue between data producer and data consumer that actually loads the data. Take a look on RabbitMQ vs Amazon SQS and etc..