Bigquery create table (native or external) link to Google cloud storage - google-bigquery

I have some files uploaded to Google Cloud Storage (csv and json).
I could create BigQuery tables, native or external, linking to these files in Google Cloud Storage.
In the process of creating bigquery tables, I could check "Schema Automatically detect".
The "Schema Automatically detect" works well with json new line delimited format file. But with the csv file, first row is the 'column name", bigquery cannot do the "schema automatically detect", it treats the first line as data, and then the schema bigquery created will be string_field_1, string_field_2 etc.
Are there anything that I need to do for my csv file that makes bigquery "Schema Automatically detect" works?
The csv file I have is "Microsoft Excel Comma Separated Value File".
Update:
If first column is empty, BigQuery autodetect doesn't detect headers
custom id,asset id,related isrc,iswc,title,hfa song code,writers,match policy,publisher name,sync ownership share,sync ownership territory,sync ownership restriction
,A123,,,Medley of very old Viennese songs,,,,,,,
,A234,,,Suite de pièces No. 3 en Ré Mineur HWV 428 - Allemande,,,,,,,
But if first column is not empty - it is OK:
custom id,asset id,related isrc,iswc,title,hfa song code,writers,match policy,publisher name,sync ownership share,sync ownership territory,sync ownership restriction
1,A123,,,Medley of very old Viennese songs,,,,,,,
2,A234,,,Suite de pièces No. 3 en Ré Mineur HWV 428 - Allemande,,,,,,,
Should it be a feature improvement request for BigQuery?

CSV autodetect does detect header line in CSV files, so there must be something special about your data. It would be good if you can provide the real data snippet and the actual commands you used. Here is my example that demonstrates how it works:
~$ cat > /tmp/people.csv
Id,Name,DOB
1,Bill Gates,1955-10-28
2,Larry Page,1973-03-26
3,Mark Zuckerberg,1984-05-14
~$ bq load --source_format=CSV --autodetect dataset.people /tmp/people.csv
Upload complete.
Waiting on bqjob_r33dc9ca5653c4312_0000015af95f6209_1 ... (2s) Current status: DONE
~$ bq show dataset.people
Table project:dataset.people
Last modified Schema Total Rows Total Bytes Expiration Labels
----------------- ----------------- ------------ ------------- ------------ --------
22 Mar 21:14:27 |- Id: integer 3 89
|- Name: string
|- DOB: date

custom id,asset id,related isrc,iswc,title,hfa song code,writers,match policy,publisher name,sync ownership share,sync ownership territory,sync ownership restriction
,A123,,,Medley of very old Viennese songs,,,,,,,
,A234,,,Suite de pièces No. 3 en Ré Mineur HWV 428 - Allemande,,,,,,,
If the first column is empty, Google BigQuery cannot detect the schema.
custom id,asset id,related isrc,iswc,title,hfa song code,writers,match policy,publisher name,sync ownership share,sync ownership territory,sync ownership restriction
1,A123,,,Medley of very old Viennese songs,,,,,,,
2,A234,,,Suite de pièces No. 3 en Ré Mineur HWV 428 - Allemande,,,,,,,
If I add the value to the first column, then Google BigQuery can detect the schema.
Should it be a feature improvement request for BigQuery?

Related

How to use Snowflake identifier function to reference a stage object

I can descrbie a stage w/ the identifier:
desc stage identifier('db.schema.stage_name');
But get an error when I try to use the stage with the at symbol syntax
Have tried these variations but no dice so far:
list #identifier('db.schema.stage_name');
list identifier('#db.schema.stage_name');
list identifier('db.schema.stage_name');
list identifier(#'db.schema.stage_name');
list identifier("#db.schema.stage_name");
The use of IDENTIFIER may indicate the need to query/list content of a stage with stage name provided as variable.
An alternative approach could be usage of directory tables:
Directory tables store a catalog of staged files in cloud storage. Roles with sufficient privileges can query a directory table to retrieve file URLs to access the staged files, as well as other metadata.
Enabling directory table on the stage:
CREATE OR REPLACE STAGE test DIRECTORY = (ENABLE = TRUE);
ALTER STAGE test REFRESH;
Listing content of the stage:
SET var = '#public.test';
SELECT * FROM DIRECTORY($var);
Output:
+---------------+------+---------------+-----+------+----------+
| RELATIVE_PATH | SIZE | LAST_MODIFIED | MD5 | ETAG | FILE_URL |
+---------------+------+---------------+-----+------+----------+

Hitachi Content Platform (HCP) S3 - How to I disable or delete previous versions?

I am (unfortunately) using Hitachi Content Platform for S3 object storage, and I need to sync around 400 images to a bucket every 2 minutes. The filenames are always the same, and the sync "updates" the original file with the latest image.
Originally, I was unable to overwrite existing files. Unlike other platforms, on HCP, you cannot update a file that already exists when versioning is disabled, it returns a 409 and won't store the file, so I've enabled versioning which allows the files to be overwritten.
The issue now is that HCP is set to retain old versions for 0 days for my bucket (which my S3 admin says should cause it to retain no versions) and "Keep deleted versions" is also disabled, but the bucket is still filling up with objects (400 files every 2 minutes = ~288K per day). It seems to cap out at this amount, after the first day it remains at 288K permanently (which seems like it's eventually removing the old versions after 1 day).
Here's an example script that simulates the problem:
# Generate 400 files with the current date/time in them
for i in $(seq -w 1 400); do
echo $(date +'%Y%m%d%H%M%S') > "file_${i}.txt"
done
# Sync the current directory to the bucket
aws --endpoint-url $HCP_HOST s3 sync . s3://$HCP_BUCKET/
# Run this a few times to simulate the 2 minute upload cycle
The initial sync is very quick, and takes less than 5 seconds, but throughout the day it becomes slower and slower as the bucket begins to get more versions, eventually taking sometimes over 2 minutes to sync the files (which is bad since I need to sync the files every 2 minutes).
If I try to list the objects in the bucket after 1 day, only 400 files come back in the list, but it can take over 1 minute to return (which is why I need to add --cli-read-timeout 0):
# List all the files in the bucket
aws --endpoint-url $HCP_HOST s3 ls s3://$HCP_BUCKET/ --cli-read-timeout 0 --summarize
# Output
Total Objects: 400
Total Size: 400
I can also list and see all of the old unwanted versions:
# List object versions and parse output with jq
aws --endpoint-url $HCP_HOST s3api list-object-versions --bucket $HCP_BUCKET --cli-read-timeout 0 | jq -c '.Versions[] | {"key": .Key, "version_id": .VersionId, "latest": .IsLatest}'
Output:
{"key":"file_001.txt","version_id":"107250810359745","latest":false}
{"key":"file_001.txt","version_id":"107250814851905","latest":false}
{"key":"file_001.txt","version_id":"107250827750849","latest":false}
{"key":"file_001.txt","version_id":"107250828383425","latest":false}
{"key":"file_001.txt","version_id":"107251210538305","latest":false}
{"key":"file_001.txt","version_id":"107251210707777","latest":false}
{"key":"file_001.txt","version_id":"107251210872641","latest":false}
{"key":"file_001.txt","version_id":"107251212449985","latest":false}
{"key":"file_001.txt","version_id":"107251212455681","latest":false}
{"key":"file_001.txt","version_id":"107251212464001","latest":false}
{"key":"file_001.txt","version_id":"107251212470209","latest":false}
{"key":"file_001.txt","version_id":"107251212644161","latest":false}
{"key":"file_001.txt","version_id":"107251212651329","latest":false}
{"key":"file_001.txt","version_id":"107251217133185","latest":false}
{"key":"file_001.txt","version_id":"107251217138817","latest":false}
{"key":"file_001.txt","version_id":"107251217145217","latest":false}
{"key":"file_001.txt","version_id":"107251217150913","latest":false}
{"key":"file_001.txt","version_id":"107251217156609","latest":false}
{"key":"file_001.txt","version_id":"107251217163649","latest":false}
{"key":"file_001.txt","version_id":"107251217331201","latest":false}
{"key":"file_001.txt","version_id":"107251217343617","latest":false}
{"key":"file_001.txt","version_id":"107251217413505","latest":false}
{"key":"file_001.txt","version_id":"107251217422913","latest":false}
{"key":"file_001.txt","version_id":"107251217428289","latest":false}
{"key":"file_001.txt","version_id":"107251217433537","latest":false}
{"key":"file_001.txt","version_id":"107251344110849","latest":true}
// ...
I thought I could just run a job that cleans up the old versions on a regular basis, but I've tried to delete the old versions and it fails with an error:
# Try deleting an old version for the file_001.txt key
aws --endpoint-url $HCP_HOST s3api delete-object --bucket $HCP_BUCKET --key "file_001.txt" --version-id 107250810359745
# Error
An error occurred (NotImplemented) when calling the DeleteObject operation:
Only the current version of an object can be deleted.
I've tested this using MinIO and AWS S3 and my use-case works perfectly fine on both of those platforms.
Is there anything I'm doing incorrectly, or is there a setting in HCP that I'm missing that could make it so I can overwrite objects on sync while retaining no previous versions? Alternatively, is there a way to manually delete the previous versions?

Add file name and timestamp into each record in BigQuery using Dataflow

I have a few .txt files with data in JSON to be loaded to google BigQuery table. Along with the columns in the text files I will need to insert filename and current timestamp for each rows. It is in GCP Dataflow with Python 3.7
I accessed the Filemetadata containing the filepath and size using GCSFileSystem.match and metadata_list.
I believe I need to get the pipeline code to run in a loop, pass the filepath to ReadFromText, and call a FileNameReadFunction ParDo.
(p
| "read from file" >> ReadFromText(known_args.input)
| "parse" >> beam.Map(json.loads)
| "Add FileName" >> beam.ParDo(AddFilenamesFn(), GCSFilePath)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(known_args.output,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
I followed the steps in Dataflow/apache beam - how to access current filename when passing in pattern? but I can't make it quite work.
Any help is appreciated.
You can use textio.ReadFromTextWithFilename instead of ReadFromText. That will produce a PCollection of (filename,line) tuples.
To include the file and timestamp in your output json record, you could change your "parse" line to
| "parse" >> beam.map(lambda (file, line): {
**json.loads(line),
"filename": file,
"timestamp": datetime.now()})

unable to load csv file from GCS into bigquery

I am unable to load 500mb csv file from google cloud storage to big query but i got this error
Errors:
Too many errors encountered. (error code: invalid)
Job ID xxxx-xxxx-xxxx:bquijob_59e9ec3a_155fe16096e
Start Time Jul 18, 2016, 6:28:27 PM
End Time Jul 18, 2016, 6:28:28 PM
Destination Table xxxx-xxxx-xxxx:DEV.VIS24_2014_TO_2017
Write Preference Write if empty
Source Format CSV
Delimiter ,
Skip Leading Rows 1
Source URI gs://xxxx-xxxx-xxxx-dev/VIS24 2014 to 2017.csv.gz
I have gzipped 500mb csv file to csv.gz to upload to GCS.Please help me to solve this issue
The internal details for your job show that there was an error reading the row #1 of your CSV file. You'll need to investigate further, but it could be that you have a header row that doesn't conform to the schema of the rest of the file, so we're trying to parse a string in the header as an integer or boolean or something like that. You can set the skipLeadingRows property to skip such a row.
Other than that, I'd check that the first row of your data matches the schema you're attempting to import with.
Also, the error message you received is unfortunately very unhelpful, so I've filed a bug internally to make the error you received in this case more helpful.

BigQuery console api "Cannot start a job without a project id"

I can call a sql on big query browser tool and I installed the bq tool on centos and register it now I can able connect bigdata and show the dataset or get the table data with head method but when i call the quert from bq tool I got "BigQuery error in query operation: Cannot start a job without a project id." I searched it on google but nothing found helpful.
Does anyone run a select query via "This is BigQuery CLI v2.0.1"
BigQuery> ls
projectId friendlyName
-------------- --------------
XXXX
API Project
BigQuery> show publicdata:samples.shakespeare
Table publicdata:samples.shakespeare
Last modified Schema Total Rows Total Bytes Expiration
----------------- ------------------------------------ ------------ ------------- ------------
02 May 02:47:25 |- word: string (required) 164656 6432064
|- word_count: integer (required)
|- corpus: string (required)
|- corpus_date: integer (required)
BigQuery> query "SELECT title FROM [publicdata:samples.wikipedia] LIMIT 10 "
BigQuery error in query operation: Cannot start a job without a project id.
In order to run a query, you need to provide a project id, which is the project that gets billed for the query (there is a free quota of 25GB/month, but we still need a project to attribute the usage to). You can specify a project either with the --project_id flag or by setting a default project by running gcloud config set project PROJECT_ID. See the docs for bq and especially the 'Working with projects' section here.
Also it sounds like you may have an old version of bq. The most recent can be downloaded here: https://cloud.google.com/sdk/docs/