Headers Not Loading on Google Big Query Data Export - google-bigquery

Trying to export a big query table to cloud storage and for some reason, I have header=true but no column headers are being added to my export file upon creation. I've deleted the file and re-tried several times with no luck. Any feedback is greatly appreciated.
EXPORT DATA
OPTIONS ( uri = 'gs://data-feeds-public/inventory_used_google_merchant*.csv',
format='csv',
overwrite = true,
header= true)
AS
SELECT * FROM `mytable'

Related

Bigquery LoadJobConfig Delete Source Files After Transfer

When creating a Bigquery Data Transfer Service Job Manually through the UI, I can select an option to delete source files after transfer. When I try to use the CLI or the Python Client to create on-demand Data Transfer Service Jobs, I do not see an option to delete the source files after transfer. Do you know if there is another way to do so? Right now, my Source URI is gs://<bucket_path>/*, so it's not trivial to delete the files myself.
For me works this snippet (replace YOUR-... with your data):
from google.cloud import bigquery_datatransfer
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR-CRED-FILE-PATH"
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
destination_project_id = "YOUR-PROJECT-ID"
destination_dataset_id = "YOUR-DATASET-ID"
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=destination_dataset_id,
display_name="YOUR-TRANSFER-NAME",
data_source_id="google_cloud_storage",
params={
"data_path_template":"gs://PATH-TO-YOUR-DATA/*.csv",
"destination_table_name_template":"YOUR-TABLE-NAME",
"file_format":"CSV",
"skip_leading_rows":"1",
"delete_source_files": True
},
)
transfer_config = transfer_client.create_transfer_config(
parent=transfer_client.common_project_path(destination_project_id),
transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")
In this example, table YOUR-TABLE-NAME must already exist in BigQuery, otherwise the transfer will crash with error "Not found: Table YOUR-TABLE-NAME".
I used this packages:
google-cloud-bigquery-datatransfer>=3.4.1
google-cloud-bigquery>=2.31.0
Pay attention to the attribute delete_source_files in params. From docs:
Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

API call to bigquery.jobs.insert failed: Not Found: Dataset

I'm working on importing CSV files from a Google Drive, through Apps Scripts into Big Query.
BUT, when the code gets to the part where it needs to send the job to BigQuery, it states that the dataset is not found - even though the correct dataset ID is already in the code.
Very much thank you!
If you are making use of the google example code, this error that you indicate is more than a copy and paste. However, validate that you have the following:
const projectId = 'XXXXXXXX';
const datasetId = 'YYYYYYYY';
const csvFileId = '0BwzA1Orbvy5WMXFLaTR1Z1p2UDg';
try {
table = BigQuery.Tables.insert(table, projectId, datasetId);
Logger.log('Table created: %s', table.id);
} catch (error) {
Logger.log('unable to create table');
}
according to the documentation in the link:
https://developers.google.com/apps-script/advanced/bigquery
It also validates that in the services tag you have the bigquery service enabled.

fetching data in Splunk using rest api

I want to import XML data to Splunk using below .py script
My concerns are:
Can I directly configure .py script output to index data in splunk using inputs.conf, or do I need to save output first into a .csv file. If yes can anyone please suggest some approach so that data does not get changed after storing it into a new .csv file.
How can I configure that .py file to fetch data in every 5 min.
import requests
import xmltodict
import json
url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
content=xmltodict.parse(response.text)
print(content)
If you put your Python script into a [script://] stanza in inputs.conf then not only can you have Splunk launch the script automatically every 5 minutes, but anything the script writes to stdout will be indexed in Splunk.
[script:///path/to/the/script.py]
interval = 1/5 * * * *
index = main
sourcetype = foo

BigQuery: How to autoreload table with new storage JSON files?

I have just created one BigQuery table by linking available JSON files in Google Cloud Storage. But I do not see any option to auto-reload table rows with new files added in Google Cloud Storage folder or bucket.
Currently, I have to go to BigQuery console and then delete & recreate the same table to load new files. But this solution is not scalable for us because we run a cron job on BigQuery API. How to auto-reload data in BigQuery?
Thanks
When you define External Table on top of Files in Google Cloud Storage - you can use wildcard for Source Location, so your table will represent all files that match
Then, when you query such table - you can use _file_name field which will "tell" you which file given row came from
SELECT
_file_name AS file,
*
FROM `yourTable`
This way - whenever you add new file in GCS - you will get it in table "automatically"
With Google Cloud Functions you can automate BigQuery each time you receive a new file:
Create a new function at https://console.cloud.google.com/functions/add
Point "bucket" to the one receiving files.
Codewise, import BigQuery inside package.json:
{
"dependencies": {
"#google-cloud/bigquery": "^0.9.6"
}
}
And on index.js you can act on the new file in any appropriate way:
var bigQuery = BigQuery({ projectId: 'your-project-id' });
exports.processFile = (event, callback) => {
console.log('Processing: ' + JSON.stringify(event.data));
query(event.data);
callback();
};
var BigQuery = require('#google-cloud/bigquery');
function query(data) {
const filename = data.name.split('/').pop();
const full_filename = `gs://${data.bucket}/${data.name}`
// if you want to run a query:
query = '...'
bigQuery.query({
query: query,
useLegacySql: false
});
};

How do I skip header row using Python glcoud.bigquery client?

I have a daily GCP billing export file in csv format containing GCP billing details. This export contains a header row. I've setup a load job as follows (summarized):
from google.cloud import bigquery
job = client.load_table_from_storage(job_name, dest_table, source_gs_file)
job.source_format = 'CSV'
job.skipLeadingRows=1
job.begin()
This job produces the error:
Could not parse 'Start Time' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]]
This error means that it is still trying to parse the header row even though I specified skipLeadingRows=1. What am I doing wrong here?
You should use skip_leading_rows instead of skipLeadingRows when using the Python SDK.
skip_leading_rows: Number of rows to skip when reading data (CSV only).
Reference: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html
I cannot reproduce this. I took the example you gave ("2017-02-04T00:00:00-08:00"), added 3 rows/timestamps to a csv file, uploaded it to GCS, and finally created an empty table in BigQuery with one column of type TIMESTAMP.
File contents:
2017-02-04T00:00:00-08:00
2017-02-03T00:00:00-08:00
2017-02-02T00:00:00-08:00
I then ran the example Python script found here, and it successfully loaded the file into the table:
Loaded 3 rows into timestamp_test:gcs_load_test.
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(job.output_rows, dataset_name, table_name))