How do I skip header row using Python glcoud.bigquery client? - google-bigquery

I have a daily GCP billing export file in csv format containing GCP billing details. This export contains a header row. I've setup a load job as follows (summarized):
from google.cloud import bigquery
job = client.load_table_from_storage(job_name, dest_table, source_gs_file)
job.source_format = 'CSV'
job.skipLeadingRows=1
job.begin()
This job produces the error:
Could not parse 'Start Time' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]]
This error means that it is still trying to parse the header row even though I specified skipLeadingRows=1. What am I doing wrong here?

You should use skip_leading_rows instead of skipLeadingRows when using the Python SDK.
skip_leading_rows: Number of rows to skip when reading data (CSV only).
Reference: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html

I cannot reproduce this. I took the example you gave ("2017-02-04T00:00:00-08:00"), added 3 rows/timestamps to a csv file, uploaded it to GCS, and finally created an empty table in BigQuery with one column of type TIMESTAMP.
File contents:
2017-02-04T00:00:00-08:00
2017-02-03T00:00:00-08:00
2017-02-02T00:00:00-08:00
I then ran the example Python script found here, and it successfully loaded the file into the table:
Loaded 3 rows into timestamp_test:gcs_load_test.
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(job.output_rows, dataset_name, table_name))

Related

BQ client.load_table_from_uri behaves different from python to with in Airflow

I am trying to load a CSV into BQ using a custom operator in Airflow.
My custom operator is using
load_job_config = bigquery.LoadJobConfig(
schema=self.schema_fields,
skip_leading_rows=self.skip_leading_rows,
source_format=bigquery.SourceFormat.CSV
)
load_job = client.load_table_from_uri(
'gs://' + self.source_bucket + "/" + self.source_object, self.dsp_tmp_dataset_table,
job_config=load_job_config
) #
The issue I am facing is that I always get errors
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table nonprod-cloud-composer:dsp_data_transformation.tremorvideo_daily_datafeed. Field Date has changed type from TIMESTAMP to DATE
The exact same code when run outside of Airflow as a stand alone python works fine.
I am using the exactly same schema object , same source CSV file just that the environment is different.
Below is the high level steps followed
Created table in BQ
Using the
LOAD DATA OVERWRITE XXXX
FROM FILES (
format = 'CSV',
uris = ['gs://xxx.csv']);
This worked fine and the data was loaded into the table.
3. Truncated the table and tried to run the custom operator what has the code above listed. Then faced errors.
4. Created a simple python program with to test the bq load job and that works fine too.
Its just that when ever the same load job is triggered using Airflow the schema detection fails and leads to all sorts of errors.

Bigquery LoadJobConfig Delete Source Files After Transfer

When creating a Bigquery Data Transfer Service Job Manually through the UI, I can select an option to delete source files after transfer. When I try to use the CLI or the Python Client to create on-demand Data Transfer Service Jobs, I do not see an option to delete the source files after transfer. Do you know if there is another way to do so? Right now, my Source URI is gs://<bucket_path>/*, so it's not trivial to delete the files myself.
For me works this snippet (replace YOUR-... with your data):
from google.cloud import bigquery_datatransfer
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR-CRED-FILE-PATH"
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
destination_project_id = "YOUR-PROJECT-ID"
destination_dataset_id = "YOUR-DATASET-ID"
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=destination_dataset_id,
display_name="YOUR-TRANSFER-NAME",
data_source_id="google_cloud_storage",
params={
"data_path_template":"gs://PATH-TO-YOUR-DATA/*.csv",
"destination_table_name_template":"YOUR-TABLE-NAME",
"file_format":"CSV",
"skip_leading_rows":"1",
"delete_source_files": True
},
)
transfer_config = transfer_client.create_transfer_config(
parent=transfer_client.common_project_path(destination_project_id),
transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")
In this example, table YOUR-TABLE-NAME must already exist in BigQuery, otherwise the transfer will crash with error "Not found: Table YOUR-TABLE-NAME".
I used this packages:
google-cloud-bigquery-datatransfer>=3.4.1
google-cloud-bigquery>=2.31.0
Pay attention to the attribute delete_source_files in params. From docs:
Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

fetching data in Splunk using rest api

I want to import XML data to Splunk using below .py script
My concerns are:
Can I directly configure .py script output to index data in splunk using inputs.conf, or do I need to save output first into a .csv file. If yes can anyone please suggest some approach so that data does not get changed after storing it into a new .csv file.
How can I configure that .py file to fetch data in every 5 min.
import requests
import xmltodict
import json
url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
content=xmltodict.parse(response.text)
print(content)
If you put your Python script into a [script://] stanza in inputs.conf then not only can you have Splunk launch the script automatically every 5 minutes, but anything the script writes to stdout will be indexed in Splunk.
[script:///path/to/the/script.py]
interval = 1/5 * * * *
index = main
sourcetype = foo

Google Cloud Storage Joining multiple csv files

I exported a dataset from Google BigQuery to Google Cloud Storage, given the size of the file BigQuery exported the file as 99 csv files.
However now I want to connect to my GCP Bucket and perform some analysis with Spark, yet I need to join all 99 files into a single large csv file to run my analysis.
How can this be achieved?
BigQuery splits the data exported into several files if it is larger than 1GB. But you can merge these files with the gsutil tool, check this official documentation to know how to perform object composition with gsutil.
As BigQuery export the files with the same prefix, you can use a wildcard * to merge them into one composite object:
gsutil compose gs://example-bucket/component-obj-* gs://example-bucket/composite-object
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.
The downside of this option is that the header row of each .csv file will be added in the composite object. But you can avoid this by modifiyng the jobConfig to set the print_header parameter to False.
Here is a Python sample code, but you can use any other BigQuery Client library:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'yourBucket'
project = 'bigquery-public-data'
dataset_id = 'libraries_io'
table_id = 'dependencies'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'file-*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US',
job_config=job_config) # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
Finally, remember to compose an empty .csv with just the headers row.
I got tired kind tired of doing multiple recursive compose operations, stripping headers, etc... Especially when dealing with 3500 split gzipped csv files.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article with a use case and usage example for it:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.

export big query table locally

I have a big query table that I would like to run on using pandas DataFrame. The table is big and using the: pd.read_gpq() function gets stuck and does not manage to retrieve the data.
I implemented a chunk mechanism using pandas that works, but it takes a long time to fetch (an hour for 9M rows). So im looking into a new sulotion.
I would like to download the table to as a csv file and then read it. I saw this code in the google cloud docs:
# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
but all the URIs shown in the examples are google cloud buckets URIs and not local, and I didn't manage to download it (tried to put a local URI which gave me an error).
Is there a way to download the table's data as csv file without using a bucket?
As mentioned here
The limitation with bigquery export is - You cannot export data to a local file or to Google Drive, but you can save query results to a local file. The only supported export location is Cloud Storage.
Is there a way to download the table's data as csv file without using a bucket?
So now as we know that we can store query result to local file so you can use something like this :
from google.cloud import bigquery
client = bigquery.Client()
# Perform a query.
QUERY = (
'SELECT * FROM `project_name.dataset_name.table_name`')
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
this rows variable will have all the table rows and you can either directly use it or can write it to a local file.