BigQuery: How to autoreload table with new storage JSON files? - google-bigquery

I have just created one BigQuery table by linking available JSON files in Google Cloud Storage. But I do not see any option to auto-reload table rows with new files added in Google Cloud Storage folder or bucket.
Currently, I have to go to BigQuery console and then delete & recreate the same table to load new files. But this solution is not scalable for us because we run a cron job on BigQuery API. How to auto-reload data in BigQuery?
Thanks

When you define External Table on top of Files in Google Cloud Storage - you can use wildcard for Source Location, so your table will represent all files that match
Then, when you query such table - you can use _file_name field which will "tell" you which file given row came from
SELECT
_file_name AS file,
*
FROM `yourTable`
This way - whenever you add new file in GCS - you will get it in table "automatically"

With Google Cloud Functions you can automate BigQuery each time you receive a new file:
Create a new function at https://console.cloud.google.com/functions/add
Point "bucket" to the one receiving files.
Codewise, import BigQuery inside package.json:
{
"dependencies": {
"#google-cloud/bigquery": "^0.9.6"
}
}
And on index.js you can act on the new file in any appropriate way:
var bigQuery = BigQuery({ projectId: 'your-project-id' });
exports.processFile = (event, callback) => {
console.log('Processing: ' + JSON.stringify(event.data));
query(event.data);
callback();
};
var BigQuery = require('#google-cloud/bigquery');
function query(data) {
const filename = data.name.split('/').pop();
const full_filename = `gs://${data.bucket}/${data.name}`
// if you want to run a query:
query = '...'
bigQuery.query({
query: query,
useLegacySql: false
});
};

Related

How can I write Parquet files with int64 timestamps (instead of int96) from AWS Kinesis Firehose?

Why do int96 timestamps not work for me?
I want to read the Parquet files with S3 Select. S3 Select does not support timestamps saved as int96 according to the documentation. Also, storing timestamps in parquet as int96 is deprecated.
What did I try?
Firehose uses org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe for serialization to parquet. (The exact hive version that is used by AWS is unknown.) While reading the hive code, I came across the following config switch: hive.parquet.write.int64.timestamp. I tried to apply this config switch by changing the Serde parameters in the AWS Glue table config:
Unfortunately, this did not make a difference and my timestamp column is still stored as int96 (checked by downloading a file from S3 and inspecting it with parq my-file.parquet --schema)
While I was not able to make Firehose write int64 timestamps, I found a workaround to convert the int96 timestamps returned by the S3 Select query result into something useful.
I used the approach described in
How parquet stores timestamp data in S3?
boto 3 - loosing date format
to write the following conversion function in JavaScript:
const hideTimePart = BigInt(64);
const maskToHideJulianDayPart = BigInt('0xffffffffffffffff');
const unixEpochInJulianDay = 2_440_588;
const nanoSecsInOneSec = BigInt(1_000_000_000);
const secsInOneDay = 86_400;
const milliSecsInOneSec = 1_000;
export const parseS3SelectParquetTimeStamp = (ts: string) => {
const tsBigInt = BigInt(ts);
const julianDay = Number(tsBigInt >> hideTimePart);
const secsSinceUnixEpochToStartOfJulianDay = (julianDay - unixEpochInJulianDay) * secsInOneDay;
const nanoSecsSinceStartOfJulianDay = tsBigInt & maskToHideJulianDayPart;
const secsSinceStartOJulianDay = Number(nanoSecsSinceStartOfJulianDay / nanoSecsInOneSec);
return new Date(
(secsSinceUnixEpochToStartOfJulianDay + secsSinceStartOJulianDay) * milliSecsInOneSec,
);
};
parseS3SelectParquetTimeStamp('45377606915595481758988800'); // Result: '2022-12-11T20:58:33.000Z'
Note, unlike expected, the timestamps returned by S3 Select stores the julian day part at the beginning and not in the last 4 bytes. The nano sec time part is stored in the last 8 bytes. Furthermore, the byte order is not reversed.
(Regarding the julian day constant 2440588: Using 2440587.5 would be wrong in this context according to https://docs.oracle.com/javase/8/docs/api/java/time/temporal/JulianFields.html)

Headers Not Loading on Google Big Query Data Export

Trying to export a big query table to cloud storage and for some reason, I have header=true but no column headers are being added to my export file upon creation. I've deleted the file and re-tried several times with no luck. Any feedback is greatly appreciated.
EXPORT DATA
OPTIONS ( uri = 'gs://data-feeds-public/inventory_used_google_merchant*.csv',
format='csv',
overwrite = true,
header= true)
AS
SELECT * FROM `mytable'

Bigquery LoadJobConfig Delete Source Files After Transfer

When creating a Bigquery Data Transfer Service Job Manually through the UI, I can select an option to delete source files after transfer. When I try to use the CLI or the Python Client to create on-demand Data Transfer Service Jobs, I do not see an option to delete the source files after transfer. Do you know if there is another way to do so? Right now, my Source URI is gs://<bucket_path>/*, so it's not trivial to delete the files myself.
For me works this snippet (replace YOUR-... with your data):
from google.cloud import bigquery_datatransfer
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR-CRED-FILE-PATH"
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
destination_project_id = "YOUR-PROJECT-ID"
destination_dataset_id = "YOUR-DATASET-ID"
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=destination_dataset_id,
display_name="YOUR-TRANSFER-NAME",
data_source_id="google_cloud_storage",
params={
"data_path_template":"gs://PATH-TO-YOUR-DATA/*.csv",
"destination_table_name_template":"YOUR-TABLE-NAME",
"file_format":"CSV",
"skip_leading_rows":"1",
"delete_source_files": True
},
)
transfer_config = transfer_client.create_transfer_config(
parent=transfer_client.common_project_path(destination_project_id),
transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")
In this example, table YOUR-TABLE-NAME must already exist in BigQuery, otherwise the transfer will crash with error "Not found: Table YOUR-TABLE-NAME".
I used this packages:
google-cloud-bigquery-datatransfer>=3.4.1
google-cloud-bigquery>=2.31.0
Pay attention to the attribute delete_source_files in params. From docs:
Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

API call to bigquery.jobs.insert failed: Not Found: Dataset

I'm working on importing CSV files from a Google Drive, through Apps Scripts into Big Query.
BUT, when the code gets to the part where it needs to send the job to BigQuery, it states that the dataset is not found - even though the correct dataset ID is already in the code.
Very much thank you!
If you are making use of the google example code, this error that you indicate is more than a copy and paste. However, validate that you have the following:
const projectId = 'XXXXXXXX';
const datasetId = 'YYYYYYYY';
const csvFileId = '0BwzA1Orbvy5WMXFLaTR1Z1p2UDg';
try {
table = BigQuery.Tables.insert(table, projectId, datasetId);
Logger.log('Table created: %s', table.id);
} catch (error) {
Logger.log('unable to create table');
}
according to the documentation in the link:
https://developers.google.com/apps-script/advanced/bigquery
It also validates that in the services tag you have the bigquery service enabled.

How to work with exported Stack Driver logs from Google Cloud Projects into BigQuery

I have created an "export" from my Stackdriver Logging page in my Google Cloud project. I configured the export to go to a BigQuery dataset.
When I go to BigQuery, I see the dataset.
There are no tables in my dataset, since Stackdriver export created the BigQuery dataset for me.
How do I see the data that was exported? Since there are no tables I cannot perform a "select * from X". I could create a table but I don't know what columns to add nor do I know how to tell Stackdriver logging to write to that table.
I must be missing a step.
Google has a short 1 minute video on exporting to Big Query but it stops exactly at the point where I am in the process.
When a new Stackdriver export is defined, it will then start to export newly written log records to the target sink (BQ in this case). As per the documentation found here:
https://cloud.google.com/logging/docs/export/
it states:
Since exporting happens for new log entries only, you cannot export
log entries that Logging received before your sink was created.
If one wants to export existing logs to a file, one can use gcloud (or API) as described here:
https://cloud.google.com/logging/docs/reference/tools/gcloud-logging#reading_log_entries
The output of this "dump" of existing log records can then used in whatever manner you see fit. For example, it could be imported into a BQ table.
To export logs in the bigquery from the stackdrive , you have to create Logger Sink using code or GCP logging UI
Then create Sink, add a filter.
https://cloud.google.com/logging/docs/export/configure_export_v2
Then add logs to stack driver using code
public static void writeLog(Severity severity, String logName, Map<String, String> jsonMap) {
List<Map<String, String>> maps = limitMap(jsonMap);
for (Map<String, String> map : maps) {
LogEntry logEntry = LogEntry.newBuilder(Payload.JsonPayload.of(map))
.setSeverity(severity)
.setLogName(logName)
.setResource(monitoredResource)
.build();
logging.write(Collections.singleton(logEntry));
}
}
private static MonitoredResource monitoredResource =
MonitoredResource.newBuilder("global")
.addLabel("project_id", logging.getOptions().getProjectId())
.build();
https://cloud.google.com/bigquery/docs/writing-results