How can I write Parquet files with int64 timestamps (instead of int96) from AWS Kinesis Firehose? - hive

Why do int96 timestamps not work for me?
I want to read the Parquet files with S3 Select. S3 Select does not support timestamps saved as int96 according to the documentation. Also, storing timestamps in parquet as int96 is deprecated.
What did I try?
Firehose uses org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe for serialization to parquet. (The exact hive version that is used by AWS is unknown.) While reading the hive code, I came across the following config switch: hive.parquet.write.int64.timestamp. I tried to apply this config switch by changing the Serde parameters in the AWS Glue table config:
Unfortunately, this did not make a difference and my timestamp column is still stored as int96 (checked by downloading a file from S3 and inspecting it with parq my-file.parquet --schema)

While I was not able to make Firehose write int64 timestamps, I found a workaround to convert the int96 timestamps returned by the S3 Select query result into something useful.
I used the approach described in
How parquet stores timestamp data in S3?
boto 3 - loosing date format
to write the following conversion function in JavaScript:
const hideTimePart = BigInt(64);
const maskToHideJulianDayPart = BigInt('0xffffffffffffffff');
const unixEpochInJulianDay = 2_440_588;
const nanoSecsInOneSec = BigInt(1_000_000_000);
const secsInOneDay = 86_400;
const milliSecsInOneSec = 1_000;
export const parseS3SelectParquetTimeStamp = (ts: string) => {
const tsBigInt = BigInt(ts);
const julianDay = Number(tsBigInt >> hideTimePart);
const secsSinceUnixEpochToStartOfJulianDay = (julianDay - unixEpochInJulianDay) * secsInOneDay;
const nanoSecsSinceStartOfJulianDay = tsBigInt & maskToHideJulianDayPart;
const secsSinceStartOJulianDay = Number(nanoSecsSinceStartOfJulianDay / nanoSecsInOneSec);
return new Date(
(secsSinceUnixEpochToStartOfJulianDay + secsSinceStartOJulianDay) * milliSecsInOneSec,
);
};
parseS3SelectParquetTimeStamp('45377606915595481758988800'); // Result: '2022-12-11T20:58:33.000Z'
Note, unlike expected, the timestamps returned by S3 Select stores the julian day part at the beginning and not in the last 4 bytes. The nano sec time part is stored in the last 8 bytes. Furthermore, the byte order is not reversed.
(Regarding the julian day constant 2440588: Using 2440587.5 would be wrong in this context according to https://docs.oracle.com/javase/8/docs/api/java/time/temporal/JulianFields.html)

Related

Google Cloud Storage Joining multiple csv files

I exported a dataset from Google BigQuery to Google Cloud Storage, given the size of the file BigQuery exported the file as 99 csv files.
However now I want to connect to my GCP Bucket and perform some analysis with Spark, yet I need to join all 99 files into a single large csv file to run my analysis.
How can this be achieved?
BigQuery splits the data exported into several files if it is larger than 1GB. But you can merge these files with the gsutil tool, check this official documentation to know how to perform object composition with gsutil.
As BigQuery export the files with the same prefix, you can use a wildcard * to merge them into one composite object:
gsutil compose gs://example-bucket/component-obj-* gs://example-bucket/composite-object
Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.
The downside of this option is that the header row of each .csv file will be added in the composite object. But you can avoid this by modifiyng the jobConfig to set the print_header parameter to False.
Here is a Python sample code, but you can use any other BigQuery Client library:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'yourBucket'
project = 'bigquery-public-data'
dataset_id = 'libraries_io'
table_id = 'dependencies'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'file-*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US',
job_config=job_config) # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
Finally, remember to compose an empty .csv with just the headers row.
I got tired kind tired of doing multiple recursive compose operations, stripping headers, etc... Especially when dealing with 3500 split gzipped csv files.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article with a use case and usage example for it:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.

How to change the name of the Athena results stored in S3?

The results of Athena query is saved by the query id (a long string) in S3. I was wondering if there's a way to save the results of the query with a pre-specified name? (that can later be easily looked up)
You can do so by a simple AWS Lambda function.
Change names of AWS Athena results stored in S3 bucket
client = boto3.client('athena')
s3 = boto3.resource("s3")
#run query
queryStart = client.start_query_execution(
QueryString = '
#PUT_YOUR_QUERY_HERE
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
',
QueryExecutionContext = {
'Database': "covid_data" //YOUR_ATHENA_DATABASE_NAME
},
ResultConfiguration = {
#query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
#executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
#copies newly generated csv file with appropriate name
#query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
#destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
#deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')
unfortunately no (at least not yet)! the best way to do this as of now is to write a script to go through all the results of each run and rename (moving+deleting) all the files in that s3 bucket!
For named queries your results location will be structured as follows:
s3://athena-query-results-<account>-<region>/<query-name>/<year>/<month>/<day>/<UUID>.csv
I don't know any method of specifying the UUID by the client. But you could look for the newest file within the s3 folder of your named query.
Alternatively, you could use the s3 API or the aws cli to copy the result into a location of your choice.
Does this answer your question?
def delete_metadata():
s3 = boto3.resource('s3')
client_s3 = boto3.client('s3')
bucket = s3.Bucket('testing')
for obj in bucket.objects.filter(Prefix='prepared/'):
if obj.key.endswith('.metadata'):
print(obj.key)
client_s3.delete_object(Bucket=bucket.name,Key=obj.key)

BigQuery: How to autoreload table with new storage JSON files?

I have just created one BigQuery table by linking available JSON files in Google Cloud Storage. But I do not see any option to auto-reload table rows with new files added in Google Cloud Storage folder or bucket.
Currently, I have to go to BigQuery console and then delete & recreate the same table to load new files. But this solution is not scalable for us because we run a cron job on BigQuery API. How to auto-reload data in BigQuery?
Thanks
When you define External Table on top of Files in Google Cloud Storage - you can use wildcard for Source Location, so your table will represent all files that match
Then, when you query such table - you can use _file_name field which will "tell" you which file given row came from
SELECT
_file_name AS file,
*
FROM `yourTable`
This way - whenever you add new file in GCS - you will get it in table "automatically"
With Google Cloud Functions you can automate BigQuery each time you receive a new file:
Create a new function at https://console.cloud.google.com/functions/add
Point "bucket" to the one receiving files.
Codewise, import BigQuery inside package.json:
{
"dependencies": {
"#google-cloud/bigquery": "^0.9.6"
}
}
And on index.js you can act on the new file in any appropriate way:
var bigQuery = BigQuery({ projectId: 'your-project-id' });
exports.processFile = (event, callback) => {
console.log('Processing: ' + JSON.stringify(event.data));
query(event.data);
callback();
};
var BigQuery = require('#google-cloud/bigquery');
function query(data) {
const filename = data.name.split('/').pop();
const full_filename = `gs://${data.bucket}/${data.name}`
// if you want to run a query:
query = '...'
bigQuery.query({
query: query,
useLegacySql: false
});
};

How do I skip header row using Python glcoud.bigquery client?

I have a daily GCP billing export file in csv format containing GCP billing details. This export contains a header row. I've setup a load job as follows (summarized):
from google.cloud import bigquery
job = client.load_table_from_storage(job_name, dest_table, source_gs_file)
job.source_format = 'CSV'
job.skipLeadingRows=1
job.begin()
This job produces the error:
Could not parse 'Start Time' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]]
This error means that it is still trying to parse the header row even though I specified skipLeadingRows=1. What am I doing wrong here?
You should use skip_leading_rows instead of skipLeadingRows when using the Python SDK.
skip_leading_rows: Number of rows to skip when reading data (CSV only).
Reference: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJobConfig.html
I cannot reproduce this. I took the example you gave ("2017-02-04T00:00:00-08:00"), added 3 rows/timestamps to a csv file, uploaded it to GCS, and finally created an empty table in BigQuery with one column of type TIMESTAMP.
File contents:
2017-02-04T00:00:00-08:00
2017-02-03T00:00:00-08:00
2017-02-02T00:00:00-08:00
I then ran the example Python script found here, and it successfully loaded the file into the table:
Loaded 3 rows into timestamp_test:gcs_load_test.
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(job.output_rows, dataset_name, table_name))

Exporting data to GCS from BigQuery - Split file size control

I am currently exporting data from Bigquery to GCS buckets. I am doing this programmatically using the following query:
query_request = bigquery_service.jobs()
DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';
DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
'extract': {
'sourceTable': {
'projectId': PROJECT_ID,
'datasetId': DATASET_ID,
'tableId': #####,
},
'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
'destinationFormat': 'CSV',
'printHeader': 'false',
'compression': 'GZIP'
}
}
}
query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
body=query_data).execute()
Since there is a constraint that only 1GB per file can be exported to GCS, I used the single wildcard URI (https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple). This splits the file into multiple smaller parts. After splitting, each of the file parts are gzipped as well.
My question: Can I control the file sizes of the split files? For example, if I have a 14GB file to export to GCS, this will be split into 14 1GB files. But is there a way to change that 1GB into another size (smaller than 1GB before gzipping)? I checked the various parameters that are available for modifying the configuration.extract object? (Refer: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract)
If you specify multiple URI patterns, the data will be sharded between them. So if you used, say, 28 URI patterns, each shard would be about half a GB. You'd end up with second files of size zero for each pattern, as this is really meant for MR jobs, but its one way to accomplish what you want.
More info here (see the Multiple Wildcard URIs section): Exporting Data From BigQuery