Is there a way we can filter blobs that are not in archive tier in Azure Blob? - blob

I have a blob container say "demo". It has many files, few are in hot tier and few are in archive tier. I want to process only those files which are in Hot ignoring archive tier files.
using getmetadata activity lists all files including archive tier files
using "az storage blob list throws" error - 'This operation is not permitted on an archived blob.'
Pl direct me in the right direction.

You can use List Blobs operation which returns a list of the blobs under the specified container.
Method: GET
Request URI: https://myaccount.blob.core.windows.net/mycontainer?restype=container&comp=list
HTTP Version: HTTP/1.1
This will return the response body in XML format which you can later filter on the basis of access tier names.
In the above URI, you also need to provide parameter include={versions=2019-12-12} if it doesn't use newest version automatically.
For version 2017-04-17 and above, List Blobs returns the AccessTier
element if an access tier has been explicitly set. For Blob Storage or
General Purpose v2 accounts, valid values are Hot/Cool/Archive. If the
blob is in rehydrate pending state then ArchiveStatus element is
returned with one of the valid values
rehydrate-pending-to-hot/rehydrate-pending-to-cool.
Refer List Blobs for more details.

Related

In BigQuery, query to get GCS metadata (filenames in GCS)

We have a GCS bucket with a subfolder at url https://storage.googleapis.com/our-bucket/path-to-subfolder. This sub-folder contains files:
file_1_3.png
file_7_4.png
file_3_2.png
file_4_1.png
We'd like to create a table in BigQuery with a column number1 with values 1,7,3,4 (first number in filename) and a column number2 with the second numbers. String splitting is easy, once the data (a column with filenames) is in BigQuery. How can the filenames be retrieved? Is it possible to query a GCS bucket for metadata on files?
EDIT: want to do this
Updating the answer to reflect the question of how do you retrieve GCS Bucket metadata on files.
There are two options you can have here depending on the use case:
Utilize a cloud function on a cron schedule to perform a read of metadata (like in the example you shared) then using the BQ Client library perform an insert. Then perform the regex listed below.
This option utilizes a feature (remote function) in preview so you may not have the functionality needed, however may be able to request it. This option would get you the latest data on read. It involves the following:
Create a Cloud Function that returns an array of blob names, see code below.
Create a connection resource in BigQuery (overall process is listed here however since the remote function portion is in preview the documentation and potentially your UI may not reflect the necessary options (it did not in mine).
Create a remote function (third code block in link)
Call the function from your code then manipulate as needed with regexp.
Example CF for option 2:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
blob_array = []
for blob in blobs:
blob_array.append()
return blob_array
Example remote function from documentation:
CREATE FUNCTION mydataset.remoteMultiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
REMOTE WITH CONNECTION us.myconnection
OPTIONS(endpoint="https://us-central1-myproject.cloudfunctions.net/multiply");
Once its in it will return the full gcs path of the file. From there you can use REGEX like the following regexp_extract(_FILE_NAME, 'file_(.+)_') to extract the important information.
Now that BQ Remote Function (RF) is GA as well as JSON, I thought of sharing a way to get any property of blobs in a bucket, right from BQ SQL.
!! Make sure to carefully read the official documentation first on how to set up RF as it's easy to miss a step. There are slight differences if you rather use 2nd Gen Function or Cloud run
Create following storage Cloud Function (here Python) - 1st gen good enough:
import json
from google.cloud import storage
storage_client = storage.Client()
def list_blobs(request):
print(request_json := request.get_json()) # print for debugging
calls = request_json['calls']
bucket_name = calls[0][0]
blobs = storage_client.list_blobs(bucket_name)
reply = [b._properties for b in blobs]
return json.dumps({'replies': [reply]})
Create BQ remote function (assumes fns dataset, us.api connection and my_project_id):
CREATE FUNCTION fns.list_blobs(bucket STRING)
RETURNS JSON
REMOTE WITH CONNECTION us.api
OPTIONS(endpoint="https://us-central1-my_project_id.cloudfunctions.net/storage")
The trick to return multiples values for a single request is to use JSON type
SELECT whatever properties you want
SELECT STRING(blob.name), STRING(blob.size), CAST(STRING(blob.updated) AS TIMESTAMP)
FROM
UNNEST(
JSON_EXTRACT_ARRAY(
fns.list_blobs('my_bucket')
)
) blob
The JSON is converted to an ARRAY, and UNNEST() pivots to multiple rows - unfortunately not columns too.
Voila ! I wish there was a easier way to fully parse a JSON array to a table, populating all columns at once, but as of this writing, you must explicitly extract the properties you want:
You can do many more cool things by extending the functions (cloud and remote) so you don't have to leave SQL, like,
generate and return signed URL to display/download right from a query result (e.g. BI tool)
use user_defined_context and branch logic in the CF code, to perform other operations like delete blobs or do other stuff
Object tables are read-only tables containing metadata index over the unstructured data objects in a specified Cloud Storage bucket. Each row of the table corresponds to an object, and the table columns correspond to the object metadata generated by Cloud Storage, including any custom metadata.
With Object tables we can get the file names and do operations on top of that in BigQuery itself.
https://cloud.google.com/bigquery/docs/object-table-introduction

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

Getting files and folders in the datalake while reading from datafactory

While reading azure sql table data (which actually consists of path of the directories) from azure data factory by using the paths how to dynamically get the files from the datalake.
Can any one tell me what should I give in the dataset
Screenshot
You could use lookup activity to read data from azure sql, and then following it by an foreach activity. And then, pass #item(). to your dataset parameter k1.

Enumerate blob names in Azure Data Factory v2

I need to enumerate all the blob names that sit in an Azure Blobs container and dump the list to a file in another blob storage.
The part that I cannot master is the enumeration.
Thanks.
Get metadata activity is what you want.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
Please use childItems to get all the files. And then use a foreach to iterate the childItems
Inside the for each activity, you may want to check if each item is a file. You could use if activity and the following expression.
Then in the "If true" activity, assume you want to copy data, you could use #item().name to get each of your file name.
You could find more documentations with this link.

Exporting query results as JSON via Google BigQuery API

I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.
1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json