snowflake current_ip_address to Bigquery - google-bigquery

Is there a bigquery equivalent for snowflake's current_ip_address() sql function? If not, can it be implemented via javascript udf? I could not find any relevant information and hence need some inputs.
Here is the snowflake doc about the function:
https://docs.snowflake.com/en/sql-reference/functions/current_ip_address.html

BigQuery does not have the IP address of the client that submitted the request, so there is no equivalent (and it can't be implemented using a JavaScript UDF, either).

Unfortunately, BigQuery doesn’t have any function like SQL current_ip_address() with Javascript either. It can just manipulate the IP address with Net Functions.
You can get the external or internal IP address from a VM.
You can see these examples.
External IP
gcloud compute instances list --filter="name=my-instance" --format "[box]"
Internal IP
gcloud compute instances list --filter="name=my-instance" --format "get(networkInterfaces[0].networkIP)" ...
You can see this documentation.
With Cloud SQL, you can get the IP address from a database:
export DB_IP=$(gcloud sql instances describe $DATABASE_ID --project $PROJECT_ID --format 'value(ipAddresses.ipAddress)')

Related

BigQuery API on gcp compute instance

New to BigQuery on GCP
I'm trying to query tables on a public dataset on gcp.
I'd like to query the tables via my compute instance (debian).
Is there a step by step out there?
Thanks
MS
Please find step below.
Create your Compute Engine with Custom scope and Enable Bigquery access.
If you already created instance , stop the instance and then click on edit it will allow you to change the service account scope.
now run your query with bq as below.
bq query --nouse_legacy_sql
'SELECT
*
FROM
bigquery-public-data.samples.shakespeare limit 2'

In BigQuery, query to get GCS metadata (filenames in GCS)

We have a GCS bucket with a subfolder at url https://storage.googleapis.com/our-bucket/path-to-subfolder. This sub-folder contains files:
file_1_3.png
file_7_4.png
file_3_2.png
file_4_1.png
We'd like to create a table in BigQuery with a column number1 with values 1,7,3,4 (first number in filename) and a column number2 with the second numbers. String splitting is easy, once the data (a column with filenames) is in BigQuery. How can the filenames be retrieved? Is it possible to query a GCS bucket for metadata on files?
EDIT: want to do this
Updating the answer to reflect the question of how do you retrieve GCS Bucket metadata on files.
There are two options you can have here depending on the use case:
Utilize a cloud function on a cron schedule to perform a read of metadata (like in the example you shared) then using the BQ Client library perform an insert. Then perform the regex listed below.
This option utilizes a feature (remote function) in preview so you may not have the functionality needed, however may be able to request it. This option would get you the latest data on read. It involves the following:
Create a Cloud Function that returns an array of blob names, see code below.
Create a connection resource in BigQuery (overall process is listed here however since the remote function portion is in preview the documentation and potentially your UI may not reflect the necessary options (it did not in mine).
Create a remote function (third code block in link)
Call the function from your code then manipulate as needed with regexp.
Example CF for option 2:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
blob_array = []
for blob in blobs:
blob_array.append()
return blob_array
Example remote function from documentation:
CREATE FUNCTION mydataset.remoteMultiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
REMOTE WITH CONNECTION us.myconnection
OPTIONS(endpoint="https://us-central1-myproject.cloudfunctions.net/multiply");
Once its in it will return the full gcs path of the file. From there you can use REGEX like the following regexp_extract(_FILE_NAME, 'file_(.+)_') to extract the important information.
Now that BQ Remote Function (RF) is GA as well as JSON, I thought of sharing a way to get any property of blobs in a bucket, right from BQ SQL.
!! Make sure to carefully read the official documentation first on how to set up RF as it's easy to miss a step. There are slight differences if you rather use 2nd Gen Function or Cloud run
Create following storage Cloud Function (here Python) - 1st gen good enough:
import json
from google.cloud import storage
storage_client = storage.Client()
def list_blobs(request):
print(request_json := request.get_json()) # print for debugging
calls = request_json['calls']
bucket_name = calls[0][0]
blobs = storage_client.list_blobs(bucket_name)
reply = [b._properties for b in blobs]
return json.dumps({'replies': [reply]})
Create BQ remote function (assumes fns dataset, us.api connection and my_project_id):
CREATE FUNCTION fns.list_blobs(bucket STRING)
RETURNS JSON
REMOTE WITH CONNECTION us.api
OPTIONS(endpoint="https://us-central1-my_project_id.cloudfunctions.net/storage")
The trick to return multiples values for a single request is to use JSON type
SELECT whatever properties you want
SELECT STRING(blob.name), STRING(blob.size), CAST(STRING(blob.updated) AS TIMESTAMP)
FROM
UNNEST(
JSON_EXTRACT_ARRAY(
fns.list_blobs('my_bucket')
)
) blob
The JSON is converted to an ARRAY, and UNNEST() pivots to multiple rows - unfortunately not columns too.
Voila ! I wish there was a easier way to fully parse a JSON array to a table, populating all columns at once, but as of this writing, you must explicitly extract the properties you want:
You can do many more cool things by extending the functions (cloud and remote) so you don't have to leave SQL, like,
generate and return signed URL to display/download right from a query result (e.g. BI tool)
use user_defined_context and branch logic in the CF code, to perform other operations like delete blobs or do other stuff
Object tables are read-only tables containing metadata index over the unstructured data objects in a specified Cloud Storage bucket. Each row of the table corresponds to an object, and the table columns correspond to the object metadata generated by Cloud Storage, including any custom metadata.
With Object tables we can get the file names and do operations on top of that in BigQuery itself.
https://cloud.google.com/bigquery/docs/object-table-introduction

Can't access external Hive metastore with Pyspark

I am trying to run a simple code to simply show databases that I created previously on my hive2-server. (note in this example there are both, examples in python and scala both with the same results).
If I log in into a hive shell and list my databases I see a total of 3 databases.
When I start Spark shell(2.3) on pyspark I do the usual and add the following property to my SparkSession:
sqlContext.setConf("hive.metastore.uris","thrift://*****:9083")
And re-start a SparkContext within my session.
If I run the following line to see all the configs:
pyspark.conf.SparkConf().getAll()
spark.sparkContext._conf.getAll()
I can indeed see the parameter has been added, I start a new HiveContext:
hiveContext = pyspark.sql.HiveContext(sc)
But If I list my databases:
hiveContext.sql("SHOW DATABASES").show()
It will not show the same results from the hive shell.
I'm a bit lost, for some reason it looks like it is ignoring the config parameter as I am sure the one I'm using it's my metastore as the address I get from running:
hive -e "SET" | grep metastore.uris
Is the same address also if I run:
ses2 = spark.builder.master("local").appName("Hive_Test").config('hive.metastore.uris','thrift://******:9083').getOrCreate()
ses2.sql("SET").show()
Could it be a permission issue? Like some tables are not set to be seen outside the hive shell/user.
Thanks
Managed to solve the issue, because a communication issue the Hive was not hosted in that machine, corrected the code and everything fine.

How to save Google Cloud Datalab output into BigQuery using R

I am using R in Google Cloud Datalab and I want to save my output, which is a table containing Strings that is created in the code itself, to BigQuery. I know there is a way to do it with Python by using bqr_create_table so I am looking for the equivalent in R.
I have found this blog post from Gus Class on Google Cloud Platform which uses this code to write to BigQuery:
# Install BigRQuery if you haven't already...
# install.packages("devtools")
# devtools::install_github("rstats-db/bigrquery")
# library(bigrquery)
insert_upload_job("your-project-id", "test_dataset", "stash", stash)
Where "test_dataset" is the dataset in BigQuery, "stash" is the table inside the dataset and stash is any dataframe you have define with your data.
There is more information on how to authorize with bigrquery

ExecuteSQL processor returns corrupted data

I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000.
What I did is select * from my_table where dt=1000.
Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.
The DBCPConnectionPool is configured to use HiveJDBC4 jar.
Later I tried using the compatible jar according to the CDH release, didn't fix it either.
The ExecuteSQL processor is configured as such:
Normalize Table/Column Names: true
Use Avro Logical Types: false
Hive version: 1.1.0
CDH: 5.7.1
Any ideas what's happening? Thanks!
EDIT:
Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.
Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).
If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.
Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile