Bigquery query result to dataframe with Airflow - sql

I am trying to query the data from bigquery and write it to dataframe with Airflow. But either it is giving file not found (service account key) or file name is too long or eof line read error.
I have tried with hooks as well but I am not able to do put key file as json as it is saying it is too long.
Any tips on how I can achieve it?
def get_data_from_GBQ():
global customer_data
ofo_cred = Variable.get("ofo_cred")
logging.info(ofo_cred)
logging.info("Variable is here")
customer_data_query = """ SELECT FirstName, LastName, Organisation FROM `bigquery-bi.ofo.Customers` LIMIT 2 """
logging.info("test")
# Creating a connection to the google bigquery
client = bigquery.Client.from_service_account_json(ofo_cred)
logging.info("after client")
customer_data = client.query(customer_data_query).to_dataframe()
logging.info("after client")
print(customer_data)
dag = DAG(
'odoo_gbq_connection',
default_args=default_args,
description='A connection between ',
schedule_interval=timedelta(days=1),)
And the error is:
FileNotFoundError: [Errno 2] No such file or directory: '{\r\n "type": "service_account",\r\n "project_id":...

bigquery.Client.from_service_account_json function expects file name of the service account file, you provide it with the contents of that file, so it tries to find the file which path starts with {\r\n "type": "servi... and it fails with FileNotFound.
Potential fix:
client = bigquery.Client.from_service_account_json(path_to_ofo_cred)
https://googleapis.dev/python/google-api-core/latest/auth.html#service-accounts

Related

Synapse Spark: Python logging to log file in Azure Data Lake Storage

I am working in Synapse Spark and building a logger function to handle error logging. I intend to push the logs to an existing log file (data.log) located in AzureDataLakeStorageAccount/Container/Folder/.
In addition to the root logger I have added a StreamHandler and trying to setup a FileHandler to manage the log file write-out.
The log file path I am specifying has path in this format: 'abfss:/container#storageaccountname.dfs.core.windows.net/Data/logging/data.log'
When I run the below code, I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_<number/container_number/abfss:/container#storageaccountname.dfs.core.windows.net/Data/logging/data.log'
The default mount path is getting prefixed to the ADLS file path.
Here is the code:
'''
import logging
def init_logger(name: str, logging_level: int = logging.DEBUG) -> logging.Logger:
_log_format = "%(levelname)s %(asctime)s %(name)s: %(message)s"
_date_format = "%Y-%m-%d %I:%M:%S %p %z"
_formatter = logging.Formatter(fmt=_log_format, datefmt=_date_format)
_root_logger = logging.getLogger()
_logger = logging.getLogger(name)
_logger.setLevel(logging_level)
#Root and Stream Handler
if _root_logger.handlers:
for handler in _root_logger.handlers:
handler.setFormatter(_formatter)
_logger.setLevel(logging_level)
else:
_handler = logging.StreamHandler(sys.stderr)
_handler.setLevel(logging_level)
_handler.setFormatter(_formatter)
_logger.addHandler(_handler)
__handler = logging.FileHandler(LogFilepath, 'a')
__handler.setLevel(logging_level)
__handler.setFormatter(_formatter)
_logger.addHandler(__handler)
return _logger
'''
To address the mount path prefix I added a series of '../' to move level up but even with this I end up with a solitary '/' prefixed to my ADLS path.
I have not found any online assistance or article where this has been implemented in an Azure Data Lake setup. Any assistance will be appreciated.
Looking at the error, it seems to be the path of the file might be the issue here. Instead of giving the abfss path, you can try by giving the path like /synfs/{jobid}/<mount point>/filepath.
Use the below code for that.
jobid=mssparkutils.env.getJobId()
LogFilepath='/synfs/'+jobid+'/<mount_point>/filename.log'
print(LogFilepath)
The above path might work for you. If not, you can try this method with open() as an alternative for logging to an ADLS file.

How can I reference an external SQL file using Airflow's BigQuery operator?

I'm currently using Airflow with the BigQuery operator to trigger various SQL scripts. This works fine when the SQL is written directly in the Airflow DAG file. For example:
bigquery_transform = BigQueryOperator(
task_id='bq-transform',
bql='SELECT * FROM `example.table`',
destination_dataset_table='example.destination'
)
However, I'd like to store the SQL in a separate file saved to a storage bucket. For example:
bql='gs://example_bucket/sample_script.sql'
When calling this external file I recieve a "Template Not Found" error.
I've seen some examples load the SQL file into the Airflow DAG folder, however, I'd really like to access files saved to a separate storage bucket. Is this possible?
You can reference any SQL files in your Google Cloud Storage Bucket. Here's a following example where I call the file Query_File.sql in the sql directory in my airflow dag bucket.
CONNECTION_ID = 'project_name'
with DAG('dag', schedule_interval='0 9 * * *', template_searchpath=['/home/airflow/gcs/dags/'], max_active_runs=15, catchup=True, default_args=default_args) as dag:
battery_data_quality = BigQueryOperator(
task_id='task-id',
sql='/SQL/Query_File.sql',
destination_dataset_table='project-name.DataSetName.TableName${{ds_nodash}}',
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False,
dag=dag
)
You can also consider using the gcs_to_gcs operator to copy things from your desired bucket into one that is accessible by composer.
download works differently in GoogleCloudStorageDownloadOperator for Airflow version 1.10.3 and 1.10.15.
def execute(self, context):
self.object = context['dag_run'].conf['job_name'] + '.sql'
logging.info('filemname in GoogleCloudStorageDownloadOperator: %s', self.object)
self.filename = context['dag_run'].conf['job_name'] + '.sql'
self.log.info('Executing download: %s, %s, %s', self.bucket,
self.object, self.filename)
hook = GoogleCloudStorageHook(
google_cloud_storage_conn_id=self.google_cloud_storage_conn_id,
delegate_to=self.delegate_to
)
file_bytes = hook.download(bucket=self.bucket,
object=self.object)
if self.store_to_xcom_key:
if sys.getsizeof(file_bytes) < 49344:
context['ti'].xcom_push(key=self.store_to_xcom_key, value=file_bytes.decode('utf-8'))
else:
raise RuntimeError(
'The size of the downloaded file is too large to push to XCom!'
)

(InternalError) when calling the SelectObjectContent operation in boto3

I have a series of files that are in JSON that need to be split into multiple files to reduce their size. One issue is that the files are extracted using a third party tool and arrive as a JSON object on a single line.
I can use S3 select to process a small file (say around 300Mb uncompressed) but when I try and use a larger file - say 1Gb uncompressed (90Mb gzip compressed) I get the following error:
[ERROR] EventStreamError: An error occurred (InternalError) when calling the SelectObjectContent operation: We encountered an internal error. Please try again.
The query that I am trying to run is:
select count(*) as rowcount from s3object[*][*] s
I can't run the query from the console because the file is larger than 128Mb but the code that is performing the operation is as follows:
def execute_select_query(bucket, key, query):
"""
Runs a query against an object in S3.
"""
if key.endswith("gz"):
compression = "GZIP"
else:
compression = "NONE"
LOGGER.info("Running query |%s| against s3://%s/%s", query, bucket, key)
return S3_CLIENT.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={"JSON": {"Type": "DOCUMENT"}, "CompressionType": compression},
OutputSerialization={'JSON': {}},
)

Issue automating CSV import to an RSQLite DB

I'm trying to automate writing CSV files to an RSQLite DB.
I am doing so by indexing csvFiles, which is a list of data.frame variables stored in the environment.
I can't seem to figure out why my dbWriteTable() code works perfectly fine when I enter it manually but not when I try to index the name and value fields.
### CREATE DB ###
mydb <- dbConnect(RSQLite::SQLite(),"")
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in 1:length(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = csvFiles[i], overwrite=T)
i=i+1
}
# EXAMPLE CODE THAT SUCCESSFULLY MANUAL IMPORTS INTO mydb
dbWriteTable(mydb,"DEPARTMENT",DEPARTMENT)
When I run the for loop above, I'm given this error:
"Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'DEPARTMENT': No such file or directory
# note that 'DEPARTMENT' is the value of csvFiles[1]
Here's the dput output of csvFiles:
c("DEPARTMENT", "EMPLOYEE_PHONE", "PRODUCT", "EMPLOYEE", "SALES_ORDER_LINE",
"SALES_ORDER", "CUSTOMER", "INVOICES", "STOCK_TOTAL")
I've researched this error and it seems to be related to my working directory; however, I don't really understand what to change, as I'm not even trying to manipulate files from my computer, simply data.frames already in my environment.
Please help!
Simply use get() for the value argument as you are passing a string value when a dataframe object is expected. Notice your manual version does not have DEPARTMENT quoted for value.
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in seq_along(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = get(csvFiles[i]), overwrite=T)
}
Alternatively, consider building a list of named dataframes with mget and loop element-wise between list's names and df elements with Map:
dfs <- mget(csvfiles)
output <- Map(function(n, d) dbWriteTable(mydb, name = n, value = d, overwrite=T), names(dfs), dfs)

pyhs2/hive No files matching path file and file Exists

Using the hive or beeline client, I have no problem executing this statement:
hive -e "LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2"
The data from the file is loaded successfully into hive.
However, when using pyhs2 from the same machine, the file is not found:
import pyhs2
conn_str = {'authMechanism':'NOSASL', 'host':'azus',}
conn = pyhs2.connect(conn_str)
with conn.cursor() as cur:
cur.execute("LOAD DATA LOCAL INPATH '/tmp/tmpBKe_Mc' INTO TABLE unit_test_hs2")
Throws exception:
Traceback (most recent call last):
File "data_access/hs2.py", line 38, in write
cur.execute("LOAD DATA LOCAL INPATH '%s' INTO TABLE %s" % (csv_file.name, table_name))
File "/edge/1/anaconda/lib/python2.7/site-packages/pyhs2/cursor.py", line 63, in execute
raise Pyhs2Exception(res.status.errorCode, res.status.errorMessage)
pyhs2.error.Pyhs2Exception: "Error while compiling statement: FAILED: SemanticException Line 1:23 Invalid path ''/tmp/tmpBKe_Mc'': No files matching path file:/tmp/tmpBKe_Mc"
I've seen similar questions posted about this problem, and the usual answer is that the query is running on a different server that doesn't have the local file '/tmp/tmpBKe_Mc' stored on it. However, if that is the case, why would running the command directly from the CLI work but using pyhs2 not work?
(Secondary question: how can I show which server is trying to handle the query? I've tried cur.execute("set"), which returns all configuration parameters but when grepping for "host" the returned parameters don't seem to contain a real hostname.)
Thanks!
This happens because pyhs2 trying to find file on cluster
Solution is to have your source saved in related hdfs location instead of /tmp