Azure Event Hubs Capture to Storage with Data Lake Gen2 enabled - azure-data-lake

I'm trying to use the Capture feature of Event Hubs to store in a Storage Account v2 with Data Lake Storage Gen2 enabled.
In the portal, after choosing the Storage Account, the containers don't show up and I can't create a new one.
In Azure CLI, I ran the following command:
az eventhubs eventhub update -n hubtest --namespace-name #removed# -g #removed# --enable-capture True --capture-interval 300 --capture-size-limit 262144000 --storage-account #removed# --blob-container #removed# --destination-name capturetest
And I'm getting the following error:
'NoneType' object has no attribute 'enabled'
Traceback (most recent call last):File "/opt/az/lib/python3.6/site-packages/knack/cli.py", line 206, in invoke cmd_result = self.invocation.execute(args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 328, in execute raise ex
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 386, in _run_jobs_serially results.append(self._run_job(expanded_arg, cmd_copy
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 379, in _run_job six.reraise(*sys.exc_info())
File "/opt/az/lib/python3.6/site-packages/six.py", line 693, in reraise raise value
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 356, in _run_job result = cmd_copy(params)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 171, in __call__ return self.handler(*args, **kwargs)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/arm.py", line 477, in handler instance = custom_function(instance=instance, **custom_func_args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/eventhubs/custom.py", line 112, in cli_eheventhub_update instance.capture_description.enabled = enabled
AttributeError: 'NoneType' object has no attribute 'enabled'

I can reproduce your issue, it seems not support to enable Azure Event Hubs Capture with Data Lake Gen2, remember the Data Lake Gen2 is in preview.
See this link: https://learn.microsoft.com/en-gb/azure/storage/blobs/data-lake-storage-upgrade?toc=%2fazure%2fstorage%2fblobs%2ftoc.json#azure-ecosystem

As long as you have first created your Azure Storage account with Data Lake Storage Gen2 - see the image from the portal below:
[Enable Data Lake Storage Gen2 on storage account]
https://i.stack.imgur.com/J55kC.png
You can then just use 'Azure Storage' as the capture provider and proceed to select the storage account container - see the image from the portal below:
[storage account selection]
https://i.stack.imgur.com/FhI1x.png
Note*
If you don't already have a container configured, you will be asked to do so as part of the selection process steps.
Bit of an old question I know, but I needed to do just that today. Hope it helps.
Reference:
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-enable-through-portal

Event Hub Capture is now supported on Azure Data Lake Storage Gen 2

Related

What is the recommended architecture for using amazon neptune in a scalable way?

I am building an application backed by a Neptune database. Because I want the application to be scalable, I am using AWS Lambda + API gateway to build a REST API to interact with the database. This seems to be a reasonable idea based on the fact that this use case is documented in the Neptune docs.
The Neptune docs recommend reusing the websocket connection to the database across the entire execution context of the function, which is what I am doing at the moment. The docs also recommend resetting the connection and retrying upon errors (see here), which I am also using. However, I am seeing exceptions every now and then (perhaps every 20 requests on average). One of the exceptions I get is
ConnectionResetError: Cannot write to closing transport
which seems to be the same as this issue.
The other one is:
Traceback (most recent call last):
File "/var/task/chalice/app.py", line 1685, in _get_view_function_response
response = view_function(**function_args)
File "/var/task/app.py", line 57, in resource
return Resource(app.current_request, g).process()
File "/var/task/backoff/_sync.py", line 94, in retry
ret = target(*args, **kwargs)
File "/var/task/chalicelib/handlers/resource.py", line 106, in get
values = resources.valueMap().with_(WithOptions.tokens).toList()
File "/var/task/gremlin_python/process/traversal.py", line 57, in toList
return list(iter(self))
File "/var/task/gremlin_python/process/traversal.py", line 47, in __next__
self.traversal_strategies.apply_strategies(self)
File "/var/task/gremlin_python/process/traversal.py", line 548, in apply_strategies
traversal_strategy.apply(traversal)
File "/var/task/gremlin_python/driver/remote_connection.py", line 63, in apply
remote_traversal = self.remote_connection.submit(traversal.bytecode)
File "/var/task/gremlin_python/driver/driver_remote_connection.py", line 60, in submit
results = result_set.all().result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/var/task/gremlin_python/driver/resultset.py", line 90, in cb
f.result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/var/lang/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/var/task/gremlin_python/driver/connection.py", line 82, in _receive
data = self._transport.read()
File "/var/task/gremlin_python/driver/aiohttp/transport.py", line 104, in read
raise RuntimeError("Connection was already closed.")
RuntimeError: Connection was already closed.
In case it is relevant, I am using gremlingpython==3.5.1
It seems to me that these issues are all ultimately a consequence of using AWS Lambda, namely due to the mismatch between the longevity of websocket connections and the ephemeral nature of lambda execution contexts. The question then is: Am I doing the wrong thing by trying to use AWS lambda for my API? Would it be more appropriate to setup an EC2 instance and deal with the scalability in some other way?
P.S. Previously I did create and close a connection in every function execution (as previously recommended in the Neptune docs), which did work fine but was naturally slow.
The latest version of Neptune only supports Gremlin 3.4.11 (https://docs.aws.amazon.com/neptune/latest/userguide/engine-releases-1.0.5.1.html). I would start by using gremlin-python 3.4.11 and see if that resolves your issue. Gremlin-python 3.5 replaced Tornado with AIO HTTP (ref) for websocket connections and I suspect that change may be causing a slight change in behavior that a future release supporting Gremlin 3.5 will address.
I wonder whether the 'Connection was already closed' error message is not being treated as a retriable error by the retry logic?
What happens if you add this error message to the list of retriable_error_msgs in the Python example in the docs?

Streaming job failure-State Schema not Compatible issue

My streaming job is now failing with the below error, streaming job worked fine for almost 2 months, and it is completely stateless transformation and just needs to append the new rows to the destination delta table. Before streaming, I'm manually providing the schema to a csv files, even verified the streaming job schema and downstream table schema both matches perfectly along with the datatype.
Not sure, why even in the stateless transformation, I'm getting the below error. Any help would be appreciated.
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
File "/databricks/spark/python/pyspark/sql/utils.py", line 195, in call
raise e
File "/databricks/spark/python/pyspark/sql/utils.py", line 192, in call
self.func(DataFrame(jdf, self.sql_ctx), batch_id)
File "<command-422857213447422>", line 2, in write_to_managed_table
print(f"inside foreachBatch for batch_id:{batchId}, rows in passed dataframe: {micro_batch_df.count()}")
File "/databricks/spark/python/pyspark/sql/dataframe.py", line 670, in count
return int(self._jdf.count())
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/sql/utils.py", line 110, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o433.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 28 in stage 13792.0
failed 4 times, most recent failure: Lost task 28.3 in stage 13792.0 (TID 752198)
(10.139.64.13 executor 45):
org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema
doesn't match to the schema for existing state! Please note that Spark allow difference of
field name: check count of fields and data type of each field.
There might a problem with the CSV file, it could be corrupted.
You can ignore this csv file by setting the "mode" option to "PERMISSIVE" or "DROPMALFORMED".
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamReader.html#csv(path:String):org.apache.spark.sql.DataFrame
spark.read.format("csv")
.option("header,"true")
.option("path","your.csv")
.option("mode","DROPMALFORMED")
.schema(csvSchema)
.load()

Saving Matplotlib Output to Blob Storage on Databricks

I'm trying to write matplotlib figures to the Azure blob storage using the method provided here:
Saving Matplotlib Output to DBFS on Databricks.
However, when I replace the path in the code with
path = 'wasbs://test#someblob.blob.core.windows.net/'
I get this error
[Errno 2] No such file or directory: 'wasbs://test#someblob.blob.core.windows.net/'
I don't understand the problem...
As per my research, you cannot save Matplotlib output to Azure Blob Storage directly.
You may follow the below steps to save Matplotlib output to Azure Blob Storage:
Step 1: You need to first save it to the Databrick File System (DBFS) and then copy it to Azure Blob storage.
Saving Matplotlib output to Databricks File System (DBFS): We are using the below command to save the output to DBFS: plt.savefig('/dbfs/myfolder/Graph1.png')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'fruits':['apple','banana'], 'count': [1,2]})
plt.close()
df.set_index('fruits',inplace = True)
df.plot.bar()
plt.savefig('/dbfs/myfolder/Graph1.png')
Step 2: Copy the file from Databricks File System to Azure Blob Storage.
There are two methods to copy file from DBFS to Azure Blob Stroage.
Method 1: Access Azure Blob storage directly
Access Azure Blob Storage directly by setting "Spark.conf.set" and copy file from DBFS to Blob Storage.
spark.conf.set("fs.azure.account.key.< Blob Storage Name>.blob.core.windows.net", "<Azure Blob Storage Key>")
Use dbutils.fs.cp to copy file from DBFS to Azure Blob Storage:
dbutils.fs.cp('dbfs:/myfolder/Graph1.png', 'wasbs://<Container>#<Storage Name>.blob.core.windows.net/Azure')
Method 2: Mount Azure Blob storage containers to DBFS
You can mount a Blob storage container or a folder inside a container to Databricks File System (DBFS). The mount is a pointer to a Blob storage container, so the data is never synced locally.
dbutils.fs.mount(
source = "wasbs://sampledata#chepra.blob.core.windows.net/Azure",
mount_point = "/mnt/chepra",
extra_configs = {"fs.azure.sas.sampledata.chepra.blob.core.windows.net":dbutils.secrets.get(scope = "azurestorage", key = "azurestoragekey")})
Use dbutils.fs.cp copy the file to Azure Blob Storage Container:
dbutils.fs.cp('dbfs:/myfolder/Graph1.png', '/dbfs/mnt/chepra')
By following Method1 or Method2 you can successfully save the output to Azure Blob Storage.
For more details, refer "Databricks - Azure Blob Storage".
Hope this helps. Do let us know if you any further queries.
You can write with .savefig() directly to Azure blob storage- you just need to mount the blob container before.
The following works for me, where I had mounted the blob container as /mnt/mydatalakemount
plt.savefig('/dbfs/mnt/mydatalakemount/plt.png')
or
fig.savefig('/dbfs/mnt/mydatalakemount/fig.png')
Documentation on mounting blob container is here.
This is what I also came up with so far. In order to reload the image from blob and display it as png in a databricks notebook again I use the following code:
blob_path = ...
dbfs_path = ...
dbutils.fs.cp( blob_path, dbfs_path )
with open( dbfs_path, "rb" ) as f:
im = BytesIO( f.read() )
img = mpimg.imread( im )
imgplot = plt.imshow( img )
display( imgplot.figure )
I didn't succeed using dbutils, which cannot be correctly created.
But I did succeed by mounting the file-shares to a Linux path, like this:
https://learn.microsoft.com/en-us/azure/azure-functions/scripts/functions-cli-mount-files-storage-linux

SQLite3 database is Locked in Azure

I have a Flask server Running on Azure provided by Azure App services with sqlite3 as a database. I am unable to update sqlite3 as it is showing that database is locked
2018-11-09T13:21:53.854367947Z [2018-11-09 13:21:53,835] ERROR in app: Exception on /borrow [POST]
2018-11-09T13:21:53.854407246Z Traceback (most recent call last):
2018-11-09T13:21:53.854413046Z File "/home/site/wwwroot/antenv/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app
2018-11-09T13:21:53.854417846Z response = self.full_dispatch_request()
2018-11-09T13:21:53.854422246Z File "/home/site/wwwroot/antenv/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
2018-11-09T13:21:53.854427146Z rv = self.handle_user_exception(e)
2018-11-09T13:21:53.854431646Z File "/home/site/wwwroot/antenv/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception
2018-11-09T13:21:53.854436146Z reraise(exc_type, exc_value, tb)
2018-11-09T13:21:53.854440346Z File "/home/site/wwwroot/antenv/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise
2018-11-09T13:21:53.854444746Z raise value
2018-11-09T13:21:53.854448846Z File "/home/site/wwwroot/antenv/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
2018-11-09T13:21:53.854453246Z rv = self.dispatch_request()
2018-11-09T13:21:53.854457546Z File "/home/site/wwwroot/antenv/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request
2018-11-09T13:21:53.854461846Z return self.view_functions[rule.endpoint](**req.view_args)
2018-11-09T13:21:53.854466046Z File "/home/site/wwwroot/application.py", line 282, in borrow
2018-11-09T13:21:53.854480146Z cursor.execute("UPDATE books SET stock = stock - 1 WHERE bookid = ?",(bookid,))
2018-11-09T13:21:53.854963942Z sqlite3.OperationalError: database is locked
Here is the route -
#app.route('/borrow',methods=["POST"])
def borrow():
# import pdb; pdb.set_trace()
body = request.get_json()
user_id = body["userid"]
bookid = body["bookid"]
conn = sqlite3.connect("database.db")
cursor = conn.cursor()
date = datetime.now()
expiry_date = date + timedelta(days=30)
cursor.execute("UPDATE books SET stock = stock - 1 WHERE bookid = ?",(bookid,))
# conn.commit()
cursor.execute("INSERT INTO borrowed (issuedate,returndate,memberid,bookid) VALUES (?,?,?,?)",("xxx","xxx",user_id,bookid,))
conn.commit()
cursor.close()
conn.close()
return json.dumps({"status":200,"conn":"working with datess update"})
I tried checking the database integrity using pragma. There was no integrity loss. So I don't know what might be causing that error. Any help is Appreciated :)
I use Azure app service on Docker on Linux, and have the same issue. If you are using Azure app service on Windows, the problem is different from mine.
The problem is that /home is mounted as CIFS filesystem which can not deal with SQLite3 lock.
My workaround is to copy db.sqlite3 file to some directory other than /home, and properly set permissions and ownerships of the db.sqlite3 file and its directory as well. Then, let my project read/write it. However, this workaround is pretty awkward. I don't recommned.
Presumably this solution is not safe for production workloads but at least I got it working by executing the following command:
sqlite3 <database-file> 'PRAGMA journal_mode=wal;'
After running the above command, my database stored on an Azure File share works inside a container Web App.
I got it by setting up the azure mount options with the following configuration:
dir_mode=0777,file_mode=0777,uid=0,gid=0,mfsymlinks,nobrl,cache=strict
But the real solution is to add the flag nobrl (Byte-Range Lock).
Add storageclass example for kubernetes:
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: azureclass
provisioner: kubernetes.io/azure-file
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=0
- gid=0
- mfsymlinks
- nobrl
- cache=strict
parameters:
skuName: Standard_LRS
This answer appears toward the top of a typical Google search for this issue so I thought I'd add a couple of additional tips:
For those running JavaScript and using Sequelize as the interface to your SQLite DB, running
await sequelize.query('PRAGMA journal_mode=WAL;')
prior to creating your database will allow you to read/write the DB file in an Azure web app running under a Linux service plan. I have a separate script that creates one via a call to sequelize.sync(). I'm storing the DB file in a separate directory under /home within the file system for the Linux container. It seems to run fine and my workload is expected to be very light. Note that you don't need to set the journal mode again when your app starts and you try to connect to the database, that mode will be set in the file itself (this wasn't obvious from the SQLite docs).

How to upgrade odoo 8 to odoo 9 database?

I am trying to upgrade an odoo installation from 8.0 to 9.0. What I've done so far is the following:
Backup the odoo database from the production system
Installed the backup DB as test in my current system
Copied the odoo folder in a folder on my system
Checked, if everything works. It works!
Updated to the latest v8.0 version, still works
Did a git checkout 9.0 followed by a git pull.
Started odoo 9.0 with the command ./openerp-server -d testDB -u all
This commands breaks with the following error and does not update my database:
LINE 1: select model, transient from ir_model where state='manual'
^
, in query select model, transient from ir_model where state=%s
2015-10-26 00:37:29,823 4501 CRITICAL testDB openerp.service.server:
Failed to initialize database `testDB`.
Traceback (most recent call last):
File "/opt/odoo/openerp/service/server.py", line 885, in preload_registries
registry = RegistryManager.new(dbname, update_module=update_module)
File "/opt/odoo/openerp/modules/registry.py", line 385, in new
openerp.modules.load_modules(registry._db, force_demo, status, update_module)
File "/opt/odoo/openerp/modules/loading.py", line 279, in load_modules
loaded_modules, processed_modules = load_module_graph(cr, graph, status, perform_checks=update_module, report=report)
File "/opt/odoo/openerp/modules/loading.py", line 136, in load_module_graph
registry.setup_models(cr, partial=True)
File "/opt/odoo/openerp/modules/registry.py", line 185, in setup_models
cr.execute('select model, transient from ir_model where state=%s', ('manual',))
File "/opt/odoo/openerp/sql_db.py", line 139, in wrapper
return f(self, *args, **kwargs)
File "/opt/odoo/openerp/sql_db.py", line 215, in execute
res = self._obj.execute(query, params)
ProgrammingError: column "transient" does not exist
LINE 1: select model, transient from ir_model where state='manual'
Are there any steps which I have to follow to upgrade the database or has everything to be done by hand? And if yes, what should I do? Obviously it failed because the specific column is non-existent in my database. But is there any update script because I fear, if I change this there will be the next error waiting for me.
Thanks in advance.
You can ask the odoo company to do that task for you by going to this link
.But they will charge money for that. If you can do it yourself here is the documentation on how to do that,
https://doc.therp.nl/openupgrade/intro.html
Option 2: We can use pgadmin(postgresql gui tool).Just select your database name and in the top you can see sql enabled,click it and issue an sql query to display all data(you must know the table name which you want to retreive) after that you can export it.The exported file contains all the data with column headings,we may have to rearrange columns according to odoo9 DB.Once it is done select odoo9 database,right click on the table name which you want to import data to and select import option.It may take a while and it should give message as "data imported successfully".
I found the answer on Github.
The trick is to create a field called transient which is Boolean with the default value false in the table ir_model.
As I expected, this is not the complete solution as there are other problem with the database needing adjustments.
You are trying to run a Odoo 8.0 database on Odoo 9.0.
The column 'transient' is in the code base for 9.0 and not in the 8.0 code base. Hence the 8.0 database is being ran on the 9.0 code base. Hence, the database has not been upgraded properly.
As stated in the previous answer. You can either get Odoo to do it or can do it yourself as well.