Query blob storage with Get-AzDataLakeGen2ChildItem? - azure-powershell

Our powershell test harness used to use Get-AzDataLakeGen2ChildItem to list blobs found in non data lake storage accounts. Today I updated the powershell and Az module versions they were locked at, and now when issuing the command (specifying a Filesystem container, and context), the following error is returned:
Get-AzDataLakeGen2ChildItem: Input string was not in a correct format.
I'm assuming something has changed, and this function cannot process a result from non data lake storage compatibly anymore.
For one reason or another, a while back we changed from using Get-AzStorageBlob. So interested to know if there's any solution to be able to continue working with this call, rather than to deviate from Get-AzDataLakeGen2ChildItem where required.

One of the workaround to list the sub directories and files in a directory or Filesystem from an Azure storage account using the Get-AzDataLakeGen2ChildItem .
To do that we must have enabled Hierarchical Namespace .
Then you will get something like below example;
NOTE:- If you are using existing storage which has not enabled Hierarchical Namespace then you need to upgrade that storage account by doing the below steps:
For more information please refer the below links:-
MS DOC| Get-AzDataLakeGen2ChildItem , Get-AzStorageBlob .
SO THREAD FOR SIMILAR ISSUE.

Related

Azure Blob Storage - How to read source string: wasbs://training#dbtrainsouthcentralus.blob.core.windows.net

I am doing a lab for an Azure Data course and there was some code to run from within Azure Databricks.
I noticed that it seemed to mount something from the following location:
wasbs://training#dbtrainsouthcentralus.blob.core.windows.net
So I am trying to figure out how to deconstruct the above string
wasbs looks to mean "windows azure storage blob"
The string training#dbtrainsouthcentralus.blob.core.windows.net looks like it means "container name"#"account name" - Which I would think should be something in my Azure Data Lake.
I dug around in my ADLS and was not able to find anything related to "training#dbtrainsouthcentralus.blob.core.windows.net"
So I was wondering, where on earth did this come from? How can I trace back to where this path came from?
The url is indeed constructed as follows:
wasbs://[container-name]#[storage-account-name].blob.core.windows.net[directory-name] (source)
I dug around in my ADLS ...
You won't find it in ALDS, it is a seperate resource in your subscription. There should be a storage account named dbtrainsouthcentralus.
Note: It could also be a public accessible storage account in some training subscription you do not have access to and is provided by microsoft for training purposes.

Download big number of files (400k) from S3 bucket into Azure Datalake Gen2 using Azure Data Factory

I need to download a big number of files (around 400k) files from an S3 bucket. I have the paths stored in a csv file. Some of the paths may not exist.
The two options i see are:
Use the foreach activity and somehow pass the contents of the file there. But i think that this would flood my monitor pane with a huge number of runs, and it feels like it is meant to be for smaller pipelines.
Use the listOfFiles option which is supported in the S3 source. The problem with this approach is that the list must be in the S3 bucket and cannot be loaded from Azure Datalake Gen2 (anybody knows why, please let me know as well).
I have tried using the listOfFiles way, but the pipeline fails once it finds the first missing file. The fault tolerance options contain a "skip missing file" option but it is defined as "Skip the files if it is being deleted from source store during the data movement", so it is of no use to me.
I don't want to download more files than needed, so copying the bucket as-is is not an option. How can i approach this issue with ADF? I'm looking for a solution that uses the predefined transformations, ideally i would like to not involve Azure Batch or Azure Functions for such a simple task.

AWS Data pipeline postStepCommand unable to access INPUT1_STAGING_DIR

In EMR Activity of a Data pipline, I am trying to use postStepCommand (as documented here ) to invoke a shell script. As part of it I am trying to access the standard directory paths ${INPUT1_STAGING_DIR} and ${OUTPUT1_STAGING_DIR}
But seems like it's not able to access it's value. Is it by design ?

Transferring Storage Accounts Table into Data Lake using Data Factory

I am trying to use Data Factory to transfer a table from Storage Accounts into Data Lake. Microsoft claims that one can, "store files of arbitrary sizes and formats into Data Lake". I use the online wizard and try to create a pipeline. Pipeline gets created, but I then always get an error saying:
Copy activity encountered a user error: ErrorCode=UserErrorTabularCopyBehaviorNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=CopyBehavior property is not supported if the source is tabular data source.,Source=Microsoft.DataTransfer.ClientLibrary,'.
Any suggestions what I can do to be able to use Data Factory to transfer data from Storage Accounts table into Data Lake?
Thanks.
Your case is supported by ADF. While for the error you hit, there is a known defect that for some cases the copy wizard mis-generate a "CopyBehavior" property which is not applicable. We are fixing that now.
For you to workaround, go to Azure portal -> Author and deploy -> select that pipeline -> find the "CopyBehavior": "MergeFiles" under AzureDataLakeStoreSink and remove that line -> then deploy and rerun the activity.
If you happened to author an run-once pipeline, please re-author a scheduled one given the former is hard to be updated using JSON.
Thanks,
Linda

WSO2 Gadget Gen Tool -

I have an external Hadoop cluster (CDH4) with Hive. I used the Gadget Gen tool (BAM 2.3.0) to create a simple table gadget, but no data is populated when I add the gadget to a dashboard using the URL supplied from the gadget gen tool.
Here's my data source settings from the Gadget Generator Wizard
jdbc:hive://x.x.x.x:10000/default
org.apache.hadoop.hive.jdbc.HiveDriver
I added the following jar files to make sure I had everything required for the JDBC connection and restarted wso2server:
hive-exec-0.10.0-cdh4.2.0.jar hive-jdbc-0.10.0-cdh4.2.0.jar
hive-metastore-0.10.0-cdh4.2.0.jar hive-service-0.10.0-cdh4.2.0.jar
libfb303-0.9.0.jar commons-logging-1.0.4.jar slf4j-api-1.6.4.jar
slf4j-log4j12-1.6.1.jar hadoop-core-2.0.0-mr1-cdh4.2.0.jar
I see map reduce jobs running on my cluster during step 2 and 3 of the wizard (and the wizard shows me previews of the actual data), but I don't see any jobs submitted after the gadget is generated.
Any help appreciated.
Gadgen gen tool is for RDBMS database such as MySQL,h2, etc. you can't provide hive URL from the gadget gen tool and run it.
Generally in WSO2 BAM, the hive is used to summarize the collected data which was stored in cassandra and write the summarized final result on RDBMS database. Then from Gadget-gen tool, the gdaget xmls are created by pointing to the final result stored RDBMS database.
You can find more information on WSO2 BAM 2.3.0 documentation. http://docs.wso2.org/wiki/display/BAM230/Gadget+Generation+Tool
Make sure the URL generated for the location of Gadget XML has the correct IP/Host Name. See whether the given gadget xml is located in the registry location of the generated url. You do not have to worry about Hive / Hadoop / Cassandra stuff as they are not relevant to the Gadget. Only the RDBMS (H2 by default) data matters. Hope your problem will be resolved when Gadget location is corrected.