Meta data in Azure data lake - azure-data-lake

I have written a Azure finction in C# that recursivly goes through the data lake and generates a file with metadata (filename,path,size mofied date etc) of all files and folders in the datalake.
This takes quite a while since we have a lot of files and foders. So I was just wondering if there was a meta data store that we could pull this data from directly? I thinking of something like sys tables in SQL Server.
Thanks in advance!

There are some features around file information that will soon be released that give you some of the file system meta data properties. But you would still need to enumerate your folder hierarchies yourself.
For example:
#data =
EXTRACT
vehicle_id int
, entry_id long
, event_date DateTime
, latitude float
, longitude float
, speed int
, direction string
, trip_id int?
, uri = FILE.URI()
, modified_date = FILE.MODIFIED()
, created_date = FILE.CREATED()
, file_sz = FILE.LENGTH()
FROM "/Samples/Data/AmbulanceData/vehicle{*}"
USING Extractors.Csv();
OUTPUT #data
TO "/output/releasenotes/winter2018/fileprops.csv"
USING Outputters.Csv(outputHeader : true);
I suggest that you file a request for a file system meta-data catalog view (e.g., usql.files and usql.filesystem) at http://aka.ms/adlfeedback to augment our metadata catalog views.

Related

How to create a blank "Delta" Lake table schema in Azure Data Lake Gen2 using Azure Synapse Serverless SQL Pool?

I have a file with data integrated from 2 different sources using Azure Mapping Data Flow and loaded into an ADLS2 datalake container/folder i.e. for example :- /staging/EDW/Current/products.parquet file.
I now need to process this file in staging using Azure Mapping Data Flow and load into it's corresponding dimension table using SCD type2 method to maintain history.
However, I want to try creating & process this dimension table as "Delta" table in Azure Data Lake using Azure Mapping Data Flow only. However, since SCD type 2 requires a source lookup to check if there are any existing records/rows and if not insert all or if changed records do updates etc etc. (let's say during first time load).
For that, I need to first create a default/blank "Delta" table in Azure data lake folder i.e. for example :- /curated/Delta/Dimension/Products/. Just like we would have done if it were in Azure SQL DW (Dedicated Pool) in which we could have first created a blank dbo.dim_products table with just the schema/structure and no rows.
I am trying to implement a DataLake-House architecture implementation by utilizing & evaluating the best features of both Delta Lake and Azure Synapse Serverless SQL pool using Azure Mapping data flow - for performance, cost savings, ease of development (low code) & understanding. However, at the same time want to avoid a Logical Datawarehouse (LDW) kind of architecture implementation at this time.
For this, tried creating a new database under built-in Azure Synapse Serverless SQL pool, defined data source, format and a blank delta table/schema structure (without any rows); but no luck.
create database delta_dwh;
create external data source deltalakestorage
with ( location = 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/' );
create external file format deltalakeformat
with (format_type = delta);
drop external table products;
create external table dbo.products
(
product_skey int,
product_id int,
product_name nvarchar(max),
product_category nvarchar(max),
product_price decimal (38,18),
valid_from date,
valid_to date,
is_active char(1)
)
with
(
location='https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products',
data_source = deltalakestorage,
file_format = deltalakeformat
);
However, this fails since a Delta table/file requires _delta_log/*.json folder/file to be present which maintains transaction log. That means, I have to first write few (dummy) rows as in Delta format to the said target folder and then only I can read it and perform following queries used in for SCD type 2 implementation:
select isnull(max(product_skey), 0)
FROM OPENROWSET(
BULK 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products/*.parquet',
FORMAT = 'DELTA') as rows
Any thoughts, inputs, suggestions ??
Thanks!
You may try to create initial /dummy data_flow + pipiline to create this empty delta files.
It's only simple workaround.
Create CSV with your sample table data.
Create dataflow with name =initDelta
Use this CSV as source in data flow
In projection panel set up correct data types.
Add filtering after source and setup dummy filter 1=2 etc.
Add sink with delta output.
Put your initDelta dataflow into dummy pipeline and run it.
Folder structure for delta should created.
You mentioned the your initial data is in parque file. You can use this file. Schema of table(columns and data types) will be imported from file. Filter out all rows and save result as delta.
I think it should work or I missed something in your problem
I don't think you can use Serverless SQL pool to create a delta table........yet. I think it is coming soon though.

Azure Synapse Serverless - SQL query to return rows in directory for each file

I have an Azure Data Lake Gen2 Container in which I have several json files. I would like to write a query that returns a record for each file. I am not interested in parsing the files, I just want to know what files are there and have this returned in a view. Does anyone have any tips on how I might do this? Everything I have found is about how to parse/read the files...I am going to let Power BI do that since the json format is not standard. In this case I just need a listing of files. Thanks!
You can use the filepath() and filename() function in Azure Synapse Analytics serverless SQL pools to return those. You can even GROUP BY them to return aggregated results. A simple example:
SELECT
[result].filepath() AS filepath,
[result].filename() AS filename,
COUNT(*) AS records
FROM
OPENROWSET(
BULK 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/puYear=2019/puMonth=4/*.parquet',
FORMAT = 'PARQUET'
) AS [result]
GROUP BY [result].filepath(), [result].filename()
See the documentation for further examples.

Data Factory Childitem modified or created date

I have a Data Factory V2 pipeline consisting of 'get metadata' and 'forEach' activities that reads a list of files on a file share (on-prem) and logs it in a database table. Currently, I'm only able to read file name, but would like to also retrieve the date modified and/or date created property of each file. Any help, please?
Thank you
According to the MS documentation.
We can see File system and SFTP both support the lastModified property. But we only can get the lastModified of one file or folder at a time.
I'm using File system to do the test. The process is basically the same as the previous post, we need to add a GetMetaData activity to the ForEach activity.
This is my local files.
First, I created a table for logging.
create table Copy_Logs (
Copy_File_Name varchar(max),
Last_modified datetime
)
In ADF, I'm using Child Items at Get Metadata1 activity to get the file list of the folder.
Then add dynamic content #activity('Get Metadata1').output.childItems at ForEach1 activity.
Inside the ForEach1 activity, using Last modified at Get Metadata2 activity.
In the dataset of Get Metadata2 activity, I key in #item().name as follows.
Using CopyFiles_To_Azure activity to copy local files to the Azure Data Lake Storage V2.
I key in #item().name at the source dataset of CopyFiles_To_Azure activity.
At Create_Logs activity, I'm using the following sql to get the info we need.
select '#{item().name}' as Copy_File_Name, '#{activity('Get Metadata2').output.lastModified}' as Last_modified
In the end, sink to the sql table we created previously. The result is as follows.
One way , I can think of is please add a new Getmetdata inside the FE loop and use paramterized dataset and pass a filename as the paramter . The below animation should helped , I did tested the same .
HTH .

Support for creating table out of limited number of column in Presto

I was playing around with Presto. I uploaded parquet file with 10 columns.I want to created table (external location s3) in meta store with 5 column using presto-cli. Looks like presto doesn't support this ?
Is there any other way to get this working.
That should be easily possible if you are using Parquet or ORC file formats. This is another advantage of keeping metadata separate than actual data. As mentioned in the comments, you should use column names to access the fields instead of index.
One of the example:
CREATE TABLE hive.web.request_logs (
request_time timestamp,
url varchar,
ip varchar,
user_agent varchar
)
WITH (
format = 'parquet',
external_location = 's3://my-bucket/data/logs/'
)
Reference:
https://prestodb.github.io/docs/current/connector/hive.html#examples

How to resolve special character issue in SQL Server data warehouse

I have to load the data from datalake into a SQL Server data warehouse using the polybase tables. I have created the set up for the creation of external tables. I have created the external tables and I am trying to do select * from ext_t1 table but I'm getting ???? for a column in ext_table.
Below is my external table script. I have found the issue with the special character in data. How can we escape the special character and need to use only varchar datatype not nvarchar. Can some help me on this issue?
CREATE EXTERNAL FILE FORMAT [CSVFileFormat_Test] WITH (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = N',', STRING_DELIMITER = N'"',DATE_FORMAT='yyyy-MM-dd', FIRST_ROW = 2, USE_TYPE_DEFAULT = True,Encoding='UTF8'))
CREATE EXTERNAL TABLE [dbo].[EXT_TEST1]
( A VARCHAR(10),B VARCHAR(20))
(DATA_SOURCE = [Azure_Datalake],LOCATION = N'/A/Test_CSV/',FILE_FORMAT =csvfileformat,REJECT_TYPE = VALUE,REJECT_VALUE = 1)
Data: (special character in csv for A column as follows)
ÐК Ð’ÐЗМ Завод
ÐК Ð’ÐЗМ ЗаÑтройщик
This is data mismatch issue and this read may help you .
External Table Considerations
Creating an external table is easy, but there are some nuances that need to be discussed.
External Tables are strongly typed. This means that each row of the data being ingested must satisfy the table schema definition. If a row does not match the schema definition, the row is rejected from the load.
The REJECT_TYPE and REJECT_VALUE options allow you to define how many rows or what percentage of the data must be present in the final table. During load, if the reject value is reached, the load fails. The most common cause of rejected rows is a schema definition mismatch. For example, if a column is incorrectly given the schema of int when the data in the file is a string, every row will fail to load.
Data Lake Storage Gen1 uses Role Based Access Control (RBAC) to control access to the data. This means that the Service Principal must have read permissions to the directories defined in the location parameter and to the children of the final directory and files. This enables PolyBase to authenticate and load that data.