External Table in ADX for ADLS data : No records - azure-data-lake

I have Log data being stored at ADLS gen 2 storage . I am trying to query it in ADX , therefore i created an External table in ADX but the records are not populating here. I get no records in ADX external table.
Created External Table:
.create external table extable1 (AppId:string)
kind=blob
dataformat=json
(
h#'https://clickstreamstorelake2.blob.core.windows.net/streamout/0_56da70eca49745f8b830da45ff6aba57_1.json;secret_key_here'
)
with
(
docstring = "Docs",
folder = "ExternalTables",
namePrefix="Prefix"
)
Json Mapping
.create external table extable1 json mapping "map1" '[{ "column" : "AppId", "datatype" : "string", "path" : "$.AppId"}]'
ADLS gen 2 file

Wrong "nameprefix" parameter was passed which caused the no records.
namePrefix : string If set, indicates the prefix of the blobs. On write operations, all blobs will be written with this prefix. On read operations, only blobs with this prefix are read.
It should be in accordance with the blobs present in ADLS.
Below code works well if there is no nameprefix set in ADLS container.
.create external table extable1 (AppId:string)
kind=blob
dataformat=json
(
h#'https://clickstreamstorelake2.blob.core.windows.net/streamout/0_56da70eca49745f8b830da45ff6aba57_1.json;secret_key_here'
)
with
(
folder = "ExternalTables"
)

Related

Trino S3 partitions folder structure

I do not understand what paths Trino needs in order to create table from existing files. I use S3 + Hive metastore.
My JSON file:
{"a":1,"b":2,"snapshot":"partitionA"}
Create table command:
create table trino.partitioned_jsons (a INTEGER, b INTEGER, snapshot varchar) with (external_location = 's3a://bucket/test/partitioned_jsons/*', format='JSON', partitioned_by = ARRAY['snapshot']
What I have tried:
Store JSON file in s3://bucket/test/partitioned_jsons/partitionA/file.json
Store JSON file in s3://bucket/test/partitioned_jsons/snapshot=partitionA/file.json
Store JSON file in s3://bucket/test/partitioned_jsons/snapshot/partitionA.json
But all returns just an empty table.

Getting Null values while loading parquet data from s3 to snowflake

Problem Statement : To Load parquet data from aws s3 to snowflake table.
Command which I am using :
COPY INTO schema.test_table from
(select $1:ID::INTEGER, $1:DATE::TIMESTAMP, $1:TYPE::VARCHAR FROM
#s3_external_stage/folder/part-00000-c000.snappy.parquet)
file_format = (type=parquet);```
In result , I am getting null values
I queried parquet data with s3, it has values in it.
Not sure where I am missing.
Also, is there any way we can get data from parquet files into tables recursively
for ex :
s3_folder /
|
----fileabc.parquet
-----file_xyz.parquet

How to read parquet data with partitions from Aws S3 using presto?

I have data stored in S3 in form of parquet files with partitions. I am trying to read this data using presto. I am able to read data if I give the complete location of parquet file with partition. Below is the query to read data from "section a":
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (external_location = 's3://bucket/presto/section=a', format = 'PARQUET');
But my data is partitioned with different sections i.e. s3://bucket/presto folder contains multiple folders like "section=a", "section=b", etc.
I am trying to read the data with partitions as follows:
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
The table is being created but when I try to select the data the table is empty.
I am new to Presto, please help.
Thanks
You create table correctly:
create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255))
WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
However, in "Hive table format" the partitions are not auto-discovered. Instead, they need to be declared explicitly. There are some reasons for this:
explicit declaration of partitions allows you to publish a partition "atomically", once you're done writing
section=a, section=b is only the convention, the partition location may be different. In fact the partition can be located in some other S3 bucket, or different storage
To auto-discover partitions in the case like yours, you can use the system.sync_partition_metadata procedure that comes with Presto.

Azure Data Lake - how to insert into external table in AzureSQL DB?

From Azure Data Lake, inserting records to an external table in AzureSQL DB produce the following error:
Error E_CSC_USER_CANNOTMODIFYEXTERNALTABLE Modifying external table 'credDB.dbo.BuildInfosClone' is not supported.
Modifying external table 'credDB.dbo.BuildInfosClone' is not supported.
External tables are read-only tables.
How to insert records to external database ? My credential has read-write access. I am using regular Azure SQL DB, not Data Warehouse.
Complete U-SQL code
CREATE DATA SOURCE myDataSource
FROM AZURESQLDB
WITH
(
PROVIDER_STRING = "Database=MedicusMT2",
CREDENTIAL = credDB.rnddref_admin,
REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime)
);
CREATE EXTERNAL TABLE IF NOT EXISTS dbo.BuildInfosClone
(
[Key] string,
[Value] string
)
FROM myDataSource LOCATION "dbo.BuildInfosClone";
INSERT INTO dbo.BuildInfosClone
( [Key], [Value] )
VALUES
("SampleKey","SampleValue");
You cannot currently write directly to Azure SQL Data Warehouse tables using U-SQL. You could either write your data out to flat file then import it using Polybase or use Data Factory to orchestrate the copy.
Alternately you can use Azure Databricks to write directly to SQL Data Warehouse as per this tutorial.

mapping JSON object stored in HBase to struct<Array<..>> Hive external table

I have a hbase table that contains a column in JSON format.So, I want create a hive external table that contains a struct> type.
Hbase table named smms:
colum name : nodeid , value : "4545781751" in STRING FORMAT
column name : events in JSON FORMAT
value : [{"id":12542, "status" :"true", ..},{"id":"1477", "status":"false", ..}]
Hive external table :
Create external table msg (
key INT
nodeid STRING,
events STRUCT<ARRAY<id:INT, status: STRING>
}
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:nodeid, data:events") TBLPROPERTIES ("hbase.table.name" = "smms");
the hive query : select * from msg; return a following result :
nodeid : 4545781751
events : NULL
Thanks
The HBaseStorageHandler (de)serialiser only supports String and binary fields https://cwiki.apache.org/confluence/display/Hive/HBaseIntegratio
What you store in HBase is actually a string (which indeed contains a JSON) but you can't map it to a complex Hive type.
The solution would be to define events as a string, and to export the data to another HIVE table using HIVE JSON deserialiser https://github.com/rcongiu/Hive-JSON-Serde