SQL Polybase External Table - Dealing with file metadata - azure-data-lake

I cannot find any reference for dealing with file metadata when creating an External Table starting from a partitioned source of files. More precisely: I have a set of partitioned parquet files. The partition strategy is in the form:
{YEAR}/{MONTH}/{filename}.parquet
Now I can create an external table referencing the whole set using the LOCATION pointing at the root of the partition and using s recursive strategy.
LOCATION = 'folder_or_filepath' Specifies the folder or the file path
and file name for the actual data in Hadoop or Azure blob storage. The
location starts from the root folder. The root folder is the data
location specified in the external data source.
In this context, it would be crucial to be able to access partitioning metadata like {YEAR}, {MONTH} or {filename} and store them as columns into the newly created external table for further usages.
By my researches, access file metadata seems to be a missing feature right now. But I'm not sure.
For sure, it is not possible leverages on PARTITION BY functionality as evidenced here:
https://feedback.azure.com/forums/307516-azure-synapse-analytics/suggestions/19520860-polybase-partitioned-by-functionality-when-creati
Is there some mitigation strategy? I'm about to set up a Data Factory Mapping Dataflow which will do the dirty job. But I'm still unsure about these two options:
Reducing the partitioned set into a single file adding metadata columns on each row;
Just adding metadata columns on each file and leave the partitioned hierarchy;
Bonus: any suggestion?

Related

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

What is hive.metastore.warehouse.dir for?

I am new to HIVE, I am trying to setup a hive metastore service with standalone MySQL DB, and I realized that I need to config hive.metastore.warehouse.dir in the hive-site.xml, but I am having a hard time to understand what it is for?
1, None of the metadata will be stored in this location, because all of the metadata will be stored in the MySQL db.
2, None of the data files will be stored in this location, because I am not setting up a Hive data service, it is just a metastore service. And when creating hive tables, I will specify the location of the table.
Why do I still need to set this configuration?
spark.sql.warehouse.dir is a static configuration property that sets Hive’s hive.metastore.warehouse.dir property, i.e. the location of default database for the Hive warehouse
That is correct. This directory indicates where the actual data in the tables will reside.
It sounds like in most of your situations, the data will reside outside of what you set for this directory. However, if a user were to forget to set the location or if there are any internal/automated calls that use the "default" database. This is where your "default" data will reside.

hive external table location vs load path

By going through the internet about external tables and managed table, I understood that we need to specify the Location while creating the external table as hive will create the tables in the given location but in case of managed table, the default directory mentioned in hive.metastore.warehouse.dir will be used.
Please correct me if anything wrongly stated.
What confusing me is:
Is the LOCATION clause used to specify where the data exist for External table or where to create the directory to store the actual data?
If the LOCATION clause is used to specify where the data exist, then why are we using the PATH clause in the LOAD statement.
The location clause in the DDL of an external table is used to
specify the hdfs location where the data needs to be stored. Later
on when we query the table the data would be read from this specified
path.
The load data inpath is the path of the source file from where the data
is loaded into the table. The source could be either a local file
path or a hdfs file path.
Hope I have cleared your confusion.

Does Hive duplicate data?

I have a large log file which I loaded in to HDFS. HDFS will replicate to different nodes based on rack awareness.
Now I load the same file into a hive table. The commands are as below:
create table log_analysis (logtext string) STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/';
LOAD DATA INPATH '/user/log/apache.log' OVERWRITE INTO TABLE log_analysis;
Now when I go and see the '/user/hive/warehouse/' directory there is a table file and copying it to local, it has all the log file data.
My question is: the existing file in HDFS is replicated. Then loading that file in hive table, stored on HDFS also gets replicated.
Is that not the same file stored 6 times (assuming that replication factor is 3) ? That would be such a waste of resources.
Correct, In case you are loading the data from HDFS , the data moves from HDFS to the /user/hive/warehouse/yourdatabasename/tablename.
Your question indicates that you have created an INTERNAL table using hive and you are loading data into HIVE table from HDFS location.
When you load data into an internal table using LOAD DATA INPATAH command, It moves data from the primary location to another location. In your case, it should be /user/hive/warehouse/log_analysis. So basically it provides new address and new HDFS location of the data and you won't be seeing anything in the previous location.
When you move data from one location to another location on HDFS. NameNode receives a new location of the data and it deletes all old metadata for that data. Hence there won't be any duplicate information of the data and the data and there will be only 3 replication and it will be stored only 3 times.
I hope it clear to you.

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.