Presto query engine with Azure Data Lake - azure-data-lake

I have a requirement to deploy a presto server which can help me query data stored in ADLS in Avro file formats.
I have gone through this tutorial and it seems that the Hive is used as a catalogue/connector in presto to query from ADLS. Can I bypass Hive and have any connector to extract data from ADLS?

Can I bypass Hive and have any connector to extract data from ADLS?
No.
Hive here plays two roles here:
storage for metadata. It contains information like:
schema and table name
columns
data format
data location
execution
it is capable to read data from (HDFS) distributed file systems (like HDFS, S3, ADLS)
it tells how execution can be distributed.

Related

Hive table in Power Bi

I want to create a hive table which will store data with orc format and snappy compression. Will power bi be able to read from that table? Also do you suggest any other format/compression for my table?
ORC is a special file format only going to work with hive and its highly optimized for HDFS read operations. And power BI can connect to hive using hive odbc data connection. So, i think if you have to use hive all the time, you can use this format to store the data. But if you want flexibility of both hive and impala and use cludera provided impala ODBC driver, you can think of using parquet.
Now, both orc and parquet has their own advantages and disadvantages. And main deciding factor can be tools that access the data, how nested data is, and how many columns are there .
If you have many columns with nested data and if you want to use both hive and impala to access data, go with parquet. And if you have few columns with flat data structure and huge amount of data, go with orc.

Read hive table (or HDFS data in parquet format) in Streamsets DC

Is it possible to read hive table (or HDFS data in parquet format) in Streamsets Data collector? I don't want to use Transformer for this.
Reading the raw files in parquet is counter to the way that data collector works so that would be a better use case for transformer.
But I have successfully used the jdbc origin either from Impala or hive to achieve this, there are some additional hurdles to jump with the jdbc source.

How to Load data into bigquery from buckets partitioned with Year/Month/Day

We have a data stored on gcp bucket in below format -
gs:/gcptest/Year=2020/Month=06/day=18/test1.parquet and so many files under the day=18 folder.
I want to create a table in bigquery with the columns present in the files and partitioned by Year,Month,Day that is present on the file path.
So that when I will load the data into table I can just select the path from the gcp bucket and load the data which will partitioned by values of Year/Month/Day present on the path
BigQuery supports loading externally partitioned data in Avro, Parquet, ORC, CSV and JSON formats that is stored on Cloud Storage using a default hive partitioning layout.
Support is currently limited to the BigQuery web UI, command-line tool, and REST API.
You can see more in Loading externally partitioned data documentation
Also see how to Query externally partitioned data

Indexing and partitioning Parquet in S3

Is it possible to both index and partition a Parquet file in S3 or is this functionality only available on File Storage types of volumes?
I'm looking for a way to provide researchers to access the same data in S3 via EMR notebooks for (a) generic R and Python scripts, and (b) Spark-enabled querying. But the proprietary solution and query language we have right now provides indexing and partitioning on an NFS store - so I want to preserve this functionality. I see that Delta Lake provides this, but I'm wondering if this can be achieved with simpler tools like Arrow.
You could use Delta lake to Partition a Parquet file. They are also indexed by default.
You can do it like this
%sql
CREATE TABLE UsableTable_unpartitioned
USING DELTA
LOCATION 'Location of the Parquet File on S3' ;
CREATE TABLE UsableTable
USING DELTA
PARTITIONED BY (my_partitioned_column)
LOCATION 'MyS3Location'
select * from UsableTable_unpartitioned;
DROP TABLE UsableTable_unpartitioned;
Verify your partitions and all the required information created :
%sql
describe detail UsableTable
You could expose this table using JDBC

Move data from hive tables in Google Dataproc to BigQuery

We are doing the data transformations using Google Dataproc and all our data is residing in Dataproc Hive tables. How do i transfer/move this data to BigQuery.
Transfer to BigQuery from Hive seems to have a standard pattern:
dump your Hive into Avro files
Load those files in BigQuery
See an example here: Migrate hive table to Google BigQuery
As mentioned above, take care about the types compatibility between Hive/Avro/BigQuery.
And for the first time I guess it would not hurt to do some validations by comparing that the tables on both Hive and BigQuery have the same data: https://github.com/bolcom/hive_compared_bq