Accessing Spark Tables (Parquet on ADLS) using Presto from a Local Linux Machine

Accessing Spark Tables (Parquet on ADLS) using Presto from a Local Linux Machine - hive

Would like to know if we can access the Spark External tables with MS SQL as metastore and external files on Azure Data lake using Hive Metastore service (Presto) from a Linux Machine.
We are trying to access the spark delta tables having parquet files on ADLS through Presto. Below is the scenario. I would like to know if there is a possible way to achieve this. We are doing this as a POC only and we believe knowing the answer will take us to next step.
Our central data repository is all spark Delta tables created by many pipelines. The data is stored in Parquet format. MS SQL is the external metastore. Data in these spark tables are used by other teams/applications and they would like to access these data through Presto.
We learnt that Presto uses the metastore service of Hive to access the hive table details. We tried accessing the tables from Hive (thinking if this works Presto also works). But we find problems with different filesystems. We have setup Hadoop and Hive in one single Linux machine. The versions are 3.1.2 & 3.1.1. The hive service is connecting to the SQL metastore and showing the results of few basic commands. However when it comes to accessing the actual data stored in parquet in a ADLS path, it fails saying File system exception. I understand this problem that it is an interaction of many file systems like (ADFS, HDFS, linux) but not finding any blogs that guides us. Kindly help.
Hive Show Database command:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> SHOW DATABASES;
OK
7nowrtpocqa
c360
default
digital
Hive Listing tables:
hive> SHOW TABLES;
OK
amzn_order_details
amzn_order_items
amzn_product_details
Query data from Orders table:
hive> select * from dlvry_orders limit 3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"
Time taken: 3.37 seconds
How can I make my setup access the Datalake files and bring in the data?
I believe my metastore should have the exact full path of the ADLS where files are stored. If it is, how will my Hive/Hadoop in Linux will understand the path.
If it can recognize the path also, in which configuraion file should I give the credentials for accessing the data lake (in any .XML)
How can the different file systems interact
Kindly help. Thanks for all the inputs.

Related

Can bigquery tables be accessed through .hql file which will run on dataproc

I am trying to access a BigQuery table from my .hql file that I will be running on a Dataproc cluster.
I have written the code below to set the tables as variables in the Hive environment:
Set hivevar:source_table_name=project_id:dataset_name:table_name;
Set hivevar:destination_table_name=project_id:datasetname:dest_tablename;
Then I wrote a query to insert the output into the table present in BigQuery.
Insert into ${destination_table_name} select count(*) from ${source_table_name} where name like 'A%';
After running the job from Dataproc I am getting an error "table not found". But the table is present in BigQuery dataset.
Can someone please help in resolve the issue.

You are trying to access BigQuery tables in the Hive environment. When you run this query, Hive searches for the alias you provided in it's own tables list and not in the BigQuery's tables.
In this link you can find a a package that allows you to connect BigQuery and Hive.
As you can see in the description:
This is a Hive StorageHandler plugin that enables Hive to interact
with BigQuery. It allows you keep your existing pipelines but move to
BigQuery. It utilizes the high throughput BigQuery Storage API to read
data and uses the BigQuery API to write data.
The following steps are performed under Dataproc cluster in Google
Cloud Platform. If you need to run in your cluster, you will need
setup Google Cloud SDK and Google Cloud Storage connector for Hadoop.
I holp it helps

Hive ORC ACID table on AZURE Blob Storage possible for MERGE

On HDFS Hive ORC ACID for Hive MERGE no issue.
On S3 not possible.
For Azure HD Insight I am not clear from docs if such a table on Azure Blob Storage is posible? Seeking confirmation or otherwise.
I am pretty sure no go. See the update I gave on the answer, however.

According to Azure HDInsight offical documents Azure HDInsight 4.0 overview as the figure below,
As I known, Hive MERGE requires MapReduce, but HDInsight does not support it for Hive, so it's also not possible.
UPDATE by question poster
HDInsight 4.0 doesn't support MapReduce for Apache Hive. Use Apache Tez instead. So, with Tez it will still work and from https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-version-release Spark with Hive 3 and Warehouse Connector are also options.

Presto and Hive

I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.

Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.

It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.

I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?

Use hive metastore service WITHOUT Hadoop/HDFS

I know the question is a little bit strange. I love Hadoop & HDFS, but recently work on SparkSQL with Hive Metastore.
I want to use SparkSQL as a vertical SQL engine to run OLAP query across different datasources like RDB, Mongo, Elastic ... without ETL process. Then I register different schema as external tables in Metastore with corresponding Hive storage Handler.
Moreover, HDFS is not used as a datasource in my work. Then, given Map/R is already replaced by Spark engine. That sound to me that Hadoop/HDFS is useless but to base the installation of Hive. I don't want to buy them all.
I wonder If I only start Hive metastore service without Hadoop/HDFS to support SparkSQL, what kind of issue will happen. Would I put myself into the jungle?

What you need is "Hive Local Mode" (search for "Hive, Map-Reduce and Local-Mode" in the page).
Also this may help.
This configuration is only suggested if you are experimenting locally. But in this case you only need the metastore.
Also from here;
Spark SQL uses Hive Metastore, even if when we don't configure it to . When not configured it uses a default Derby DB as metastore.
So this seems to be quite legal;
Arrange your metastore in Hive
Start Hive in local mode
And make Spark use Hive metastore
Use Spark as an SQL engine for all datasources supported by Hive.

Tool to "Data Load" or "ETL" -- from SQL Server into Amazon Redshift

I am trying to figure out decent but simple tool which I can host myself in AWS EC2, which will allow me to pull data out of SQL Server 2005 and push to Amazon Redshift.
I basically have a view in SQL Server on which I am doing SELECT * and I need just put all this data into Redshift. The biggest concern is that there is a lot of data, and this will need to be configurable so I can queue it, run as a nighly/continuous job, etc.
Any suggestions?

alexeypro,
dump tables to files, then you have two fundamental challenges to solve:
Transporting data to Amazon
Loading data to Redshift tables.
Amazon S3 will help you with both:
S3 supports fast upload of files to Amazon from your SQL server location. See this great article. It is from 2011 but I did some testing a few months back and saw very similar results. I was testing with gigabytes of data and 16 uploader threads were ok, as I'm not on backbone. Key thing to remember is that compression and parallel upload are your friends to cut down the time for upload.
Once data are on S3, Redshift supports high-performance parallel load from files on S3 to table(s) via COPY SQL command. To get fastest load performance pre-partition your data based on table distribution key and and pre-sort it to avoid expensive vacuums. All is well documented in Amazon's best practices. I have to say these guys know how to make things neat & simple, so just follow the steps.
If you are coder you can orchestrate the whole process remotely using scripts in whatever shell/language you want. You'll need tools/libraries for parallel HTTP upload to S3 and command line access to Redshift (psql) to launch the COPY command.
Another options is Java, there are libraries for S3 upload and JDBC access to Redshift.
As other posters suggest, you could probably use SSIS (or essentially any other ETL tool) as well. I was testing with CloverETL. Took care of automating the process as well as partitioning/presorting the files for load.

Now Microsoft released SSIS Powerpack, so you can do it natively.
SSIS Amazon Redshift Data Transfer Task
Very fast bulk copy from on-premises data to Amazon Redshift in few clicks
Load data to Amazon Redshift from traditional DB engines like SQL Server, Oracle, MySQL, DB2
Load data to Amazon Redshift from Flat Files
Automatic file archiving support
Automatic file compression support to reduce bandwidth and cost
Rich error handling and logging support to troubleshoot Redshift Datawarehouse loading issues
Support for SQL Server 2005, 2008, 2012, 2014 (32 bit and 64 bit)
Why SSIS PowerPack?
High performance suite of Custom SSIS tasks, transforms and adapters

With existing ETL tools, an alternate option to avoid staging data in Amazon (S3/Dynamo) is to use the commercial DataDirect Amazon Redshift Driver which supports a high performance load over the wire without additional dependencies to stage data.
https://blogs.datadirect.com/2014/10/recap-amazon-redshift-salesforce-data-integration-oow14.html

For getting data into Amazon Redshift, I made DataDuck http://dataducketl.com/
It's like Ruby on Rails but for building ETLs.
To give you an idea of how easy it is to set up, here's how you get your data into Redshift.
Add gem 'dataduck' to your Gemfile.
Run bundle install
Run datatduck quickstart and follow the instructions
This will autogenerate files representing the tables and columns you want to migrate to the data warehouse. You can modify these to customize it, e.g. remove or transform some of the columns.
Commit this code to your own ETL project repository
Git pull the code on your EC2 server
Run dataduck etl all on a cron job, from the EC2 server, to transfer all the tables into Amazon Redshift

Why not Python+boto+psycopg2 script?
It will run on EC2 Windows or Linux instance.
If it's OS Windows you could:
Extract data from SQL Server( using sqlcmd.exe)
Compress it (using gzip.GzipFile).
Multipart upload it to S3 (using boto)
Append it to Amazon Redshit table (using psycopg2).
Similarly, it worked for me when I wrote Oracle-To-Redshift-Data-Loader

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas