I want to design Web UI which fetches data from HDFS. I want to generate some reports using this data which is stored in HDFS. I have my own custom reports format. I am writing REST API's to fetch data. But running HIVE queries gives latency issues Hence I want different approach for this, I could think of two.
Using IMPALA to create tables. But I am not sure about REST support for IMPALA.
Using HIVE but instead of MR use SPARK as execution engine. .
spark-job-server provides REST support, and fetch data with SPARK-SQL.
Which of the approach will be suitable or is there any better approach for this?
Please can anyone help as I am very new in this.
I'd prefer to choose impala if latency is the main consideration. It's dedicated to SQL processing on hdfs and does it well. About REST api and the application logic you are achieving, this seems to be a good example
I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.
Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.
It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.
I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?
I am planning to move my data into hbase but there are so many logic in stored procedure that i found it will be easy to move my data to hbase using SQL language, but how can i do that. I didn't found anything. Please help
I have Oracle Big Data. I have created a table in HIVE. I am able to view the data through HUE in HIVE. But when I am trying to browse to that file I am getting error related HDFS Super User.
Please assist.
Make sure that WebHdfs is configured properly.
I need to create a simple data warehouse. The data sources for the data warehouse are heterogeneous, thus I'm experimenting with Frameworks like Apache Flume for data collection. I went through the documentation but didn't find anything about SQL. (http://flume.apache.org/FlumeDeveloperGuide.html and http://flume.apache.org/FlumeUserGuide.html#flume-sources)
Question: Are there any (native) possibilities to connect an Apache Flume source to an SQL server?
Apache Flume is designed to collect, aggregate and move log data to HDFS.
If you are considering moving large amounts of data from a SQL database, take a look at Apache Sqoop:
http://sqoop.apache.org/
Look into this project flume-ng-sql-source. Here are some examples as well.
http://www.toadworld.com/platforms/oracle/w/wiki/11093.streaming-oracle-database-logs-to-hdfs-with-flume
http://www.toadworld.com/platforms/oracle/w/wiki/11100.streaming-mysql-table-data-to-oracle-nosql-database-with-flume