I would like to expose HIVE as web service, so that my PHP programs can invoke the web service to show the output in UI. i am not sure how to do this in HIVE.
There is a JDBC driver for Hive. It supports only a subset of JDBC API and SQL syntax. These limitations are defined by the capabilities of Hive. Hive is much better suited for batch commands, for example filter a large subset of an enormous set of data, possibly joining with other Hive tables. Nevertheless, using JDBC you can conceivably access Hive data via web services. You might consider creating asynchronous web services. If you are going to access Hive in synchronous web service calls, make sure your timeouts are large enough to accommodate times needed to run Hive commands.
Related
I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this
I want to design Web UI which fetches data from HDFS. I want to generate some reports using this data which is stored in HDFS. I have my own custom reports format. I am writing REST API's to fetch data. But running HIVE queries gives latency issues Hence I want different approach for this, I could think of two.
Using IMPALA to create tables. But I am not sure about REST support for IMPALA.
Using HIVE but instead of MR use SPARK as execution engine. .
spark-job-server provides REST support, and fetch data with SPARK-SQL.
Which of the approach will be suitable or is there any better approach for this?
Please can anyone help as I am very new in this.
I'd prefer to choose impala if latency is the main consideration. It's dedicated to SQL processing on hdfs and does it well. About REST api and the application logic you are achieving, this seems to be a good example
Background: We currently have our data split between two relational databases (Oracle and Postgres). There is a need to run ad-hoc queries that involve tables in both databases. Currently we are doing this in one of two ways:
ETL from one database to another. This requires a lot of developer
time.
Oracle foreign data wrapper on our
Postgres server. This is working, but the queries run extremely
slowly.
We already use Google Cloud Platform (for the project that uses the Postgres server). We are familiar with Google BigQuery (BQ).
What we want to do:
We want most of our tables from both these databases (as-is) available at a single location, so querying them is easy and fast. We are thinking of copying over the data from both DB servers into BQ, without doing any transformations.
It looks like we need to take full dumps of our tables on a periodic basis (daily) and update BQ since BQ is append-only. The recent availability of DML in BQ seems to be very limited.
We are aware that loading the tables as is to BQ is not an optimal solution and we need to denormalize for efficiency, but this is a problem we have to solve after analyzing the feasibility.
My question is whether BQ is a good solution for us, and if yes, how to efficiently keep BQ in sync with our DB data, or whether we should look at something else (like say, Redshift)?
WePay has been publishing a series of articles on how they solve these problems. Check out https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka.
To keep everything synchronized they:
The flow of data starts with each microservice’s MySQL database. These
databases run in Google Cloud as CloudSQL MySQL instances with GTIDs
enabled. We’ve set up a downstream MySQL cluster specifically for
Debezium. Each CloudSQL instance replicates its data into the Debezium
cluster, which consists of two MySQL machines: a primary (active)
server and secondary (passive) server. This single Debezium cluster is
an operational trick to make it easier for us to operate Debezium.
Rather than having Debezium connect to dozens of microservice
databases directly, we can connect to just a single database. This
also isolates Debezium from impacting the production OLTP workload
that the master CloudSQL instances are handling.
And then:
The Debezium connectors feed the MySQL messages into Kafka (and add
their schemas to the Confluent schema registry), where downstream
systems can consume them. We use our Kafka connect BigQuery connector
to load the MySQL data into BigQuery using BigQuery’s streaming API.
This gives us a data warehouse in BigQuery that is usually less than
30 seconds behind the data that’s in production. Other microservices,
stream processors, and data infrastructure consume the feeds as well.
As part of my spark Job I am storing the output to a Hive table on HDInsight. I now want to expose the data to any COTS tools that can consume Odata feed like Tableau or other such tools. I was wondering if any one has some pointers on how this can be accomplished?
It's easy to do if data is stored in hive.
HDInsight Spark cluster has thrift server setup that allows BI tools like Tableau/PowerBI to process data via spark.
See:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/
I am trying to figure out decent but simple tool which I can host myself in AWS EC2, which will allow me to pull data out of SQL Server 2005 and push to Amazon Redshift.
I basically have a view in SQL Server on which I am doing SELECT * and I need just put all this data into Redshift. The biggest concern is that there is a lot of data, and this will need to be configurable so I can queue it, run as a nighly/continuous job, etc.
Any suggestions?
alexeypro,
dump tables to files, then you have two fundamental challenges to solve:
Transporting data to Amazon
Loading data to Redshift tables.
Amazon S3 will help you with both:
S3 supports fast upload of files to Amazon from your SQL server location. See this great article. It is from 2011 but I did some testing a few months back and saw very similar results. I was testing with gigabytes of data and 16 uploader threads were ok, as I'm not on backbone. Key thing to remember is that compression and parallel upload are your friends to cut down the time for upload.
Once data are on S3, Redshift supports high-performance parallel load from files on S3 to table(s) via COPY SQL command. To get fastest load performance pre-partition your data based on table distribution key and and pre-sort it to avoid expensive vacuums. All is well documented in Amazon's best practices. I have to say these guys know how to make things neat & simple, so just follow the steps.
If you are coder you can orchestrate the whole process remotely using scripts in whatever shell/language you want. You'll need tools/libraries for parallel HTTP upload to S3 and command line access to Redshift (psql) to launch the COPY command.
Another options is Java, there are libraries for S3 upload and JDBC access to Redshift.
As other posters suggest, you could probably use SSIS (or essentially any other ETL tool) as well. I was testing with CloverETL. Took care of automating the process as well as partitioning/presorting the files for load.
Now Microsoft released SSIS Powerpack, so you can do it natively.
SSIS Amazon Redshift Data Transfer Task
Very fast bulk copy from on-premises data to Amazon Redshift in few clicks
Load data to Amazon Redshift from traditional DB engines like SQL Server, Oracle, MySQL, DB2
Load data to Amazon Redshift from Flat Files
Automatic file archiving support
Automatic file compression support to reduce bandwidth and cost
Rich error handling and logging support to troubleshoot Redshift Datawarehouse loading issues
Support for SQL Server 2005, 2008, 2012, 2014 (32 bit and 64 bit)
Why SSIS PowerPack?
High performance suite of Custom SSIS tasks, transforms and adapters
With existing ETL tools, an alternate option to avoid staging data in Amazon (S3/Dynamo) is to use the commercial DataDirect Amazon Redshift Driver which supports a high performance load over the wire without additional dependencies to stage data.
https://blogs.datadirect.com/2014/10/recap-amazon-redshift-salesforce-data-integration-oow14.html
For getting data into Amazon Redshift, I made DataDuck http://dataducketl.com/
It's like Ruby on Rails but for building ETLs.
To give you an idea of how easy it is to set up, here's how you get your data into Redshift.
Add gem 'dataduck' to your Gemfile.
Run bundle install
Run datatduck quickstart and follow the instructions
This will autogenerate files representing the tables and columns you want to migrate to the data warehouse. You can modify these to customize it, e.g. remove or transform some of the columns.
Commit this code to your own ETL project repository
Git pull the code on your EC2 server
Run dataduck etl all on a cron job, from the EC2 server, to transfer all the tables into Amazon Redshift
Why not Python+boto+psycopg2 script?
It will run on EC2 Windows or Linux instance.
If it's OS Windows you could:
Extract data from SQL Server( using sqlcmd.exe)
Compress it (using gzip.GzipFile).
Multipart upload it to S3 (using boto)
Append it to Amazon Redshit table (using psycopg2).
Similarly, it worked for me when I wrote Oracle-To-Redshift-Data-Loader