The Madlib website suggests it is compatible with Postgresql. Amazon Redhift is based on Postgresql. Can install Madlib on Redshift?
The Madlib library suggests it is compatible with postgres, but the full advantage of MADlib you will take when you will start using it with a MPP Database( Massively Parallel Database ) and also uses some internal pyhton libraries which are similar in both and which may not be the case in Amazon Redshift, it will be good if you use it with greenplum which is also an opensource now and is totally based on Postgres otherwise you will not be able to get the most out of it.
Related
Since YugabyteDB is a Postgres based implementation, will Postgres large objects( https://www.postgresql.org/docs/9.2/largeobjects.html) work in YSQL?
YugabyteDB currently supports BYTEA same as Postgresql. But doesn't have support for Large Objects like Postgresql (splitting the blob internally into chunks). There is a feature request issue on github for large object support.
SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?
When SparkSQL uses hive
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292
Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL
Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.
SparkSQL vs Spark API you can simply imagine you are in RDBMS world:
SparkSQL is pure SQL, and Spark API is language for writing stored procedure
Hive on Spark is similar to SparkSQL, it is a pure SQL interface that use spark as execution engine, SparkSQL uses Hive's syntax, so as a language, i would say they are almost the same.
but Hive on Spark has a much better support for hive features, especially hiveserver2 and security features, hive features in SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983
i believe hive support in spark project is really very low priority stuff...
sadly Hive on spark integration is not that easy, there are a lot of dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301
and, when i'm trying hive with spark integration, for debug purpose, i'm always starting hive cli like this:
export HADOOP_USER_CLASSPATH_FIRST=true
bin/hive --hiveconf hive.root.logger=DEBUG,console
our requirement is using spark with hiveserver2 in a secure way (with authentication and authorization), currently SparkSQL alone can not provide this, we are using ranger/sentry + Hive on Spark.
hope this can help you to get a better idea which direction you should go.
here is related answer I find in the hive official site:
1.3 Comparison with Shark and Spark SQL
There are two related projects in the Spark ecosystem that provide Hive QL support on Spark: Shark and Spark SQL.
●The Shark project translates query plans generated by Hive into its own representation and executes them over Spark.
●Spark SQL is a feature in Spark. It uses Hive’s parser as the frontend to provide Hive QL support. Spark application developers can easily express their data processing logic in SQL, as well as the other Spark operators, in their code. Spark SQL supports a different use case than Hive.
Compared with Shark and Spark SQL, our approach by design supports all existing Hive features, including Hive QL (and any future extension), and Hive’s integration with authorization, monitoring, auditing, and other operational tools.
3. Hive-Level Design
As noted in the introduction, this project takes a different approach from that of Shark or Spark SQL in the sense that we are not going to implement SQL semantics using Spark's primitives. On the contrary, we will implement it using MapReduce primitives. The only new thing here is that these MapReduce primitives will be executed in Spark. In fact, only a few of Spark's primitives will be used in this design.
The approach of executing Hive’s MapReduce primitives on Spark that is different from what Shark or Spark SQL does has the following direct advantages:
1.Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future.
2.This approach avoids or reduces the necessity of any customization work in Hive’s Spark execution engine.
3.It will also limit the scope of the project and reduce longterm maintenance by keeping Hive-on-Spark congruent to Hive MapReduce and Tez.
The DHIS2 documentation mentions that it supports mysql (https://docs.dhis2.org/2.28/en/implementer/html/installation.html), however thats the last point mysql is ever mentioned.
Does the current version really support mysql? If it does, will GIS still work?
From direct dhis2 support email...
Up until and including version 2.28, mysql should work.
However, from version 2.29 we require PostgreSQL as the database platform, together with the PostGIS spatial extension. This means that MySQL is no longer supported.
The minimum version required is PostgreSQL 9.1. However we recommend upgrading to a later version as we plan to take advantage of some of the useful features part of PostgreSQL 10 such as logical replication and native partitioning in future versions of DHIS 2.
First of all it is recommended to use postgres.
Secondly most of the testing and QA is done on instances with postgres.
Thirdly POST GIS extension is available only in postgres , which can cause a hurdle for you at later stage.
Fourthly , the GIS data points and boundaries are stored in a format which is better handled in postgres db structure.
Therefore please go with postgres and chill
we are just starting to use Google Analytics data in BigQuery and previously used just the MSSQL Server in the work environment. We would like to move some of the analysis to the GCP and BigQuery, but could not decide on what is the better option to use - standard or legacy SQL?
In both cases we would have to adjust to the new language version, but the real question is what is the best choice when it comes to Google Analytics data analysis? Is there something that from the technical point of view should make us choose legacy over standard, or the other way around?
It is very misleading for us that there are two versions, because legacy seems to be more developed now, but perphaps standard will be the main version for SQL in the future in BQ?
BigQuery Standard SQL is the way to go. It has much more features than Legacy SQL.
Note: it is not binary choice. You always can use Legacy SQL - if there is something that you will find easier to express with it. From my experience it is mostly opposite - with very few exceptions. Most prominent (for me for example being) - Table Decorators - Support for table decorators in standard SQL is planned but not yet implemented.
I would recommend looking into Migrating from legacy SQL - not from migration point of view as you are the new to BigQuery - but because it is a good place to see and compare features of both dialects in one place.
Also I recommend to check BigQuery Issue Tracker so you can get some extra insight
Standard SQL is the preferred SQL dialect for use in BigQuery, as stated in the migration guide. While legacy SQL has been around for quite some time--and is still the default at the time of this writing--there is no active development work on it. If you are evaluating which to use, you should pick standard SQL, since in addition to being more similar to T-SQL (SQL Server's dialect) it is more expressive, has fewer surprising edge cases, and generally has more features.
Go with Standard SQL, as that's on the longterm roadmap.
From experience some queries are faster under Legacy SQL, but this is changing as Standard SQL is the one that is actively developed.
I'm using a third party tool to query our data stored in Bigquery. The third party tool uses a Bigquery JDBC driver. I would like to take advantage of UDF's but I do not see any documentation or support for UDF's and the jdbc driver. Is it supported? If not is there an ETA?
we're currently only supporting UDFs through the API, but do have work planned to support declarative definition (and persistence) of your functions. This won't be shipped before 2016.
Just out of curiosity, what tool are you using? Tableau?
There is an inline way to define a UDF, but it is "alpha", unsupported, and undocumented. Use at your own risk.
https://stackoverflow.com/a/36208489/2259571