Is visualization with Druid and Superset only possible for time-series data? - data-visualization

When I connect Druid and Superset, can I use the data I have only for time-series charts? Or can I use it for other charts as well? (deck.gl Scatterplot, Mapbox etc.)

Through SQLAlchemy, Superset generates database-specific queries in the no-code, Explore interface. As long as the database is well supported in Superset, all chart types should work!

Related

Time Series data on PostgreSQL

MongodDB 5.0 comes with support for time series https://docs.mongodb.com/manual/core/timeseries-collections/
I wonder, what is status with PostgreSQL support for time series?
I could quickly find TimescaleDB
https://github.com/timescale/timescaledb,
that is actually open source extension for PostgreSQL.
and detailed PostreSQL usage example from AWS https://aws.amazon.com/blogs/database/designing-high-performance-time-series-data-tables-on-amazon-rds-for-postgresql/
Please comment on this options or give other. Maybe other SQL databases extended to natively support time series.
Postgres has very good storage for time series data - arrays. I remember, I used it 20 years ago for this purpose. You can transform these data to table with unnest function. The arrays use transparent compression.
Timescaledb use it internally with massive partitioning and with some extending optimizer and planner - so using an arrays is transparent for users. TimescaleDB is very good product, smart solution, and extending Postgres's possibilities very well. Postgres has only generic compressions, but Timescale supports special compression's techniques designed for time series data - that is much more effective. It is really good product.
The work with time series data is usually much simpler than work with general data. Time series data (and queries) has clean characteristic and these data are append only. It's not too hard to write time series database. There are lot of products from of this kind. The benefit of timescale is the fact, that it is extension of Postgres. So if you know Postgres, you know timescaledb. And against other time series databases, it is SQL based - so if you know SQL, you don't need to learn new special language.
Attention - lot of SQL databases supports temporal data, but it is different than time series data.

Apache Ignite analogue of Spark vector UDF and distributed compute in general

I have been using Spark for some time now with success in Python however we have a product written in C# that would greatly benefit from distributed and parallel execution. I did some research and tried out the new C# API for Spark but this is a little restrictive at the moment.
In regards to Ignite, on the surface it seems like a decent alternative. Its got good .NET support, it has clustering ability and the ability to distribute compute across the grid.
However, I was wondering if it really can be used to replace Spark in our use case - what we need is a distributed way in which to perform data frame type operations. In particular a lot of our code in Python was implemented using Pandas UDF and we let Spark worry about the data transfer and merging of results.
If i wanted to use Ignite, where our data is really more like a table (typically CSV sourced) rather than key/value based, is there an efficient way to represent that data across the grid and send computations to the cluster that execute on an arbitrary subset of the data in the same way Spark does, especially in the sense that the result of the calculations just become 1..n more columns in the dataframe without having to collect all the results back to the main program?
You can load your structured data (CSV) to Ignite using its SQL implementation:
https://apacheignite-sql.readme.io/docs/overview
it will provide the possibility to do distributed SQL queries over this data and indexes support. Spark also provides the possibility to work with structured data using SQL but there are no indexes. Indexes will help you to significantly increase the performance of your SQL operations.
In case if you have already had some solution worked using Spark data frames then you also can save the same logic but use Ignite integration with Spark instead:
https://apacheignite-fs.readme.io/docs/ignite-data-frame
In this case, you can have all data stored in Ignite SQL tables and do SQL requests and other operations using Spark.
Here you can see an example how to load CSV data to Ignite using Spark DF and how it can be configured:
https://www.gridgain.com/resources/blog/how-debug-data-loading-spark-ignite

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
easier to reason about (less verbose)
easier to maintain (SQL vs scala/python code)
you can run it easily on the GUI if needed
fast without having to really reason about partitioning, caching and so on...
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
preprocessing (previously in Spark, now in SQL)
feature engineering (previously in Spark, now mainly in SQL)
machine learning model and predictions (Spark ML)
Am I missing something ?
Is there any con in using BigQuery this way instead of Spark ?
Thanks
A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.
If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:
BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.
I'm using PySpark (on GCP Dataproc), BigQuery and we have jobs in both. I will summarize my vision about Pros and Cons of one system against the other. And I do admit that your environment could be different, so that something which I think is Pros might not be like this for you.
Pros of Spark:
better testing of the code, simpler to build unit tests and run them with mocked data and classes, rather in trying to do this with BigQuery
it's possible to use SQL (SparkSQL) for operations and even combine operations over different data sources (DB, files, BQ)
we have JSON files in the format which is not valid for BigQuery, and it cannot parse them (while files have valid JSON format)
possible to implement naturally more complicated logic for some cases, for example, traversing arrays in nested fields and other complicated calculations
better custom monitoring is possible, when we need to check specific metrics in the pipeline we can send related metrics (StatsD, etc.) easier
more natural for CI/CD processes
Pros of BigQuery (all with a note: if all data is available):
simplicity of SQL, when all data is available in a convenient format
DBAs who are not familiar with Python/Scala still could contribute (bcs they know SQL)
awesome infrastructure behind the scene, very performant
With both approaches it's possible to check quickly the result in GUI. For example, Jupyter Notebook allows to run PySpark instantly. I cannot add my notes about ML related traits, though.

How to Visualize data in Apache Kudu?

Is it possible to visualize data in Apache Kudu? Is there any guideline for it?
Kudu itself does not have any built-in data visualization tool. Just like Oracle is an RDBMS and it does not come with a data visualization tool either. However, there are a few options:
Built a custom visualization tool yourself by using Java, Python or C++ API. https://kudu.apache.org/docs/developing.html.
Impala is a SQL engine that has built-in integration with Kudu. It also supports ODBC/JDBC driver. Thus you can hook almost any BI tools to Impala to query the data in Kudu and build visualization.
You can also use Arcadia Data visualization which directly connects to Kudu tables without Impala connection. And Arcadia is built specially for Big data applications, which resides on distributed cluster. [Distributed visualization tool for distributed computing]

How to use spatial queries in google spanner

Can we use geo spatial queries in google spanner DB? If not, is there any alternate way we can use spatial queries in SQL by moving spatial calculations in application server?
Cloud Spanner doesn't currently support geospatial queries, but might in the future.
One way to implement it yourself is using something like Google's S2 library to map your geopoints into S2 cells, as an example. You can then use this same format to convert your queries to check against this geohash.
If you don't need Spanner's scale, perhaps a Cloud SQL Postgres instance would be better.