How to use spatial queries in google spanner - sql

Can we use geo spatial queries in google spanner DB? If not, is there any alternate way we can use spatial queries in SQL by moving spatial calculations in application server?

Cloud Spanner doesn't currently support geospatial queries, but might in the future.
One way to implement it yourself is using something like Google's S2 library to map your geopoints into S2 cells, as an example. You can then use this same format to convert your queries to check against this geohash.
If you don't need Spanner's scale, perhaps a Cloud SQL Postgres instance would be better.

Related

is redshift really required when quicksight can query directly from s3 using athena?

We have data dumped into S3 buckets and we are using these data to pull some reports in Quicksight some directly accessing s3 as datasource and for other reports, we used Athena to query S3.
At which point, one need to use Redshift? Is there any advantage of using Redshift over S3+Athena?
No you might be perfectly fine with just QuickSight, Athena and S3 - it will be relatively cheaper as well if you keep Redshift out of the equation. Athena is based on PRESTO and is pretty comprehensive in terms of functionality for most SQL reporting needs.
You would need Redshift if you approach or hit the QuickSight's SPICE limits and would still like your reports to be snappy and load quickly. From a data engineering side, if you need to update existing records it is easier to micro batch and update records in RedShift. With athena/s3 you also need to take care of optimising the storage format (use orc/parquet file formats, use partitions, not use small files etc...) - it is not rocket science but some people prefer paying for RedShift and not having to worry about that at all.
In the end, RedShift will probably scale better when your data grows very large (into the petabyte scale). However, my suggestion would be to keep using Athena and follow it's best practices and only use RedShift if you anticipate huge growth and want to be sure that you can scale the underlying engine on demand (and, of course, are willing to pay extra for it).

Time Series data on PostgreSQL

MongodDB 5.0 comes with support for time series https://docs.mongodb.com/manual/core/timeseries-collections/
I wonder, what is status with PostgreSQL support for time series?
I could quickly find TimescaleDB
https://github.com/timescale/timescaledb,
that is actually open source extension for PostgreSQL.
and detailed PostreSQL usage example from AWS https://aws.amazon.com/blogs/database/designing-high-performance-time-series-data-tables-on-amazon-rds-for-postgresql/
Please comment on this options or give other. Maybe other SQL databases extended to natively support time series.
Postgres has very good storage for time series data - arrays. I remember, I used it 20 years ago for this purpose. You can transform these data to table with unnest function. The arrays use transparent compression.
Timescaledb use it internally with massive partitioning and with some extending optimizer and planner - so using an arrays is transparent for users. TimescaleDB is very good product, smart solution, and extending Postgres's possibilities very well. Postgres has only generic compressions, but Timescale supports special compression's techniques designed for time series data - that is much more effective. It is really good product.
The work with time series data is usually much simpler than work with general data. Time series data (and queries) has clean characteristic and these data are append only. It's not too hard to write time series database. There are lot of products from of this kind. The benefit of timescale is the fact, that it is extension of Postgres. So if you know Postgres, you know timescaledb. And against other time series databases, it is SQL based - so if you know SQL, you don't need to learn new special language.
Attention - lot of SQL databases supports temporal data, but it is different than time series data.

Spark SQL vs Hive vs Presto SQL for analytics on top of Parquet file

I have terabytes of data stored in Parquet format for analytics use case. There are multiple big tables which needs joins as well and there are heavy queries. The system is expected to be highly scalable. Currently, evaluating Spark SQL, Hive and Presto SQL. Based on theories, all seem to be meeting the requirements. Could you please shed some light on the differences and what should be considered for the above mentioned use case. Tableau will be used for visualization on top of this.

Apache Ignite analogue of Spark vector UDF and distributed compute in general

I have been using Spark for some time now with success in Python however we have a product written in C# that would greatly benefit from distributed and parallel execution. I did some research and tried out the new C# API for Spark but this is a little restrictive at the moment.
In regards to Ignite, on the surface it seems like a decent alternative. Its got good .NET support, it has clustering ability and the ability to distribute compute across the grid.
However, I was wondering if it really can be used to replace Spark in our use case - what we need is a distributed way in which to perform data frame type operations. In particular a lot of our code in Python was implemented using Pandas UDF and we let Spark worry about the data transfer and merging of results.
If i wanted to use Ignite, where our data is really more like a table (typically CSV sourced) rather than key/value based, is there an efficient way to represent that data across the grid and send computations to the cluster that execute on an arbitrary subset of the data in the same way Spark does, especially in the sense that the result of the calculations just become 1..n more columns in the dataframe without having to collect all the results back to the main program?
You can load your structured data (CSV) to Ignite using its SQL implementation:
https://apacheignite-sql.readme.io/docs/overview
it will provide the possibility to do distributed SQL queries over this data and indexes support. Spark also provides the possibility to work with structured data using SQL but there are no indexes. Indexes will help you to significantly increase the performance of your SQL operations.
In case if you have already had some solution worked using Spark data frames then you also can save the same logic but use Ignite integration with Spark instead:
https://apacheignite-fs.readme.io/docs/ignite-data-frame
In this case, you can have all data stored in Ignite SQL tables and do SQL requests and other operations using Spark.
Here you can see an example how to load CSV data to Ignite using Spark DF and how it can be configured:
https://www.gridgain.com/resources/blog/how-debug-data-loading-spark-ignite

Can I the same programming language in BigQuery and Google Cloud Dataflow?

I want to use the same function for parsing events in two different technologies: Goolge Bigquery and DataFlow. Is there a language I can do this in? If not, is google planning to support one any time soon?
Background: Some of this parsing is complex (e.g., applying custom URL extraction rules, extracting information out of the user agent) but it's not computationally expensive and doesn't involve joining the events to any other large look-up tables. Because the parsing can be complex, I want to write my parsing logic in only one language and run it wherever I need it: sometimes in BigQuery, sometimes in other environments like DataFlow. I want to avoid writing the same complex parsers/extractors in different languages because of the bugs and inconsistencies that can result from that.
I know BigQuery supports javascript UDFs. Is there a clean way to run javascript on Google Cloud DataFlow? Will BigQuery someday support UDFs in some other language?
We tend to use Java to puppet bigquery jobs and parse their resulting data, and then we also do that in dataflow as well.
Likewise, you have leeway with the amount of sql that you write vs auto-generate from the code-base, and how much you lean on bigquery vs dataflow.
(we have found with our larger amounts of data, that there is a big benefit to offloading as much initial grouping/filtering into bigquery before pulling it into dataflow)