Apache Cassandra and Spark - sql

I am an experienced RDBMD's developer and admin. But I am new to Apache Cassandra and Spark. I learned Cassandra's CQL, and the documentation says that CQL does not support joins and sub-queries because it would be too inefficient in Cassandra because of its distributed data nature.
So, I concluded that in distributed data env., joins and sub-queries are not supported because they will affect performance badly.
But then I learned Spark, which also works with distributed data, but Spark supports all SQL features including joins and sub-queries. Even though Spark is not database system and thus does not even have indexes... So, my question is how Spark does support joins and sub-queries on distributed data?, and does it do it efficiently?.
Thanks in advance.

Spark does the "hard work" required to do a join on distributed data. It performs large shuffles to align data on keys before actually performing joins. This basically means that any join requires a very large amount of data movement unless the original data sources are partitioned based on the keys used for joining.
C* does not allow for generic joins like this because of the cost involved, it is geared towards OLTP workloads and requiring a full data shuffle is inherently OLAP.

Apache spark has a concept of RDD(Resilient Distributed DataSet)which gets created in memory.
Its basically a fundamental data structure in spark.
Joins, queries are performed on this RDDs and as it operates in memory ,that`s the reason it is very efficient.
Please go through the docs below for getting some idea on Resilient Dataset
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Related

What is the relationship between BlazingSQL and dask?

I'm trying to understand if BlazingSQL is a competitor or complementary to dask.
I have some medium-sized data (10-50GB) saved as parquet files on Azure blob storage.
IIUC I can query, join, aggregate, groupby with BlazingSQL using SQL syntax, but I can also read the data into CuDF using dask_cudf and do all same operations using python/dataframe syntax.
So, it seems to me that they're direct competitors?
Is it correct that (one of) the benefits of using dask is that it can operate on partitions so can operate on datasets larger than GPU memory whereas BlazingSQL is limited to what can fit on the GPU?
Why would one choose to use BlazingSQL rather than dask?
Edit:
The docs talk about dask_cudf but the actual repo is archived saying that dask support is now in cudf itself. It would be good to know how to leverage dask to operate on larger-than-gpu-memory datasets with cudf
Full disclosure I'm a co-founder of BlazingSQL.
BlazingSQL and Dask are not competitive, in fact you need Dask to use BlazingSQL in a distributed context. All distibured BlazingSQL results return dask_cudf result sets, so you can then continuer operations on said results in python/dataframe syntax. To your point, you are correct on two counts:
BlazingSQL is currently limited to GPU memory, and actually some system memory by leveraging CUDA's Unified Virtual Memory. That will change soon, we are estimating around v0.13 which is scheduled for an early March release. Upon that release, memory will spill off and cache to system memory, local drives, or even our supported storage plugins such as AWS S3, Google Cloud Storage, and HDFS.
You can totally write SQL operations as dask_cudf functions, but it is incumbent on the user to know all of those functions, and optimize their usage of them. SQL has a variety of benefits in that it is more accessible (more people know it, and it's very easy to learn), and there is a great deal of research around optimizing SQL (cost-based optimizers for example) for running queries at scale.
If you wish to make RAPIDS accessible to more users SQL is a pretty easy onboarding process, and it's very easy to optimize for because of the reduced scope necessary to optimize SQL operations over Dask which has many other considerations.

Understanding distributed join of Apache Ignite

We are exploring to use Apache Ignite in our project. Basically, we have dozens of oracle tables.And we want to load each table into Ignite Cache ,and then do join between these caches. There are many joins between our tables(so there will be many distributed join between caches).
The uncertain thing it that it could be really hard to collocate our data using the affinity-collocation feature... as described here:
https://apacheignite.readme.io/docs/affinity-collocation
So, I would ask if our data in cache is not collocated, then does Ignite distributed join support this(we are using Ignite 1.7.0)? I would imagine there will be many data movement when doing the join(This would be very similar to SQL on Hadoop, like Hive or Spark SQL)
Also, I am wondering the performance between non-collocation distributed join and spark sql.
I would add that if you use distributed non-collocated mode for SQL queries then it doesn't mean that the data will be silly moved all the time. The engine will try all its best to optimize the execution and, even, it may result in no data movement at all. However, it depends on a type of query and how data is spread our across the cluster.
In any case, my recommendation will be to collocate as much data as you can so that you can rely on the most performant collocated mode and fallback to non-collocated mode for the rest of the scenarios.
I do believe that the performance of non-collocated Ignite queries will be still better than the performance of Spark SQL engine simply because Ignite allows you to index the data while Spark doesn't which is essential.
You are right, non-collocated joins causes many data movement. http://apacheignite.gridgain.org/docs/sql-queries#distributed-joins
Ignite tries to reduce unnecessary data movement using all available ways. Affinity-collocation, Replicated caches, Near Caches, Indices, In-memory data storage.
Also, if you already use Spark, you can try to back it by Ignite to improve performance.
http://insidebigdata.com/2016/06/20/apache-ignite-and-apache-spark-complementary-in-memory-computing-solutions/

access sql database as nosql (couchbase)

I hope to access sql database as the way of nosql key-value pairs/document.
This is for future upgrade if user amount increases a lot,
I can migrate from sql to nosql immediately while application code changes nothing.
Of course I can write the api/solution by myself, just wonder if there is any person has done same thing as I said before and published the solution.
Your comment welcome
While I agree with everything scalabilitysolved has said, there is an interesting feature in the offing for Postgres, scheduled for the 9.4 Postgres release, namely, jsonb: http://www.postgresql.org/docs/devel/static/datatype-json.html with some interesting indexing and query possibilities. I mention this as you tagged Mongodb and Couchbase, both of which use JSON (well, technically BSON in Mongodb's case).
Of course, querying, sharding, replication, ACID guarantees, etc will still be totally different between Postgres (or any other traditional RDBMS) and any document-based NoSQL solution and migrations between any two RDBMS tends to be quite painful, let alone between an RDBMS and a NoSQL data store. However, jsonb looks quite promising as a potential half-way house between two of the major paradigms of data storage.
On the other hand, every release of MongoDB brings enhancements to the aggregation pipeline, which is certainly something that seems to appeal to people used to the flexibility that SQL offers and less "alien" than distributed map/reduce jobs. So, it seems reasonable to conclude that there will continue to be cross pollination.
See Explanation of JSONB introduced by PostgreSQL for some further insights into jsonb.
No no no, do not consider this, it's a really bad idea. Pick either a RDBMS or NoSQL solution based upon how your data is modelled and your usage patterns. Converting from one to another is going to be painful and especially if your 'user amount increases a lot'.
Let's face it, either approach would deal with a large increase in usage and both would benefit more from specific optimizations to their database then simply swapping because one 'scales more'.
If your data model fits RDBMS and it needs to perform better than analyze your queries, check your indexes are optimized and look into caching and better data pattern access.
If your data model fits a NoSQL database then as your dataset grows you can add additional nodes (Couchbase),caching expensive map reduce jobs and again optimizing your data pattern access.
In summary, pick either SQL or NoSQL dependent on your data needs, don't just assume that NoSQL is a magic bullet as with easier scaling comes a much less flexible querying model.

MongoDB and PostgreSQL thoughts

I've got an app fully working with PostgreSQL. After reading about MongoDB, I was interested to see how the app would work with it. After a few weeks, I migrated the whole system to MongoDB.
I like a few things with MongoDB. However, I found certain queries I was doing in PostgreSQL, I couldn't do efficiently in MongoDB. Especially, when I had to join several tables to calculate some logic. For example, this.
Moreover, I am using Ruby on Rails 3, and an ODM called Mongoid. Mongoid is still in beta release. Documentation was good, but again, at times I found the ODM to be very limiting compared to what Active Record offered with traditional (SQL) database systems.
Even to this date, I feel more comfortable working with PostgreSQL than MongoDB. Only because I can join tables and do anything with the data.
I've made two types of backups. One with PostgreSQL and the other with MongoDB. Some say, some apps are more suitable with one or the other type of db. Should I continue with MongoDB and eventually hope for its RoR ODM (Mongoid) to fully mature, or should I consider using PostgreSQL?
A few more questions:
1) Which one would be more suitable for developing a social networking site similar to Facebook.
2) Which one would be more suitable for 4-page standard layout type of website (Home, Products, About, Contact)
You dumped a decades-tested, fully featured RDBMS for a young, beta-quality, feature-thin document store with little community support. Unless you're already running tens of thousands of dollars a month in servers and think MongoDB was a better fit for the nature of your data, you probably wasted a lot of time for negative benefit. MongoDB is fun to toy with, and I've built a few apps using it myself for that reason, but it's almost never a better choice than Postgres/MySQL/SQL Server/etc. for production applications.
Let's quote what you wrote and see what it tells us:
"I like a few things with Mongodb. However, I found certain queries I was
doing in PostgreSql, I couldn't do efficiently in Mongodb. Especially,
when I had to join several tables to calculate some logic."
"I found the ODM to be very limiting compared to what Active Record offered
with traditional (SQL) database systems."
"I feel more comfortable working with PostgreSql than Mongodb. Only because
I can join tables and do anything with the data."
Based on what you've said it looks to me like you should stick with PostgreSQL. Keep an eye on MongoDB and use it if and when it's appropriate. But given what you've said it sounds like PG is a better fit for you at present.
Share and enjoy.
I haven't used MongoDB yet, and may never get round to it as I haven't found anything I can't do with Postgres, but just to quote the PostgreSQL 9.2 release notes:
With PostgreSQL 9.2, query results can be returned as JSON data types.
Combined with the new PL/V8 Javascript and PL/Coffee database
programming extensions, and the optional HStore key-value store, users
can now utilize PostgreSQL like a "NoSQL" document database, while
retaining PostgreSQL's reliability, flexibility and performance.
So looks like in new versions of Postgres you can have the best of both worlds. I haven't used this yet either but as a bit of a fan of PostgreSQL (excellent docs / mailing lists) I wouldn't hesitate using it for almost anything RDBMS related.
First of all postgres is an RDBMS and MongoDB is NoSQL .
but Stand-alone NoSQL technologies do not meet ACID standards because they sacrifice critical data protections in favor of high throughput performance for unstructured applications.
Postgres 9.4 providing NoSQL capabilities along with full transaction support, storing JSON documents with constraints on the fields data.
so you will get all advantages from both RDBMS and NoSQL
check it out for detailed article http://www.aptuz.com/blog/is-postgres-nosql-database-better-than-mongodb/
To experience Postgres' NoSQL performance for yourself. Download the pg_nosql_benchmark at GitHub. here is the link https://github.com/EnterpriseDB/pg_nosql_benchmark
We also have research on the same that which is better. PostGres or MongoDb. but with all facts and figures in hand, we found that PostGres is far better to use than MongoDb. in MongoDb, beside eats up memory and CPU, it also occupies large amount of disk space. It's increasing 2x size of disk on certain interval.
My experience with Postgres and Mongo after working with both the databases in my projects .
Postgres(RDBMS)
Postgres is recommended if your future applications have a complicated schema that needs lots of joins or all the data have relations or if we have heavy writing. Postgres is open source, faster, ACID compliant and uses less memory on disk, and is all around good performant for JSON storage also and includes full serializability of transactions with 3 levels of transaction isolation.
The biggest advantage of staying with Postgres is that we have best of both worlds. We can store data into JSONB with constraints, consistency and speed. On the other hand, we can use all SQL features for other types of data. The underlying engine is very stable and copes well with a good range of data volumes. It also runs on your choice of hardware and operating system. Postgres providing NoSQL capabilities along with full transaction support, storing JSON documents with constraints on the fields data.
General Constraints for Postgres
Scaling Postgres Horizontally is significantly harder, but doable.
Fast read operations cannot be fully achieved with Postgres.
NO SQL Data Bases
Mongo DB (Wired Tiger)
MongoDB may beat Postgres in dimension of “horizontal scale”. Storing JSON is what Mongo is optimized to do. Mongo stores its data in a binary format called BSONb which is (roughly) just a binary representation of a superset of JSON. MongoDB stores objects exactly as they were designed. According to MongoDB, for write-intensive applications, Mongo says the new engine(Wired Tiger) gives users an up to 10x increase in write performance(I should try this), with 80 percent reduction in storage utilization, helping to lower costs of storage, achieve greater utilization of hardware.
General Constraints of MongoDb
The usage of a schema less storage engine leads to the problem of implicit schemas. These schemas aren’t defined by our storage engine but instead are defined based on application behavior and expectations.
Stand-alone NoSQL technologies do not meet ACID standards because they sacrifice critical data protections in favor of high throughput performance for unstructured applications. It’s not hard to apply ACID on NoSQL databases but it would make database slow and inflexible up to some extent.
“Most of the NoSQL limitations were optimized in the newer versions and releases which have overcome its previous limitations up to a great extent”.
Which one would be more suitable for developing a social networking site similar to Facebook?
Facebook currently uses combination of databases like Hive and Cassandra.
Which one would be more suitable for 4-page standard layout type of website (Home, Products, About, Contact)
Again it depends how you want to store and process your data. but any SQL or NOSQL database would do the job.

Scalability of Using MySQL as a Key/Value Database

I am interested to know the performance impacts of using MySQL as a key-value database vs. say Redis/MongoDB/CouchDB. I have used both Redis and CouchDB in the past so I'm very familiar with their use cases, and know that it's better to store key/value pairs in say NoSQL vs. MySQL.
But here's the situation:
the bulk of our applications already have lots of MySQL tables
We host everything on Heroku (which only has MongoDB and MySQL, and is basically 1-db-type per app)
we don't want to be using multiple different databases in this case.
So basically, I'm looking for some info on the scalability of having a key/value table in MySQL. Maybe at three different arbitrary tiers:
1000 writes per day
1000 writes per hour
1000 writes per second
1000 reads per hour
1000 reads per second
A practical example is in building something like MixPanel's Real-time Web Analytics Tracker, which would require writing very often depending on traffic.
Wordpress and other popular software use this all the time: Post has "Meta" model which is just key/value, so you can add arbitrary properties to an object which can be searched over.
Another option is to store a serializable hash in a blob but that seems worse.
What is your take?
I'd say that you'll have to run your own benchmark because it is only you that knows the following important aspects:
the size of the data to be stored in this KV table
the level of parallelism you want to achieve
the number of existing queries reaching your MySQL instance
I'd also say that depending on the durability requirements for this data, you'll also want to test multiple engines: InnoDB, MyISAM.
While I do expect some NoSQL solutions to be faster, based on your constraints you may find out that MySQL will perform good enough for your requirements.
SQL databases are more and more used as a persistance layer, with computations and delivery cached in Key-Value repositories.
With this in mind, those guys have done quite a test here:
InnoDB inserts 43,000 records per second AT ITS PEAK*;
TokuDB inserts 34,000 records per second AT ITS PEAK*;
This KV inserts 100 millions of records per second (2,000+ times more).
To answer your question, a Key-Value repository is more than likely to outdo MySQL by several orders of magnitude:
Processing 100,000,000 items:
kv_add()....time:....978.32 ms
kv_get().....time:....297.07 ms
kv_free()....time:........0.00 ms
OK, your test was 1,000 ops per second, but it can't hurt to be able to do 1,000 times more!
See this for further details (they also compare it with Tokyo Cabinet).
There is no doubt that using a NOSQL solution is going to be faster, since it is simpler.
NOSQL and Relational do not compete with each other, they are different tools that can solve different problems.
That being said for 1000 writes/day or per hour, MySQL will have no problem.
For 1000 per second you will need some fancy hardware to get there. For the NOSQL solution you will probably still need some distributed file system.
It also depends on what you are storing.
Check out the series of blog posts here where the author runs tests comparing MongoDB and MySQL performance, and fights through the MySQL performance tuning mess. MongoDB was doing ~100K row reads per second, MySQL in c/s mode was doing 43K max, but with the embedded library he managed to get it up to 172K row reads per second.
It sounds a little complicated to get that high on a single node, so ymmv.
The writes/second question is a little harder, but this still might give you some ideas on configs to try.
You should first implement it in the simplest way then compare that. Always test things. This means:
Create a schema that's representative of your use case.
Create queries representative of your use case.
Create significant amounts of dummy data representive of your use case.
In a variety of loops, including both random access and sequential, bench mark it.
Ensure you use concurrency (run many processes randomly hammering the server with all kinds of queries representative of your use cases).
Once you have that, measure, test. There are different ways you can go about it. Some tests can be simple but might be less realistic. Measure throughput and latency.
Then try to optimise it.
MySQL has one particular limitation for KV which is the standard Engines with persistence use indexes optimised for range lookups, not for KV, which might introduce some overhead, though it's also difficult to have things such as hash work with persistent storage due to rehashing. Memory tables support a hash index.
Many people associate certain things with being slow such as SQL, RELATIONAL, JOINS, ACID, etc.
When using an ACID capable relational database, you don't have to necessarily use ACID or relations.
While joins have a bad reputation for being slow this is usually down to misconceptions about joins. Often people simply write bad queries. This is made more difficult as SQL is declarative, it can get things wrong, especially with JOINs where there are often multiple ways to perform the join. What people are actually getting out of NoSQL in this case is imperative. NoDeclaritive would be more accurate as that's the problem with SQL a lot of people are having. Quite often people simply lack indexes. That's not an argument in favour of joins but rather to illuminate where people can get it wrong on speed.
Traditional databases can be extremely fast if you do certain special things for that such as ignoring data integrity or handling it elsewhere. You don't have to wait for the harddrive to flush writes, you don't have to enforce relations, you don't have to enforce unique constraints, you don't have to use transactions but if you do replace safety with speed then you need to know what you're doing.
NoSQL solutions by comparison first and foremost tend to be designed to support various modes of scaling out of the box. The performance of an individual node might not be quite what you expect. NoSQL solutions also struggle for general use with many having quite unusual performance characteristics or limited feature sets.