I am pretty new to Cassandra, just started learning Cassandra a week ago.
I first read, that it was a NoSQL, but when i started using CQL,
I started to wonder whether Cassandra is a NoSQL or SQL DB?
Can someone explain why CQL is more or less like SQL?
CQL is declarative like SQL and the very basic structure of the query component of the language (select things where condition) is the same. But there are enough differences that one should not approach using it in the same way as conventional SQL.
The obvious items: 1. There are no joins or subqueries. 2. No transactions
Less obvious but equally important to note:
Except for the primary key, you can only apply a WHERE condition on a column if you have created an index on that column. In SQL, you don't have to index a column to filter on it but in CQL the select statement will fail outright.
There are no OR or NOT logical operators, only AND. It is very important to model your data so you won't need these two; it is very easy to accidentally forget.
Date handling is profoundly different. CQL permits ONLY the equal operator for timestamps so extremely common and useful expressions like this do not work: where dateField > TO_TIMESTAMP('2013-01-01','YYYY-MM-DD') Also, CQL does not permit string insert of dates accurate to millis (seconds only) -- but it does permit entry of millis since epoch as a long int -- which most other DB engines do NOT permit. Lastly, timezone (as GMT offset) is invisibly captured for both long millis and string formats without a timezone. This can lead to confusion for those systems that deliberately do not conflate local time + GMT offset.
You can ONLY update a table based on primary key (or an IN list of primary keys). You cannot update based on other column data, nor can you do a mass update like this: update table set field = value; CQL demands a where clause with the primary key.
Grammar for AND does not permit parens. TO be fair, it's not necessary because of the lack of the OR operator but this means traditional SQL rewriters that add "protective" parens around expressions will not work with CQL, e.g.: select * from www where (str1 = 'foo2') and (dat1 = 12312442);
In general, it is best to use Cassandra as a big, resilient permastore of data for which a small number of very high level, very high performance queries can be applied to drag out a subset of data to work with at the application layer. That subset might be 1 million rows, yes. CQL and the Cassandra model is not designed for 2 page long SELECT statements with embedded cases, aggregations, etc. etc.
For all intents and purposes, CQL is SQL, so in the strictest sense Cassandra is an SQL database. However, most people closely associate SQL with the relational databases it is usually applied to. Under this (mis)interpretation, Cassandra should not be considered an "SQL database" since it is not relational, and does not support ACID properties.
Docs for CQLV3.0
CQL DESCRIBE to get schema of keyspace, column family, cluster
CQL Doesn't support some stuffs I had known in SQL like joins group by triggers cursors procedure transactions stored procedures
CQL3.0 Supports ORDER BY
CQL Supports all DML and DDL functionalities
CQL Supports BATCH
BATCH is not an analogue for SQL ACID transactions.
Just the DOC mentioned above is a best reference :)
Related
Does anyone have any good references to read regarding whether PostgreSQL conforms to all 12 of Codd's rules?
If not, are there any veteran PostgreSQL users that have an opinion on this topic?
Thank you.
PostGreSQL do not comply to the 12 rules. Rules that PG is unnable to respects are :
2 - The information rule : because you can create table without a PRIMARY KEY. But most of the RDBMS (Oracle, DB2, SQL Server... ) provides tables without PKs, except SQL Azure.
3 - Systematic treatment of null values : the one and only marker that is accepted for unknown or inapplicable is the NULL one.
PostGreSQL add some extra marker like infinity or -infinity (other RDBMS like SQL Server, does not use such ugly thing that cannot combine in algebraic operations nor functions...)
6 - The view updating rule : PostGreSQL does not accepts INSERT, UPDATE or DELETE in view containing JOINs. Most professionnal RDBMS accept this (Oracle, DB2, SQL Server... ) but limits it to one table in the view as a general standard rule of SQL about INSERTs/UPDATEs/DELETEs.
7 - Relational Operations Rule (for high-level insert, update, and delete) : PostGreSQL is unnable to UPDATE properly keys (PRIMARY or UNIQUE). This because, when updating a batch of rows, PostGreSQL acts row by row and checks the constraint each time it has updated the row and not like all other RDBMS do once all rows have been updated. The theory of relational databases being based on the fact that updates must be set-based and not row by row... Of course there is a possible ugly "rustine" for PostGreSQL (deferrable constraints...) but this affects some other logics.
11 - Distribution independence. When partitionning a table or an index (and more generally for any storage management operation) there must be no logical aspect likely to affect applications... More generally, Codd's theory is based on the fact that logical objects and logical processing must in no case be dependent to physical objects and vice versa. Physical/logical separation is essential and all major RDBMSs use it (Oracle, SQL Server, IBM DB2), but PostGreSQL requires the creation of additional tables to be able to divide a table into partitions...
Of course RDBMS that respects all those rule are very rare. I have work with about 20 differents RDBMS, but the only one I think thats respects plainely all the 12 Codd's rules is Microsoft Azure SQL
My answer is based on other answers via Google.
In 1986 was created a first ANSI standard of the SQL language. The last ANSI standard is ANSI SQL:2003. It's requirements are nevertheless implemented only in few RDBMSs. So far the most widespread RDBMSs respect ANSI SQL:1999, respectively older ANSI SQL:1992. PostgreSQL supports completely ANSI SQL:1999 and partly ANSI:2003.
If strictly speaking Codd's rules:
Codd's 12 rules aren't all that is to the relational
model. In fact, he expanded these 12 to 40 rules in his 1.990 book
_The_Relational_Model_for_Database_Management.
But furthermore, if you care to read Christopher J Date's
1.999 _An_Introduction_to_Database_Systems_ you will see that the
relational model comprises some basic elements and some principles.
The basic element is the domain, or data type. PostgreSQL
does not really enforces domains because it accepts NULL, which by
definition is not part of any domain. Thus the triplet domain, name
and value called attribute breaks down, and so the tuple -- because it
represents a proposition, and a proposition with missing information
is another proposition, not the one declared in the relation's header
--, and so also the relation breaks down.
Furthermore, the relation is a set, not a bag. A bag
accepts duplicates, but not a relation. So because PostgreSQL does
not enforce the necessity of declaring a candidate key for each and
every table, its tables are not necessarily relations, but quite
possibly and commonly bags of not tuples as shown above, but simply
rows.
Also, the first principle is The Information Principle, where
all the database must be represented by data. Object IDs violate
this, with serious consequences about data independence, which by the
way is necessary to another relational model sine qua non, namely the
separation between user, logical and physical schemas. Also not
properly supported by PostgreSQL.
While we believe that NoSQL Databases have come to fill a number of gaps which are challenging on the side of RDBMS, i have had several challenges over time with NoSQL DBs in the area of their query eco-system.
Couchbase for example, like its mother CouchDB have had major improvements in reading data using views, lists, Key lookups, map reduce, e.t.c. Couchbase has even moved to create an SQL-like query engine for their huge 2.X verson. MongoDB has also made serious improvements and complex queries are possible on it and many other NoSQL DB developments going on out there.
Most NoSQL DBs can perform Complex queries based on LOGICAL and COMPARISON OPERATORS e.g. AND, OR,== e.t.c However, aggergation and performing complex relations on data are a problem on my part. For example, in CouchDB and/or Couchbase, Views span only a single DB. It is not possible to write a view which will aggregate data from two or more databases. Let me now get to the problem. Functions (whether aggregate or not): AVG, SUM, ROUND,TRUNC,MAX, MIN, e.t.c The lack of data types makes it impossible to efficiently work with Date and Times hence the lack of Date and time functions e.g. TO_DATE,SYSDATE (for system date/time), ADD_MONTHSs, DATE BETWEEN, DATE/TIME format Conversion e.t.c. It is true, that many will say that , they lack Schemas, types and stuff, but, i have found myself not running away from the need for atleast any one of the functions listed up there. For example because NoSQL DBs have no Date/Time data type, it is hard to perform queries based on those, because you might want to analyse trends based on time. Also, others have tried to use UNIX/EPOC Time stamps and stuff to solve this but it aint a single size fits all solution. Map Reduce can be used to attain aggregation to a certain (small) degree, but the overhead has been realised to be great. However, the lack of GROUP BY functionality makes it a straineous solution to filter through what ou want. Look at the query below:
SELECT
doc.field1, doc.field3, SUM(doc.field2 + doc.field4)
FROM
couchdb.my_database
GROUP BY doc.field1, doc.field3
HAVING SUM(doc.field2 + doc.field4) > 20000;
This is not very easy to attain on CouchDB or Couchbase. i am not sure if its possible on MongoDB. I wish it were possible out of the box. This has made it difficult to use NoSQL as a Data warehouse or OLTP/OLAP solution. I found that, each time a complex analysis needs to be made, one needs to do it in the middle ware by paging through different datasets. Now, most experienced Guys (e.g. CLOUDANT) have tweaked LUCENE to perform complex queries, but because it was initially meant for indexing and text search, it has not solved the lack of FUNCTIONS and DATA AGGREGATION on most NoSQL DBs.
Because of lack of FUNCTIONS, most NoSQL DBs have the NULL data type but lack the option of converting NULL Objects to something else, like it is in some RDBMS. For example in Oracle, i could: NVL(COLUMN,0) in order to include all the rows while performing say an AVG calculation on a given column (since say, by default the null columns will not be counted/included in the query processing).
To fully understand the problem, CouchDB views for example operate within the scope of a doc like this below:
function(doc){
// if statements, logical operators, comparison operators
// e.t.c here. until you do am emit of that doc
// if it satisfies the conditions set
// emit(null, doc) OR emit(doc.x,[doc.y, doc.z]) e.t.c.
// you can only emit javascript data types anyways
emit(doc.field1,doc)
}
The docs which satisfy the filters, are let through and go onto the next stage or to a reduce function. Imagine a doc structure like this below:
{
x: '',
y: '',
z: {
p: '',
n: N // integer or number data type
},
date: 'DD/MON/YYYY' // date format
}
Now, lets imagine the possibility of this kind of query:
function(){
var average = select AVG(doc.z.n) from couchdb.my_database;
var Result = select doc.x,doc.y from couchdb.my_database where
doc.z.n > average and doc.y = 'some string' and
doc.date between '01-JUN-2012' and '03-AUG-2012';
emit(Result);
}
OR if this query were possible:
function(){
var latest = select MAX(doc.date) from couchdb.my_database;
var Result = select
doc.x,doc.z.p,MONTHS_BETWEEN(doc.date,latest) as "Months_interval"
from couchdb.my_database where doc.y like '%john%'
order by doc.z.p;
emit(Result);
}
Qn 1: Which NoSQL Database solution has attained to a great degree, the query capability being talked about in the details above ? what key features make it stand out ?
Qn 2: Is the lack of a Schema, or the characteristic of being Key-Value a reason for the lack of FUNCTIONS in Querying these Databases ? What is the reason for the lack of Aggregate functionality in most NoSQL DBs ?
Qn 3: If the query ability above is possible in any of the NoSQL DBs, show how the last two (2) query problems above can be attained using the existing NoSQL infrastracture (consider any NoSQL technology of your choice)
MongoDB has something called Aggregation Framework and it works pretty well. I would say that almost every SQL Aggregation query could be carried out with this framework. Here you have some examples of "conversion" from SQL to Aggregation Framework.
Anyway MongoDB is a document oriented database and not key-value like CouchDB, so I don't know if it fits your requirements.
Trying to develop something which should be portable between the bigger RDBMS'es.
The issue is around generating and using auto-increment numbers as the primary key for a table.
There are two topics here
The mechanism used to generate the auto-increment numbers.
How to specify that you want to use this as the primary key on a
table.
I'm looking for verification for what I think is the current state of affairs:
Unfortunately standardization came late to this area and in some respect is still not implemented (as a mandatory standard). This means it is still in 2013 impossible to write a CREATE TABLE statement in a portable way ... if you want it with a auto-generated primary key.
Can this really be so?
Re (1). This is standardized because it came in SQL:2003. As far as I understand the way to go is SEQUENCEs. I believe these are a mandatory part of SQL:2003, right? The other possibility is the IDENTITY keyword which is also defined in SQL:2003 but that one is - as far as I can tell - an optional part of the standard ... which means a key player like Oracle doesn't implement it... and can still claim compliance. Ok, so SEQUENCEs is the designated portable method for this, right ?
Re (2). Database vendors implement this in different ways. In PostgreSQL you can link the CREATE TABLE statement directly with the sequence, in Oracle you would have to create a trigger to ensure the SEQUENCE is used with the table.
So my conclusion is that without a standardized solution to (2) it really doesn't help much that all the major players now support SEQUENCEs. I would still have to write db-specific code for something as simple as a CREATE TABLE statement.
Is this right?
Standards and their implementation aside I would also be interested if anyone has a portable solution to the problem, no matter if it is a hack from a RDBMS best practice perspective. For such a solution to work it would have to be independent from any application, i.e. it must the database that solves the issue, not the application layer. Perhaps if both the concept of TRIGGERs and SEQUENCEs can be said to be standardized then a solution that combines the two of them would be portable?
As for "portable create table statements": It starts with the data types: Whether boolean, int or long data types are part of any SQL standard or not, I really appreciate these types. PostgreSql supports these data types, Oracle does not. Ironically Oracle supports boolean in PL/SQL, but not as a data type in a table. Even the length of table/column names etc. are restricted in Oracle to 30 characters. So not even the most simple "create table" is always portable.
As for auto-generated primary keys: I am not aware of a syntax which is portable, so I do not define this in the "create table". Of couse this only delays the problem, and leaves it to the insert statements. This topic is connected with another problem: Getting the generated key after an insert using JDBC in the most efficient way. This differs substantially between Oracle and PostgreSql, and if you have ever dared to use case sensitive table/column names in Oracle, it won't be funny.
As for constraints, I prefer to add them in separate statements after "create table". The set of constraints may differ, if you implement a boolean data type in Oracle using char(1) together with a check constraint whereas PostgreSql supports this data type directly.
As for "standards": One example
SQL99 standard: for SELECT DISTINCT, ORDER BY expressions must appear in select list
This message is from PostgreSql, Oracle 11g does not complain. After 14 years, will they change it?
Generally speaking, you still have to write database specific code.
As for your conclusion: In our scenario we implemented a portable database application using a model driven approach. This logical meta data is used by the application, and there are different back ends for different database types. We do not use any ORM, just "direct SQL", because this simplifies tuning of SQL statements, and it gives full access to all SQL features. We wrote our own library, and later we found out that the key ideas match these of "Anorm".
The good news is that while there are tons of small annoyances, it works pretty well, even with complex queries. For example, window aggregate functions are quite portable (row_number(), partition by). You have to use listagg on Oracle, whereas you need string_agg on PostgreSql. Recursive commen table expressions require "with recursive" in PostgreSql, Oracle does not like it. PostgreSql supports "limit" and "offset" in queries, you need to wrap this in Oracle. It drives you crazy, if you use SQL arrays both in Oracle and PostgreSql (arrays as columns in tables). There are materialized views on Oracle, but they do not exist in PostgreSql. Surprisingly enough, it is possible to write database stored procedures not only in Java, but in Scala, and this works amazingly well in both Oracle and PostgreSql. This list is not complete. But so far we managed to find an acceptable (= fast) solution for any "portability problem".
Does it pay off? In our scenario, there is a central Oracle installation (RAC, read/write), but there are distributed PostgreSql installations as localhost databases on each application server (only readonly). This gives a big performance and scalability boost, without the cost penalty.
If you really want to have it solved in the database only, there is one possibility: Put anything in stored procedures, write these in Java/Scala, and restrict yourself in the application to call these procedures, and to read the result sets. This of course just moves the complexity from the application layer into the database, but you accepted hacks :-)
Triggers are quite standardized, if you use Java stored procedures. And if it is supported by your databases, by your management, your data center people, and your colleagues. The non-technical/social aspects are to be considered as well. I have even heard of database tuning people which do not accept the general "left outer join" syntax; they insisted on the Oracle way of using "(+)".
So even if triggers (PL/SQL) and sequences were standardized, there would be so many other things to consider.
Update
As for returning the generated primary keys I can only judge the situation from JDBC's perspective.
PostgreSql returns it, if you use Statement.getGeneratedKeys (I consider this the normal way).
Oracle requires you to specify the (primary key) column(s) whose values you want to get back explicitly when you create the prepared statement. This works, but only if you are not using case sensitive table names. In that case all you receive is a misleading ORA-00942: table or view does not exist thrown in Oracle's JDBC driver: There was/is a bug in Oracle's JDBC driver, and I have not found a way to get the value using a portable JDBC method. So at the cost of an additional proprietary "select sequence.currVal from dual" within the same transaction right after the insert, you can get back the primary key. The additional time was acceptable in our case, we compared the times to insert 100000 rows: PostgreSql is faster until the 10000th row, after that Oracle performs better.
See a stackoverflow question regarding the ways to get the primary key and
the bug report with case sensitive table names from 2008
This example shows pretty well the problems. Normally PostgreSql follows the way you expect it to work, but you may have to find a special way for Oracle.
say i have a table
Id int
Region int
Name nvarchar
select * from table1 where region = 1 and name = 'test'
select * from table1 where name = 'test' and region = 1
will there be a difference in performance?
assume no indexes
is it the same with LINQ?
Because your qualifiers are, in essence, actually the same (it doesn't matter what order the where clauses are put in), then no, there's no difference between those.
As for LINQ, you will need to know what query LINQ to SQL actually emits (you can use a SQL Profiler to find out). Sometimes the query will be the simplest query you can think of, sometimes it will be a convoluted variety of such without you realizing it, because of things like dependencies on FKs or other such constraints. LINQ also wouldn't use an * for select.
The only real way to know is to find out the SQL Server Query Execution plan of both queries. To read more on the topic, go here:
SQL Server Query Execution Plan Analysis
Should it? No. SQL is a relational algebra and the DBMS should optimize irrespective of order within the statement.
Does it? Possibly. Some DBMS' may store data in a certain order (e.g., maintain a key of some sort) despite what they've been told. But, and here's the crux: you cannot rely on it.
You may need to switch DBMS' at some point in the future. Even a later version of the same DBMS may change its behavior. The only thing you should be relying on is what's in the SQL standard.
Regarding the query given: with no indexes or primary key on the two fields in question, you should assume that you'll need a full table scan for both cases. Hence they should run at the same speed.
I don't recommend the *, because the engine should look for the table scheme before executing the query. Instead use the table fields you want to avoid unnecessary overhead.
And yes, the engine optimizes your queries, but help him :) with that.
Best Regards!
For simple queries, likely there is little or no difference, but yes indeed the way you write a query can have a huge impact on performance.
In SQL Server (performance issues are very database specific), a correlated subquery will usually have poor performance compared to doing the same thing in a join to a derived table.
Other things in a query that can affect performance include using SARGable1 where clauses instead of non-SARGable ones, selecting only the fields you need and never using select * (especially not when doing a join as at least one field is repeated), using a set-bases query instead of a cursor, avoiding using a wildcard as the first character in a a like clause and on and on. There are very large books that devote chapters to more efficient ways to write queries.
1 "SARGable", for those that don't know, are stage 1 predicates in DB2 parlance (and possibly other DBMS'). Stage 1 predicates are more efficient since they're parts of indexes and DB2 uses those first.
What techniques can be applied effectively to improve the performance of SQL queries? Are there any general rules that apply?
Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints
Learn what's really going on under the hood - you should be able to understand the following concepts in detail:
Indexes (not just what they are but actually how they work).
Clustered indexes vs heap allocated tables.
Text and binary lookups and when they can be in-lined.
Fill factor.
How records are ghosted for update/delete.
When page splits happen and why.
Statistics, and how they effect various query speeds.
The query planner, and how it works for your specific database (for instance on some systems "select *" is slow, on modern MS-Sql DBs the planner can handle it).
The biggest thing you can do is to look for table scans in sql server query analyzer (make sure you turn on "show execution plan"). Otherwise there are a myriad of articles at MSDN and elsewhere that will give good advice.
As an aside, when I started learning to optimize queries I ran sql server query profiler against a trace, looked at the generated SQL, and tried to figure out why that was an improvement. Query profiler is far from optimal, but it's a decent start.
There are a couple of things you can look at to optimize your query performance.
Ensure that you just have the minimum of data. Make sure you select only the columns you need. Reduce field sizes to a minimum.
Consider de-normalising your database to reduce joins
Avoid loops (i.e. fetch cursors), stick to set operations.
Implement the query as a stored procedure as this is pre-compiled and will execute faster.
Make sure that you have the correct indexes set up. If your database is used mostly for searching then consider more indexes.
Use the execution plan to see how the processing is done. What you want to avoid is a table scan as this is costly.
Make sure that the Auto Statistics is set to on. SQL needs this to help decide the optimal execution. See Mike Gunderloy's great post for more info. Basics of Statistics in SQL Server 2005
Make sure your indexes are not fragmented. Reducing SQL Server Index Fragmentation
Make sure your tables are not fragmented. How to Detect Table Fragmentation in SQL Server 2000 and 2005
Use a with statment to handle query filtering.
Limit each subquery to the minimum number of rows possible.
then join the subqueries.
WITH
master AS
(
SELECT SSN, FIRST_NAME, LAST_NAME
FROM MASTER_SSN
WHERE STATE = 'PA' AND
GENDER = 'M'
),
taxReturns AS
(
SELECT SSN, RETURN_ID, GROSS_PAY
FROM MASTER_RETURNS
WHERE YEAR < 2003 AND
YEAR > 2000
)
SELECT *
FROM master,
taxReturns
WHERE master.ssn = taxReturns.ssn
A subqueries within a with statement may end up as being the same as inline views,
or automatically generated temp tables. I find in the work I do, retail data, that about 70-80% of the time, there is a performance benefit.
100% of the time, there is a maintenance benefit.
I think using SQL query analyzer would be a good start.
In Oracle you can look at the explain plan to compare variations on your query
Make sure that you have the right indexes on the table. if you frequently use a column as a way to order or limit your dataset an index can make a big difference. I saw in a recent article that select distinct can really slow down a query, especially if you have no index.
The obvious optimization for SELECT queries is ensuring you have indexes on columns used for joins or in WHERE clauses.
Since adding indexes can slow down data writes you do need to monitor performance to ensure you don't kill the DB's write performance, but that's where using a good query analysis tool can help you balanace things accordingly.
Indexes
Statistics
on microsoft stack, Database Engine Tuning Advisor
Some other points (Mine are based on SQL server, since each db backend has it's own implementations they may or may not hold true for all databases):
Avoid correlated subqueries in the select part of a statement, they are essentially cursors.
Design your tables to use the correct datatypes to avoid having to apply functions on them to get the data out. It is far harder to do date math when you store your data as varchar for instance.
If you find that you are frequently doing joins that have functions in them, then you need to think about redesigning your tables.
If your WHERE or JOIN conditions include OR statements (which are slower) you may get better speed using a UNION statement.
UNION ALL is faster than UNION if (And only if) the two statments are mutually exclusive and return the same results either way.
NOT EXISTS is usually faster than NOT IN or using a left join with a WHERE clause of ID = null
In an UPDATE query add a WHERE condition to make sure you are not updating values that are already equal. The difference between updating 10,000,000 records and 4 can be quite significant!
Consider pre-calculating some values if you will be querying them frequently or for large reports. A sum of the values in an order only needs to be done when the order is made or adjusted, rather than when you are summarizing the results of 10,000,000 million orders in a report. Pre-calculations should be done in triggers so that they are always up-to-date is the underlying data changes. And it doesn't have to be just numbers either, we havea calculated field that concatenates names that we use in reports.
Be wary of scalar UDFs, they can be slower than putting the code in line.
Temp table tend to be faster for large data set and table variables faster for small ones. In addition you can index temp tables.
Formatting is usually faster in the user interface than in SQL.
Do not return more data than you actually need.
This one seems obvious but you would not believe how often I end up fixing this. Do not join to tables that you are not using to filter the records or actually calling one of the fields in the select part of the statement. Unnecessary joins can be very expensive.
It is an very bad idea to create views that call other views that call other views. You may find you are joining to the same table 6 times when you only need to once and creating 100,000,00 records in an underlying view in order to get the 6 that are in your final result.
In designing a database, think about reporting not just the user interface to enter data. Data is useless if it is not used, so think about how it will be used after it is in the database and how that data will be maintained or audited. That will often change the design. (This is one reason why it is a poor idea to let an ORM design your tables, it is only thinking about one use case for the data.) The most complex queries affecting the most data are in reporting, so designing changes to help reporting can speed up queries (and simplify them) considerably.
Database-specific implementations of features can be faster than using standard SQL (That's one of the ways they sell their product), so get to know your database features and find out which are faster.
And because it can't be said too often, use indexes correctly, not too many or too few. And make your WHERE clauses sargable (Able to use indexes).