Date and String handling in Ignite SQL queries - ignite

I am about to try Ignite to compare it to Hazelcast performance-wise as data grid. As I am researching all of the features I need, there is no mention of date (java.util.Date) and String comparison in SQL queries (less than, greater, etc.) I am guessing it relies on Comparable (I would), but would like to know the exact answer.
Another related question (probably better asked separately) is about indexes. Hazelcast has indexes and the so-called Portable serialization format which essentially stores a subset of fields separately from the serialized object to avoid deserialization. How can I guarantee to avoid it in Ignite SQL queries? All fields indexed? What about compound indexes, etc. I am wondering how complex queries work internally, as there are no compound indexes according to the documentation.

Date and String actually are considered as SQL types TIMESTAMP and VARCHAR respectively, so it is not about Comparable. But for any non-standard types Ignite SQL will rely on Comparable if they participate in indexes or queries.
According to docs compound indexes are supported and called Group indexes. And complex queries work pretty well :)
Currently Ignite does not store indexed values separately but instead keeps deserialized Java object and uses reflection to access properties. In near future (hopefully in weeks) Ignite is going to release a feature allowing to index serialized objects and access fields without keeping Java objects (or even having indexed Java classes on nodes).

Related

GridGain SQL queries without data model + other GridGain SQL questions

I have been checking out GridGain for a while and came across some features regarding GridGain's SQL capabilities, which led me to some questions (that I couldn't find a firm answer in the docs)
From the examples, there is always an explicit data model. I am using Java, so that means there's always a class definition of the model to be queried for. The examples in the API docs: http://atlassian.gridgain.com/wiki/display/GG60/SQL,+Scan,+And+Full+Text+Queries begin by showing how properties much be annotated, which suggests to me an explicit model is always required. Properties of the model can be annotated for SQL querying such as "#GridCacheQuerySqlField". Is an explicit data model always required? Ideally, I would like a way to not have to explicitly state the model, as my use case does change often and has complex relations.
What subset of SQL queries can be performed through GridGain's SQL API? My use cases often require very complex queries. For example, in the docs (same link as above) it states that "Continuous Queries cannot be used with SQL. Only predicate-based queries are supported." where can I find what subset of SQL is supported (and under what conditions, as the example provided does not perform continuous sql queries unless the condition that queries are predicate-based is met)
Thanks in advance for the insight
GridGain has support for non-fixed data model in Enterprise version, namely portable objects. Portable objects allow you to render data model as a map-like nesting structure which allows dynamic structure changes, indexing and portability across different languages (Java, C#, .NET). You can take a look at portable objects in GridGain Enterprise edition examples and read documentation here: http://entdoc.gridgain.org/latest/Portable+Cross+Platform+Objects In open-source version explicit class definition is always required.
The SQL limitations are described in GridCacheQuery javadoc: http://gridgain.com/sdk/6.5.0/javadoc/org/gridgain/grid/cache/query/GridCacheQuery.html
Group by and sort by statements are applied separately on each node, so result set will likely be incorrectly grouped or sorted after results from multiple remote nodes are grouped together.
Aggregation functions like sum, max, avg, etc. are also applied on each node. Therefore you will get several results containing aggregated values, one for each node.
Joins will work correctly only if joined objects are stored in collocated mode or at least one side of the join is stored in REPLICATED cache.

Can we use user-defined (non-scalar) SQL-types for ORM?

I'm wondering if its possible to do ORM using SQL.2003 object types (aka STRUCTs, aka non-scalar types).
The idea behind that is to avoid the "n+1 selects" problem by retrieving complete objects directly from the database. Sort of eager "FetchMode.JOIN", but in the database.
Are there any ORM frameworks fpor Java or .Net which support SQL object types at all?
At least JDBC has the getObject method and I've also found an example of user-defined types in ADO.Net
As an Oracle developer, I may be biased towards database-centric approaches and I also didn't use ORM before. But Oracle features Object Views which let you compose objects from several relational tables. I bet these could be magnitudes faster than pulling all those single records out of the database, let alone issuing n+1 selects.
I am the developer of jOOQ, and I am striving to make jOOQ exactly what you need. jOOQ currently supports any of these Oracle features:
All types of SQL constructs, including nested selects, aliasing, union operations, etc
Stored procedures and functions
Packages
VARRAY types (mapped to Java arrays)
UDT types (mapped to Java objects)
combinations thereof
More support will be added in the near future, for advanced Oracle concepts such as
TABLE types
CURSOR, REF CURSOR types
Other collection types
Object views are currently not supported in the way you described, but I'll clearly put them on the roadmap.
See more on http://www.jooq.org

Which Database can i Safely use a GUID as Primary Key besides SQL Server?

The reason I want to use a Guid is because in the event that I have to split the database into I won't have primary keys that overlap on both databases. So if I use a Guid there won't be any overlapping. I also want to use the GUID in the url also, so the Guid will need to be Indexed.
I will be using ASP.NET C# as my web server.
Postgres has a UUID type. MySQL has a UUID function. Oracle has a SYS_GUID function.
As others have said you can use GUIDs/UUIDs in pretty much any modern DB. The algorithm for generating a GUID is pretty straitforward and you can be reasonably sure that you won't get dupes however there are some considerations.
+) Although GUIDs are generally representations of 128 Bit values the actual format used differs from implementation to implemenation - you may want to consider normalizing them by removing non-significant characters (usually dashes or spaces).
+) To absolutely ensure uniqueness you can also append a value to the guid. For example if you're worried about MS and Oracle guids colliding add "MS" to the former and "Or" to the latter - now even if the guids themselves do collide they keys won't.
As others have mentioned however there is a potentially severe price to pay here: your keys will be large (128 bits) and won't index very well (although this is somewhat dependent on the implementation).
The techique works very well for small databases (especially those where the entire dataset can fit in memory) but as DBs grow you'll definately have to accept a performance trade-off.
One thing you might consider is a hybrid approach. Without more information it's hard to really know what you're trying to do so these might not help:
1) Remember that primary keys don't have to be a single column - you can have a simple numeric key to identify your rows and another row, containing a single value, that identifies the database that hosts the data or created the key. Creating the primary key as aggregate of both columns allows indexing to index fewer complex values and should be significantly faster.
2) You can "fake it" by constructing the key as a concatenated field (as in the above idea to append a DB identifier to the key). So your key would be a simple number followed by some DB identifier (perhaps a guid for each DB).
Indexing such a value (since the values would still be sequential) should be much faster.
In both cases you'll have some manual work to do if you ever do split the DB(s) - you'll have to update some keys with a new DB ID, but this would be a one-time,infrequent event. In exchange you can tune your DB much better.
There are definately other ways to ensure data integrity across mutiple databases. Many enterprise DBMSs have tools built-in for clustering data across multiple servers or databases, some have special tools or design patterns that make it easier, etc.
In short I would say that guids are nice and simple and do what you want, but that you should only consider them if either a) the dataset is small or b) the DBMS has specific features to optimize their use as keys (for example sequential guids). If the datasets are going to be very large or if you're trying to limit DBMS-specific dependencies I would play around more with optimizing a "key + identifier" strategy.
Most any RDBMS you will use can take any number and type of columns as a PK. So, if you're storing the GUID as a CHAR(n) for some length n, you should be fine. Now, I'm not sure if this is advisable, as I'm guessing indexing on CHARs is not as efficient as on integers.
Hope that helps.
I suppose you could store a GUID as an int128 as well.
Both mySQL and postgres are known to support GUID datatypes (I believe it's called UUID but it's the same thing).
Unless I have completely lost my memory, a properly designed 3rd+ normal form database schema does not rely on unique ints, or by extension GUIDs or UUIDs for primary keys. Nor does it use intermediate lookup tables of ints/GUIDS/UUIDS to relate the tables containing the data.
You should grind your schema until it expresses the relations amongst tables of data in terms of the data in the tables, not auto-generated identifiers that have no intrinsic relationship to the data.
I freely grant that you may just possibly be doing something that really really requires GUIDs (or auto-increment integers) for primary keys. But I seriously doubt that is the case - it almost never is.
You can implement your own membership provider based on whatever database schema you choose to design. It's nowhere near as tricky as it may look at first.
google "roll your own membership provider" for plenty of pointers.
In my theoretical little world, you'd be able to do this with SQLite. You'd generate the Guid from .Net and write it to the SQLite database as a string. You could also index that field.
You do loose some of the index benefits because it'd be stored as a string but it should be fully backwards compatible so that you could import/export to/from SQL Server.
From looking through the comments it looks like you are trying to use a different database to MS SQL with the ASP.net membership provider - as others have mentioned you could roll your own provider to use a different DB however a quick Google search turned up a few ready made options:
MySQL Provider
MySQL Provider 2
SqlLite Provider
Hope these help
If you are using other MS technologies already you should consider Sql Server Express.
http://www.microsoft.com/express/sql/default.aspx
It is a real implementation of MS Sql Server and it is free. It does have significant limitations as you might imagine, but if your product can fit inside those you get the support, developer community and stability of Sql Server and a clear upgrade path if you need to grow.

When is sqlite's manifest typing useful?

sqlite uses something that the authors call "Manifest Typing", which basically means that sqlite is dynamically typed: You can store a varchar value in a "int" column if you want to.
This is an interesting design decision, but whenever I've used sqlite, I've used it like a standard RDMS and treated the types as if they were static. Indeed, I've never even wished for dynamically typed columns when designing databases in other systems.
So, when is this feature useful? Has anybody found a good use for it in practice that could not have been done just as easily with statically typed columns?
It really just makes types easier to use. You don't need to worry about how big this field needs to be at a database level anymore, or how many digets your intergers can be. More or less it is a 'why not?' thing.
On the other side, static typing in SQL Server allows the system to search and index better, in some cases much better, but for half of all applications, I doubt that database performance improvements would matter, or their performance is 'poor' for other reasons (temp tables being created every select, exponential selects, etc).
I use SqLite all the time for my .NET projects as a client cache because it is just too easy too use. Now if they can only get it to use GUIDs the same as SQL server I would be a happy camper.
Dynamic typing is useful for storing things like configuration settings. Take the Windows Registry, for example. Each key is a lot like having an SQLite table of the form:
CREATE TABLE Settings (Name TEXT PRIMARY KEY, Value);
where Value can be NULL (REG_NONE) or an INTEGER (REG_DWORD/REG_QWORD), TEXT (REG_SZ), or BLOB (REG_BINARY).
Also, I'll have to agree with the Jasons about the usefulness of not enforcing a maximum size for strings. Because much of the time, those limits are purely arbitary, and you can count on someday finding a 32-byte string that needs to be stored in your VARCHAR(30).

What should one map strings to in a database ORM?

Strings are unbounded, but it seems every normal relational database requires that a column declare its maximum length. This seems to be a rather significant discrepancy and I'm curious how typical ORMs handle this.
Using a 'text' column type would theoretically give you much more string-like storage, but as I understand it text columns are not queryable, or at least not efficiently (non-indexed?).
I'm thinking of using something like NHibernate perhaps, but my ORM needs are relatively simple, so if I can just write it myself it might save some bloat.
SqlServer for instance stores only the actual size of the data. For me, defining a large enough size for strings is sufficient that the user doesn't recognize the limits.
Exmaples:
name of a product, person etc: 500
Paths, Urls etc: 1000
Comments, free text: 2000 or even more
NHibernate does not anything with the size at runtime. You need to use some kind of validator or let the database either cut or throw an exception.
Quote: "my ORM needs are relatively simple". It's hard to say if NHibernate is overkill or not. Data access isn't generally that simple.
As a simple guide of the top of my head, take NHibernate if:
You have a fine-grained or complex domain model. You need to map inheritance.
You want your domain model somewhat independent from the database model.
You need some lazy loading features
You want to be database independent, eg. run it on SqlServer or Oracle
If you think that a class per table is what you need, you don't actually need a ORM.
According to my knowledge, they dont handle it, either you let the ORM define the schema, then:
The ORM will decide the size either defaulted or defined by your config.
Or the schema is not deined by the ORM, then it just has to obey the rules, if you insert too large strings, then you'll get errors from the DB.
I would stay on varcharish types, e.g. varchar2 for oracle or nvarchar for sql server, unless you're dealing with clobs.