Can I create Storage Attached Indexes on a map's values in Cassandra 4.x? - indexing

Is it be possible to create a Storage Attached Index on a map's values in Cassandra? In my case, I have a column named coordinates which is of data type map<text,float> and contains latitude and longitude of the sensors' locations. I would therefore like to create a SAI on the map's values so to be able to query the table based on those values.
Is this an anti-pattern? Would it be better to have two separate columns for latitude and longitude?

No, creating a SASI Index is not supported on a map (or other "complex" types) yet.
Is this an anti-pattern?
Unfortunately yes. Even if it worked, filtering in the WHERE clause by coordinates alone would not be a partition-based query, and would be a bit slow. Not sure how the key is/was modeled here, but there would need to be some work done to ensure that a query could be served by a single node. And SAI/SASI based index queries can't guarantee that.

Related

PostgreSQL - What should I put inside a JSON column

The data I want to store data that has this characteristics:
There are a finite number of fields (I don't expect to add new fields);
There are some columns that are common to all sets of data (a category field, for instance);
There are some columns that are specific to individual sets of data (each category needs it's own fields);
Here's how it would look like in a regular table:
I'm having trouble figuring out which would be the better way to store this data in a database for this situation.
Bellow are the ideas I already had:
Do exactly as the tabular table (I would have many NULL values);
Divide the categories into tables (I would use joins when needed);
Use JSON type for storing the values (no NULL values and having it all in same table).
So my questions are:
Is one of these solutions (or one that I have not thought about it) that is better for this case?
Are there other factors, other than the ones presented here, that I should consider to make this decision?
Unless you have very many columns (~ 100), it is usually better to use normal columns. NULL values don't take any storage space in PostgreSQL.
On the other hand, if you have queries that can use any of these columns in the WHERE condition, and you compare with =, a single GIN index on a jsonb might be better than having many B-tree indexes, because the index maintenance costs would be higher.
The definitive answer depends on the SQL statements that you plan to run on that table.
You have laid out the three options pretty well. Things to consider are:
Performance
Data size
Each of maintenance
Flexibility
Security
Note that you don't even allude to security considerations. But security at the table level is usually a tad simpler than at the column level and might be important for regulated data such as PII (personally identifiable information).
The primary strength of the JSON solution is flexibility. It is easy to add new columns. But you don't need that. JSON has a cost in data size and data type flexibility (notably JSON doesn't support date/times explicitly).
A multiple table solution requires duplicating the primary key but may result in much less storage overall if the columns really are sparse. The "may" may also depend on the data type. A NULL string for instance occupies less space than a NULL float in a table record.
The joins on multiple tables will be 1-1 on primary keys. These should be pretty fast.
What would I do? Unless the answer is obvious, I would dump the data into a single table with a bunch of columns. If that table starts to get unwieldy, then I would think about splitting it into separate tables -- but still have one table for the common columns. The details of one or multiple tables can be hidden behind a view.
Depends on how much data you want to store, but as long as it is finite it shouldn't make a big difference if it contains a lot of null's or not

Is there any work around for indexing list in Apche ignite and use in where clause?

I am not sure how to index List/array in apache ignite. I want to use my list/array in where clause, I can write custom function but it will search all the data set, But I am looking for indexing of list/array.
Please help me.
A common way to store lists in SQL database is to create a table of pairs, representing one-to-many relation.
Columns of this table of pairs can be indexed and used in where clauses after joining with the initial table.
To make joins work fast, you will probably need to make records of these two tables collocated by affinity.

Wikipedia Graph Database Insertion

I am trying to create a database from dbpedia RDF triples. I have a table Categories which contains all the Categories in wikipedia. To store categorizations i have created a table with child and parent fields, both foreign keys to Categories table. To load categories from NTriples iam using the following SQL Query
INSERT INTO CatToCat (`child`, `parent`)
values((SELECT id FROM Categories WHERE BINARY identifier='Bar'),
(SELECT id FROM Categories WHERE BINARY identifier='Bar'));
But the insertion is very slow.. inserting 2.5Million relationships would take very long time.. is there better way to optimize the query, schema??
you could try a Graph Database like Neo4j, with RDF layers on top, there is for instance the Tinkerpop SAIL implementation, see https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation
That should work a bit better than RDBMS, at least for Neo4j.
/peter
Consider loading SELECT id, indentifier from Categories into a hash table (or trie) on the client side, and using that to fill CatToCat. On a database the size of wikipedia, I'd expect to see a huge performance difference between constant time hash lookups and trie lookups (which are constant with respect to the number of different data items), and log n B-Tree lookups. (Of course, you need to have the memory available.)
Consider using a single PreparedStatement, with parameter binding so that MySQL doesn't have to re-parse and re-optimize the query for every insertion.
You'll have to benchmark these to figure out how much of an improvement they actually are.
I solved the problem. Was some indexing issues. Made identifier in Categories unique and binary. I guess that sped up the two selects.

Non-trivial geolocation query db caching

I first have to say that I really am a rookie in caching, so please do elaborate on any explanation and bear with me if my question is stupid.
I have a server with pretty limited resources, so I'm really interested in caching db-queries as effectively as I can. My issue is this:
I have a MYSQL DB with a table for geolocations, there are columns (lat and lng) - I only indexed lat since a query will always have both lat and lng, and only 1 index can be effectively used to my understanding (?).
The queries are very alternating in coordinates like
select lat, lng
where lat BETWEEN 123123123 AND 312412312 AND lng BETWEEN 235124231 AND 34123124
where the long numbers that are the boundaries of the BETWEEN query are constantly changing, so IS there a way to cache this the smart way, so that the cache doesn't have to be a complete query match, but the values of previous between queries can be held against a new to save some db resources?
I hope you get my question - if not please ask.
Thank you so much
Update 24/01/2011
Now that I've gotten some response I want to know what the most efficient way of querying would be.
Would the Between query with int values execute faster or
would the radius calculation with point values execute faster
if 1. then how would the optimal index look like?
If your table is MyISAM you can use Point datatype (see this answer for more details)
If you are not willing or are not able to use spatial indexes, you should two separate indexes:
CREATE INDEX ix_mytable_lat_lon ON mytable (lat, lon)
CREATE INDEX ix_mytable_lon_lat ON mytable (lon, lat)
In this case, MySQL can use an index_intersect over these indexes which is sometimes faster than mere filtering with a single index.
Even if it does not, it can pick a more selective index if there are two of those.
As for the caching, all pages read from the indexes are cached and reside in memory until they will be overwritten with hotter data (it not all database fits to the cache).
This will prevent MySQL from the need to read the data from disk.
MySQL is also able to cache the whole resultsets in memory, however, this requires the query to be repeated verbatim, with all parameters exactly the same.
I think to do significantly better you'll need to characterize your data better. If you've got data that's uniformly distributed across longitude and latitude, with no correlation, and if your queries are similarly distributed and independent - you're stuck. But if your data or your queries cluster in interesting ways, you may find that you can introduce new columns that make at least some queries quicker. If most queries happen within some hard range, maybe you can set that data aside - add a flag, link it to some other table, even put the frequently-requested data into its own table. Can you tell us any more about the data?

When are computed columns appropriate?

I'm considering designing a table with a computed column in Microsoft SQL Server 2008. It would be a simple calculation like (ISNULL(colA,(0)) + ISNULL(colB,(0))) - like a total. Our application uses Entity Framework 4.
I'm not completely familiar with computed columns so I'm curious what others have to say about when they are appropriate to be used as opposed to other mechanisms which achieve the same result, such as views, or a computed Entity column.
Are there any reasons why I wouldn't want to use a computed column in a table?
If I do use a computed column, should it be persisted or not? I've read about different performance results using persisted, not persisted, with indexed and non indexed computed columns here. Given that my computation seems simple, I'm inclined to say that it shouldn't be persisted.
In my experience, they're most useful/appropriate when they can be used in other places like an index or a check constraint, which sometimes requires that the column be persisted (physically stored in the table). For further details, see Computed Columns and Creating Indexes on Computed Columns.
If your computed column is not persisted, it will be calculated every time you access it in e.g. a SELECT. If the data it's based on changes frequently, that might be okay.
If the data doesn't change frequently, e.g. if you have a computed column to turn your numeric OrderID INT into a human-readable ORD-0001234 or something like that, then definitely make your computed column persisted - in that case, the value will be computed and physically stored on disk, and any subsequent access to it is like reading any other column on your table - no re-computation over and over again.
We've also come to use (and highly appreciate!) computed columns to extract certain pieces of information from XML columns and surfacing them on the table as separate (persisted) columns. That makes querying against those items just much more efficient than constantly having to poke into the XML with XQuery to retrieve the information. For this use case, I think persisted computed columns are a great way to speed up your queries!
Let's say you have a computed column called ProspectRanking that is the result of the evaluation of the values in several columns: ReadingLevel, AnnualIncome, Gender, OwnsBoat, HasPurchasedPremiumGasolineRecently.
Let's also say that many decentralized departments in your large mega-corporation use this data, and they all have their own programmers on staff, but you want the ProspectRanking algorithms to be managed centrally by IT at corporate headquarters, who maintain close communication with the VP of Marketing. Let's also say that the algorithm is frequently tweaked to reflect some changing conditions, like the interest rate or the rate of inflation.
You'd want the computation to be part of the back-end database engine and not in the client consumers of the data, if managing the front-end clients would be like herding cats.
If you can avoid herding cats, do so.
Make Sure You Are Querying Only Columns You Need
I have found using computed columns to be very useful, even if not persisted, especially in an MVVM model where you are only getting the columns you need for that specific view. So long as you are not putting logic that is less performant in the computed-column-code you should be fine. The bottom line is for those computed (not persisted columns) are going to have to be looked for anyways if you are using that data.
When it Comes to Performance
For performance you narrow your query to the rows and the computed columns. If you were putting an index on the computed column (if that is allowed Checked and it is not allowed) I would be cautious because the execution engine might decide to use that index and hurt performance by computing those columns. Most of the time you are just getting a name or description from a join table so I think this is fine.
Don't Brute Force It
The only time it wouldn't make sense to use a lot of computed columns is if you are using a single view-model class that captures all the data in all columns including those computed. In this case, your performance is going to degrade based on the number of computed columns and number of rows in your database that you are selecting from.
Computed Columns for ORM Works Great.
An object relational mapper such as EntityFramework allow you to query a subset of the columns in your query. This works especially well using LINQ to EntityFramework. By using the computed columns you don't have to clutter your ORM class with mapped views for each of the model types.
var data = from e in db.Employees
select new NarrowEmployeeView { Id, Name };
Only the Id and Name are queried.
var data = from e in db.Employees
select new WiderEmployeeView { Id, Name, DepartmentName };
Assuming the DepartmentName is a computed column you then get your computed executed for the latter query.
Peformance Profiler
If you use a peformance profiler and filter against sql queries you can see that in fact the computed columns are ignored when not in the select statement.
Computed columns can be appropriate if you plan to query by that information.
For instance, if you have a dataset that you are going to present in the UI. Having a computed column will allow you to page the view while still allowing sorting and filtering on the computed column. if that computed column is in code only, then it will be much more difficult to reasonably sort or filter the dataset for display based on that value.
Computed column is a business rule and it's more appropriate to implement it on the client and not in the storage. Database is for storing/retrieving data, not for business rule processing. The fact that it can do something doesn't mean you should do it that way. You too you are free to jump from tour Eiffel but it will be a bad decision :)