Can RethinkDB efficiently handle lots of sparseness? - schema

usecase: as part of a data-infra I'm contemplating storing many* entities of various schema.org types in the same RethinkDB table.
Given the inherent type-hierarchy of schema.org, some properties are shared by all types, some properties are only available on 1 type, and everything in between.
For example: a Person,Organization,LocalBusiness, share properties like name, description, postalAddress, etc. while some are only used by Person, such as firstName.
Mapping this to a RethinkDB table will result in many properties (fields in Rethink-speak) being empty for many entities. As a guess I'd say a field will be empty about 90% of the time on average. About ~150 fields exist.
Would RethinkDB be able to efficiently handle such a sparse layout? This is a broad question I realize, but I'm looking for specifics like:
If I were to build indexes on some (not all) of these fields would empty values consume space in these indexes?
what would the performance penalty (cpu and mem) be if these fields were all allowed to be multivalued? i.e.: arrays?
*) a couple of million to start with

RethinkDB works well with sparse data. Indexes are currently always sparse indexes, so your index won't be cluttered up by documents that don't have the indexed field.

Related

Postgresql using json objects to store data sets

I'm fairly new to using postgres and would like to hear some opinions before I take the time to set this up. Not that it matters what framework I'm using, but to better explain the scenario, I'm using django with many custom models (essentially post types) that require many fields which equates to many many columns in each db table.
I know postgres is well optimized for this type of scenario but in my case, I'm planning for this to be a large scale application and when it gets larger, I'll be making many calls to the database which could negatively affect the performance.
I honestly have no idea if this would be a viable option but could setting up my database to store the data in a json format in a single column be a good solution?
for example, let's say I have a model called 'laptop' and rather than have individual columns for specs like:
brand:
model:
screen_size:
ram_capacity:
processor_type:
processor_speed:
hard_drive_capacity:
graphics_card:
drive: ....
and end up having 30 to 40 columns would it be viable to have it set up like:
id(primary_key):
brand:
model:
specs(json object): {
brand:
model:
screen_size:
ram_capacity:
processor_type:
processor_speed:
hard_drive_capacity:
graphics_card:
drive:
}
I'm just using the laptop model as an example but obviously there are millions of potential use cases. Just wanted to know the pros and cons of this potential set up and any insight is appreciated.
Is it viable? Yes. Is it a good idea? Generally no.
Why not? JSON is powerful, but it has some shortcomings:
JSON repeats the headers, so the size of the data is much larger using JSON than without.
JSON's types don't have the same flexibility as Postgres types. For instance, it is missing dates.
It is harder to enforce constraints on JSON objects.
JSON is overly flexible, so there is little validation of the data going into a JSON object (other than the JSON format).
Note that JSONB overcomes some of these limitations.
So, if you have columns that you know are going to be part of each entity, storing them as separate columns is generally the best approach.
When do you use JSON? You use it when you have optional columns that would not be shared among most of the rows in the table. For instance, different manufacturers might have specs specific to that manufacturer and that would be suitable for a JSON object.
#Gordon Linoff explains many of the reasons why that is a bad idea. One question
I would also ask is how do you plan to look up items in the database if all the
information is packed into JSON? You will find it difficult to search for the items with a particular processor or with a particular processor speed for example.
You have not really thought about using the database structure very well. If there is a lot of overlap between your products you not only waste space but will suffer form insert and deletion anomolies. For example you include processor type and processor speed in the record whereas a properly normalised database would look more like :
CREATE TABLE manufacturers (
name ...
);
CREATE TABLE processors (
uid INTEGER GENERATED ALWAYS AS IDENTITY,
model TEXT NOT NULL,
manufacturer INDEX REFERENCES manufacturers (uid),
speed INTEGER NOT NULL,
cache INTEGER ....
)
CREATE TABLE models (
name ...
processor INTEGER REFERENCES processors (uid)
...
)
i.e. would be normalised. This reduces data loads considerably but also helps prevent update anomolies. For example with your scheme there has to be model a with
an AMD xyz processor in order to record details of that processor.
If you really do not want to do database normalisation then you might be interested in the HSTORE type which is a set of key/value pairs you can keep in a
column. This is not so good as a properly normalised database but makes it
possible to do queries on the keys and values.

How can Datomic users cope without composite indexes?

In Datomic, how do you efficiently perform queries such as 'find all people living in Washington older than 50' (city and age may vary)? In relational databases and most of NoSQL databases you use composite indexes for this purpose; Datomic, as far as I'm aware of, does not support anything like this.
I built several, say, medium-sized web-apps and not a single one would perform quick enough, if not for composite indexes. How are Datomic users dealing with this? Or are they just playing with datasets small enough not to suffer from this? Am I missing something?
This problem and its solution are not identical in Datomic due to the structure of data (datoms) in Datomic. There are two performance characteristics/strategies that may add some shading to this:
(1) When you fetch data in Datomic, you fetch an entire leaf segment from the index tree (not an individual item) - with segments being composed of potentially many thousands of datoms. This is then cached automatically so that you don't have to reach out over the network to get more datoms.
If you're querying a single person - i.e., a single entity, for their age and where they live, it's very likely the query's navigation of the EAVT or AEVT indexes may have cached everything you need. You've effectively cached the datom, how to navigate to it to it, and related datoms (by locality in the index).
(2) Partitions can provide a manual means to specify locality of reference. Partitions impact the entity ID's value (it's encoded in the high bits) and ensure that related entities are sorted near each other. So for an alternative implementation of the above problem, if you needed information from the city and person entities both, you could include them in the same partition.
I've written a library to handle this: https://github.com/arohner/datomic-compound-index
Update 2019-06-28: Since 0.9.5927 (Datomic On-Prem) / 480-8770 (Datomic Cloud), Datomic supports Tuples as a new Attribute Type, which allows you to have compound indexes.

How is a SQL table different from an array of structs? (In terms of usage, not implementation)

Here's a simple example of an SQL table:
CREATE TABLE persons
(
id INTEGER,
name VARCHAR(255),
height DOUBLE
);
Since I haven't used SQL very much, I haven't yet learned to think in its terms. Effectively, my brain translates the above into this:
struct Person
{
int id;
string name;
double height;
Person(int id_, const char* name_, double height_)
:id(id_),name(name_),height(height_)
{}
};
Person persons[64];
Then, inserting some elements, in SQL:
INSERT INTO persons (id, name, height) VALUES (1234, 'Frank', 5.125);
INSERT INTO persons (id, name, height) VALUES (5678, 'Jesse', 6.333);
...and how I'm thinking of it:
persons[0] = Person(1234, "Frank", 5.125);
persons[1] = Person(5678, "Jesse", 6.333);
I've read that SQL can be thought of as two major parts: data manipulation and data definition. I'm more concerned about organizing my data in the first place, as opposed to querying and modifying it. There, the distinctions of SQL are immediately obvious. To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic. Where does the array-of-structs analogy I'm automatically drawing for myself break down?
To give a concrete example, let's say that I want each entry in my persons table (or each of my Person objects) to contain a field denoting the names of that person's children (actual fruit-of-your-loins children, not hierarchical data structure children). In reality, these would probably be cross-table references (or pointers to objects), but let's keep things simple and make this field contain zero or more names. In my C++ example, I'd modify the declaration like so:
vector<string> namesOfChildren;
...and do something like this:
persons[0].namesOfChildren.push_back("John");
persons[0].namesOfChildren.push_back("Jane");
But, from what I can tell, the typical usage of SQL doesn't mirror this approach. If I'm wrong and there's a simple, straightforward solution, great. If not, I'm sure a SQL novice like myself could benefit greatly from a little cogitation on the subject of how databases of SQL tables are meant to be used in contrast to bare, generic data structures.
To me, it seems like the subtleties of how data can and should be structured in SQL is a more obscure topic.
It's called "data(base) modeling" and is somewhere between engineering discipline and art (like much of the computer programming). If you are really interested in the topic, take a look at ERwin Methods Guide.
Where does the array-of-structs analogy I'm automatically drawing for myself break down?
At persistency, concurrency, consistency and scalability.
Persistency: The table is automatically saved to the permanent storage. It'll stay there and survive reboots (not that a real database server will reboot much) until you explicitly delete it or there is a catastrophic hardware failure. DBMSes have well-oiled backup procedures that should help in the latter case.
Concurrency: Tables are meant to be accessed and (need be) modified by many clients concurrently. Mechanisms such as locking and multi-version concurrency control are employed to ensure clients will not "step on each other's toes".
Consistency: You can define certain constraints (such as uniqueness, foreign keys or checks) and the DBMS will make sure they are never broken. Furthermore, this can often be done in a declarative manner, minimizing chance for errors. On top of that, everything you do in a database is transactional, so you reap the benefits of atomicity, consistency, isolation and durability (aka. "ACID"). In a nutshell, the database will defend itself from bad data.
Scalability: A well designed database schema can grow well beyond the confines of the available RAM, and still keep good performance, using techniques such as indexing, partitioning, clustering etc... Furthermore, SQL is declarative and set-based, which means that the DBMS has the latitude to pick the optimal "query execution plan" for the data at hand, auto-parallelize the query, cache the results in hope they will be reused etc... without changing the meaning of the query.
Your analogy to the array of structs is not bad ... for the beginning.
After this beginning the differences start in relation to organizing data.
Database people love their "Normal Forms" laws. We do not have these laws in C++ or similar programming languages. Organizing data in the tables according to these laws help database engines to do their magic (queries, joins) better, i.e keep databases compact and crunch millions of rows in fractions of a second, and allow multiple requests concurrently. They are not absolute laws: the 1NF (1st Normal Form) is followed in 99.9999% cases, but the bigger the number (2NF, 3NF, ...) the more often DB planners allow themselves to deviate from them.
Description of normal forms can be found for example here.
I will try to illustrate differences on your example.
In your example the fields of your struct correspond to the columns of the database table. Adding vector of the names as a new field of struct would correspond to adding comma separated list of the names into a new column of your table. This is a violation of the 1NF which demands that one cell is for one value - not for the list of values. To normalize your data you will need to have two arrays: one of Person structs, and another new of structs for Child. While in C++ we can use just pointers to link each child to its parent, in SQL we must use the mechanism of the key. You already added id field into Person struct, now we need to add ParentId field to Child struct so that database engine could find the Parent. ParentId column is called foreign key. Another approach to satisfy 1NF instead of creating the new table/struct for children is that we can switch to children-centric thinking and have just one table with a record per child which will include all the information about the parent of the child. Info about the parent obviously will be repeated in as many records as many children this parent has.
Note (this is also considered part of 1NF) that while in the array of structs we always know the order of the elements, in databases it is up to the engine in what order to store the records. It is just mathematical un-ordered set of records, the engine can resort it in internal storage for various optimizations as it likes. When you retrieve the records from the database with the SELECT statement, if you care about the order, you need to provide ORDER BY clause.
2NF is about removing repetitions from your records. Imagine you would have place of work related fields also as part of your Person struct. Imagine it would include Name of the company and company address. If many Persons in your dataset work in the same Company your would repeat address of the company in their records. Probably we wouldn't do these repetitions in C++ either, but nevertheless extracting these repetitions into a separate table would satisfy 2NF. Strictly speaking even if there is no repetitions and all your Persons work in different places, 2NF still requires to extract data about the workplaces into separate table because it requires that one table would represent one entity.
3NF is about removing transitive dependency and is considered kind of optional, so I will not describe it here. See link above.
Another feature of databases quite different from conventional programming of data structures in C++ is the database indexes. Simplifying, index is just a copy of a column (or columns) (i.e vertical slice) into a separate table where they are stored in an inherent for them order and each record in the index retains the reference to the whole record. So, in your example to create index by height you would create another array of 64 elems of the new
struct HeightIndexElem
{
double height;
Person* pFullRecord;
}
and sort them by height in this array. This will allow the DB engine to automatically optimize certain queries. The database engine itself decides when to use certain index. In C++ we usually create maps (Dictionaries in C#) to speed up finding element by certain characteristic but we must use these maps ourselves - no automatic aspect there.
There are major differences:-
SQL tables are persistent -- (English Tran: written to disk)
They are transactional -- (really written to disk)
They can be an arbitary size -- (Tables of a several hundred million rows are quite common)
They support relational algebra -- (Joins with other tables, filtering etc.)
Relational Algebra is provable -- For a given SELECT statement there is only one possible correct answer.
The biggest differences are that when you "UPDATE" and "COMMIT" you know your data is saved in the database and will be there until you decide to "DELETE" it. When you update a structure within an array its gone when the machine is switched off.
The other big difference is scale. The size of a modern DBMS is only limited by your hard disk budget.
[I really like farfareast's answer from an academic stand point, but I feel the need to add a more practically oriented answer too.]
SQL tables themselves are "bare, generic data structures" as you call C++'s structures. They are only different data structures: a table is always an array of (fixed size) structs and the only pointers you can use are foreign keys.
For example, when you are adding a vector<string> to your struct, you are already using pointers internally as strings are only a "fancy" way of writing char*. This would already require a second table in SQL (using a secondayr index column to keep the elements in order). Of course there are things like postgresql's arrays that can help in this specific case, but those are "only" shortcuts for similar hand-writeable constructs.
The real difference in data structure and algorithms comes from the fact that you can easily add declarations of index structures. Say you know you need to always access Persons in the order of their height. In C++ you'd use some kind of tree or sorted list to keep them in order. There is an STL container for that. The downside is, that when you need to access them in a different order (say by name), you'll have to add a second tree and duplicate the data or start using pointers to Persons. If you add a Person, you need to update all containers and so on. This becomes cumbersome and soon you'll be on the front page of The Daily WTF. SQL tables on the other side can have attached indices which automatically keep up with new and changed data. Of course, their maintenance also must be paid in performance, but the management of them is basically deciding which are required by your access patterns -- something needed in every case -- and defining them. In contrast to having to rewrite large parts of an application, this is a much more favorable situation.

Multiple or single index in Lucene?

I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager

Best pattern for storing (product) attributes in SQL Server

We are starting a new project where we need to store product and many product attributes in a database. The technology stack is MS SQL 2008 and Entity Framework 4.0 / LINQ for data access.
The products (and Products Table) are pretty straightforward (a SKU, manufacturer, price, etc..). However there are also many attributes to store with each product (think industrial widgets). These may range from color to certification(s) to pipe size. Every product may have different attributes, and some may have multiples of the same attribute (Ex: Certifications).
The current proposal is that we will basically have a name/value pair table with a FK back to the product ID in each row.
An example of the attributes Table may look like this:
ProdID AttributeName AttributeValue
123 Color Blue
123 FittingSize 1.25
123 Certification AS1111
123 Certification EE2212
123 Certification FM.3
456 Pipe 11
678 Color Red
999 Certification AE1111
...
Note: Attribute name would likely come from a lookup table or enum.
So the main question here is: Is this the best pattern for doing something like this? How will the performance be? Queries will be based on a JOIN of the product and attributes table, and generally need many WHEREs to filter on specific attributes - the most common search will be to find a product based on a set of known/desired attributes.
If anyone has any suggestions or a better pattern for this type of data, please let me know.
Thanks!
-Ed
You are about to re-invent the dreaded EAV model, Entity-Attribute-Value. This is notorious for having problems in real-life, for various reasons, many covered by Dave's answer.
Luckly the SQL Customer Advisory Team (SQLCAT) has a whitepaper on the topic,
Best Practices for Semantic Data Modeling for Performance and Scalability. I highly recommend this paper. Unfortunately, it does not offer a panacea, a cookie cutter solution, since the problem has no solution. Instead, you'll learn how to find the balance between a fixed queryable schema and a flexible EAV structure, a balance that works for your specific case:
Semantic data models can be very
complex and until semantic databases
are commonly available, the challenge
remains to find the optimal balance
between the pure object model and the
pure relational model for each
application. The key to success is to
understand the issues, make the
necessary mitigations for those
issues, and then test, test, and test.
Scalability testing is a critical
success factor if you are going to
find that optimal design.
This is going to be problematic for a couple of reasons:
Your entity queries will be much harder to write. Transforming the results of those queries into something resembling a ViewModel when it comes time for presentation is going to be painful because it will involve a pivot for each product.
Understanding what your datatypes will be is going to be tough when it comes time to read certain types of data. Are you planning on storing this as strings? For example, DateTimes hold more data than the default .ToString() implementation writes to the string. You're also going to have issues if you try to store floating-point values.
Your objects' data integrity is at risk. There will be a temptation to put properties which should be just attributes of your main product tables in this "bucket o' data". Maybe the design will be semi-sane to begin with, but I guarantee you that after a certain amount of time, folks will start to just throw properties in the bag. It'll then be very tough to keep your objects' integrity with such a loosely defined structure.
Your indexes will most likely be suboptimal. Again think of a property which should be on your product table. Instead of being able to index on just one column, you will now be forced to make a potentially very large composite index on your "type" table.
Since you're apparently planning to throw out proper datatypes and use strings, the performance of range queries for numeric data will likely be poor.
Your table will get big, slowing backups and queries. Instead of an integer being 4 bytes, you're going to have to store far more for an integer of any size.
Better to normalize the table in a more "traditional" way using "IS-A" relationships. For example, you might have Pipes, which are a type of Product, but have a couple more attributes. You might have Stoves, which are a type of product, but have a couple more attributes still.
If you really have a generic database and all sorts of other properties which aren't going to be subject to data integrity rules, you very well may want to consider storing data in an XML column. It's hard to tell you what the correct design choice is unless I know a lot more about your business.
IMO this is a design antipattern. The siren song of this idea has lured many a developer onto the rocks of of an unmaintainable application.
I know it is an old one - however there might be other readers...
I have seen the balance EAV to attribute modeled approach. Well - it is still EAV. "EAV's are like drugs" is pretty much true. So what about thinking it through once more - and let's be aggressive really:
I still liked the supertype apporach, where a lot of tables use the same primary key from a key generator. Let's reuse this one. So what about creating a new table for each set of attributes - all having the primary from the same key generator? Eg. you would have a table with the fields "color,pipe", another table "fittingsize,pipe", and so on. The requirement "volatility of attributes" screams for a carefully(automatically) maintained data dictionary anyway.
This approach is fully normalized and can be fully automated. You can support checks if specific attribute sets materialized already as table by hashing attribute name clusters, eg. crc32(lower('color~fittingsize~pipe')) where the atribute names need to be sorted alphabetically. Of course this requires to have the hash in the data dictionary. Based on the data dictionary each object can be searched (using 'UNION'), especially if the data dictionary itself is a table. Having the data dictionary as table also allows you to use its primary (surrogate) key as basis for unique tablenames, to end up with tables like 'attributes1','attributes2',... Most databases nowadays support some billion tables - so we are sort of save on that end as well. You could even have a product catalouge with very common attributes, that reference the extended attribute tables.
An open issue are 1:n data sets. I am afraid you need to sort them out in separate tables. However this very much depends on your data presentation and querying strategy. Should they always be presented as comma seperated string attached to the product or do you want to eg. be able to query for all products of a certain Certification?
Before you flame this approach please consider this: It is meant for use cases where you have a very high volatility of attributes - in quantity and quality - only. Also it was preset, that you cannot know most of the attributes at the point in time when the solution is created. So do not discuss this in a context where you can model your attributes upfront which would enable you to balance trade offs much better.
In short, you cannot go all one route. If you use an EAV like your example you will have a myriad of problems like those outlined by the other posters not the least of which will be performance and data integrity. Let me reiterate, that using an EAV as the core of your solution will fail when you get to reporting and analysis. However, as you have also stated, you might have hundreds of attributes that change regularly.
The solution, IMO, is a hybrid. For common attributes, use columns/standard schema. For additional, arbitrary attributes, use an EAV. However, the rule with the EAV data is that you can never, ever, under any circumstances, write a query that includes a sort or filter on an attribute. I.e., you can never write Where AttributeName = 'Foo'. The EAV portion of the schema represents a bag of data that is merely there for tracking purposes. In fact, I have seen many people implement this solution by using Xml for the EAV portion. The moment someone does want to search, filter, sort or place an EAV value in a specific spot on a report, that attribute must be elevated to a top level column in the products table.
The key to this hybrid approach is discipline. It will seem simple enough to add a filter, sort or put an attribute in a specific spot somewhere on a report especially when you get pressure from management. You must resist this temptation. Once you go down the dark path... If you do not think that you can maintain that level of discipline in your development team, then I would not use an EAV. As I've mentioned before, EAV's are like drugs: in small quantities and used under the right circumstances they can be beneficial. Too much will kill you.
Rather than have a name-value table, create the usual Product table structure containing all the common attributes, and add an XML column for the attributes that vary by product.
I have used this structure before and it worked quite well.
As #Dave Markle mentions, the name-value approach can lead to a world of pain.