Good practices between SQL and elasticsearch - sql

Imagine you have a SQL database like mysql or postgresql. You have two tables : user and car. One user can drive N cars, a car can be driven by N users, so you have a third "drive" table with two foreign key.
Now, you want that your table user goes on elasticsearch, because you want search users by name, email... etc... Maybe you also need to do some search on the car table.
I see three way to achieve this, I d'like to know what is the best way :
1) Abandon the sql database. All your tables are now on elasticsearch. You can do search on whatever you want, but you must treat all your constraints manually.
2) Keep the structure on the sql database, you keep your three tables, the primary keys and the foreign keys. But your tables contains only elasticsearch ID of the associated row in elasticsearch. For exemple in table user, you keep user_id and add a user_elasticsearch_id that point on the elasticsearch row where you found the name, the email... etc... So you have your sql constraints, you can do search, but you must maintain two tables.
3) Duplicate. You don't touch your sql database, you duplicate all the rows on the elasticsearch database. You have your constrains, you can search, but again you must maintain two tables and you have twice the data and twice the storage.
Now, brave fellow of stackoverflow, what would you do in this case ?
Thank you.

The most common setup for critical business data is having e.g. a SQL database as your primary datastore and Elasticsearch as additional search index. (= your solution 3).
An alternative for non business-critical data like logs etc. is having Elasticsearch standalone.
Solution 2 seems wired, is not an option for me.

Because you may have a lot of business rules mixed into you database and application using it, I would be conservative and keep the DB. And use ES to index the user attributes I want to search on. ES would return scored results. When a result select I would switch to DB to retrieve all information and relations.
So I would choose 2b : keep DB and store PK in ES, not ID in DB).
Keep in mind you can force the ID en ES. It could be "user_PK" or something alike.

Related

What is the most correct way to store a "list" in a SQL Database?

So, I've read a lot about how stashing multiple values into one column is a bad idea and violates the first rule of data normalisation (which, surprisingly, is not "Do Not Talk About Data Normalisation") so I need some help.
At the moment I'm designing an ASP .NET webpage for the place I work for. I want to display data on a web page depending on what Active Directory groups the person belongs to. The first way of doing this that comes to mind is to have a table with, essentially, a column containing the AD group and the second column containing what list of computers belong to that list.
I've learnt that this is showing great disregard for relational databases, so what is a better way to do it? I want to control this access by SQL tables, so I can add/remove from these tables and change end users access accordingly.
Thanks for the help! :)
EDIT: To describe exactly what I want to do is this:
We have a certain group of computers that need to be checked up on, however these computers are in physically difficult to reach locations. The organisation I belong to has remote control enabled for these computers, however they're not in the business of giving out the remote control password (understandable).
The added layer of complexity is that, depending on who you are, our clients should only be able to see a certain group of computers (that is, the group of computers that their area owns). So, if Group A has Thomas in it, and Group B has Jones in it, if you belong to either group then you would just see one entry. However, if you belong to both groups you should see both Thomas and Jones computers in it.
The reason why I think that storing this data in a SQL cell is the way to go is because, to store them in tables would require (in my mind) a new table for each new "group" of computers. I don't want to crank out SQL tables for every new group, I'd much rather just have an added row in a SQL table somewhere.
Does this make any sense?
You basically have three options in SQL Server:
Storing the values in a single column.
Storing the values in a junction table.
Storing the values as XML (or as some other structured data format).
(Other databases have other options, such as arrays, nested tables, and JSON.)
In almost all cases, using a junction table is the correct approach. Why? Here are some reasons:
SQL Server has (relatively) lousy string manipulation, so doing something as simple as ensuring a unique list is really, really hard.
A junction table allows you to store lots of other information (When was a machine added? What is the full description of the machine? etc. etc.).
Most queries that you want are pretty easy with a junction table (with the one exception of getting a comma-delimited list, alas -- which is just counterintuitive rather than "hard").
All the types are stored natively.
A junction table allows you to enforce constraints (both check and foreign key) on the elements of the list.
Although a delimited list is almost never the right solution, it is possible to think of cases where it might be useful:
The list doesn't change and presentation of the list is very important.
Space usage is an issue (alas, denormalization often results in fewer pages).
Queries do not really access elements of the list, just the entire thing.
XML is also a reasonable choice under some circumstances. In the most recent versions of SQL Server, this can be made pretty efficient. However, it incurs the overhead of reading and parsing XML -- and things like duplicate elimination are still not obvious.
So, you do have options. In almost all cases, the junction table is the right approach.
There is an "it depends" that you should consider. If the data is never going to be queried (or queried very rarely) storing it as XML or JSON would be perfectly acceptable. Many DBAs would freak out but it is much faster to get the blob of data that you are going to send to the client than to recompose and decompose a set of columns from a secondary table. (There is a reason document and object databases are becoming so popular.)
... though I would ask why are you replicating active directory to your database and how are you planning on keeping these in sync.
I not really a bad idea to store multiple values in one column, but will depend the search you want.
If you just only want to know the persons that is part of a group then you can store persons in one column with a group id as key. For update you just update the entire list in a group.
But if you want to search a specified person that belongs to group, then its not recommended that you store this multiple persons in one column. In this case its better to store a itermedium table that store person id, and group id.
Sounds like you want a table that maps users to group IDs and a second table that maps group IDs to which computers are in that group. I'm not sure, your language describing the problem was a bit confusing to me.
a list has some columns like: name, family name, phone number etc.
and rows like name=john familyName= lee number=12321321
name=... familyname=... number=...
an sql database works same way. every row in a sql database is a record. so you jusr add records of your list into your database using insert query.
complete explanation in here:
http://www.w3schools.com/sql/sql_insert.asp
This sounds like a typical many-to-many problem. You have many groups and many computers and they are related to eachother. In this situation, it is often recommended to use a mapping table, a.k.a. "junction table" or "cross-reference" table. This table consist solely of the two foreign keys in your other tables.
If your tables look like this:
Computer
- computerId
- otherComputerColumns
Group
- groupId
- othergroupColumns
Then your mapping table would look like this:
GroupComputer
- groupId
- computerId
And you would insert a single record for every relationship between a group and computer. This is in compliance with the rules for third normal form in regards to database normalization.
You can have a table with the group and group id, another table with the computer and computer id and a third table with the relation of group id and computer id.

Implementing a cache layer (including some sql db tables) in couchbase

Suppose I have many SQL tables with 10 columns (at least) per each.
Let's take for example:
HR Table: ID, FirstName, LastName, PhoneNumber, Gender, City, Street, Height, Weight, IQ
I need to build a cache layer for all of my SQL tables.
What would be the best way to store the data in Couchbase ?
Should I store the whole document for each row ?
Here is a potential key, For example - A key that brings me a JSON document that contains all columns where its row ID=4:
HR_4
Or should I implement it like key-value store ?
For instance - A key that brings me a specific value (not the entire columns):
HR_4_FirstName
Please put in mind that I DO need to get an entire row for key in my application, but sometimes I need to get just one specific column.
The question is: Should I go for the second way, and if I need a few values - just send a few requests from my application and aggregate them ?
On the other hand, going the second way is much more keys to handle (That actually means having a key for each db field).
I would look at how your application uses and accesses the data. It may be worthwhile to have several objects for the data you are trying to store depending on access patterns and what you want to optimize for. May I recommend this article on data modeling for a user profile store in Couchbase. Let me know if this does not help.

What is this form of database called?

I'm new to databases and I'm thinking of creating one for a website. I started with SQL, but I really am not sure if I'm using the right kind of database.
Here's the problem:
What I have right now is the first option. So that means that, my query looks something like this:
user_id photo_id photo_url
0 0 abc.jpg
0 1 123.jpg
0 2 lol.png
etc.. But to me that seems a little bit inefficient when the database becomes BIG. So the thing I want is the second option shown in the picture. Something like this, then:
user_id photos
0 {abc.jpg, 123.jpg, lol.png}
Or something like that:
user_id photo_ids
0 {0, 1, 2}
I couldn't find anything like that, I only find the ordinary SQL. Is there anyway to do something like that^ (even if it isn't considered a "database")? If not, why is SQL more efficient for those kinds of situations? How can I make it more efficient?
Thanks in advance.
Your initial approach to having a user_id, photo_id, photo_url is correct. This is the normalized relationship that most database management systems use.
The following relationship is called "one to many," as a user can have many photos.
You may want to go as far as separating the photo details and just providing a reference table between the users and photos.
The reason your second approach is inefficient is because databases are not designed to search or store multiple values in a single column. While it's possible to store data in this fashion, you shouldn't.
If you wanted to locate a particular photo for a user using your second approach, you would have to search using LIKE, which will most likely not make use of any indexes. The process of extracting or listing those photos would also be inefficient.
You can read more about basic database principles here.
Your first example looks like a traditional relational database, where a table stores a single record per row in a standard 1:1 key-value attribute set. This is how data is stored in RDBMS' like Oracle, MySQL and SQL Server. Your second example looks more like a document database or NoSQL database, where data is stored in nested data objects (like hashes and arrays). This is how data is stored in database systems like MongoDB.
There are benefits and costs to storing data in either model. With relational databases, where data is spread accross multiple tables and linked by keys, it is easy to get at data from multiple angles and aggregate it for multiple purposes. With document databases, data is typically more difficult to join in single queries, but much faster to retrieve, and also typically formatted for quicker application use.
For your application, the latter (document database model) might be best if you only care about referencing a user's images when you have a user ID. This would not be ideal for say, querying for all images of category 'profile pic' or for all images uploaded after a certain date. You could probably accomplish your task with either database type, and choosing the right database will always depend on the application(s) that it will be used for, but as a general rule-of-thumb, relational databases are more flexible and hard to go wrong with.
What you want (having user -> (photo1, photo2, ...)) is kind of an INDEX :
When you execute your request, it will go to the INDEX and fetch the INDEX "user" in the photos table, and get the photo list to fetch. Not all the database will be looked up, it's optimised.
I would do something like
Users_Table(One User - One Photo)
With all the column that every user will have. if one user will have only one photo then just add a column in this table with photo_url
One User Many Photos
If one User Can have multiple Photos. then create a table separately for photos which contains only UserID from Users_Table and the Photo_ID and Photo_File.
Many Users Many Photos
If One Photo can be assigned to multiple users then Create a Separate table for Photos Where there are PhotoID and Photo_File. Third Table User_Photos which can have UserID from Users_Table and Photo_ID from Photos Table.

Find key by value

The think I'm trying to implement is an id table. Basically it has the structure (user_id, lecturer_id) which user_id refers to the primary key in my User table and lecturer_id refers to the primary key of my Lecturer table.
I'm trying to implement this in redis but if I set the key as User's primary id, when I try to run a query like get all the records with lecturer id=5 since lecturer is not the key, but value I won't be able to reach it in O(1) time.
How can I form a structure like the id table I mentioned in above, or Redis does not support that?
One of the things you learn fast while working with redis is that you get to design your data structure around your accessing needs, specially when it comes to relations (it's not a relational database after all)
There is no way to search by "value" with a O(1) time complexity as you already noticed, but there are ways to approach what you describe using redis. Here's what I would recommend:
Store your user data by user id (in e.g. a hash) as you are already doing.
Have an additional set for each lecturer id containing all user ids that correspond to the lecturer id in question.
This might seem like duplicating the data of the relation, since your user data would have to store the lecture id, and your lecture data would store user ids, but that's the (tiny) price to pay if one is to build relations in a no-relational data store like redis. In practical terms this works well; memory is rarely a bottleneck for small-ish data-sets (think thousands of ids).
To get a better picture at how are people using redis to model applications with relations, I recommend reading Design and implementation of a simple Twitter clone and the source code of Lamernews, both of which are written by redis author Salvatore Sanfilippo.
As already answered, in vanilla Redis there is no way to store the data only once and have Redis query them for you.
You have to maintain secondary indexes yourself.
However with the modules in Redis, this is not necessary true. Modules like zeeSQL, or RediSearch allow to store data directly in Redis and retrieve them with a SQL query (for zeeSQL) or simil SQL for RediSearch.
In your case, a small example with zeeSQL.
> ZEESQL.CREATE_DB DB
OK
> ZEESQL.EXEC DB COMMAND "CREATE TABLE user(user_id INT, lecture_id INT);"
OK
> ZEESQL.EXEC DB COMMAND "SELECT * FROM user WHERE lecture_id = 3;"
... your result ...

Simulating variable column names in sqlite

I want to store entries (a set of key=>value pairs) in a database, but the keys vary from entry to entry.
I thought of storing with two tables, (1) of the keys for each entry and (2) of the values of specific keys for each entry, where entries share a common id field in both tables, but I am not sure how to pull entries as a key=>value pairs in sql with this sort of configuration.
Is there a better method? If this is not possible in sqlite, is it possible in mysql? Thanks!
It sounds like you are looking for the Entity-Attribute-Value model.
Alternatives are to create different tables for different types of entities, or to have a table with a column for every possible key and set the value to NULL for entities that don't have that key.
You might want to take a look at Bill Karwin's presentation SQL Antipatterns where he covers some of the pros and cons of the EAV model and suggests possible alternatives. The relevant part starts from slide 16.
#Mark Byers is right, this is the EAV model. You should read Bad CaRMa before you go down that dark path. It's a story of how this database design practically destroyed a company.
In a relational database, every row in a relation must include the same columns. That's part of the definition for a relation. This is true in SQLite, MySQL, or any other relational database.
Also see my presentation Practical Object-Oriented Models in SQL or my book SQL Antipatterns, in which I show the problems caused by the EAV model.
If you need variable columns per entity, you need a non-relational database. There are document-oriented databases like CouchDB or MongoDB that are catching on in popularity.
Or try Berkeley DB if you want an embeddable single-user solution like SQLite.