Key-Value Store (Redis) for Phi Coefficient Use-Case - redis

I want to build an application in which the user can assign multiple tags (strings) to a date (YYYY-MM-DD string). The main use-case is to calculate the Phi coefficient for a combination of two tags (A and B), which requires to put every date in one of the following categories:
neither Tag is assigned
Tag A is assigned, but not Tag B
Tag B is assigned but not Tag A
Tag A and Tag B are assigned
The crucial information is how many dates belong to each category, not what dates.
The question is, how to persist that data so that it can be looked up quickly for the categorization described above.
Using a key-value store, such as Redis, storing sets with the tags as keys and the dates for each Tag as values would be an option that makes it easy to fill the store with new information. For the lookup, the intersection (SINTER) of A and B would form the fourth category, the differences (SDIFF) between A and B, and B and A, respectively, would form the second and third category.
The question remains, how to calculate the first category: The number of dates, where neither Tag A nor B applies to. The only option that comes to my mind is to read out the dates by iterating over all the keys, and subtract the numbers of the categories 2, 3, and 4 from the total number of dates. Is there a more elegant and more efficient way to achieve this goal? Or do I better use an SQL data base for that use-case?
EDIT: Another idea would be to not only store the dates by tag, but also the tags by date in a redundant manner, so that the retrieval of all dates is easier.

There are two basic approaches here: store the data in a single canonical form and use that to compute derived data as you need it; or store the information in multiple ways up front to optimize lookup speed.
Up to now you've taken the first approach. That's good, since storing information in a single place makes many things simpler, and eliminates the risk of having inconsistent data. The downside is that computing derived values can be slow. In your case you're talking about O(n) operations in the best case, with iterating over all keys in the worst case. Although it's always worth doing performance testing before making things more complex, my intuition is that you're right to be worried.
Storing the derived data separately from the canonical data allows you to optimize your lookup performance. Your last paragraph suggests storing the same information in multiple ways, but as long as you're doing that you might as well store the actual desired derived values rather than keeping the existing dates-by-tag data structure.
Specifically, my suggestion is to store the tags by date while separately storing the counts for categories 1-4. Each time you record a new (or changed, or deleted) input value you both update your canonical data structure and update your counts. You could probably do so atomically with a fairly simple Lua script. You can then access the desired counts in O(1) time and be confident that they accurately reflect the underlying data.

Related

Querying SQL using a Code column vs extended table

I am setting up a fairly large dataset (catalogue) on a sql database (i'd guesstimate ∼100k records) to store information regarding products. Each product is characterized by about 20-30 properties, so that would basically mean 20-30 column. The system is setup so that each of these properties is actually linked to a code, and each product is therefore characterized by a unique string made concatenating all these properties (the string has to be unique, if two product codes are the same then the two products are actually the same product). What I am trying to figure out is if sql-wise there is any difference to storing the catalogue as a table of 20-30 columns, or if I am better off just having 1 column with the code and decoding the properties from the code. The difference being that in one case I would do
SELECT * FROM Catalogue WHERE Color='RED'
versus
SELECT * FROM Catalogue WHERE Code LIKE '____R____________'
Also it might make it easier to check whether a product already exists, as I am only comparing a single column compared to 20-30 columns. I could also just add an extra column to the complete table to store the code and use one method when doing one operation and another when doing another operation.
I have almost no knowledge of how the SQL engine works so I might be completely off with my reasoning here.
The code approach seems silly. Why do I phrase it this way?
You have a few dozen columns with attributes and you know what they are. Why would you NOT include that information in the data model.
I am also amused by how you are going to distinguish these comparisons:
WHERE Code LIKE '____R____________'
WHERE Code LIKE '___R_____________'
WHERE Code LIKE '_____R___________'
WHERE Code LIKE '____R___________'
That just seems like a recipe for spending half the rest of your future life on debugging -- if not your code then someone else's.
And, with separate columns, you can create indexes for commonly used combinations.
If not all rows have all attributes -- or if the attributes can be expanded in the future -- you might want a structure with a separate line for each attribute:
entityId code value
1 Color Red
This is called an entity-attribute-value (EAV) model and is appropriate under some circumstances.

Best Practice - Should I make one table or two for two similar sets of data?

I need a table to store types of tests. I've been provided with two excel spreadsheets, one for microbial tests, one for pathogens. Microbial has 5 columns and Pathogens has 10. The 5 columns are in both tables. So one has 5 extra columns.
Just to give you an idea, the table columns would be something like this:
**Microbial**
Test Method IncubationStage1
**Pathogens**
Test Method IncubationStage1 IncubationStage2 Enrichment
So Is it better to have one table for Microbial and one for Pathogens, or better to have one table for Tests and have both within it? Is it bad to have a Microbial in a table where I know for certain only half the columns will be utilized? Or is it better to keep related items in the same table, and separate them by a column "Type"?
Obviously both will work fine but I'm wondering which is better.
The answer to these sorts of questions is always "it depends."
For my opinion, if you think you'll ever want to aggregate the data by test or by method across pathogenic or microbial types, then certainly you should put the data in the same table with an additional column that differentiates them.
You also could potentially better "normalize" your tables like this:
Table1: ExperimentID_PK ExperimentTypeID_FK Test Method
Table2: MeasurementRecordID_PK ExperimentID_FK Timestamp Other metadata about the record
Table3 MeasurementID_PK MeasurementTypeID_FK MeasurementValue MeasurementRecordID_FK
Table4: MeasurmentTypeId_PK Metadata About Measurement Types
Table5: ExperimentTypeId_PK Metadata About Experiment Types
... where all the leaf data elements point back to their parent data elements through foreign keys, and then you'd join data together in SQL statements, with indexes applied for optimal performance based on the types of queries you wanted to make. Obviously one of your rows in the question would end up appearing as multiple rows across multiple tables in this schema, and only at query time could they conceivably be reunited into individual rows (e.g. bound by MeasurementRecordID).
But there are other patterns too, in No-SQL land normalization can be the enemy. Slicing and dicing data sets turns out to be easier in some domains if it is stored in a more bloated format to make query structures more obvious. So it kind of comes down to thinking through your use cases.

SQL many value in one var [duplicate]

So, per Mehrdad's answer to a related question, I get it that a "proper" database table column doesn't store a list. Rather, you should create another table that effectively holds the elements of said list and then link to it directly or through a junction table. However, the type of list I want to create will be composed of unique items (unlike the linked question's fruit example). Furthermore, the items in my list are explicitly sorted - which means that if I stored the elements in another table, I'd have to sort them every time I accessed them. Finally, the list is basically atomic in that any time I wish to access the list, I will want to access the entire list rather than just a piece of it - so it seems silly to have to issue a database query to gather together pieces of the list.
AKX's solution (linked above) is to serialize the list and store it in a binary column. But this also seems inconvenient because it means that I have to worry about serialization and deserialization.
Is there any better solution? If there is no better solution, then why? It seems that this problem should come up from time to time.
... just a little more info to let you know where I'm coming from. As soon as I had just begun understanding SQL and databases in general, I was turned on to LINQ to SQL, and so now I'm a little spoiled because I expect to deal with my programming object model without having to think about how the objects are queried or stored in the database.
Thanks All!
John
UPDATE: So in the first flurry of answers I'm getting, I see "you can go the CSV/XML route... but DON'T!". So now I'm looking for explanations of why. Point me to some good references.
Also, to give you a better idea of what I'm up to: In my database I have a Function table that will have a list of (x,y) pairs. (The table will also have other information that is of no consequence for our discussion.) I will never need to see part of the list of (x,y) pairs. Rather, I will take all of them and plot them on the screen. I will allow the user to drag the nodes around to change the values occasionally or add more values to the plot.
No, there is no "better" way to store a sequence of items in a single column. Relational databases are designed specifically to store one value per row/column combination. In order to store more than one value, you must serialize your list into a single value for storage, then deserialize it upon retrieval. There is no other way to do what you're talking about (because what you're talking about is a bad idea that should, in general, never be done).
I understand that you think it's silly to create another table to store that list, but this is exactly what relational databases do. You're fighting an uphill battle and violating one of the most basic principles of relational database design for no good reason. Since you state that you're just learning SQL, I would strongly advise you to avoid this idea and stick with the practices recommended to you by more seasoned SQL developers.
The principle you're violating is called first normal form, which is the first step in database normalization.
At the risk of oversimplifying things, database normalization is the process of defining your database based upon what the data is, so that you can write sensible, consistent queries against it and be able to maintain it easily. Normalization is designed to limit logical inconsistencies and corruption in your data, and there are a lot of levels to it. The Wikipedia article on database normalization is actually pretty good.
Basically, the first rule (or form) of normalization states that your table must represent a relation. This means that:
You must be able to differentiate one row from any other row (in other words, you table must have something that can serve as a primary key. This also means that no row should be duplicated.
Any ordering of the data must be defined by the data, not by the physical ordering of the rows (SQL is based upon the idea of a set, meaning that the only ordering you should rely on is that which you explicitly define in your query)
Every row/column intersection must contain one and only one value
The last point is obviously the salient point here. SQL is designed to store your sets for you, not to provide you with a "bucket" for you to store a set yourself. Yes, it's possible to do. No, the world won't end. You have, however, already crippled yourself in understanding SQL and the best practices that go along with it by immediately jumping into using an ORM. LINQ to SQL is fantastic, just like graphing calculators are. In the same vein, however, they should not be used as a substitute for knowing how the processes they employ actually work.
Your list may be entirely "atomic" now, and that may not change for this project. But you will, however, get into the habit of doing similar things in other projects, and you'll eventually (likely quickly) run into a scenario where you're now fitting your quick-n-easy list-in-a-column approach where it is wholly inappropriate. There is not much additional work in creating the correct table for what you're trying to store, and you won't be derided by other SQL developers when they see your database design. Besides, LINQ to SQL is going to see your relation and give you the proper object-oriented interface to your list automatically. Why would you give up the convenience offered to you by the ORM so that you can perform nonstandard and ill-advised database hackery?
You can just forget SQL all together and go with a "NoSQL" approach. RavenDB, MongoDB and CouchDB jump to mind as possible solutions. With a NoSQL approach, you are not using the relational model..you aren't even constrained to schemas.
What I have seen many people do is this (it may not be the best approach, correct me if I am wrong):
The table which I am using in the example is given below(the table includes nicknames that you have given to your specific girlfriends. Each girlfriend has a unique id):
nicknames(id,seq_no,names)
Suppose, you want to store many nicknames under an id. This is why we have included a seq_no field.
Now, fill these values to your table:
(1,1,'sweetheart'), (1,2,'pumpkin'), (2,1,'cutie'), (2,2,'cherry pie')
If you want to find all the names that you have given to your girl friend id 1 then you can use:
select names from nicknames where id = 1;
Simple answer: If, and only if, you're certain that the list will always be used as a list, then join the list together on your end with a character (such as '\0') that will not be used in the text ever, and store that. Then when you retrieve it, you can split by '\0'. There are of course other ways of going about this stuff, but those are dependent on your specific database vendor.
As an example, you can store JSON in a Postgres database. If your list is text, and you just want the list without further hassle, that's a reasonable compromise.
Others have ventured suggestions of serializing, but I don't really think that serializing is a good idea: Part of the neat thing about databases is that several programs written in different languages can talk to one another. And programs serialized using Java's format would not do all that well if a Lisp program wanted to load it.
If you want a good way to do this sort of thing there are usually array-or-similar types available. Postgres for instance, offers array as a type, and lets you store an array of text, if that's what you want, and there are similar tricks for MySql and MS SQL using JSON, and IBM's DB2 offer an array type as well (in their own helpful documentation). This would not be so common if there wasn't a need for this.
What you do lose by going that road is the notion of the list as a bunch of things in sequence. At least nominally, databases treat fields as single values. But if that's all you want, then you should go for it. It's a value judgement you have to make for yourself.
In addition to what everyone else has said, I would suggest you analyze your approach in longer terms than just now. It is currently the case that items are unique. It is currently the case that resorting the items would require a new list. It is almost required that the list are currently short. Even though I don't have the domain specifics, it is not much of a stretch to think those requirements could change. If you serialize your list, you are baking in an inflexibility that is not necessary in a more-normalized design. Btw, that does not necessarily mean a full Many:Many relationship. You could just have a single child table with a foreign key to the parent and a character column for the item.
If you still want to go down this road of serializing the list, you might consider storing the list in XML. Some databases such as SQL Server even have an XML data type. The only reason I'd suggest XML is that almost by definition, this list needs to be short. If the list is long, then serializing it in general is an awful approach. If you go the CSV route, you need to account for the values containing the delimiter which means you are compelled to use quoted identifiers. Persuming that the lists are short, it probably will not make much difference whether you use CSV or XML.
If you need to query on the list, then store it in a table.
If you always want the list, you could store it as a delimited list in a column. Even in this case, unless you have VERY specific reasons not to, store it in a lookup table.
Many SQL databases allow a table to contain a subtable as a component. The usual method is to allow the domain of one of the columns to be a table. This is in addition to using some convention like CSV to encode the substructure in ways unknown to the DBMS.
When Ed Codd was developing the relational model in 1969-1970, he specifically defined a normal form that would disallow this kind of nesting of tables. Normal form was later called First Normal Form. He then went on to show that for every database, there is a database in first normal form that expresses the same information.
Why bother with this? Well, databases in first normal form permit keyed access to all data. If you provide a table name, a key value into that table, and a column name, the database will contain at most one cell containing one item of data.
If you allow a cell to contain a list or a table or any other collection, now you can't provide keyed access to the sub items, without completely reworking the idea of a key.
Keyed access to all data is fundamental to the relational model. Without this concept, the model isn't relational. As to why the relational model is a good idea, and what might be the limitations of that good idea, you have to look at the 50 years worth of accumulated experience with the relational model.
I'd just store it as CSV, if it's simple values then it should be all you need (XML is very verbose and serializing to/from it would probably be overkill but that would be an option as well).
Here's a good answer for how to pull out CSVs with LINQ.
Only one option doesn't mentioned in the answers. You can de-normalize your DB design. So you need two tables. One table contains proper list, one item per row, another table contains whole list in one column (coma-separated, for example).
Here it is 'traditional' DB design:
List(ListID, ListName)
Item(ItemID,ItemName)
List_Item(ListID, ItemID, SortOrder)
Here it is de-normalized table:
Lists(ListID, ListContent)
The idea here - you maintain Lists table using triggers or application code. Every time you modify List_Item content, appropriate rows in Lists get updated automatically. If you mostly read lists it could work quite fine. Pros - you can read lists in one statement. Cons - updates take more time and efforts.
I was very reluctant to choose the path I finally decide to take because of many answers. While they add more understanding to what is SQL and its principles, I decided to become an outlaw. I was also hesitant to post my findings as for some it's more important to vent frustration to someone breaking the rules rather than understanding that there are very few universal truthes.
I have tested it extensively and, in my specific case, it was way more efficient than both using array type (generously offered by PostgreSQL) or querying another table.
Here is my answer:
I have successfully implemented a list into a single field in PostgreSQL, by making use of the fixed length of each item of the list. Let say each item is a color as an ARGB hex value, it means 8 char. So you can create your array of max 10 items by multiplying by the length of each item:
ALTER product ADD color varchar(80)
In case your list items length differ you can always fill the padding with \0
NB: Obviously this is not necessarily the best approach for hex number since a list of integers would consume less storage but this is just for the purpose of illustrating this idea of array by making use of a fixed length allocated to each item.
The reason why:
1/ Very convenient: retrieve item i at substring i*n, (i +1)*n.
2/ No overhead of cross tables queries.
3/ More efficient and cost-saving on the server side. The list is like a mini blob that the client will have to split.
While I respect people following rules, many explanations are very theoretical and often fail to acknowledge that, in some specific cases, especially when aiming for cost optimal with low-latency solutions, some minor tweaks are more than welcome.
"God forbid that it is violating some holy sacred principle of SQL": Adopting a more open-minded and pragmatic approach before reciting the rules is always the way to go. Else you might end up like a candid fanatic reciting the Three Laws of Robotics before being obliterated by Skynet
I don't pretend that this solution is a breakthrough, nor that it is ideal in term of readability and database flexibility, but it can certainly give you an edge when it comes to latency.
What I do is that if the List required to be stored is small then I would just convert it to a string then split it later when required.
example in python -
for y in b:
if text1 == "":
text1 = y
else:
text1 = text1 + f"~{y}"
then when I required it I just call it from the db and -
out = query.split('~')
print(out)
this will return a list, and a string will be stored in the db. But if you are storing a lot of data in the list then creating a table is the best option.
If you really wanted to store it in a column and have it queryable a lot of databases support XML now. If not querying you can store them as comma separated values and parse them out with a function when you need them separated. I agree with everyone else though if you are looking to use a relational database a big part of normalization is the separating of data like that. I am not saying that all data fits a relational database though. You could always look into other types of databases if a lot of your data doesn't fit the model.
I think in certain cases, you can create a FAKE "list" of items in the database, for example, the merchandise has a few pictures to show its details, you can concatenate all the IDs of pictures split by comma and store the string into the DB, then you just need to parse the string when you need it. I am working on a website now and I am planning to use this way.
you can store it as text that looks like a list and create a function that can return its data as an actual list. example:
database:
_____________________
| word | letters |
| me | '[m, e]' |
| you |'[y, o, u]' | note that the letters column is of type 'TEXT'
| for |'[f, o, r]' |
|___in___|_'[i, n]'___|
And the list compiler function (written in python, but it should be easily translatable to most other programming languages). TEXT represents the text loaded from the sql table. returns list of strings from string containing list. if you want it to return ints instead of strings, make mode equal to 'int'. Likewise with 'string', 'bool', or 'float'.
def string_to_list(string, mode):
items = []
item = ""
itemExpected = True
for char in string[1:]:
if itemExpected and char not in [']', ',', '[']:
item += char
elif char in [',', '[', ']']:
itemExpected = True
items.append(item)
item = ""
newItems = []
if mode == "int":
for i in items:
newItems.append(int(i))
elif mode == "float":
for i in items:
newItems.append(float(i))
elif mode == "boolean":
for i in items:
if i in ["true", "True"]:
newItems.append(True)
elif i in ["false", "False"]:
newItems.append(False)
else:
newItems.append(None)
elif mode == "string":
return items
else:
raise Exception("the 'mode'/second parameter of string_to_list() must be one of: 'int', 'string', 'bool', or 'float'")
return newItems
Also here is a list-to-string function in case you need it.
def list_to_string(lst):
string = "["
for i in lst:
string += str(i) + ","
if string[-1] == ',':
string = string[:-1] + "]"
else:
string += "]"
return string
Imagine your grandmother's box of recipes, all written on index cards. Each of those recipes is a list of ingredients, which are themselves ordered pairs of items and quantities. If you create a recipe database, you wouldn't need to create one table for the recipe names and a second table where each ingredient was a separate record. That sounds like what we're saying here. My apologies if I've misread anything.
From Microsoft's T-SQL Fundamentals:
Atomicity of attributes is subjective in the same way that the
definition of a set is subjective. As an example, should an employee
name in an Employees relation be expressed with one attribute
(fullname), two (firstname and lastname), or three (firstname,
middlename, and lastname)? The answer depends on the application. If
the application needs to manipulate the parts of the employee’s name
separately (such as for search purposes), it makes sense to break them
apart; otherwise, it doesn’t.
So, if you needed to manipulate your list of coordinates via SQL, you would need to split the elements of the list into separate records. But is you just wanted to store a list and retrieve it for use by some other software, then storing the list as a single value makes more sense.

Building a MySQL database that can take an infinite number of fields

I am building a MySQL-driven website that will analyze customer surveys distributed by a variety of clients. Generally, these surveys are structured fairly consistently, and most of our clients' data can be reduced to the same normalized database structure.
However, every client inevitably ends up including highly specific demographic questions for their customers that are irrelevant to every other one of our clients. For instance, although all of our clients will ask about customer satisfaction, only our auto clients will ask whether the customers know how to drive manual transmissions.
Up to now, I have been adding columns to a respondents table for all general demographic information, with a lot of default null's mixed in. However, as we add more clients, it's clear that this will end up with a massive number of columns which are almost always null.
Is there a way to do this consistently? I would rather keep as much of the standardized data as possible in the respondents table since our import script is already written for that table. One thought of mine is to build a respondent_supplemental_demographic_info table that has the columns response_id, demographic_field, demographic_value (so the manual transmissions example might become: 'ID999', 'can_drive_manual_indicator', true). This could hold an infinite number of demographic_fields, but would be incredible painful to work with from both a processing and programming perspective. Any ideas?
Your solution to this problem is called entity-attribute-value (EAV). This "unpivots" columns so they are rows in a table and then you tie them together into a single view.
EAV structures are a bit tricky to learn how to deal with. They require many more joins or aggregations to get a single view out. Also, the types of the values becomes challenging. Generally there is one value column, so everything is stored as a string. You can, of course, have a type column with different types.
They also take up more space, because the entity id is repeated on each row (I think that is the response_id in your case).
Although not idea in all situations, they are appropriate in a situation such as you describe. You are adding attributes indefinitely. You will quickly run over the maximum number of columns allowed in a single table (typically between 1,000 and 4,000 depending on the database). You can also keep track of each value in each column separately -- if they are added at different times, for instance, you can keep a time stamp on when they go in.
Another alternative is to maintain a separate table for each client, and then use some other process to combine the data into a common data structure.
Do not fall for a table with key-value pairs (field id, field value) as that is inefficient.
In your case I would create a table per customer. And metadata tables (in a separate DB) describing these tables. With these metadata you can generate SQL etcetera. That is definitely superior too having many null columns. Or copied, adapted scripts. It requires a bit of programming, where an application uses the metadata to generate SQL, collect the data (without customer specific semantic knowledge) and generate reports.

separate table for picture items with image field

I store different items (notes, articles, pictures, files) in a single table (there are many metadata in common for all the item types - for example, categories, tags, rating, statistics etc.).
My first design was like this: table Items, plus another "detail" table for each of the item types (NoteItems, ArticleItems, PictureItems etc.). To retrieve a single item, tables must be joined one-to-one (SELECT * FROM Items INNER JOIN PictureItems ON Items.Id = PictureItems.Id WHERE Items.Id = N).
I'm pretty sure that this "by-the-book" design would work nicely (done that several times), however, I start to wonder whether the design is overkill. It would be much simpler to have a single table (Items).
Let's say that there are about 5% of items of picture or file type.
And now, the question: if I go for the (almost) single table design, would it be better to have detail tables for image fields anyway (for picture and file items, of course)?
Scenario 1: only one table: Items (for storing notes, articles, pictures, files...)
Scenario 2: two tables: Items (for storing notes, articles, picture files), ImageItems (for storing only image field of item types picture, file); one-to-one relation
(Scenario 3 would be a minor variation of Scenario 2; with 3 tables (Items, PictureItems, FileItems))
Advantages of scenario 1 are:
simpler select queries (no joins)
transaction-less updates (only one table is updated on INSERT/UPDATE)
performance, scalability due to transaction-less updates?
Advantages of scenario 2 are:
cleaner design
lower data consumption (in scenario 1, about 95% of items of type other than picture or file would have NULL value in the image field, that's about 16 bytes wasted for the pointer)
Which scenario would you choose: 1 (transaction-less updates) or 2 (lower data consumption)? Thanks for your opinions.
If programmers are sensible enough to query only required columns from the table instead of "SELECT *", first design approach looks okay.
Need to take care of indexing, referential constraints etc. for the second design.
If the database doesn't need to know what's in those items (won't be indexing or searching on them) then option 1 seems the best option (assuming you just have one 'Item' column as a BLOB) - you can just read out the items as binary data and deal with them yourself as appropriate - thus avoiding that inner join.
I don't believe scenario 2 gives you lower data consumption - you can just use a BLOB field (and anyway the overhead of an extra ImageItems table is probably comparable to 16 bytes per row)
So I would personally go for option 1, but of course it depends on how you handle the Items when they come out of the database.
First approach is usually penalized if you are using some sort of ORM or automatically generate your DAL ( SubSonic? ).You will retreive the Image column (and it's data) everytime you pass arround a DAL object (or a collection), so normally I would use scenario 2 (or 3)
From the SQL perspective either scenarios will work about the same depending on your storage engine (ISAM, InnoDB,etc), but even there the benefits and diferences between scenarios are minor.
If you're right about only about 5% of your row actually having additional image/binary data, then I'd definitely say use the one table approach, combined with the hint Murthy gave - make sure not to do SELECT * on this table, but only request those columns that you really need - leave out the BLOB columns as often as possible.
If your database grows, you might also want to check out a separate filegroup for the BLOB data to keep things separated and clean (but that's really only when you're dealing with hundreds of thousands of rows or more, and if you can split the filegroups onto several separate disks).
KISS - Keep It Smart and Simple - whenever possible! :-)
Marc