Modeling database with many (0,n) relations - sql

I have a database to manage characters and I would like to add some changes:
Now :
One characater has multiple equipements, one equipement belongs to multiple characters.
One equipement has multiple stats, one stat belongs to multiple equipements.
We can know the side of an equipement for a character if we know the caracter and the equipement so I put "side" on the pivot. The problem is : the side only concerns rings so my data for side looks like
null | null | null | left | null | right | null |null | null | null
Is it a problem to have an attribute only concerning one type?
Also, I want to add for a stat, elements. The problem is that for something like 30 stats, only 2 are concerned by elements. And for a stat concerned by an element, the character choose one element or more elements. It will probably looks like this :
For the moment, I manage the stats like this :
I retrieve all data for a character with his equipements, then when I know his equipements I can retrieve all stats (I know their value through "value_stat"). Thanks to Laravel, I do it with eager loads.
Now to have the "value_stat" I must know id_character, id_equipement, id_stats, id_element because the element chosen depends on the character. Is that pattern correct? Even with that pattern for most of cases, id_element will be set as null because element is only for some stats.
Thanks!

Is it a problem to have an attribute only concerning one type?
Reworded: "Is it OK to have an attribute which will be non-NULL for only a subset of rows?
Answer: Yes. This is common in SQL databases. You just have to make sure that your application handles NULL values correctly.
Now to have the "value_stat" I must know id_character, id_equipement, id_stats, id_element because the element chosen depends on the character. Is that pattern correct?
Answer: It depends if actual object model is correct! For example, is relation between equipment and stat really many-to-many? I mean - can one stat record really refer to many different equipments? If that's the case, no problem. From SQL perspective there is nothing bad about using arbitrary complex relationship model. But from normalization perspective, it must match real world (in this case: game logic).

Related

SQL: List-Field contains sublist

Quick preface: I use the SQL implementation persistent (Haskell) and esqueleto.
Anyway, I want to have a SQL table with a column of type [String], i.e. a list of strings. Now I want to make a query which gives me all the records where a given list is a sublist of the one in the record.
For instance the table with
ID Category
0 ["math", "algebra"]
1 ["personal", "life"]
2 ["algebra", "university", "personal"]
with a query of ["personal", "algebra"] would return only the record with ID=2, since ["personal", "algebra"] is a sublist of ["algebra", "university", "personal"].
Is a query like this possible with variable-length of my sought-after sublist and "basic" SQL operators?
If someone knows their way around persistent/esqueleto that would of course be awesome.
Thanks.
Expanding on the comment of Gordon Linoff and the previous answer:
SQL databases are sometimes limited in their power. Since the order of your Strings in [String] does not seem to matter, you are trying to put something like a set into a relational database and for your query you suggest something like a is a subset of operator.
If there was a database engine that provides those structures, there would be nothing wrong about using it (I don't know any). However, approximating your set logic (or any logic that is not natively supported by the database) has disadvantages:
You have to explicitly deal with edge cases (cf. xnyhps' answer)
Instead of hiding the complexity of storing data, you need to explicitly deal with it in your code
You need to study the database engine rather than writing your Haskell code
The interface between database and Haskell code becomes blurry
A mightier approach is to reformulate your storing task to something that fits easily into the relational database concept. I.e. try to put it in terms of relations.
Entities and relations are simple, thus you avoid edge cases. You don't need to bother how exactly the db backend stores your data. You don't have to bother with the database much at all. And your interface is reduced to rather straightforward queries (making use of joins). Everything that cannot be (comparatively) easily realized with a query, belongs (probably) into the Haskell code.
Of course, the details differ based on the specific circumstances.
In your specific case, you could use something like this:
Table: Category
ID Description
0 math
1 algebra
2 personal
3 life
4 university
Table: CategoryGroup
ID CategoryID
0 0
0 1
1 2
1 3
2 1
2 4
2 2
... where the foreign key relation allows to have groups of categories. Here, you are using a relational database where it excels. In order to query for CategoryGroup you would join the two tables, resulting in a result of type
[(Entity CategoryGroup, Entity Category)]
which I would transform in Haskell to something like
[(Entity CategoryGroup, [Entity Category])]
Where the Category entities are collected for each CategoryGroup (that requires deriving (Eq, Ord) in your CategoryGroup-Model).
The set-logic as described above, for a given List cs :: [Entity Category], would then go like
import qualified Data.Set as Set
import Data.Set (isSubsetOf)
let s = Set.fromList ["personal", "algebra"]
let s0 = Set.fromList $ map (categoryDescription . entityVal) cs
if s `isSubsetOf` s0 -- ... ?
Getting used to the restrictions of relational databases can be annoying in the beginning. I guess, for something of central importance (persisting data) a robust concept is often better than a mighty one and it pays out to always know what your database is doing exactly.
By using [String], persistent converts the entire list to a quoted string, making it very hard to work with from SQL.
You can do something like:
mapM (\cat ->
where_ (x ^. Category `like` (%) ++. val (show cat) ++. (%)))
["personal", "algebra"]
But this is very fragile (may break when the categories contain ", etc.).
Better approaches are:
You could do the filtering in Haskell if the database is small enough.
It would be much easier to model your data as:
Objects:
ID ...
0 ...
1 ...
2 ...
ObjectCategories:
ObjectID Category
0 math
0 algebra
1 personal
1 life
2 algebra
2 university
2 personal

SSIS 2 Conditional Split not working as Expected

I am new to SSIS and was going through a usecase where I want to implement SCD type 2 without using SCD component (that is the requirement) where I have to use more than one conditional split and a Lookup. now when i use a single lookup and single consditional split it is working like a charm but the moment i introduce second conditional split it is not working either way . I have provide dataviwer to see data but it is not showing data also.
can you help me
my data flow is look like this
still first conditional split everything looking fine but after inserting second and third conditional split its not working.
Remember Occam's razor. Which is more likely, that a product is developed that works with 0 to 1 components but utterly fails with 2 or that your implementation is flawed?
I find that the following situations are most likely to go against expectations which I translate into "I haven't read any documentation and am fumbling my way through the UI". That's not intended as an insult, just an approach many, including myself, employ. The trick of course, is when presented with something that behaves contrary, you then consult the Fine Manual.
High level design issues
Your conditional splits do not account for all the possibilities, therefore you're "losing" data. Something your data flow lacks, is row counts. How many rows did I start with? How many rows have gone to the various sinks/destinations? At the end of the day/data flow, you must be able to account for all the data. If not, then something's gone awry and your data load is invalid.
I'd also pick nits at your lack of useful component names and OLE DB Command objects but that's maintainability and scalability which is premature optimization when the answers are wrong
What's likely the cause
Getting down to brass tacks, I'm willing to wager you are losing data to the following conditions in your data
NULL
case sensitivity
From your path annotations, your second Conditional Split 1 has 2-3 outputs. Male, Female, and assuming you didn't rename the Default output to Male or Female, Default.
You state you're losing all the data at this split. It's likely that it's all going to the Default output. I expect you have an Expressions in your conditional split like
Male := [GenderColumn] == "Male"
Female := [GenderColumn] == "Female"
However, if your source data contains male, mAle, female, FEMALE and all the permutations inbetween, you're only ever going to match based on a strict case sensitive match, which none of your data has matched. To resolve this, you would want to compare consistent values.
Here I am arbitrarily converting everything to upper case. LOWER works just as well. The important thing is that they need to result in the same value. I'm also lazy in that I'm applying a function to a constant.
Male := UPPER([GenderColumn]) == UPPER("Male")
Female := UPPER([GenderColumn]) == UPPER("Female")
But wait, what if I have NULLs? Great question, wat do? A NULL value is neither Male or Female, how should that data be treated? Right now, it's going down the Default output path. Maybe it should be treated as Male since we have a gender bias in our product. Your business user will likely know what should be done with unknown values so you should consult them. You would then add in an OR condition, via the || and test for whether the current value for our column is NULL
Male := UPPER([GenderColumn]) == UPPER("Male") || ISNULL([GenderColumn])
Female := UPPER([GenderColumn]) == UPPER("Female")

SQL: How to design best table for my data structure [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I am designing postgresql database for an online card game. I want to provide users with option to access their play history - user can see log of his games.
The data I want to save has the following structure:
Player Ids | game_data
-----------|--------------
A,B,C,D | ..game log..
A,D,E,F | ..game log..
D,C,A | ..game log..
D,A | ..game log..
Each game can have up to 22 participants, so Max number of players is 22, Min number of players is 2.
So far I have about 100M records. Every day I add about 500K records. I have about 500K Players. Player-ID is a 32 byte string (MD5).
I want players to be able to access their game_data, so I want player to be able to select last XX games_logs by Player-Id. I need to do it as fast as possible. What would be the best way to do it with Postgres? I would prefer to save all these data in a single table.
So far I am considering two approaches:
Approach 1
Make a field of JSON type and save all Player's in in JSON-Array and query JSON in SELCT statment.
Approach 2
Make 22 fields in a table for each player (If there is no player field is NULL) and make ugly query over all fields.
So far I don't like any of these approaches. If there is a better way to do it?
Added
Typical request would be: SELECT LAST XXX GAME FOR PLAYER_ID = 'A'
A user can have many keys, and a key may belong to many user , that's why you should have third table to contain user-key pairs. You should kep json value of each key in keys table.
Here is how it should be.
I would use a table with a serial field (just to follow the convention), an array of integers to store the keys and a json field to hold the data.
You can add indexes to array columns in PostgreSQL.
You will really regret Approach number 2. It's hell to check every USER1...22 field for playerid = X when retrieving GameData per User. I have seen it used many times in systems designed for easy data entry with little thought to getting the data out. Your SQL or other code will be brittle and you will be loathe to write test code.
Do you really need to keep it to just one table? The standard ('normalized') Many-to-many approach can be really fast with proper db indexing and tuning. Use integer keys when possible. Call that option #3 (e.g. make a many-to-many table GameUsers table with two columns user_id and game_id)
I have a similar situation, and do BOTH #1 and the Many-to-Many tables. For your solution #1, instead of JSON I insert a delimited list of user names in a text field. I just store these user_id's (which in my case are text, which I don't like either) as comma delimited (my customer likes commas) however I add preceding and ending commas like this: ",A,D,E,F,"
The Simplistic SQL :
select game_log from game_data where user_list LIKE ('%,D,%')
or if user_id is variable or column
select game_log from game_data where user_list LIKE ('%,' || user_id || ',%')
You need both delimiters as user names can overlap (e.g. "Mirko" and "Mirkota") and you don't have to waste time checking for cases where its' at beginning or end of the list. Of course you must use a delimiter not allowed in user ids, and make sure to strip out this delimiter (and other forbidden characters) from user input data to avoid SQL injection.
The big downside of doing both is keeping them in sync but given data in many-to-many tables (approach #3) you can re-generate the users list using string_agg(expression, delimiter) and concat the extra delimiters
I don't think that Serial or "array" fields are not going to help here, even with positional indexing. You still have to search each position in the array, and the vast majority of your 22-lenght arrays will be mostly empty.
In my case I am doing social network analysis so I need to know when users are together and using multiple LIKE conditions is faster than multiple joins.

Standard use of 'Z' instead of NULL to represent missing data?

Outside of the argument of whether or not NULLs should ever be used: I am responsible for an existing database that uses NULL to mean "missing or never entered" data. It is different from empty string, which means "a user set this value, and they selected 'empty'."
Another contractor on the project is firmly on the "NULLs do not exist for me; I never use NULL and nobody else should, either" side of the argument. However, what confuses me is that since the contractor's team DOES acknowledge the difference between "missing/never entered" and "intentionally empty or indicated by the user as unknown," they use a single character 'Z' throughout their code and stored procedures to represent "missing/never entered" with the same meaning as NULL throughout the rest of the database.
Although our shared customer has asked for this to be changed, and I have supported this request, the team cites this as "standard practice" among DBAs far more advanced than I; they are reluctant to change to use NULLs based on my ignorant request alone. So, can anyone help me overcome my ignorance? Is there any standard, or small group of individuals, or even a single loud voice among SQL experts which advocates the use of 'Z' in place of NULL?
Update
I have a response from the contractor to add. Here's what he said when the customer asked for the special values to be removed to allow NULL in columns with no data:
Basically, I designed the database to avoid NULLs whenever possible. Here is the rationale:
• A NULL in a string [VARCHAR] field is never necessary because an empty (zero-length) string furnishes exactly the same information.
• A NULL in an integer field (e.g., an ID value) can be handled by using a value that would never occur in the data (e.g, -1 for an integer IDENTITY field).
• A NULL in a date field can easily cause complications in date calculations. For example, in logic that computes date differences, such as the difference in days between a [RecoveryDate] and an [OnsetDate], the logic will blow up if one or both dates are NULL -- unless an explicit allowance is made for both dates being NULL. That's extra work and extra handling. If "default" or "placeholder" dates are used for [RecoveryDate] and [OnsetDate] (e.g., "1/1/1900") , mathematical calculations might show "unusual" values -- but date logic will not blow up.
NULL handling has traditionally been an area where developers make mistakes in stored procedures.
In my 15 years as a DBA, I've found it best to avoid NULLs wherever possible.
This seems to validate the mostly negative reaction to this question. Instead of applying an accepted 6NF approach to designing out NULLs, special values are used to "avoid NULLs wherever possible." I posted this question with an open mind, and I am glad I learned more about the "NULLs are useful / NULLs are evil" debate, but I am now quite comfortable labeling the 'special values' approach to be complete nonsense.
an empty (zero-length) string furnishes exactly the same information.
No, it doesn't; in the existing database we are modifying, NULL means "never entered" and empty string means "entered as empty".
NULL handling has traditionally been an area where developers make mistakes in stored procedures.
Yes, but those mistakes have been made thousands of times by thousands of developers, and the lessons and caveats for avoiding those mistakes are known and documented. As has been mentioned here: whether you accept or reject NULLs, representation of missing values is a solved problem. There is no need to invent a new solution just because developers continue make easy-to-overcome (and easy-to-identify) mistakes.
As a footnote: I have been a DBE and developer for more than 20 years (which is certainly enough time for me to know the difference beetween a database engineer and a database administrator). Throughout my career I have always been in the "NULLs are useful" camp, though I was aware that several very smart people disagreed. I was extremely skeptical about the "special values" approach, but not well-versed enough in the academics of "How To Avoid NULL the Right Way" to make a firm stand. I always love learning new things—and I still have lots to learn after 20 years. Thanks to all who contributed to make this a useful discussion.
Sack your contractor.
Okay, seriously, this isn't standard practice. This can be seen simply because all RDBMS that I have ever worked with implement NULL, logic for NULL, take account of NULL in foreign keys, have different behaviour for NULL in COUNT, etc, etc.
I would actually contend that using 'Z' or any other place holder is worse. You still require code to check for 'Z'. But you also need to document that 'Z' doesn't mean 'Z', it means something else. And you have to ensure that such documentation is read. And then what happens if 'Z' ever becomes a valid piece of data? (Such as a field for an initial?)
At a basic level, even without debating the validity of NULL vs 'Z', I would insist that the contractor conforms to standard practices that exist within your company, not his. Instituting his standard practice in an environment with an alternative standard practice will cause confusion, maintenance overheads, mis-understanding, and in the end increased costs and mistakes.
EDIT
There are cases where using an alternative to NULL is valid in my opinion. But only where doing so reduces code, rather than creating special cases which require accounting for.
I've used that for date bound data, for example. If data is valid between a start-date and an end-date, code can be simplified by not having NULL values. Instead a NULL start-date could be replaced with '01 Jan 1900' and a NULL end-date could be replaced with '31 Dec 2079'.
This still can change behaviour from what may be expected, and so should be used with care:
WHERE end-date IS NULL no longer give data that is still valid
You just created your own millennium bug
etc.
This is equivalent to reforming abstractions such that all properties can always have valid values. It is markedly different from implicitly encoding specific meaning into arbitrarily chosen values.
Still, sack the contractor.
This is easily one of the weirdest opinions I've ever heard. Using a magic value to represent "no data" rather than NULL means that every piece of code that you have will have to post-process the results to account/discard the "no-data"/"Z" values.
NULL is special because of the way that the database handles it in queries. For instance, take these two simple queries:
select * from mytable where name = 'bob';
select * from mytable where name != 'bob';
If name is ever NULL, it obviously won't show up in the first query's results. More importantly, neither will it show up in the second queries results. NULL doesn't match anything other than an explicit search for NULL, as in:
select * from mytable where name is NULL;
And what happens when the data could have Z as a valid value? Let's say you're storing someone's middle initial? Would Zachary Z Zonkas be lumped in with those people with no middle initial? Or would your contractor come up with yet another magic value to handle this?
Avoid magic values that require you to implement database features in code that the database is already fully capable of handling. This is a solved and well understood problem, and it may just be that your contractor never really grokked the notion of NULL and therefore avoids using it.
If the domain allows missing values, then using NULL to represent 'undefined' is perfectly OK (that's what it is there for). The only downside is that code that consumes the data has to be written to check for NULLs. This is the way I've always done it.
I have never heard of (or seen in practice) the use of 'Z' to represent missing data. As to "the contractor cites this as 'standard practice' among DBAs", can he provide some evidence of that assertion? As #Dems mentioned, you also need to document that 'Z' doesn't mean 'Z': what about a MiddleInitial column?
Like Aaron Alton and many others, I believe that NULL values are an integral part of database design, and should be used where appropriate.
Even if you somehow manage to explain to all your current and future developers and DBAs about "Z" instead of NULL, and even if they code everything perfectly, you will still confuse the optimizer because it will not know that you've cooked this up.
Using a special value to represent NULL (which is already a special value to represent NULL) will result in skews in the data. e.g. So many things happened on 1-Jan-1900 that it will throw out the optimizer's ability to understand that actual range of dates that really are relevant to your application.
This is like a manager deciding: "Wearing a tie is bad for productivity, so we're all going to wear masking tape around our necks. Problem solved."
I've never heard about the wide-spread use of 'Z' as a substitute for NULL.
(Incidentally, I'd not particularly like to work with a contractor who tells you in the face that they and other "advanced" DBAs are so much more knowledgeable and better than you.)
+=================================+
| FavoriteLetters |
+=================================+
| Person | FavoriteLetter |
+--------------+------------------+
| 'Anna' | 'A' |
| 'Bob' | 'B' |
| 'Claire' | 'C' |
| 'Zaphod' | 'Z' |
+---------------------------------+
How would your contractor interpret the data from the last row?
Probably he would choose a different "magic value" in this table to avoid collision with the real data 'Z'? Meaning you'd have to remember several magic values and also which one is used where... how is this better than having just one magic token NULL, and having to remember the three-valued logic rules (and pitfalls) that go with it? NULL at least is standardized, unlike your contractor's 'Z'.
I don't particularly like NULL either, but mindlessly substituting it with an actual value (or worse, with several actual values) everywhere is almost definitely worse than NULL.
Let me repeat my above comment here for better visibility: If you want to read something serious and well-grounded by people who are against NULL, I would recommend the short article "How to handle missing information without using NULLs" (links to a PDF from The Third Manifesto homepage).
Nothing in principle requires nulls for correct database design. In fact there are plenty of databases designed without using null and there are plenty of very good database designers and whole development teams who design databases without using nulls. In general it's a good thing to be cautious about adding nulls to a database because they inevitably lead to incorrect or ambiguous results later on.
I've not heard of using Z being called "standard practice" as a placeholder value instead of nulls but I expect your contractor is referring to the concept of sentinel values in general, which are sometimes used in database design. However, a much more common and flexible way to avoid nulls without using "dummy" data is simply to design them out. Decompose the table such that each type of fact is recorded in a table that doesn't have "extra", unspecified attributes.
In reply to contractors comments
Empty string <> NULL
Empty string requires 2 bytes storage + an offset read
NULL uses null bitmap = quicker
IDENTITY doesn't always start at 1 (why waste half your range?)
The whole concept is flawed as per most other answers here
While I have never seen 'Z' as a magic value to represent null, I have seen 'X' used to represent a field that has not been filled in. That said, I have only ever seen this in one place, and my interface to it was not a database, but rather an XML file… so I would not be prepared to use this an argument for being common practice.
Note that we do have to handle the 'X' specially, and, as Dems mentioned, we do have to document it, and people have been confused by it. In our defence, this is forced on us by an external supplier, not something that we cooked up ourselves!

Table Design For SystemSettings, Best Model

Someone suggested moving a table full of settings, where each column is a setting name(or type) and the rows are the customers & their respective settings for each setting.
ID | IsAdmin | ImagePath
------------------------------
12 | 1 | \path\to\images
34 | 0 | \path\to\images
The downside to this is every time we want a new setting name(or type) we alter the table(via sql) and add the new (column)setting name/type. Then update the rows(so that each customer now has a value for that setting).
The new table design proposal. The proposal is to have a column for setting name and another column for setting.
ID | SettingName | SettingValue
----------------------------
12 | IsAdmin | 1
12 | ImagePath | \path\to\images
34 | IsAdmin | 0
34 | ImagePath | \path\to\images
The point they made was that adding a new setting was as easy as a simple insert statement to the row, no added column.
But something doesn't feel right about the second design, it looks bad, but I can't come up with any arguments against it. Am I wrong?
This is a variation of an "Entity Attribute Value" schema (Joel and random SO question)
It has a few pros and more cons, and it pretty much guaranteed to end in tears.
The second approach actually resembles a dictionary. I found this to be a more convenient choice for an application I am working on for the reasons you mentioned. There are a few caveats to this approach, so you need to be careful about them:
Keep your key strings static, never rename.
Make sure each time the settings dictionary is retrieved you update it to the newest version (usually by adding keys and setting default values/prompting the user).
It's tricky to mix string and e.g. decimal data, you'll either need to choose one or provide multiple, nullable columns so you can store data in the appropriate format. Keep that metadata around somewhere.
The code that deals with the dictionary should wrap it in a strongly typed fashion, never expose it as a real dictionary (in the sense of a datastructure), provide a class instead.
Using column names to distinguish settings is usually a terrible idea. The entity you are dealing with is the SETTING, and it has the attributes NAME and VALUE. If you need to use the same name in different contexts, make the SETTING hierarchical, i.e. each setting except the root gets a parent. You customers could then have the root as their parent, and the path under each customer would be the same for each setting. You can use different columns for additional data types if you want.