Is the name IsFemale inappropriate for a database column? - naming-conventions

Just browsing our db schema and found a field named 'IsFemale'
Is the name good, or is it kinda laughable?

You should stick with the ISO standard if at all possible.
ISO/IEC 5218 Information technology — Codes for the representation of human sexes is an international standard that defines a representation of human sexes through a language-neutral single-digit code. ...
The four codes specified in ISO/IEC 5218 are:
0 = not known,
1 = male,
2 = female,
9 = not applicable.
The standard specifies that its use may be referred to by the designator "SEX".

Female and male are not mutually exclusive, so you'll have to come up with something for transsexuals, unisex, etc.
To make this as enterprisey as possible, create a GenderTypeID column:
GenderTypes
-----------
GenderTypeID Name Greeting
1 Male Dear Sir
2 Female Dear Madam
3 Unisex Dear Sir and Madam
4 Unknown Dear Sir or Madam
5 Android Dear Artificial Life Form
... and so on.

Maybe naming the column "gender" (char with 'M', 'F') would me more "sensitive".

Well, the typical thing is to have "sex" column, but you may end up with clueless clients trying filing it with values as "twice a week".
Other problem is that it's language dependent. For example in English M will mean Man, while in Spanish it may mean Mujer (woman).

isFemale indicates a bigger problem with your schema, something like that should be generalized, or possibly even normalized out:
Like, having a sex column on your table, which is a FK to a sex table:
---------------------
| ID | Type |
|-------------------|
| 1 | Male |
| 2 | Female |
| 3 | Yes Please |
---------------------
Note, don't actually do this, its silly, unless you plan supporting unusual genders. I still think a generic column is better than a isFemale bit though.

It is clear the questioner is concerned (rightly) about the failure of the database designer to take into account value-neutral language.
(Please be aware that politically correct is (rightly) no longer considered accepted, value-neutral language.)
As a computer designer, you have a particular obligation to ensure your designs do not, inadvertantly or not, include or propagate gender-preference or superiority.
While the designer may have naively presumed IsFemale would give females a 1 and therefore higher/superior value, true values are often given the value of -1. Not to mention cultures where 0 is a sacred value.
In next week's installment, we'll cover people who are intersex and queer theory and its implications for variable naming standards.

Are you only ever going to have to check for "IsFemale" true/false?
Wouldn't a column like "PersonType" or something like that be more appropriate? That way, you could have "female", "male", "company" and so forth - more possible values.
Marc
PS: but if you choose to use a "bit" (boolean) column, then the "Is" or "Has" prefix is a good choice, in my opinion - makes it quite clear it's a boolean!

What is wrong with the classic "sex" and supporting sub types such as M,F, etc...

Every database I've ever worked with has used a column name of Gender with values 0 for Female and 1 for Male. I've always assumed these values were assigned in much the same way that electronics equipment has connectors that are described as being female or male.
Whether or not IsFemale is laughable depends on the intent of the system, however it does seem to have painted the application into a corner. Gender fields for instance can be grown to accommodate additional "type" but IsFemale is obviously only ever going to be true or false and therefore not at all extensible.

Gender seems a better choice, but if you want or need to use a boolean column, there are only two choices - IsMale or IsFemale.

If you're asking if it's laughable, that depends. Would isMale be better, you chauvinist? (Just kidding!)
My guess is that whoever designed the database thought of special business rules for females and designed it this way.
There really isn't a reason why male should be true and female should be false, and it is possible that your database is less efficient in using a character and character comparison.
From a programming point of view, avoiding booleans in tables make sense for the same reasons that avoiding booleans in function parameters makes sense.

For simple systems where there is no sexual ambiguity I use IsMale. The alternative would be to use a lookup table if your requirements include intersexed individuals. If you only have males and females, using anything other than a boolean value introduces unnecessary complexity and ambiguity to the system.

Related

split a fullname into first, middle, last and suffix

I have full_name column data like
|Victoria Brown |
|Sam Allen JR |
|Ray M James III |
I want to split base on the number of space the fistname, lastname
HERE is what I did but last case statement is coming wrong it still getting the suffix when we have 3 space. also need to combinne them into one column please.
This is unfortunately a lot more complex than it may first seem, and how you handle it may have to do largely with the end goal for the data.
Here's a post that covers this same issue in fairly great depth -
SQL: parse the first, middle and last name from a fullname field
As Gilbert pointed out, some names are just different, and it will be hard to get everything right, but there are certainly things you can do to limit errors.
One of the better pieces of advice from that article would be to alter your collection method to get First/Middle/Last/Suffixes/Prefixes entered separately and join them after, rather than try to parse through richer text that contains them all.
Here is a function that you could get creative with - https://blog.seandaylor.com/sql-server-split-part/

Modeling database with many (0,n) relations

I have a database to manage characters and I would like to add some changes:
Now :
One characater has multiple equipements, one equipement belongs to multiple characters.
One equipement has multiple stats, one stat belongs to multiple equipements.
We can know the side of an equipement for a character if we know the caracter and the equipement so I put "side" on the pivot. The problem is : the side only concerns rings so my data for side looks like
null | null | null | left | null | right | null |null | null | null
Is it a problem to have an attribute only concerning one type?
Also, I want to add for a stat, elements. The problem is that for something like 30 stats, only 2 are concerned by elements. And for a stat concerned by an element, the character choose one element or more elements. It will probably looks like this :
For the moment, I manage the stats like this :
I retrieve all data for a character with his equipements, then when I know his equipements I can retrieve all stats (I know their value through "value_stat"). Thanks to Laravel, I do it with eager loads.
Now to have the "value_stat" I must know id_character, id_equipement, id_stats, id_element because the element chosen depends on the character. Is that pattern correct? Even with that pattern for most of cases, id_element will be set as null because element is only for some stats.
Thanks!
Is it a problem to have an attribute only concerning one type?
Reworded: "Is it OK to have an attribute which will be non-NULL for only a subset of rows?
Answer: Yes. This is common in SQL databases. You just have to make sure that your application handles NULL values correctly.
Now to have the "value_stat" I must know id_character, id_equipement, id_stats, id_element because the element chosen depends on the character. Is that pattern correct?
Answer: It depends if actual object model is correct! For example, is relation between equipment and stat really many-to-many? I mean - can one stat record really refer to many different equipments? If that's the case, no problem. From SQL perspective there is nothing bad about using arbitrary complex relationship model. But from normalization perspective, it must match real world (in this case: game logic).

SSIS 2 Conditional Split not working as Expected

I am new to SSIS and was going through a usecase where I want to implement SCD type 2 without using SCD component (that is the requirement) where I have to use more than one conditional split and a Lookup. now when i use a single lookup and single consditional split it is working like a charm but the moment i introduce second conditional split it is not working either way . I have provide dataviwer to see data but it is not showing data also.
can you help me
my data flow is look like this
still first conditional split everything looking fine but after inserting second and third conditional split its not working.
Remember Occam's razor. Which is more likely, that a product is developed that works with 0 to 1 components but utterly fails with 2 or that your implementation is flawed?
I find that the following situations are most likely to go against expectations which I translate into "I haven't read any documentation and am fumbling my way through the UI". That's not intended as an insult, just an approach many, including myself, employ. The trick of course, is when presented with something that behaves contrary, you then consult the Fine Manual.
High level design issues
Your conditional splits do not account for all the possibilities, therefore you're "losing" data. Something your data flow lacks, is row counts. How many rows did I start with? How many rows have gone to the various sinks/destinations? At the end of the day/data flow, you must be able to account for all the data. If not, then something's gone awry and your data load is invalid.
I'd also pick nits at your lack of useful component names and OLE DB Command objects but that's maintainability and scalability which is premature optimization when the answers are wrong
What's likely the cause
Getting down to brass tacks, I'm willing to wager you are losing data to the following conditions in your data
NULL
case sensitivity
From your path annotations, your second Conditional Split 1 has 2-3 outputs. Male, Female, and assuming you didn't rename the Default output to Male or Female, Default.
You state you're losing all the data at this split. It's likely that it's all going to the Default output. I expect you have an Expressions in your conditional split like
Male := [GenderColumn] == "Male"
Female := [GenderColumn] == "Female"
However, if your source data contains male, mAle, female, FEMALE and all the permutations inbetween, you're only ever going to match based on a strict case sensitive match, which none of your data has matched. To resolve this, you would want to compare consistent values.
Here I am arbitrarily converting everything to upper case. LOWER works just as well. The important thing is that they need to result in the same value. I'm also lazy in that I'm applying a function to a constant.
Male := UPPER([GenderColumn]) == UPPER("Male")
Female := UPPER([GenderColumn]) == UPPER("Female")
But wait, what if I have NULLs? Great question, wat do? A NULL value is neither Male or Female, how should that data be treated? Right now, it's going down the Default output path. Maybe it should be treated as Male since we have a gender bias in our product. Your business user will likely know what should be done with unknown values so you should consult them. You would then add in an OR condition, via the || and test for whether the current value for our column is NULL
Male := UPPER([GenderColumn]) == UPPER("Male") || ISNULL([GenderColumn])
Female := UPPER([GenderColumn]) == UPPER("Female")

Data Quality - Is SOUNDEX the solution?

I work for an organization that has a serious data quality problem with names. There are fifteen databases that contain information about people. For example:
Database 1
Name=Fre&d Blo-ggs DOB 01/01/1980
Database 2
Name=Freddy Bloggs DOB 01/01/1980
If a user searches for Fred Bloggs using my search tool then I want both records to be found. I was thinking about something like this:
SELECT * FROM Person WHERE Soundex('Fred Bloggs') = Soundex('Fre&d Blo-ggs')
Is it advisable to use Soundex like this rather than using replace statements like this:
select Replace(Replace(Replace(Name,',',''),'&',''),'#') from Person
where Replace(Replace(Replace(Name,',',''),'&',''),'#') = #Name
#Name is the variable passed in. Is there a better way of doing it e.g. using regular expressions? Does Soundex affect performance.
Nice idea. I would not suggest using it though. I suppose that "John Right" is not the same as "John Write", even though they hear the same. I mean that in the end, what it matters is what you want to compare.... If you want to compare if the name sounds are the same, then SOUNDEX is fine.
However, I would suggest correcting your data somehow. This would be a real solution, although I can imagine that is not an easy one.
Hope I helped!
If soundex is better than regex depends of your data. For example there are different soundex versions for different languages. You have to check with your data, which is better..
Of course soundex does affect performance as any other additional functions you are calling. If performance becomes a problem, I would advise to add an additional column with the already computed soundex or normalized names and to create an index over it.
From own experience I think a normalized / simplified search criterion as e.g. parts of surname, prename and month of birth date should be sufficient to get all persons, but not too many, so a user can decide which person (s)he really wants to choose.
Soundex wont help you. you will stuck if a consonant appears in the name by mistake.
Its better you go for string distance and specify a percentage. A kind of fuzzy matching.
Have a look at the below link for fuzzy matching using levenshtein edit distance algorithm.
Levenshtein edit distance - MS SQL SERVER

Standard use of 'Z' instead of NULL to represent missing data?

Outside of the argument of whether or not NULLs should ever be used: I am responsible for an existing database that uses NULL to mean "missing or never entered" data. It is different from empty string, which means "a user set this value, and they selected 'empty'."
Another contractor on the project is firmly on the "NULLs do not exist for me; I never use NULL and nobody else should, either" side of the argument. However, what confuses me is that since the contractor's team DOES acknowledge the difference between "missing/never entered" and "intentionally empty or indicated by the user as unknown," they use a single character 'Z' throughout their code and stored procedures to represent "missing/never entered" with the same meaning as NULL throughout the rest of the database.
Although our shared customer has asked for this to be changed, and I have supported this request, the team cites this as "standard practice" among DBAs far more advanced than I; they are reluctant to change to use NULLs based on my ignorant request alone. So, can anyone help me overcome my ignorance? Is there any standard, or small group of individuals, or even a single loud voice among SQL experts which advocates the use of 'Z' in place of NULL?
Update
I have a response from the contractor to add. Here's what he said when the customer asked for the special values to be removed to allow NULL in columns with no data:
Basically, I designed the database to avoid NULLs whenever possible. Here is the rationale:
• A NULL in a string [VARCHAR] field is never necessary because an empty (zero-length) string furnishes exactly the same information.
• A NULL in an integer field (e.g., an ID value) can be handled by using a value that would never occur in the data (e.g, -1 for an integer IDENTITY field).
• A NULL in a date field can easily cause complications in date calculations. For example, in logic that computes date differences, such as the difference in days between a [RecoveryDate] and an [OnsetDate], the logic will blow up if one or both dates are NULL -- unless an explicit allowance is made for both dates being NULL. That's extra work and extra handling. If "default" or "placeholder" dates are used for [RecoveryDate] and [OnsetDate] (e.g., "1/1/1900") , mathematical calculations might show "unusual" values -- but date logic will not blow up.
NULL handling has traditionally been an area where developers make mistakes in stored procedures.
In my 15 years as a DBA, I've found it best to avoid NULLs wherever possible.
This seems to validate the mostly negative reaction to this question. Instead of applying an accepted 6NF approach to designing out NULLs, special values are used to "avoid NULLs wherever possible." I posted this question with an open mind, and I am glad I learned more about the "NULLs are useful / NULLs are evil" debate, but I am now quite comfortable labeling the 'special values' approach to be complete nonsense.
an empty (zero-length) string furnishes exactly the same information.
No, it doesn't; in the existing database we are modifying, NULL means "never entered" and empty string means "entered as empty".
NULL handling has traditionally been an area where developers make mistakes in stored procedures.
Yes, but those mistakes have been made thousands of times by thousands of developers, and the lessons and caveats for avoiding those mistakes are known and documented. As has been mentioned here: whether you accept or reject NULLs, representation of missing values is a solved problem. There is no need to invent a new solution just because developers continue make easy-to-overcome (and easy-to-identify) mistakes.
As a footnote: I have been a DBE and developer for more than 20 years (which is certainly enough time for me to know the difference beetween a database engineer and a database administrator). Throughout my career I have always been in the "NULLs are useful" camp, though I was aware that several very smart people disagreed. I was extremely skeptical about the "special values" approach, but not well-versed enough in the academics of "How To Avoid NULL the Right Way" to make a firm stand. I always love learning new things—and I still have lots to learn after 20 years. Thanks to all who contributed to make this a useful discussion.
Sack your contractor.
Okay, seriously, this isn't standard practice. This can be seen simply because all RDBMS that I have ever worked with implement NULL, logic for NULL, take account of NULL in foreign keys, have different behaviour for NULL in COUNT, etc, etc.
I would actually contend that using 'Z' or any other place holder is worse. You still require code to check for 'Z'. But you also need to document that 'Z' doesn't mean 'Z', it means something else. And you have to ensure that such documentation is read. And then what happens if 'Z' ever becomes a valid piece of data? (Such as a field for an initial?)
At a basic level, even without debating the validity of NULL vs 'Z', I would insist that the contractor conforms to standard practices that exist within your company, not his. Instituting his standard practice in an environment with an alternative standard practice will cause confusion, maintenance overheads, mis-understanding, and in the end increased costs and mistakes.
EDIT
There are cases where using an alternative to NULL is valid in my opinion. But only where doing so reduces code, rather than creating special cases which require accounting for.
I've used that for date bound data, for example. If data is valid between a start-date and an end-date, code can be simplified by not having NULL values. Instead a NULL start-date could be replaced with '01 Jan 1900' and a NULL end-date could be replaced with '31 Dec 2079'.
This still can change behaviour from what may be expected, and so should be used with care:
WHERE end-date IS NULL no longer give data that is still valid
You just created your own millennium bug
etc.
This is equivalent to reforming abstractions such that all properties can always have valid values. It is markedly different from implicitly encoding specific meaning into arbitrarily chosen values.
Still, sack the contractor.
This is easily one of the weirdest opinions I've ever heard. Using a magic value to represent "no data" rather than NULL means that every piece of code that you have will have to post-process the results to account/discard the "no-data"/"Z" values.
NULL is special because of the way that the database handles it in queries. For instance, take these two simple queries:
select * from mytable where name = 'bob';
select * from mytable where name != 'bob';
If name is ever NULL, it obviously won't show up in the first query's results. More importantly, neither will it show up in the second queries results. NULL doesn't match anything other than an explicit search for NULL, as in:
select * from mytable where name is NULL;
And what happens when the data could have Z as a valid value? Let's say you're storing someone's middle initial? Would Zachary Z Zonkas be lumped in with those people with no middle initial? Or would your contractor come up with yet another magic value to handle this?
Avoid magic values that require you to implement database features in code that the database is already fully capable of handling. This is a solved and well understood problem, and it may just be that your contractor never really grokked the notion of NULL and therefore avoids using it.
If the domain allows missing values, then using NULL to represent 'undefined' is perfectly OK (that's what it is there for). The only downside is that code that consumes the data has to be written to check for NULLs. This is the way I've always done it.
I have never heard of (or seen in practice) the use of 'Z' to represent missing data. As to "the contractor cites this as 'standard practice' among DBAs", can he provide some evidence of that assertion? As #Dems mentioned, you also need to document that 'Z' doesn't mean 'Z': what about a MiddleInitial column?
Like Aaron Alton and many others, I believe that NULL values are an integral part of database design, and should be used where appropriate.
Even if you somehow manage to explain to all your current and future developers and DBAs about "Z" instead of NULL, and even if they code everything perfectly, you will still confuse the optimizer because it will not know that you've cooked this up.
Using a special value to represent NULL (which is already a special value to represent NULL) will result in skews in the data. e.g. So many things happened on 1-Jan-1900 that it will throw out the optimizer's ability to understand that actual range of dates that really are relevant to your application.
This is like a manager deciding: "Wearing a tie is bad for productivity, so we're all going to wear masking tape around our necks. Problem solved."
I've never heard about the wide-spread use of 'Z' as a substitute for NULL.
(Incidentally, I'd not particularly like to work with a contractor who tells you in the face that they and other "advanced" DBAs are so much more knowledgeable and better than you.)
+=================================+
| FavoriteLetters |
+=================================+
| Person | FavoriteLetter |
+--------------+------------------+
| 'Anna' | 'A' |
| 'Bob' | 'B' |
| 'Claire' | 'C' |
| 'Zaphod' | 'Z' |
+---------------------------------+
How would your contractor interpret the data from the last row?
Probably he would choose a different "magic value" in this table to avoid collision with the real data 'Z'? Meaning you'd have to remember several magic values and also which one is used where... how is this better than having just one magic token NULL, and having to remember the three-valued logic rules (and pitfalls) that go with it? NULL at least is standardized, unlike your contractor's 'Z'.
I don't particularly like NULL either, but mindlessly substituting it with an actual value (or worse, with several actual values) everywhere is almost definitely worse than NULL.
Let me repeat my above comment here for better visibility: If you want to read something serious and well-grounded by people who are against NULL, I would recommend the short article "How to handle missing information without using NULLs" (links to a PDF from The Third Manifesto homepage).
Nothing in principle requires nulls for correct database design. In fact there are plenty of databases designed without using null and there are plenty of very good database designers and whole development teams who design databases without using nulls. In general it's a good thing to be cautious about adding nulls to a database because they inevitably lead to incorrect or ambiguous results later on.
I've not heard of using Z being called "standard practice" as a placeholder value instead of nulls but I expect your contractor is referring to the concept of sentinel values in general, which are sometimes used in database design. However, a much more common and flexible way to avoid nulls without using "dummy" data is simply to design them out. Decompose the table such that each type of fact is recorded in a table that doesn't have "extra", unspecified attributes.
In reply to contractors comments
Empty string <> NULL
Empty string requires 2 bytes storage + an offset read
NULL uses null bitmap = quicker
IDENTITY doesn't always start at 1 (why waste half your range?)
The whole concept is flawed as per most other answers here
While I have never seen 'Z' as a magic value to represent null, I have seen 'X' used to represent a field that has not been filled in. That said, I have only ever seen this in one place, and my interface to it was not a database, but rather an XML file… so I would not be prepared to use this an argument for being common practice.
Note that we do have to handle the 'X' specially, and, as Dems mentioned, we do have to document it, and people have been confused by it. In our defence, this is forced on us by an external supplier, not something that we cooked up ourselves!