Database - (rows or records, columns or fields)? - sql

In database terminology:
What is the difference between a row and a record?
Likewise, aren't columns and fields the same thing?
On the blog Joe Celko The SQL Apprentice , I noticed that the banner mentions that they are different things.

Row and record can arguably be considered as the same thing.
Fields and columns are different, a field is the intersection of a row and a column.
i.e. if your table has 10 rows and 10 columns, it has 100 fields.
When you create a table using DDL statements, you define columns (metadata).
When you add rows using DML statements, you define rows and their fields.

In a broader sense, rows and columns refers to a matrix structure. When a database, not limited to a relational database, has a matrix structured data, it can be borrowed this terminology, but there might be a more specifical one.
In relational databases, for example, a table is always a matrix, so at each column in a table corresponds a field in a record and at each row corresponds a record: different concepts pointing to the same object.
A field can be present even in NoSQL databases, where often there's a free schema (no columns) and each row can have a different number of fields.
Similarly, a record can be a complex value in non-relational databases: it can contain fields with multiple distinct values (not 1NF). A row (a tuple in relational algebra) otherwise contains a single value for each field.

As stated in a previous answer to this question, row and record can arguably be used interchangeably.
Column and field can also arguably be used interchangeably. See the following article:Column (database)
Here's a quote (as of this writing), from the article mentioned above, which makes that point:
"The term field is often used interchangeably with column, although many consider it more correct to use field (or field value) to refer specifically to the single item that exists at the intersection between one row and one column."
Here's some additional background info which may be helpful:
During my IT career as an analyst and programmer, I've typically used the terms field and record, not column and row, in both programming and relational database contexts. I think that comes from the instruction that I received during my university studies, and the fact that I learned the basic data hierarchy of bit, byte, field, record, file, before learning about relational databases.
In researching this question, I found that it is common practice, and arguably correct, to use row and record interchangeably and to use column and field interchangeably. I was actually quite surprised, though, when my research indicated that row and column are preferred terms over record and field, in database terminology.

The terms Record and Field, predate relational databases, a time when computerized file systems ruled persistence storage, mainframes ruled the computing market and DBAs/Data Analysts were called DPs (Data Processing specialists).
A file with data organized in a 2-d matrix form, where a piece of information is called a field (column) and a collection of related fields a record (row). This data file is similar to a table (without standardized relationships governing the contents), therefore, the terms used during the file processing times were inherited. Technically, a row <> record and column <> field.
--For more information: Database Systems: Design, Implementation & Management - Coroner (Chapter 1, Section 5)

Records and fields make up a database table. Rows and Columns are found in spreadsheets.

Related

What is atomicity in dbms

I read something like below in 1NF form of DBMS.
There was a sentence as follows:
"Every column should be atomic."
Can anyone please explain it to me thoroughly with an example?
Re "atomic"
In Codd's original 1969 and 1970 papers he defined relations as having a value for every attribute in a row. The value could be anything, including a relation. This used no notion of "atomic". He explained that "atomic" meant not relation-valued (ie not table-valued):
So far, we have discussed examples of relations which are defined on
simple domains--domains whose elements are atomic (nondecomposable)
values. Nonatomic values can be discussed within the relational
framework. Thus, some domains may have relations as elements.
He used "simple", "atomic" and "nondecomposable" as informal expository notions. He understood that a relation has rows of which each column has an associated name and value; attributes are by definition "single-valued"; the value is of any type. The only structural property that matters relationally is being a relation. It is also just a value, but you can query it relationally. Then he used "nonsimple" etc meaning relation-valued.
By the time of Codd's 1990 book The Relational Model for Database Management: Version 2:
From a database perspective, data can be classified into two types:
atomic and compound. Atomic data cannot be decomposed into smaller
pieces by the DBMS (excluding certain special functions). Compound
data, consisting of structured combinations of atomic data, can be
decomposed by the DBMS.
In the relational model there is only one type of compound data: the
relation. The values in the domains on which each relation is defined
are required to be atomic with respect to the DBMS. A relational
database is a collection of relations of assorted degrees. All of the
query and manipulative operators are upon relations, and all of them
generate relations as results. Why focus on just one type of compound
data? The main reason is that any additional types of compound data
add complexity without adding power.
"In the relational model there is only one type of compound data: the relation."
Sadly, "atomic = non-relation" is not what you're going to hear. (Unfortunately Codd was not the clearest writer and his expository remarks get confused with his bottom line.) Virtually all presentations of the relational model get no further than what was for Codd merely a stepping stone. They promote an unhelpful confused fuzzy notion canonicalized/canonized as "atomic" determining "normalized". Sometimes they wrongly use it to define realtion. Whereas Codd used everyday "nonatomic" to introduce defining relational "nonatomic" as relation-valued and defined "normalized" as free of relation-valued domains.
(Neither is "not a repeating group" helpful as "atomic", defining it as not something that is not even a relational notion. And sure enough in 1970 Codd says "terms attribute and repeating group in present database terminology are roughly analogous to simple domain and nonsimple domain, respectively".)
Eg: This misinterpretation was promoted for a long time from early on by Chris Date, honourable early relational explicator and proselytizer, primarily in his seminal still-current book An Introduction to Database Systems. Which now (2004 8th edition) thankfully presents the helpful relationally-oriented extended notion of distinguishing relation, row and "scalar" (non-relation non-row) domains:
This definition merely states that all [relation variables] are in 1NF
Eg: Maiers' classic The Theory of Relational Databases (1983):
The definition of atomic is hazy; a value that is atomic in one application could be non-atomic in another. For a general guideline, a value is non-atomic if the application deals with only a part of the value.
Eg: The current Wikipedia article on First NF (Normal Form) section Atomicity actually quotes from the introductory parts above. And then ignores the precise meaning. (Then it says something unintelligible about when the nonatomic turtles should stop.):
Codd states that the "values in the domains on which each
relation is defined are required to be atomic with respect to the
DBMS." Codd defines an atomic value as one that "cannot be decomposed
into smaller pieces by the DBMS (excluding certain special functions)"
meaning a field should not be divided into parts with more than one
kind of data in it such that what one part means to the DBMS depends
on another part of the same field.
Re "normalized" and "1NF"
When Codd used "normalize" in 1970, he meant eliminate relation-valued ("non-simple") domains from a relational database:
For this reason (and others to be cited below) the possibility of
eliminating nonsimple domains appears worth investigating. There is,
in fact, a very simple elimination procedure, which we shall call
normalization.
Later the notion of "higher NFs" (involving FDs (functional dependencies) & then JDs (join dependencies)) arose and "normalize" took on a different meaning. Since Codd's original normalization paper, normalization theory has always given results relevant to all relations, not just those in Codd's 1NF. So one can "normalize" in the original sense of going from just relations to a "normalized" "1NF" without relation-valued columns. And one can "normalize" in the normalization-theory sense of going from a just-relations "1NF" to higher NFs while ignoring whether domains are relations. And "normalization" is commonly also used for the "hazy" notion of eliminating values with "parts". And "normalization" is also wrongly used for designing a relational version of a non-relational database (whether just relations and/or some other sense of "1NF").
Relational spirit is to eschew multiple columns with the same meaning or domains with interesting parts in favour of another base table. But we must always come to an informal ergonomic decision about when to stop representing parts and just treat a column as "atomic" (non-relation-valued) vs "nonatomic" (relation-valued).
Normalization in database management system
Atomicity and 1NF... that is not about atomic transactions, but about definition and column content.
"Atomic" means "cannot be divided or split in smaller parts". Applied to 1NF this means that a column should not contain more than one value. It should not compose or combine values that have a meaning of their own.
This tipically regards 2 very common mistakes made by database designers:
1. multiple values in one column (list columns)
columns that contain a list of values, tipically space or comma separated, like this blog post table:
id title date_posted content tags
1 new idea 2014-05-23 ... tag1,tag2,tag3
2 why this? 2014-05-24 ... tag2,tag5
3 towel day 2014-05-26 ... tag42
or this contacts table:
id room phones
4 432 111-111-111 222-222-222
5 456 999-999-999
6 512 888-888-8888 333-3333-3333
This type of denormalization is rare, as most database designers see this cannot be a good thing. But you do find tables like this. They usually come from modifications to the database, whereas it may seem simpler to widen a column and use it to stuff multiple values instead of adding a normalized related table (which often breaks existing applications).
2. complex multi-part columns
In this case one column contains different bits of information and could maybe be designed as a set of separate columns.
Typical example are fullname and address columns:
id fullname address
1 Mark Tomers 56 Tomato Road
2 Fred Askalong 3277 Hadley Drive
3 May Anne Brice 225 Century Avenue - apartment 43/a
These types of denormalizations are very common, as it is quite difficult to draw the line and what is atomic and what is not. Depending on the application, a multi-part column could very well be the best solution in some cases. It is less structured, but simpler.
Structuring an address in many atomic columns may mean having more complex code to handle results for output. Another complexity comes from the structure not being adeguate to fit all types of addresses. Using one single VARCHAR column does not pose this problem, but may pose others... typically about searching and sorting.
An extreme case of multi-part columns are dates and times. Most RDBMS provide date and time data types and provide functions to handle date and time algebra and the extraction of the various bits (month, hour, etc...). Few people would consider convenient to have separate year, mont, day columns in a relational database. But I've seen it... and with good reasons: the use case was birthdates for a justice department database. They had to handle many immigrants with few or no documents. Sometimes you just knew a person was born in a certain year, but you would not know the day or month or birth. You can't handle that type of info with a single date column.
"Every column should be atomic."
Chris Date says, "Please note very carefully that it is not just simple things like the integer 3 that are legitimate values. On the contrary, values can be arbitrarily complex; for example, a value might be a geometric point, or a polygon, or an X ray, or an XML document, or a fingerprint, or an array, or a stack, or a list, or a relation (and so on)."[1]
He also says, "A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute."[2]
He generally discourages the use of the word atomic, because it has confusing connotations. Single value is probably a better term to use.
For example, a date like '2014-01-01' is a single value. It's not indivisible; on the contrary, it quite clearly is divisible. But the dbms does one of two things with single values that have parts. The dbms either returns those values as a whole, or the dbms provides functions to manipulate the parts. (Clients don't have to write code to manipulate the parts.)[3]
In the case of dates, SQL can
return dates as a whole (SELECT CURRENT_DATE),
return one or more parts of a date (EXTRACT(YEAR FROM CURRENT_DATE)),
add and subtract intervals (CURRENT_DATE + INTERVAL '1' DAY),
subtract one date from another (CURRENT_DATE - DATE '2014-01-01'),
and so on. In this (narrow) respect, SQL is quite relational.
An Introduction to Database Systems, 8th ed, p 113. Emphasis in the original.
Ibid, p 358.
In the case of a "user-defined" type, the "user" is presumed to be a database programmer, not a client of the database.
it means column should not contain multiple values(like comma seperated values).
plz see below link.
http://www.studytonight.com/dbms/database-normalization.php

Column oriented database vs row oriented database

I have used row oriented database design for long time and except for datawarehouse projects and Big data samples, I have not used column oriented database design for OLTP app.
My row oriented table looks like
ID, Make, Model, Month, Miles, Cost
1 BMW Z3 12 12000 100
Some people in our team advocating column oriented database design.
They suggest that all the column names should be property names in a Property table.
Then another table Quote will have two columns PropertyName and PropertyValue.
In the .net code, we read each key and compare and convert to strongly typed object. The code is really getting messy.
if (qwi.DomainCode == typeof(CoreBO.Base.iQQConstants.MBPCollateralInfo).Name)
{
if (qwi.RefCode == iQQConstants.MBPCollateralInfo.ENGINETYPE)
{
Aspiration = qwi.Value;
}
else if (qwi.RefCode == iQQConstants.MBPCollateralInfo.FUELTYPE)
{
FuelType = qwi.Value;
}
else if (qwi.RefCode == iQQConstants.MBPCollateralInfo.MAKE)
{
Make = qwi.Value;
}
else if (qwi.RefCode == iQQConstants.MBPCollateralInfo.MILEAGE)
{
int reading = 0;
bool success = int.TryParse(qwi.Value, out reading);
if (success)
{
OdometerReading = reading;
}
}
}
The arguement for this column oriented design is that we won't have to change table schema and the stored proc(we are still using stored proc instead of Entity Framework).
Seems like we are heading into real problem. Is Column oriented design well accepted in the industry.
I am having trouble with your terminology. You are describing an EAV structure (standing for Entity-Attribute-Value).
Aside: A "column-oriented" database usually refers to a database that stores each column separately from others (when I learned about databases, this was called "vertical partitioning", but I don't think that caught on). Examples include Paracel and Vertica.
An entity-attribute-value database is storing each attribute for an entity as a separate row.
The first problem that you have with your particular structure is typing. Some of the attributes are strings and some are numbers. This becomes a management nightmare in an EAV world. Either you store everything as strings (losing the ability to type check values and to guarantee that arithmetic words) or you include multiple columns for different types with a type column (making queries much more complicated).
Similarly, constraints and foreign key references are much harder to implement. Also, because you are repeating the entity id and attribute id on each row, the data often takes up more space. NULL values are typically quite space efficient.
On the OLTP side, you have another problem. When you want to insert an entity, you typically want to insert a bunch of attributes as well. One insert has now turned into many inserts, and you'll want to start wrapping these in transactions, affecting performance.
Given all these shortcomings, you might think never use EAV models. There is a place for them. They are particularly useful when attributes are changing over time. Say, if you have an application where users can put in their own information with tags. In such cases, a hybrid approach is the best solution. Use a regular relational table with many columns for the common information. Use an EAV table for optional information for each entity.
Source: WIKI
Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
Row-oriented organizations are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.
In practice, row-oriented storage layouts are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column-oriented storage layouts are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).
In addition to the problems Gordon Linoff mentions, EAV data models are also fiendishly hard to query - find all cars where the make is BMW and the months between 12 and 24 and the cost < 10000 becomes a huge jumble of nasty SQL, especially if you're doing string comparison on numbers...
Generally row-oriented and column-oriented is the storage mechanism at the low level(disk). The goodness of each storage depends on your requirement. In some scenario column-oriented storage will result better and in some scenarios row-oriented will.
In Hbas database they are using the concept of column-family which is group of columns.
The difference between row-oriented is that logical table which consist of rows is stored one row per row-block whereas column-oriented stores one column per column block.
Row-oriented result in poor performance when we are firing query which are analytical (like sum of salaries, avg of salary) but works fine when we need to access invidual detail of a row or to insert a new record. Whereas column-oriented works fine on analytical queries but result in poor performance for insertion of individual record or accessing all the detail of a row.
You can visit this link which have describe different scenarios their pros and cons with example and their summary difference.
click here : http://geekrandomstuff.blogspot.tw/2014/04/row-oriented-database-vs-column.html
From my experience EAV is great for storing application settings ie. relatively static data without any further need for joining and transforming data, nothing more then that.

When to split a Large Database Table?

I'm working on an 'Employee' database and the fields are beginning to add up (20 say). The database would be populated from different UI say:
Personal Information UI: populates fields of the 'Employee' table such as birthday, surname, gender etc
Employment Details UI: populates fields of the 'Employee' table such as employee number, date employed, grade level etc
Having all the fields populated from a single UI (as you would imagine) is messy and results in one very long form that you'd need to scroll.
I'm thinking of splitting this table into several smaller tables, such that each smaller table captures a related information of an employee (i.e. splitting the table logically according to the UI).
The tables will then be joined by the employee id. I understand that splitting tables with one-to-one relationship is generally not a good idea (multiple-database-tables), but could splitting the table logically help, such that the employee information is captured in several INSERT statements?
Thanks.
Your data model should not abide to any rules imposed by the UI, just for convenience. One way to reduce the column-set for a given UI component is to use views (in most databases, you can also INSERT / UPDATE / DELETE using simple views). Another is to avoid SELECT *. You can always select subsets of your table's columns
"could splitting the table logically help, such that the employee information is captured in several INSERT statements?"
No.
How could it help?
20 fields is a fairly small number of fields for a relational database. There was a question a while ago on SO where a developer expected to have around 3,000 fields for a single table (which was actually beyond the capability of the RDBMS in question) - under such circumstances, it does make sense to split up the table.
It could also make sense to split up the table if a subset of columns were only ever going to be populated for a small proportion of rows (for example, if there were attributes that were specific to company directors).
However, from the information supplied so far, there is no apparent reason to split the table.
Briefly put, you want to normalize your data model. Normalization is the systematic restructuring of data into tables that formed the theoretical foundation of the relational data model developed by EF Codd forty years ago. There are levels of normalization - non-normalized and then first, second, third etc normal forms.
Normalization is now barely an afterthought in many database shops, ostensibly because it is erroneously believed to slow database performance.
I found a terrific summary on an IBM site covering normalization which may help. http://publib.boulder.ibm.com/infocenter/idshelp/v10/topic/com.ibm.ddi.doc/ddi56.htm
The original Codd book later revised by CJ Date is not very accessible, unfortunately. I can recommend "Database Systems: Design, Implementation & Management; Authors: Rob, P. & Coronel, C.M". I TA'ed a class on database design that used this textbook and I've kept using it for reference.

What is considered a large sql table considering number of fields

I know the question does not make that good of a sense.
What is a typical number of columns in a small, medium and large table (in a database) in professional environment.
Want to get an idea what is a good design, how big I can grow my table column wise. Is 100 columns or 200 columns in a table OK?
It totally depends upon the nature of the subject you are modeling. If you need one column, you need one; if you need 500, then you need 500. Properly designed, the size of the tables you end up with will always be "just right".
How big can they be, what performs well, what if you need more columns than you can physically stuff into SQL... those are all implementation, maintenance, and/or performance questions, and that's a different and secondary subject. Get the models right first, then worry about implementation.
Well, for SQL Server 2005, according to the Max Capacity Specifications, the maximum number of columns per base table are 1024. So that's a hard upper limit.
You should normalize your tables to at least third normal form (3NF). That should be your primary guide.
You should consider the row size. 100 sparse columns is differnt from 100 varchar (3000) columns. It is almost always better to make a related table (with an enforced 1-1 relationship) when you are starting to get past the record size that SQL Server can store on one page.
You also should consider how the data will be queried. Are many of those fields ones that will not frequently need to be returned? Do they have a natural grouping (think a person record vice a user login record) and will that natural grouping dictate how they will be queried? In that case it might be better to separate them out.
And of course you should consider normalization. If you are doing multiple columns to avoid having a one-to-many relationship and a join, then you should not do that even if you only have 6 columns. Joins (with the key fields indexed) are preferable to denormalized tables except in data warehousing situations in general. It is better if you don't have to add new columns because you need to store a new phone type.

Database table with just 1 or 2 optional fields... split into multiple tables?

In a DB I'm designing, there's one fairly central table representing something that's been sold or is for sale. It distinguishes between personal sales (like eBay) and sales from a proper company. This means there is literally 1 or two fields which are not equally appropiate to both cases... for instance one field is only used in one case, another field is optional in one case but mandatory in the other.
If there were more specialty it would be sensible to have a core table and then two tables with the fields relevant to the specific cases. But here, creating two tables just to contain like one field plus the reference to the core table seems both aesthetically bad, and painful to the query designer and DB software.
What do you think? Is it ok to bend the rules slightly by having a single table with weakened constraints - meaning the DB cannot 100% prevent inconsistent data being added (in a very limited way) - or do I suck it up and create dumb-looking 1-field tables?
What you're describing with one table for common columns and dependent tables for subtype-specific columns is called Class Table Inheritance. It's a perfectly good thing to do.
What #Scott Ferguson seems to be describing (two distinct tables for the two types of sales) is called Concrete Table Inheritance. It can also be a good solution depending on your needs, but more often it just makes it harder to write query across both subtypes.
If all you need is one or two columns that apply only to a given subtype, I agree it seems like overkill to create dependent tables. Remember that most brands of SQL database support CHECK constraints or triggers, so you can design data integrity rules into the metadata.
CREATE TABLE Sales (
sale_id SERIAL,
is_business INT NOT NULL, -- 1 for corporate, 0 for personal
sku VARCHAR(20), -- only for corporate
paypal_id VARCHAR(20), -- mandatory but only for personal
CONSTRAINT CHECK (is_business = 0 AND paypal_id IS NOT NULL)
);
I think the choice of having these fields is not going to hurt you today and would be the choice I would go for. just remember that as your database evolves you may need to make the decision to refactor to 2 separate tables, (if you need more fields)
There are some who insist that inapplicable fields should never be allowed, but I think this is one of those rules that someone wrote in a book and now we're all supposed to follow it without questioning why. In the case you're describing, a single table sounds like the simple, intelligent solution.
I would certainly not create two tables. Then all the common fields would be duplicated, and all your queries would have to join or union two tables. So the real question is, One table or three. But you seem to realize that.
You didn't clarify what the additional fields are. If the presence or absence of one field implies the record type, then I sometimes use that fact as the record type indicator rather than creating a redundant type. Like, if the only difference between a "personal sale" and a "business sale" is that a business sale has a foreign key for a company filled in, then you could simply state that you define a business sale as one with a company filled in, and no ambiguity is possible. But if the situation gets even slightly more complicated, this can be a trap: I've seen applications that say if a is null and b=c d / 7 = then it's record type A, else if b is null and etc etc. If you can't do it with one test on one field, forget it and put in a record type field.
You can always enforce consistency with code or constraints.
I worry a lot more about redundant data creating consistency problems then inapplicable fields. Redundant data creates all sorts of problems. Data inapplicable to a record type? In the worst case, just ignore it. If it's a "personal sale" and somehow a company got filled in, ignore it or null it out on sight. Problem solved.
If there are two distinct entities, "Personal Sales" and "Company Sales", then perhaps you ought to have two tables to represent those entities?
News flash: the DB cannot prevent 100% of corrupt data now matter which way you cut it. So far you have only considered what I call level 1 corruption (level 0 corruption is essentially what would happen if you wrote garbage over your database with a hex editor).
I have yet to see a database that could prevent level 2 corruption (syntactically correct records but when taken as a whole mean something perverse).
The PRO for keeping all fields in one table is that you get rid of JOIN's which makes your queries faster.
The CONTRA is that your table grows larger which makes your queries slower.
Which one impacts you more, totally depends on your data distribution and which queries you issue most often.
In general, splitting is better for OLTP systems, joining is better for data analysis (that tends to scan the tables).
Let's imagine 2 scenarios:
Split fields. There are 1,000,000 rows, the average row size is 20 bytes, the split field is filled once per 50 rows (i. e. 20,000 records in the split table).
We want to query like this:
SELECT SUM(mainfield + COALESCE(splitfield, 0))
FROM maintable
LEFT JOIN
splittable
ON splitid = mainid
This will require scanning 20,000,000 bytes and nested loops (or hash lookups) to find 10,000 records.
Each hash lookup is roughly equivalent to scanning 10 rows, so the total time will be equivalent of scanning 20,000,000 + 10 * 20,000 * 20 = 24,000,000 bytes
Joined fields. There are 1,000,000 rows, the average row size is 24 bytes, so the query will scan 24,000,000 bytes.
As you can see, the times tie.
However, if either parameter changes (field is filled more often or more rarely, the row size is more or less, etc), one or another solution will become better.