Why & When should I use SPARSE COLUMN? (SQL SERVER 2008) - sql

After going thru some tutorials on SQL Server 2008's new feature "SPARSE COLUMN", I have found that it doesn't take any space if the column value is 0 or NULL but when there is a value, it takes 4 times the space a regular(non sparse) column holds.
If my understanding is correct, then why I will go for that at the time of database design?
And if I use that, then at what situation will I be?
Also out of curiosity, how does no space get reserved when a column is defined as sparse column (I mean to say, what is the internal implementation for that?)

A sparse column doesn't use 4x the amount of space to store a value, it uses a (fixed) 4 extra bytes per non-null value. (As you've already stated, a NULL takes 0 space.)
So a non-null value stored in a bit column would be 1 bit + 4 bytes = 4.125 bytes. But if 99% of these are NULL, it is still a net savings.
A non-null value stored in a GUID (UniqueIdentifier) column is 16 bytes + 4 bytes = 20 bytes. So if only 50% of these are NULL, that's still a net savings.
So the "expected savings" depends strongly on what kind of column we're talking about, and your estimate of what ratio will be null vs non-null. Variable width columns (varchars) are probably a little more difficult to predict accurately.
This Books Online Page has a table showing what percentage of different data types would need to be null for you to end up with a benefit.
So when should you use a Sparse Column? When you expect a significant percentage of the rows to have a NULL value. Some examples that come to mind:
A "Order Return Date" column in an order table. You would hope that a very small percent of sales would result in returned products.
A "4th Address" line in an Address table. Most mailing addresses, even if you need a Department name and a "Care Of" probably don't need 4 separate lines.
A "Suffix" column in a customer table. A fairly low percent of people have a "Jr." or "III" or "Esquire" after their name.

Storing a null in a sparse column takes up no space at all.
To any external application the column will behave the same
Sparse columns work really well with filtered indexes as you will only want to create an index to deal with the non-empty attributes in the column.
You can create a column set over the sparse columns that returns an xml clip of all of the non-null data from columns covered by the set. The column set behaves like a column itself. Note: you can only have one column set per table.
Change Data Capture and Transactional replication both work, but not the column sets feature.
Downsides
If a sparse column has data in it it will take 4 more bytes than a normal column e.g. even a bit (0.125 bytes normally) is 4.125 bytes and unique identifier rises form 16 bytes to 20 bytes.
Not all data type can be sparse: text, ntext, image, timestamp, user-defined data type, geometry, or geography or varbinray (max) with the FILESTREAM attribute cannot be sparse. (Changed17/5/2009 thanks Alex for spotting the typo)
computed columns can't be sparse (although sparse columns can take part in a calculation in another computed column)
You can't apply rules or have default values.
Sparse columns cannot form part of a clustered index. If you need to do that use a computed column based on the sparse column and create the clustered index on that (which sort of defeats the object).
Merge replication doesn't work.
Data compression doesn't work.
Access (read and write) to sparse columns is more expensive, but I haven't been able to find any exact figures on this.
Reference

You're reading it wrong - it never takes 4x the space.
Specifically, it says 4* (4 bytes, see footnote), not 4x (multiply by 4). The only case where it's exactly 4x the space is a char(4), which would see savings if the NULLs exist more than 64% of the time.
"*The length is equal to the average of the data that is contained in the type, plus 2 or 4 bytes."

| datetime NULL | datetime SPARSE NULL | datetime SPARSE NULL |
|--------------------|----------------------|----------------------|
| 20171213 (8 bytes) | 20171213 (12 bytes) | 20171213 (12 bytes) |
| NULL (8 bytes) | 20171213 (12 bytes) | 20171213 (12 bytes) |
| 20171213 (8 bytes) | NULL (0 bytes) | NULL (0 bytes) |
| NULL (8 bytes) | NULL (0 bytes) | NULL (0 bytes) |
You lose 4 bytes not just once per row; but for every cell in the row that is not null.

From SQL SERVER – 2008 – Introduction to SPARSE Columns – Part 2 by Pinal Dave:
All SPARSE columns are stored as one XML column in database. Let us
see some of the advantage and disadvantage of SPARSE column.
Advantages of SPARSE column are:
INSERT, UPDATE, and DELETE statements can reference the sparse columns by name. SPARSE column can work as one XML column as well.
SPARSE column can take advantage of filtered Indexes, where data are filled in the row.
SPARSE column saves lots of database space when there are zero or null values in database.
Disadvantages of SPARSE column are:
SPARSE column does not have IDENTITY or ROWGUIDCOL property.
SPARSE column can not be applied on text, ntext, image, timestamp, geometry, geography or user defined datatypes.
SPARSE column can not have default value or rule or computed column.
Clustered index or a unique primary key index can not be applied SPARSE column. SPARSE column can not be part of clustered index key.
Table containing SPARSE column can have maximum size of 8018 bytes instead of regular 8060 bytes. A table operation which involves SPARSE
column takes performance hit over regular column.

Related

MS Access 2010 SQL query is rounding automatically to whole numbers

All,
I'm running the SQL query below in MS Access 2010. Everything works fine except that the "a.trans_amt" column is rounded to a whole number (i.e. the query returns 12.00 instead of 12.15 or 96.00 instead of 96.30). Any ideas? I'd like it to display 2 decimal points. I tried using the ROUND function but didn't have any success.
Thanks!
INSERT INTO [2-Matched Activity] ( dbs_eff_date, batch_id_r1, jrnl_name,
ledger, entity_id_s1, account_s2, intercompany_s6, trans_amt,
dbs_description, icb_name, fdt_key, combo )
SELECT a.dbs_eff_date,
a.batch_id_r1,
a.jrnl_name,
a.ledger,
a.entity_id_s1,
a.account_s2,
a.intercompany_s6,
a.trans_amt,
a.dbs_description,
a.icb_name,
a.fdt_key,
a.combo
FROM [1-ICB Daily Activity] AS a
INNER JOIN
(
SELECT
b.dbs_eff_date,
b.batch_id_r1,
b.jrnl_name,
sum(b.trans_amt) AS ["trans_amt"],
b.icb_name
FROM [1-ICB Daily Activity] AS b
GROUP BY dbs_eff_date, batch_id_r1, jrnl_name, icb_name
HAVING sum(trans_amt) = 0
) AS b
ON (a.dbs_eff_date = b.dbs_eff_date) AND (a.batch_id_r1 = b.batch_id_r1) AND
(a.jrnl_name = b.jrnl_name) AND (a.icb_name = b.icb_name);
Essentially, you are attempting to append decimal precise values to an integer column. While MS Access does not raise a type exception it will implicitly reduce precision to fit the destination storage. To avoid these undesired results, set the precision type ahead of time.
According to MSDN docs, the MS Access database engine maintains the following numeric types:
REAL 4 bytes A single-precision floating-point value with a range of ...
FLOAT 8 bytes A double-precision floating-point value with a range of ...
SMALLINT 2 bytes A short integer between – 32,768 and 32,767.
INTEGER 4 bytes A long integer between – 2,147,483,648 and 2,147,483,647.
DECIMAL 17 bytes An exact numeric data type that holds values ...
And the MS Access GUI translates these as Field Sizes in table design interface where the default format of Number is Long Integer type.
Byte — For integers that range from 0 to 255. Storage requirement is a single byte.
Integer — For integers that range from -32,768 to +32,767. Storage requirement is two bytes.
Long Integer — For integers that range from -2,147,483,648 to +2,147,483,647 ...
Single — For numeric floating point values that range from -3.4 x 1038 to ...
Double — For numeric floating point values that range from -1.797 x 10308 to ...
Replication ID — For storing a GUID that is required for replication...
Decimal — For numeric values that range from -9.999... x 1027 to +9.999...
Therefore, in designing your database, schema, and tables, select the appropriate values to accommodate your needed precision. If not using the MS Access GUI program, you can define type in a DDL command:
CREATE TABLE [2-Matched Activity] (
...
trans_amt DOUBLE,
...
)
If table already exists consider altering design with another DDL command.
ALTER TABLE [2-Matched Activity] ALTER COLUMN trans_amt DOUBLE
Do note: if you run CREATE and ALTER commands in Query Design window, no prompts or confirmation will occur but changes will render.

Calling preprocessing.scale on a heterogeneous array

I have this TypeError as per below, I have checked my df and it all contains numbers only, can this be caused when I converted to numpy array? After the conversion the array has items like
[Timestamp('1993-02-11 00:00:00') 28.1216 28.3374 ...]
Any suggestion how to solve this, please?
df:
Date Open High Low Close Volume
9 1993-02-11 28.1216 28.3374 28.1216 28.2197 19500
10 1993-02-12 28.1804 28.1804 28.0038 28.0038 42500
11 1993-02-16 27.9253 27.9253 27.2581 27.2974 374800
12 1993-02-17 27.2974 27.3366 27.1796 27.2777 210900
X = np.array(df.drop(['High'], 1))
X = preprocessing.scale(X)
TypeError: float() argument must be a string or a number
While you're saying that your dataframe "all contains numbers only", you also note that the first column consists of datetime objects. The error is telling you that preprocessing.scale only wants to work with float values.
The real question, however, is what you expect to happen to begin with. preprocessing.scale centers values on the mean and normalizes the variance. This is such that measured quantities are all represented on roughly the same footing. Now, your first column tells you what dates your data correspond to, while the rest of the columns are numeric data themselves. Why would you want to normalize the dates? How would you normalize the dates?
Semantically speaking, I believe you should leave your dates alone. Whatever post-processing you're planning to perform on your numerical data, the normalized data should still be parameterized by the original dates. If you want to process your dates too, you need to come up with an explicit way to handle your dates to something numeric (say, elapsed time from a given date in given units).
So I believe you should drop your dates from your processing round altogether, and start with
X = df.drop(['Date','High'], 1).as_matrix()

nvarchar(4001)?

MSDN has this to say on the subject:
nvarchar [ ( n | max ) ]
Variable-length Unicode character data. ncan be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two times the number of characters entered + 2 bytes. The data entered can be 0 characters in length. The ISO synonyms for nvarchar are national char varying and national character varying.
This leaves me confused. I can define a column as being 1 - 4000 long, or 2147483647 long but nothing inbetween? Is my understanding correct? Why can't I be explicit about values inbetween?
NVARCHAR(MAX) covers everything else (not just 2 billion characters). If you need more than 4,000 characters the data is most certainly going to be off-page, so as far as behavior is concerned it doesn't matter if you've used 4,001 characters, 10,000 characters, or 10,000,000 characters. It only occupies the space you need, so don't think that you are wasting (2 billion characters - the length of your actual string).
Max will accept values between 4001 and 1073741823 (bear in mind storage size is approx 2x the length of the actual string).
The restriction is basically that anything over 4000 characters must be a MAX.
Because 4000 characters or less has one behavior in terms of storage and MAX has another behavior in terms of storage. And you really don't want to start forcing string length calculations on things that are 1M characters long do you? My current understanding is that up to 4000 characters is stored in-table and MAX is stored out-of-table.
Also NVARCHAR(MAX) and VARCHAR(MAX) are replacements for text and ntext.

Assigning an empty binary value to a varbinary(MAX) column creates a column 8000 bytes long

I have a table with a varbinary(max) column, i am trying to assign to that column a zero-lengh binary buffer, but instead of getting a zero-length value in the table, i am getting an 8000 bytes long value filled with zeros:
* the dataSize column in the shown query was added using DATALENGHT(data) ("SELECT _index, dataSize=DATALENGHT(data), data FROM....") and shows the actual size on the table of the value
Where does the 8000 bytes long empty buffer come from? is this some kind of default behavior?
If your source column is binary(8000), then DATALENGTH(data) will return 8000 (it is fully padded) and data will contain the full 8000 bytes.
But since you are using
SELECT _index, dataSize=DATALENGTH(data), data FROM
It cannot be a binary(8000) column - because a fixed size column will report the same datalength for all rows. It is likely some data was copied there from a BINARY(8000) variable or other means some time in the past.

Power-law distribution in T-SQL

I basically need the answer to this SO question that provides a power-law distribution, translated to T-SQL for me.
I want to pull a last name, one at a time, from a census provided table of names. I want to get roughly the same distribution as occurs in the population. The table has 88,799 names ranked by frequency. "Smith" is rank 1 with 1.006% frequency, "Alderink" is rank 88,799 with frequency of 1.7 x 10^-6. "Sanders" is rank 75 with a frequency of 0.100%.
The curve doesn't have to fit precisely at all. Just give me about 1% "Smith" and about 1 in a million "Alderink"
Here's what I have so far.
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank] = ROUND(88799 * RAND(), 0)
But this of course yields a uniform distribution.
I promise I'll still be trying to figure this out myself by the time a smarter person responds.
Why settle for the power-law distribution when you can draw from the actual distribution ?
I suggest you alter the LastNames table to include a numeric column which would contain a numeric value representing the actual number of indivuduals with a name that is more common. You'll probably want a number on a smaller but proportional scale, say, maybe 10,000 for each percent of representation.
The list would then look something like:
(other than the 3 names mentioned in the question, I'm guessing about White, Johnson et al)
Smith 0
White 10,060
Johnson 19,123
Williams 28,456
...
Sanders 200,987
..
Alderink 999,997
And the name selection would be
SELECT TOP 1 [LastName]
FROM [LastNames] as LN
WHERE LN.[number_described_above] < ROUND(100000 * RAND(), 0)
ORDER BY [number_described_above] DESC
That's picking the first name which number does not exceed the [uniform distribution] random number. Note how the query, uses less than and ordering in desc-ending order; this will guaranty that the very first entry (Smith) gets picked. The alternative would be to start the series with Smith at 10,060 rather than zero and to discard the random draws smaller than this value.
Aside from the matter of boundary management (starting at zero rather than 10,060) mentioned above, this solution, along with the two other responses so far, are the same as the one suggested in dmckee's answer to the question referenced in this question. Essentially the idea is to use the CDF (Cumulative Distribution function).
Edit:
If you insist on using a mathematical function rather than the actual distribution, the following should provide a power law function which would somehow convey the "long tail" shape of the real distribution. You may wan to tweak the #PwrCoef value (which BTW needn't be a integer), essentially the bigger the coeficient, the more skewed to the beginning of the list the function is.
DECLARE #PwrCoef INT
SET #PwrCoef = 2
SELECT 88799 - ROUND(POWER(POWER(88799.0, #PwrCoef) * RAND(), 1.0/#PwrCoef), 0)
Notes:
- the extra ".0" in the function above are important to force SQL to perform float operations rather than integer operations.
- the reason why we subtract the power calculation from 88799 is that the calculation's distribution is such that the closer a number is closer to the end of our scale, the more likely it is to be drawn. The List of family names being sorted in the reverse order (most likely names first), we need this substraction.
Assuming a power of, say, 3 the query would then look something like
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 88799 - ROUND(POWER(POWER(88799.0, 3) * RAND(), 1.0/3), 0)
Which is the query from the question except for the last line.
Re-Edit:
In looking at the actual distribution, as apparent in the Census data, the curve is extremely steep and would require a very big power coefficient, which in turn would cause overflows and/or extreme rounding errors in the naive formula shown above.
A more sensible approach may be to operate in several tiers i.e. to perform an equal number of draws in each of the, say, three thirds (or four quarters or...) of the cumulative distribution; within each of these parts list, we would draw using a power law function, possibly with the same coeficient, but with different ranges.
For example
Assuming thirds, the list divides as follow:
First third = 425 names, from Smith to Alvarado
Second third = 6,277 names, from to Gainer
Last third = 82,097 names, from Frisby to the end
If we were to need, say, 1,000 names, we'd draw 334 from the top third of the list, 333 from the second third and 333 from the last third.
For each of the thirds we'd use a similar formula, maybe with a bigger power coeficient for the first third (were were are really interested in favoring the earlier names in the list, and also where the relative frequencies are more statistically relevant). The three selection queries could look like the following:
-- Random Drawing of a single Name in top third
-- Power Coef = 12
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 425 - ROUND(POWER(POWER(425.0, 12) * RAND(), 1.0/12), 0)
-- Second third; Power Coef = 7
...
WHERE LN.[Rank]
= (425 + 6277) - ROUND(POWER(POWER(6277.0, 7) * RAND(), 1.0/7), 0)
-- Bottom third; Power Coef = 4
...
WHERE LN.[Rank]
= (425 + 6277 + 82097) - ROUND(POWER(POWER(82097.0, 4) * RAND(), 1.0/4), 0)
Instead of storing the pdf as rank, store the CDF (the sum of all frequencies until that name, starting from Aldekirk).
Then modify your select to retrieve the first LN with rank greater than your formula result.
I read the question as "I need to get a stream of names which will mirror the frequency of last names from the 1990 US Census"
I might have read the question a bit differently than the other suggestions and although an answer has been accepted, and a very through answer it is, I will contribute my experience with the Census last names.
I had downloaded the same data from the 1990 census. My goal was to produce a large number of names to be submitted for search testing during performance testing of a medical record app. I inserted the last names and the percentage of frequency into a table. I added a column and filled it with a integer which was the product of the "total names required * frequency". The frequency data from the census did not add up to exactly 100% so my total number of names was also a bit short of the requirement. I was able to correct the number by selecting random names from the list and increasing their count until I had exactly the required number, the randomly added count never ammounted to more than .05% of the total of 10 million.
I generated 10 million random numbers in the range of 1 to 88799. With each random number I would pick that name from the list and decrement the counter for that name. My approach was to simulate dealing a deck of cards except my deck had many more distinct cards and a varing number of each card.
Do you store the actual frequencies with the ranks?
Converting the algebra from that accepted answer to MySQL is no bother, if you know what values to use for n. y would be what you currently have ROUND(88799 * RAND(), 0) and x0,x1 = 1,88799 I think, though I might misunderstand it. The only non-standard maths operator involved from a T-SQL perspective is ^ which is just POWER(x,y) == x^y.