I am trying to copy what salesforce did in their database architecture. Basically, they have a single oracle table with a thousand varchar(max) columns. They store all the customer data in this table. I am trying to accomplish the same thing with SQL Server. However, I am only able to get 308 varchar(max) fields in SQL server. I would like to know how is Oracle able to achieve 1000 column limit. I'd like to do the same thing in sql server.
A VARCHAR(MAX) field can hold GB's of information... but the max row size is 8060 bytes, so how does that add up? Well it doesn't store the 2GB in the row, it stores a 24 byte pointer instead. Those pointers are adding up to exceed your row size limit.
You could split the table out into multiple tables with fewer columns, but I don't think there is a way to override this limitation.
IMHO, a thousand columns seems more trouble than its worth. Perhaps you could take a more normalized approach.
For example, I have an Object Def table which is linked to an Extended Properties table. The XP table is linked to the OD and has fields XP-ITEM, XP-VALUE, XP-LM-UTC, and XP-LM-Usr. This structure allows any object to have any number of extended properties ... standard and/or non-standard.
The image below may give you a better visualization.
Just a couple of notes:
1) This is not for high volume transactional data i.e. Daily Loan Balances
2) Each Item ID can be linked back to an object which has it's own properties like Pick Lists, Excel Formats, etc.
3) Once can see the entire history of edits (who, what, and when)
Related
I'm playing with in-memory storage in oracle sql. I would like to compare the results of compression, I meant the amount of used space. For example, I'm running these queries:
ALTER TABLE RENTING INMEMORY MEMCOMPRESS FOR QUERY LOW(RETURN_DATE);
vs
ALTER TABLE RENTING INMEMORY MEMCOMPRESS FOR CAPACITY HIGH(RETURN_DATE);
Is there any easy way to check the size used by these compressions in SQL developer?
I found this article https://blogs.oracle.com/in-memory/database-in-memory-compression, there is a table containing 'space used' for each type of compression. This exactly what I am trying to do on my own. Thanks for any advices.
Querying v$im_segments after population will show you how many bytes from the table were loaded and how much of the in-memory store was utilised.
Since the column space is part of the In-Memory Compression Units (IMCU), there is no way to see how much space is consumed by individual columns. It is possible to display the individual column level compression setting in the view v$im_column_level though. The closest you could come would be to compare the populated size between the two compression levels. As Connor said, you can do this with v$im_segments or you can display individual IMCU information for an object with the view v$im_header.
Currently I am working on a problem where I have to log data in a Oracle10g database. I want to store data from up to 40 devices (but not necessarily always 40) as one data point, these share a bit of information and the rest is device specific.
So I could either create arrays for every device-specific column and if the device is in use the according array field is getting populated.
ID TIMESTAMP BOARD DEVICE_ID[40] ERROR_CNT[40] TEMP[40] MORE_DATA[40]...
But I think I would be wasting a lot of database space by doing it like that, because the arrays would be hardly populated
The other method I can think of would be to just use the same ID for a multi-line entry and then I put as many rows into the table as I have used devices.
ID TIMESTAMP BOARD DEVICE_ID ERROR_CNT TEMP MORE_DATA
1 437892 1 1 100 25 xxx
1 437892 1 2 50 28 yyy
Now the shared information is multiple times in the database and the data is shattered among multiple lines.
Another issue is that there might be columns used by a part of the devices and some do not carry that information, so there might be even more unused fields. So maybe it would be best to create multiple tables and split the devices into groups according to the information they have and log their data in the corresponding tables.
I appreciate any help, maybe I am even paranoid about wasted db space and should not worry about that and simply follow the 'easiest' approach, which I think would be the one with arrays.
Never store arrays in a database. Violating first normal form is a big mistake.
Worry more about how the data is queried than how it is stored. Keep the data model "dumb" and there are literally millions of people who can understand how to use it. There are probably only a few hundred people who understand Oracle object types.
For example, using object types, here is the simplest code to create a table, insert data, and query it:
drop table device;
create or replace type error_count_type is table of number;
create table device(id number, error_count error_count_type)
nested table error_count store as error_count_table;
insert into device values(1, error_count_type(10, 20));
commit;
select sum(column_value) error_count
from device
cross join table(error_count);
Not many people or tools understand creating types, store as, instantiating types, COLUMN_VALUE, or TABLE(...). Internally, Oracle stores arrays as tables anyway so there's no performance benefit.
Do it the simple way, with multiple tables. As Gordon pointed out, it's a small database anyway. Keep it simple.
I think this is too long for a comment:
1000 hours * 12/hour * 40 devices = 480,000 rows.
This is not a lot of data, so I wouldn't worry about duplication of values. You might want to go with the "other method" because it provides a lot of flexibility.
You can store all the data in columns, but if you get the columns wrong, you have to start messing around with alter table statements and that might affect queries you have already written.
I'm not sure which is faster. I have the need to store lists of possible data.
Currently I have an SQL table with the following structure being accessed with php.
boxID
place
name -- (serialNum, itemNum, idlock, etc, etc)
data
--(Note: The Primary Key here would be boxId, place, name, and data, to prevent duplicate data.)
The reason i set it up like this was to prevent creating columns per named data. Its a possibility in the future to have 5-10 different named data or more. Also possible to store 1,000 - 10,000 entries of data in one week for just one named data. It will be searched as well, like when i get place from a specific serialNum, then getting all data related to that place. (A specific serialNum, itemNum, idLock, etc, etc,)
But my concern is that my structure could be slower than just creating a named column for each named data. For example:
boxID
place
serialNum
itemNum
idLock
etc
etc
--(Note: Not even sure how to add keys to this if i would do it this way)
To sum it up: Which is faster and better practice? (keep in mind im still a novice with SQL)
The best practice is to model your data as entities with specific attributes. Typically an entity has at most a few dozen attributes. The entities typically turn into tables, and the attributes typically which turn into columns. That is, the physical model and the logic model are often very similar.
There may be other considerations. For instance, there is a limit on the number of columns a row can have -- and if you have more columns, you need another solution. Similarly, if the data is sparse (that is, most values are NULL), then having lots of unused columns may be a waste of space. That is, it is more efficient to store it in another format. SQL Server offers sparse columns for this reason.
My suggestion is that you design your table in an intuitive way with named columns. A volume of data of 1,000 - 10,000 rows per week is not that much data. That turns into 50,000 - 500,000 rows per year, which SQL Server should be easily able to handle the volume. You don't say how many named entities you have, but table with millions or tens of millions of rows are quite reasonable for modern databases.
I am looking for an up to date tool to accurately calculate the total row size and page-density of any SQL table definition for SQL Server 2005+.
Please note that there are plenty of resources concerning calculating sizes of rows in existing tables, estimating techniques for sizing, etc... However, I am designing tables and have some options about column size which I am trying to balance with efficient data access - meaning that I can relocate less-frequently accessed long text into dedicated tables to allow the most frequent access of these new tables to operate at optimum speed.
Ideally there would be an online facility where a create statement can be cut and pasted, or a sproc I can run on a dev db.
and The answer is a simple one until you start making proper table design and balance that against joins and FK data and disk access.
I'd have a look an see how many data pages you are using and remember that one reads an extend (8 data pages) from disk, not only the data page you are looking for. Then there is the option for data compression in your table as well as sparse columns and out of row type of data storage and variable length characters.
It's not about how much data is in a column, it's really about how many data reads and CPU you need to get it. this you can test when executing a Query and looking against the ACTUAL QUERY PLAN.
As for space used you can use a stored procedure called sp_spaceused. here is a source you can use to see how one could use it in dbforms
Hope it helps
Walter
I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.