Different Types saved values to SQL Database - sql

I am currently writing an application which will have a lot of transaction.
Each transaction will have a value although the value can be an int, bit, short string, large string etc...
I want to try to keep processing and storage to a minimum as I would like to run this in the cloud. Should I have lot of different fields on the transaction eg.
TransactionLine.valueint
TransactionLine.valuestring
TransactionLine.valuedecimal
TransactionLine.valuebool
or should I have separate tables for each value transaction value type.
TransactionLine - Table
---------------
TransactionLine.ValueId
ValueInt -Table
-------
ValueInt.ValueId
ValueInt.Value
ValueString - Table
-------
ValueString.ValueId
ValueString.Value

You could store key-value pairs in the database. The only data type that can store any other data type is a VARCHAR(MAX) or a BLOB. That means that all data must be converted to a string before it can be stored. That conversion will take processing time.
In the opposite direction, when you want to do a SUM or a MAX or an AVG , ... of numeric data you will first have to convert the string back to its real data type. That conversion too will take processing time.
Databases are read a lot more than written to. The conversion nightmare will get your system on its knees. There has been a lot of debate on this topic. The high cost of conversions is the killer.
There are systems that store the whole database in one single table. But in those cases the whole system is build with one clear goal: to support that system in an efficient way in a fast compiled programming language, like C(++, #), not in a relational database language like SQL.
I don't have the idea I fully understand what you really want. If you only want to store the transactions, this may be a worth trying. But why do you want to store them one field at a time? Data is stored in groups in records. And the data type of each and every column in a record is known at the creation time of the table.

You should really look into cassandra. When you say a lot of transactions, do you mean millions of records? For cassandra, handling millions of records is a norm. You will have a column family (in rdbms, table is similiar to column family) store many rows, and for each row, you do not need to predefined a column. It can be define on demand, thus reducing the storage dramatically especially if you are dealing with a lot of records.
You do not need to worry if the data is of data type int, string, decimal or bool because default datatype for column value is in BytesType. There are other data types which you can predefined too in the the column family column metadata if you want to. Since you are starting to write an application, I will suggest you spend sometime to read into cassandra and how it would help you in your situation.

Related

Nosql or sql database for big temp data

I'm new to the data world.
I want to store a temporary data in a database. The volume might be 10000 rows per transaction. Data will be removed in less than 1 hour (based on creationTime). There wont be any complex queries and there might be 1000 or more transaction per minute.
For this need, which database type would be useful?
It depends on the data structure and the use cases, not on the volume. If the data follows always the same structure, I‘d go for SQL. If the data sets are not easy to structure (e.g. have many different fields but only a few filled most of the time), NoSQL seems to be the better match.
There are databases for both types, which are able to handle large amounts of data.

Index on string column on big MSSQL database

I have to design a database to store some production process measurements data. The data would be collected from PLCs. I would like to store this data on a remote server, to which all the machines have access. I would like to store the following data:
timestamp
plant
production line
machine
workpiece number
measurement unit
measurement type
Some machines share same measurements types, some have unique ones. Is it a good solution, that I would make measurement type column of type VARCHAR and let the PLC programmers free hands with naming new measurement types without my intervention to insert new entry in related foreign table and handing them the new ids? The expected count of unique measurement types is around 100. Is an index on this column a solution for later filtering and selecting from this table which is expected to have around 50 billion rows in a year?
Probably also the size of the table would become a big issue.
EDIT: Should I also separate measurement value and measurement type to other table than part information?
Is there a way, that SQL server would take care of adding new measurement type to some internal table and handling the ids?
Hopefully I explained my question enough, otherwise write the question in comment.
Regards
I would say create a new table measurement type as you mentioned the record is less than 100 then in such case make the Id column TINYINT which will save your space as well and helpful in creating the index.

How is Oracle able to achieve 1000 columns?

I am trying to copy what salesforce did in their database architecture. Basically, they have a single oracle table with a thousand varchar(max) columns. They store all the customer data in this table. I am trying to accomplish the same thing with SQL Server. However, I am only able to get 308 varchar(max) fields in SQL server. I would like to know how is Oracle able to achieve 1000 column limit. I'd like to do the same thing in sql server.
A VARCHAR(MAX) field can hold GB's of information... but the max row size is 8060 bytes, so how does that add up? Well it doesn't store the 2GB in the row, it stores a 24 byte pointer instead. Those pointers are adding up to exceed your row size limit.
You could split the table out into multiple tables with fewer columns, but I don't think there is a way to override this limitation.
IMHO, a thousand columns seems more trouble than its worth. Perhaps you could take a more normalized approach.
For example, I have an Object Def table which is linked to an Extended Properties table. The XP table is linked to the OD and has fields XP-ITEM, XP-VALUE, XP-LM-UTC, and XP-LM-Usr. This structure allows any object to have any number of extended properties ... standard and/or non-standard.
The image below may give you a better visualization.
Just a couple of notes:
1) This is not for high volume transactional data i.e. Daily Loan Balances
2) Each Item ID can be linked back to an object which has it's own properties like Pick Lists, Excel Formats, etc.
3) Once can see the entire history of edits (who, what, and when)

Best way to structure streamed data with missing table fields to benefit filesize

I currently have a service which provides live data every second in JSON format and I save it to a SQLserver table.
Typically the table is approx 20 fields of varchar, int and decimal and each row/record is a single timestamp for each second. Both the JSON and INSERT query contain data for all fields on every timestamp.
In order to speed up response times and reduce transmitted bytes, the JSON in future will only contain changes to the data (ie the value is different from the previous value), so many fields will not be contained in the JSON.
My question is what is the best way to store this in SQL to also benefit from the reduction in data - Is there a better way to do this? If I used the same table structure with NULL entries then surely this will be the same byte size based on the field type anyway?
Edit: The new streaming format would mean the following
Each timeframe will still have data values but they would not be in the JSON array if there was no data change from previous values.
I'm looking at saving disk space. I'm happy to rebuild the data when required with post processing outside of SQL to get 'full' data for any particular timestamp.
Possibly it might be better to just store the full JSON response string with timestamp?
Not familiar with JSON and the whole idea, but the best way to store possible NULLs is to put fixed size fields (INT, BIT, SMALL INT DECIMAL, FLOAT, Etc.) in the beginning of a table and variable sized fields (VARCHAR, NVARCHAR, XML, JSON etc.) at the end of your table.
Second advice would be to use temporal table update in SQL 2016. That will store the data maybe in the best way (needs a research), but will significantly make easy extraction and handling the data.

Handling 100's of 1,000,000's of rows in T-SQL2005

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.