Sqlite database - attribute as table vs column? - sql

I'm planning on making a SQLite database to hold data acquired from numerous sensors. The data would be simple things like date added, volume in mL, temperature, etc.
Would it be good practice to create one table per sensor? Or am I better off creating a column for the sensor name instead and putting everything under one table? I plan to query data from sensor(s) based on the date added attribute.


Index on string column on big MSSQL database

I have to design a database to store some production process measurements data. The data would be collected from PLCs. I would like to store this data on a remote server, to which all the machines have access. I would like to store the following data:
production line
workpiece number
measurement unit
measurement type
Some machines share same measurements types, some have unique ones. Is it a good solution, that I would make measurement type column of type VARCHAR and let the PLC programmers free hands with naming new measurement types without my intervention to insert new entry in related foreign table and handing them the new ids? The expected count of unique measurement types is around 100. Is an index on this column a solution for later filtering and selecting from this table which is expected to have around 50 billion rows in a year?
Probably also the size of the table would become a big issue.
EDIT: Should I also separate measurement value and measurement type to other table than part information?
Is there a way, that SQL server would take care of adding new measurement type to some internal table and handling the ids?
Hopefully I explained my question enough, otherwise write the question in comment.
I would say create a new table measurement type as you mentioned the record is less than 100 then in such case make the Id column TINYINT which will save your space as well and helpful in creating the index.

SQL Structure, Dynamic Two Columns or Unique Colmuns

I'm not sure which is faster. I have the need to store lists of possible data.
Currently I have an SQL table with the following structure being accessed with php.
name -- (serialNum, itemNum, idlock, etc, etc)
--(Note: The Primary Key here would be boxId, place, name, and data, to prevent duplicate data.)
The reason i set it up like this was to prevent creating columns per named data. Its a possibility in the future to have 5-10 different named data or more. Also possible to store 1,000 - 10,000 entries of data in one week for just one named data. It will be searched as well, like when i get place from a specific serialNum, then getting all data related to that place. (A specific serialNum, itemNum, idLock, etc, etc,)
But my concern is that my structure could be slower than just creating a named column for each named data. For example:
--(Note: Not even sure how to add keys to this if i would do it this way)
To sum it up: Which is faster and better practice? (keep in mind im still a novice with SQL)
The best practice is to model your data as entities with specific attributes. Typically an entity has at most a few dozen attributes. The entities typically turn into tables, and the attributes typically which turn into columns. That is, the physical model and the logic model are often very similar.
There may be other considerations. For instance, there is a limit on the number of columns a row can have -- and if you have more columns, you need another solution. Similarly, if the data is sparse (that is, most values are NULL), then having lots of unused columns may be a waste of space. That is, it is more efficient to store it in another format. SQL Server offers sparse columns for this reason.
My suggestion is that you design your table in an intuitive way with named columns. A volume of data of 1,000 - 10,000 rows per week is not that much data. That turns into 50,000 - 500,000 rows per year, which SQL Server should be easily able to handle the volume. You don't say how many named entities you have, but table with millions or tens of millions of rows are quite reasonable for modern databases.

Correlation between amount of rows and amount columns in database performance

Is there a correlation between the amount of rows/number of columns used and it's impact within the (MS)SQL database?
A little more background:
We have to store lots of data from measurement devices. These devices ping a string with data back to us around 100 times a day. These strings contains +- 300 fields. Assume we have 100 devices in operation that means we get 10000 records back every day. At our back-end we split these data strings and have to put these into the database. When these data strings are fixed that means we add each days around 10000 new rows into the database. No big deal.
Whatsoever, the contents of these data strings may change during time. There are two options we are considering:
Using vertical tables to store the data dynamically
Using horizontal tables and add a new column now and then when it's needed.
From the perspective of ease we'd like to choose for the first approach. Whatsoever, that means we're adding 100*100*300=3000000 rows each day. Data has to be stored 1 year and a month (395 days) so then we're around 1.2 billion rows. Not calculated the expected growth.
Is it from a performance perspective smarter to use a 'vertical' or a 'horizontal' approach?
When choosing for the 'vertical' solution, how can we actual optimize performance by using PK's/FK's wisely?
When choosing for the 'horizontal' solution, are there recommendations for adding columns to the table?
I have a vertical DB with 275 million rows in the "values" table. We took this approach because we couldn't accurately define the schema at the outset either. Inserts are fantastic. Selects suck. Too be fair we throw in a couple of extra doohickies the typical vertical schema doesn't have to deal with.
Have a search for EAV aka Entity Attribute Value models. You'll find a lot of heat on both sides of the debate. Too good articles on making it work are
What is so bad about EAV, anyway?
dave’s guide to the eav
My guess is these sensors don't just start sending you extra fields. You have to release new sensors or sensor code for this to happen. That's your chance to do change control on your schema and add the extra columns. If external parties can connect sensors without notifying you this argument is null and void and you may be stuck with an EAV.
For the horizontal option you can split tables putting the frequently-used columns in one table and the less-used in a second; both tables have the same primary key values so you can link less-used to more-used columns. Also you can use RDBMS's built-in partitioning functionality to split each day's (or week's or month's) data for the others'.
Generally, you can tune a table more for inserts (or any DML) or for queries. Improving one side comes at the expense of the other. Usually, it's a balancing act.
First of all, 10K inserts a day is not really a large number. Sure, it's not insignificant, but it doesn't even come close to what would be considered "large" nowadays So, while we don't want to make inserts downright sluggish, this gives you some wiggle room.
Creating an index on the device id and/or entry timestamp will do some logical partitioning of the data for you. The exact makeup of your index(es) will depend on your queries. Are you looking for all entries for a given date or date range? Then index the timestamp column. Are you looking for all entries received from a particular device? Then index the device id column. Are you looking for entries from a particular device on a particular date or date range or sorted by the date? Then create an index on both columns.
So if you ask for the entries for device x on date y, then you are going out to the table and looking only at the rows you need. The fact that the table is much larger than the small subset you query is incidental. It's as if the rest of the table doesn't even exist. The total size of the table need not be intimidating.
Another option: As it looks like the data is written to the table and never altered after that, then you may want to create a data warehouse schema for the data. New entries can be moved to the warehouse every day or several times a day. The point is, the warehouse schema can have the data sliced, diced, quartered and cubed to make queries much more efficient. So you can have the existing table tuned for more efficient inserts and the warehouse tuned for more efficient queries. That is, after all, what data warehouses are for.
You also imply that some of each entry is (or can be) duplicated from one entry to the next. See if you can segment the data into three types:
Type 1: Data that never changes (the device id, for example)
Type 2: Data that rarely changes
Type 3: Data that changes often
Now all you have is a normalization problem, something a lot easier to solve. Let's say the row is equally split between the types. So you have one table with 100 rows of 33 columns. That's it. It never changes. Linked to that is a table with at least 100 rows of 33 columns but maybe several new rows are added each day. Finally, linked to the second table a table with rows of 33 columns that possibly grows by the full 10K every day.
This minimizes the grow-space required by the online database. The warehouse could then denormalize back to one huge table for ease of querying.

How do you append to a Hive array?

I have a Hive table where for a user ID I have a ts column, which is a timeseries, stored as array. I want to maintain the timeseries as a recentmost window.
(a) how do I append a new number to the end of each column from another table joined by ID?
(b) how do I drop the leading number?
Data in Hive is typically stored in HDFS. HDFS has limited append capabilities. If the constant modification of data is at the core of your analytics systems, then perhaps you should consider using alternatives like HBase or Cassandra.
However, if the data updates are a small part of your workflow, I would encourage you to continue using Hive (in order to make use of it's SQL like functionality) but reconsider your design for storing these updates.
A quick solution to your above problem would be to have more than one record per user ID in your table. Each record would have a timeseries corresponding to the User ID. When you want to do your last N analysis on the timeseries, you should do a select from the table by using by Distribute By on User ID column. Your custom reducer will simply pick out the last N (or less, if the size of the timeseries is less than N) timestamps and return them.
Harish Butani also did some work on Windowing functions in Hive. You can also take a look at his work and associated documentation to gain some more insight. Good luck, Alexy!

archiving the table : searching for the best way

there is a table which has 80.000 rows.
Everyday I will clone this table to another log table giving a name like 20101129_TABLE
, and every day the prefix will be changed according to date..
As you calculate, the data will be 2400 000 rows every month..
Advices please for saving space, and getting fast service and other advantages and disadvantages!! how should i think to create the best archive or log..
it is a table has the accounts info. branch code balance etc
It is quite tricky to answer your question since you are a bit vague on some important facts:
How often do you need the archived tables?
How free are you in your design-choices?
If you don't need the archived data often and you are free in your desgin I'd copy the data into an archive database. That will give you the option of storing the database on a separate disk (cost-efficiency) and you can have a separate backup-schedule on that database as well.
You could also store all the data in one table with just an additional column like ArchiveDate datetime. But I think this depends really on how you plan on accessing the data later.
Consider TABLE PARTITIONING (MSDN) - it is designed for exactly this kind of scenarios. Not only you can spread data across partitions (and map partitions to different disks), you can keep all data in the same table and let MSSQL do all the hard work in the background (what partition to use based on select criteria, etc.).