So, I am trying to learn how to set up good, and usable databases. I have ran into a problem involving storing large amounts of data correctly. The database I am using is MSSQL 2008. For example:
We test about 50,000 devices a week. Each one of these devices have a lot of data associated with them. Overall, we are just looking at the summary of data calculated from the raw data. The summary is easy to handle, its just the raw data I'm trying to enter into a database for future use in case someone wants more details.
For the summary, I have a database full of tables for each set of 50,000 devices. But, each device there is data similar to this:
("DevID") I,V,P I,V,P I,V,P ...
("DevID") WL,P WL,P WL,P ...
Totaling to 126 (~882 chars) data points for the first line and 12000 (~102,000 chars) data points for the second line. What would be the best way to store this information? Create a table for each and every device (this seems unruly)? Is there a data type that can handle this much info? I am just not sure.
Thanks!
EDIT: Updated ~char count and second line data points.
You could just normalize everything into one table
CREATE TABLE device
( id BIGINT AUTO_INCREMENT PRIMARY KEY
, DevID INT
, DataPoint VARCHAR
, INDEX(DevID))
Psudocode obviously, since I don't know your exact requirements.
Does this data represent a series of readings over time? Time-series data tends to be highly repetetive. So a common strategy is to compress it in ways that avoid storing every single value. For example use run-length encoding, or associate time intervals with each value instead of single points.
Related
I'm running SQL Server and I have a table of user profiles which contains columns for the user's personal info and a profile picture.
When setting up the project, I was given advice to store the profile image in the database. This seemed OK and worked fine, but now I'm dealing with real data and querying more rows the data is taking a lifetime to return.
To pull just the personal data, the query takes one second. To pull the images I'm looking at upwards of 6 seconds for 5 records.
The column is of type varchar(max) and the size of the data varies. Here's an example of the data lengths:
28171
4925543
144881
140455
25955
630515
439299
1700483
1089659
1412159
6003
4295935
Is there a way to optimize my fetching of this data? My query looks like this:
SELECT *
FROM userProfile
ORDER BY id
Indexing is out of the question due to the data lengths. Should I be looking at compressing the images before storing?
If takes time to return data. Five seconds seems a little long for a few megabytes, but there is overhead.
I would recommend compressing the data, if retrieval time is so important. You may be able to retrieve and uncompress the data faster than reading the uncompressed data.
That said, you should not be using select * unless you specifically want the image column. If you are using this in places where it is not necessary, that can improve performance. If you want to make this save for other users, you can add a view without the image column and encourage them to use the view.
If it is still possible to take one step back.Drop the idea of Storing images in table. Instead save path in DB and image in folder.This is the most efficient .
SELECT *
FROM userProfile
ORDER BY id
Do not use * and why are you using order by ? You can order by AT UI code
I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.
I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.
I used to work with SQL like MySQL, Postgres or MSSQL.
Now I want to play with Redis. I'm working on a little home project, that I think is the best choice for starting using Redis.
I have a machine that reads temperature (indoor and outdoor) and humidity. I need to store the readings into Redis. Can you help me to understand the best data structure to do so?
Other than this data I need to store the time (ex. unix timestamp) of the temperature reading for use plotting a graphic.
I installed Redis read the documentation, so I understand the commands and data types.
Since this is your first Redis project and it's a home project, I'd be careful about being to careful. Here's a couple ways to consider designing it (NOTE: I only dug deep into REDIS this past weekend so hopefully others will weigh in).
IDEA 1:
Four ordered sets
KEY for sets are "indoor_temps", "outdoor_temps", "indoor_humidity", "outdoor_humidity"
VALUES are the temperatures / humidities
SCORE is the date stored as EPOCH
IDEA 2:
Four types of keys (best shown by example)
datetime_key = /year:2014/month:07/day:12/hour:07/minute:32/second:54
type_keys = [indoor_temps, outdoor_temps, indoor_humidity, outdoor_humidity]
keys are of form type + "/" + datetime_key
values are the temp and humidity itself
You probably want to implement some initial design and then work with the data immediately - graph it, do stats, etc. Whatever you plan to do with it. That will expose flaws and if they are major, flush the database and try again. These designs should really only take ~1 hour to implement since the only thing you're really changing is a few Redis commands and some string manipulation to convert the data to keys.
I like Tony's suggestions, but I'll also throw out another possibility.
4 lists
keys are "indoor_temps", "outdoor_temps", "indoor_humidity", "outdoor_humidity"
values are of the form < timestamp >_< reading > ie.( "1403197981_27.2" )
Push items onto the front of the list using LPUSH. Get a set of readings using LRANGE. The list will always be ordered by the time of the reading. Obviously split the value on "_" to get your time and reading...
In all honesty, this will give the same properties as Tony's first example, with slightly worse lookup performance, but better memory usage. I'm guessing for this project you'll be neither memory, nor CPU constrained, so the choice is probably not an issue. That said, if you expect to be saving 100's of thousands or more readings, I would suggest the list unless you want to consume a large portion of your system's memory.
Also, it's a good idea to call EXPIRE on your entries with some reasonable TTL that encompasses the length of time you want to save the readings for. If your plan is to have them live in perpetuity then you may want to look at backing them up to a disk DB over time, and just use Redis as a quick lookup cache for recent readings.
Thank to all answer, I choose this strucure:
4 lists: tempIN, tempOut, humidIN and humidOUT
values are: [value]:[timestamp]. For example: "25.4:1403615247"
As suggested from wallacer i want to backup old entries out from Redis.
For main frontend i need only last two days of sample.
For example i can create Redis RDB file snapshot and "trim" the live lists. This solution is not convenient in the event that, in the future you want to recover old values.
Do you have any tips on what kind of procedure to adopt to store the data? Maybe use of SQLIte DB?
I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real
Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.
I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.
Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?