Tables with the same structure and similar data - sql

This is a question about best practice really.
I am developing a system, where I will be collecting some measurements (call them HR and RR) and calculating average values of these measurements. Now for the user interface we are only intrested in these average values, but for the in-depth data analysis later on, we need all individual measurements (to export to matlab) as well as all the average calculations (don't ask - user requirement, I would just save individual measurements and calculate average later if it is needed).
Here are the details about average calculations etc:
- HR: we get readings every 500 - 1500ms (variable). We calculate the average based on 4-12 readings (depending on time between the readings).
- RR: we get readings every 3-17sec (variable). We calculate average based on 2-3 readings (depending on time between the readings).
For both we save:
- Average value (decimal) together with the timestamp of first reading from the readings used for the average calculation.
- Each individual reading (decimal) together with timestamps of when the reading was taken.
As you can see the data is the same for average calculations and individual readings. The same with HR/RR - the data is the same and could be represented as:
- - - - - - - - - -
| Reading |
- - - - - - - - - -
| Timestamp |
| Value |
- - - - - - - - - -
Since we compute data at different time intervals etc, we cannot store HR+RR as a single row in the database, we need separate rows or tables.
The questions are:
1. Is it better practice to create seperate tables for HR and RR? Or is it better to store them in the same table as seperate rows, with a column indicating if a given row is HR or RR?
2. Is it better to create seperate tables for each individual readings? Or is it better to create self-referencing table, where each individual reading would reference a row in the same table, with the average calculation it was used in?
I am not that great with DB design and I am not sure what are the best practices used in that situation.
I was also considering using MongoDB (rather than SQL database - probably MSSQL since the project is C# based), that would probably make life easier since I could have an array of individual measurements embeded in a document with average calculation etc. As far as I know writes to Mongo are very fast...
Any pointers? Thanks.

As wishy washy as it sounds, it depends. To your first question, one could very legitimately look at this as a table of readings or as two more specific tables. That said, years ago I would’ve said a single table, but over the years have gravitated toward the two tables. For one, your key values become more specific--(Reading) vs. (Reading + Type). And otherwise you’ll find yourself adding “AND ReadType =…” in your sleep. It also leaves you more flexibility when someone decides one reading needs to be to a different precision or also store the color of shirt the technician was wearing.
On the second question, again, opinions will vary but I’d lean toward a parent table of reading sets and a detail of individual readings. The self-referencing table feels like it wins some style points but joining back to oneself can get tricky depending on the answers you’re trying to get. Also, your final DB platform choice may or may not include some of the specialized options like MSSQL’s CTEs that address some of these complexities.
Overall, you could probably have:
ReadingSet (ReadingSetID [, other info as needed])
ReadingR (ReadingRID,ReadingSetID, Value, TimeStamp)
ReadingH (ReadingHID, ReadingSetID, Value, TimeStamp)

Related

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

Storing large amount of data in Redis / NoSQL or Relational db?

I need to store and access financial market candle stick information.
The amount of candles sticks that I will need to store is beginning to looking staggering (huge). There are 1000s of markets and each one has many trading pairs, and each pair has many time frames, and each time frame is an array of candles like the below. The array below could be for hourly price data or daily price data for example.
I need to make this information available to multiple users at any given time, so need to store it and make it available somehow.
The data looks something like this:
[
{
time: 1528761600,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
},
{
time: 1528761610,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
},
{
time: 1528761630,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
}
]
Consumers of the data will mostly be a complex Javascript based charting app, but other consumers will be node code, and perhaps other backend code.
My current best idea is to put save the candlesticks in Redis, though I have also considered a noSQL database. I'm not super experienced in either, so I'm not 100% sure Redis is the right choice. It seems to be the most performant option though, but perhaps harder to work with, since I am having to learn a lot, and I'm not convinced that the method of saving and retrieval used by Redis is going to make this very easy since, I will need to continually add candles to each array.
I'm currently thinking something like:
Do an initial fetch from the candle stick api and either:
Create a Redis hash with a suitable label and stingify the whole array of candles into the hash, so that it back be parsed by Javascript etc
Drawbacks of this approach:
Every time a new candle is created, I have to parse the json, add any new candles sticks and stringify and save it.
Pros of this approach:
I can use Javascript to manage the array and make sure it's sorted etc
Create a Redis list of time stamps, which allows me to just push new candles onto the list and trust it to be in the right order. I can then do a Redis SCAN? to return time stamps between the specific dates and then use the time stamps to pull the data out of a Redis hash. After retriveng all of this, then building a json object similar to above to pass to Javascript.
I have to say that both of these approaches feels way more painfull to me putting the data in a relational database. I imagine that a no-SQL database could also be way easier, but I'm not experienced with them, so I can't say for sure.
I'm a bit lost and out of my experience here, as you can tell, and would love any advice anyone can give me.
Thanks :)
Your data is very regular - each candlestick has essentially 1 64 bit long for timestamp, and 4 32 bit numbers for the prices. This makes it very amenable to bitfield.
Storing the data
Here is how I would store it -
stock-symbol:daily_prices = bitfield with 30 * 5 records, assuming you are storing data for past 30 days
stock-symbol:hourly_prices = bitfield with 24 * 5 records
This way, your memory is (30*5 + 24*5) * 16 bytes = 4320 bytes per symbol + constant overhead per key.
You don't need to store the timestamp (see below). Also, I have assumed 4 bytes to store the price. You can store it as a whole number by eliminating the decimal.
Writing the data
To insert hourly prices, find the current hour (say 07:00 hours). If you treat the bitfield as an array of 4 byte integers, you will have to skip 7 * 4 = 28 integers. You then insert the prices at position 28, 29, 30, 31 (0 based indexes).
So, to store price for AAPL at 07:00 hours, you would run the command
bitfield AAPL:hourly_prices set i32 28 <open price> i32 29 <close price> i32 30 <highest price> i32 31 <lowest price>
You would do something similar for daily prices as well.
Reading Data
If you are building a charting library, most likely you would want to return data for multiple symbols for a given time range. Let's say you want to pull out daily prices for past 7 days, your logic will be -
For each symbol:
Get start and end range within the array
Invoke the Get Range command.
If you run this in a pipeline, it will be very fast.
Other tips
Usually, you would to filter by some property of the symbol. For example, "show me graphs of top 10 tech companies for the last 5 days".
A symbol itself is relational data. I would recommend storing that in a relational database. Just get the symbol names as a list from the relational database, and then fetch the stock prices from redis.
Redis has its limits, like anything, but they're pretty high, and if you're clever about it, you can get amazing performance out of redis. If you outgrow one instance you can start thinking about clustering, which should scale relatively linearly to a level where budget is a bigger concern than performance.
Without having a really great grasp of the data you're describing and its relations, sounds like what you're looking for is a sorted set, perhaps sorted by date. You can ZSCAN a sorted set to move through it sequentially, or you can do lots of other great things against one as well. You might have data that requires a few different things - eg a hash for some data and an entry into an index for the hash itself, or even in a few different indexes. A simple redis list might also do the job for you, since it's inherently ordered by insertion order ( this may or may not work for your cases of course; it may depend on whether your input is inherently temporally ordered).
At the end of the day, redis performance is generally dictated by how "well" the data is stored in redis - in other words, how well the native redis capabilities have been mapped into your problem domain. It's pretty easy to use and to program against. I'd highly recommend you look into it.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Trending 100 million+ rows

I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real
Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.
I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.
Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?

dimensional and unit analysis in SQL database

Problem:
A relational database (Postgres) storing timeseries data of various measurement values. Each measurement value can have a specific "measurement type" (e.g. temperature, dissolved oxygen, etc) and can have specific "measurement units" (e.g. Fahrenheit/Celsius/Kelvin, percent/milligrams per liter, etc).
Question:
Has anyone built a similar database such that dimensional integrity is conserved? Have any suggestions?
I'm considering building a measurement_type and a measurement_unit table, both of these would have text two columns, ID and text. Then I would create foreign keys to these tables in the measured_value table. Text worries me somewhat because there's the possibility for non-unique duplicates (e.g. 'ug/l' vs 'µg/l' for micrograms per liter).
The purpose of this would be so that I can both convert and verify units on queries, or via programming externally. Ideally, I would have the ability later to include strict dimensional analysis (e.g. linking µg/l to the value 'M/V' (mass divided by volume)).
Is there a more elegant way to accomplish this?
I produced a database sub-schema for handling units an aeon ago (okay, I exaggerate slightly; it was about 20 years ago, though). Fortunately, it only had to deal with simple mass, length, time dimensions - not temperature, or electric current, or luminosity, etc. Rather less simple was the currency side of the game - there were a myriad different ways of converting between one currency and another depending on date, currency, and period over which conversion rate was valid. That was handled separately from the physical units.
Fundamentally, I created a table 'measures' with an 'id' column, a name for the unit, an abbreviation, and a set of dimension exponents - one each for mass, length, time. This gets populated with names such as 'volume' (length = 3, mass = 0, time = 0), 'density' (length = 3, mass = -1, time = 0) - and the like.
There was a second table of units, which identified a measure and then the actual units used by a particular measurement. For example, there were barrels, and cubic metres, and all sorts of other units of relevance.
There was a third table that defined conversion factors between specific units. This consisted of two units and the multiplicative conversion factor that converted unit 1 to unit 2. The biggest problem here was the dynamic range of the conversion factors. If the conversion from U1 to U2 is 1.234E+10, then the inverse is a rather small number (8.103727714749e-11).
The comment from S.Lott about temperatures is interesting - we didn't have to deal with those. A stored procedure would have addressed that - though integrating one stored procedure into the system might have been tricky.
The scheme I described allowed most conversions to be described once (including hypothetical units such as furlongs per fortnight, or less hypothetical but equally obscure ones - outside the USA - like acre-feet), and the conversions could be validated (for example, both units in the conversion factor table had to have the same measure). It could be extended to handle most of the other units - though the dimensionless units such as angles (or solid angles) present some interesting problems. There was supporting code that would handle arbitrary conversions - or generate an error when the conversion could not be supported. One reason for this system was that the various international affiliate companies would report their data in their locally convenient units, but the HQ system had to accept the original data and yet present the resulting aggregated data in units that suited the managers - where different managers each had their own idea (based on their national background and length of duty in the HQ) about the best units for their reports.
"Text worries me somewhat because there's the possibility for non-unique duplicates"
Right. So don't use text as a key. Use the ID as a key.
"Is there a more elegant way to accomplish this?"
Not really. It's hard. Temperature is it's own problem because temperature is itself an average, and doesn't sum like distance does; plus F to C conversion is not a multiply (as it is with every other unit conversion.)
A note about conversions: a lot of units are linearly related, and can be converted using a formula like "y = A + Bx", where A and B are constants which could be stored in the database for each pair of units that you need to convert between. For example, for Celsius to Farenheit the constants are A=32, B=1.8.
However, there are also rare exceptions. Converting between logarithmic and non-logarithmic units, for example. Or converting between mass-per-volume and molar-mass-per-volume (in which case you would need to know the molar mass of the compound being measured).
Of course, if you are sure that all the conversions required by the system are linear, then there's no need for over-engineering, just store the two constants. You can then extract standardized results from the database using straight SQL joins with calculated fields.