Bigtable row key scenario to avoid hotspotting?
A company needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?
A. Rowkey: date#device_id, Column data: data_point
B. Rowkey: date, Column data: device_id, data_point
C. Rowkey: device_id, Column data: date, data_point
D. Rowkey: data_point, Column data: device_id, date
E. Rowkey: date#data_point, Column data: device_id
What would be the best option in above?
According to the Bigtable schema documentation:
Rows are sorted lexicographically by row key.
This means that in order to avoid hotspotting, common queries should return row results that sequential.
Essentially, you want to be querying rows with a given date and device id. Google Cloud Bigtable allows you query rows by a certain row key prefix. Since the most common queries all the data for a given device and date, the device and date need to be part of the row prefix query, and must be the first two entries in a row key.
You have 2 kind of solution.
Big Table make a Lexigoraphy dictionary using the rowkeys enforcing the organization about
1 - Add before each rowkey(prefix) a letter to force the Big Table make a lexicography index spreading your rows across the alphabet letters and avoid a colision during i/o. This technique is called Salted Table.
Ex.
123
456
789
101112
131415
a123
a456
b789
b101112
c131415
2- You can use a MD5 Hash, avoiding repeat the prefix before hasing and this way garantee a variety of prefix and this way The Big Table spreads the rowkeys across the instanceĀ“s disk.
Related
So I am an SQL noobie, and I would like to organize a structured database that has metadata and data in SQLite. I am not sure how to do this, and I have looked around at different internet sites but I haven't found anything helpful.
Basically what I would want is something like this (using different data collection stations as an example):
SQL TABLE:
Location
Lat
Long
Other Important info about the station
And then somehow when I query this table and want to see info about specific stations data I would be able to pull up the data that would look something like this:
datetime data
1/1/1980 11.6985
1/2/1980 43.6431
1/3/1980 54.9089
1/4/1980 63.1225
1/5/1980 72.4399
1/6/1980 79.1363
1/7/1980 82.2778
1/8/1980 86.0785
1/9/1980 86.8612
1/10/1980 84.3342
1/11/1980 80.4646
1/12/1980 77.1508
1/13/1980 74.827
1/14/1980 73.387
1/15/1980 72.1774
1/16/1980 71.6423
Since I don't know much about table hierarchy, I don't know how to do this, but I feel like it is probably possible. Any help would be appreciated!
using different data collection stations
Immediately indicates that a separate table for stations should be used and that the readings should relate to/associate with/reference the stations table.
For the stations table you could have something like :-
CREATE TABLE IF NOT EXISTS stations (id INTEGER PRIMARY KEY, station_name TEXT, station_latitude REAL, station_longitude REAL);
This will create a table (if it doesn't already exist) that has 4 columns :-
The first column id is a unique identifier that will be generated automatically and is what you would use to reference a specific station.
The second column, station_name is for the name of the station an is of type TEXT.
The third and fourth columns are for the stations location according to lat and long.
You could add a couple of stations using :-
INSERT INTO stations (station_name, station_latitude,station_longitude) VALUES("Zebra", 100.7892, 60.789);
INSERT INTO stations (station_name, station_latitude,station_longitude) VALUES("Yankee", 200.2967, 95.234);
You could display/return these using :-
SELECT * FROM stations
that is SELECT all columns (*) FROM the table called stations, the result would be :-
Next you could create the readings table e.g. :-
CREATE TABLE IF NOT EXISTS readings(recorded_datetime INTEGER DEFAULT (datetime('now')), data_recorded REAL, station_reference INTEGER);
This will create a table named readings (if it doesn't already exist) it will have 3 columns :-
recorded_datetime which is of type INTEGER (can store integer of up to 8 bytes i.e. pretty large). This will be used to store a time stamp. Although perhaps not what you want, but as an example, a default value will be used, the current datetime, if no value is specified for this column.
data_recorded as a REAL that is for the data.
station_reference this will refer to the station's id.
You could then insert a reading for the for the Zebra station using :-
INSERT INTO readings (data_recorded,station_reference) VALUES(11.6985,1);
As the record_datetime column has not been provided then the current datetime will be used.
If :-
INSERT INTO readings VALUES(datetime('1980-01-01 10:40'),11.6985,1);
Then this reading would be for 1/1/1980 at 10:40 for station 1.
Using :-
INSERT INTO readings VALUES(datetime('1980-01-01 10:40'),13.6985,2);
INSERT INTO readings VALUES(datetime('1966-03-01 10:40'),15.6985,2);
INSERT INTO readings VALUES(datetime('2000-01-01 10:40'),11.6985,2);
Will add some readings for Yankee station (id 2).
using SELECT station_reference, recorded_datetime, data_recorded FROM READINGS; will select all the columns but the station_reference will be the first column in the result etc e.g. :-
The obvious progression is to display the data including the respective station. For this a JOIN will be used. That is the readings table will be joined with the stations table where the respective stations details are according to the station_refrence value matching the station's id.
However, let's say that we wanted the Station info to be something like stationname (Long=???? - Lat=????) date/time data and be sorted according to station name and then according to date/time. Then the following could be used :-
SELECT
stations.station_name ||
'(Long='||station_longitude||' - Lat='||station_latitude||')'
AS stationinfo,
readings.recorded_datetime,
readings.data_recorded
FROM readings
JOIN stations ON readings.station_reference = stations.id
ORDER BY stations.station_name ASC, readings.recorded_datetime
Note this is shown more of an example that you can do quite complex things in SQL, rather than there being an expectation of fully understanding the coding.
This would result in :-
You may (or some would) argue but why can't I just have a single table with reading, datetime, station name, station latitude, station longitude.
Well you could. BUT :-
Say there were a directive to change station Zebra's name then you'd have to trawl through all the rows to and change the name numerous times. Easy to code but relatively costly in terms of resource usage. Using the two tables means that just a single update is required.
For each row you would have to duplicate data very likely wasting disk space and thus increasing the resources needed to access the data. That is Zebra would take at least 5 bytes, Real's take 8 bytes (2 of them) so that's 21 bytes. A reference to one occurrence of that repeated data would take a maximum of 8 bytes for an integer (initially just a single byte). So there would be a cost of 13 bytes per reading.
I have a table where I have these fields:
id(primary key, auto increment)
car registration number
car model
garage id
and 31 fields for each day of the mont for each row.
In these fields I have char of 1 or 2 characters representing car status on that date. I need to make a query to get number of each possibility for that day, field of any day could have values: D, I, R, TA, RZ, BV and LR.
I need to count in each row, amount of each value in that row.
Like how many I , how many D and so on. And this for every row in table.
What best approach would be here? Also maybe there is better way then having field in database table for each day because it makes over 30 fields obviously.
There is a better way. You should structure the data so you have another table, with rows such as:
CarId
Date
Status
Then your query would simply be:
select status, count(*)
from CarStatuses
where date >= #month_start and date < month_end
group by status;
For your data model, this is much harder to deal with. You can do something like this:
select status, count(*)
from ((select status_01 as status
from t
) union all
(select status_02
from t
) union all
. . .
(select status_31
from t
)
) s
group by status;
You seem to have to start with most basic tutorials about relational databases and SQL design. Some classic works like "Martin Gruber - Understanding SQL" may help. Or others. ATM you miss the basics.
Few hints.
Documents that you print for user or receive from user do not represent your internal data structures. They are created/parsed for that very purpose machine-to-human interface. Inside your program should structure the data for easy of storing/processing.
You have to add a "dictionary table" for the statuses.
ID / abbreviation / human-readable description
You may have a "business rule" that from "R" status you can transition to either "D" status or to "BV" status, but not to any other. In other words you better draft the possible status transitions "directed graph". You would keep it in extra columns of that dictionary table or in one more specialized helper table. Dictionary of transitions for the dictionary of possible statuses.
Your paper blank combines in the same row both totals and per-day detailisation. That is easy for human to look upon, but for computer that in a sense violates single responsibility principle. Row should either be responsible for primary record or for derived total calculation. You better have two tables - one for primary day by day records and another for per-month total summing up.
Bonus point would be that when you would change values in the primary data table you may ask server to automatically recalculate the corresponding month totals. Read about SQL triggers.
Also your triggers may check if the new state properly transits from the previous day state, as described in the "business rules". They would also maybe have to check there is not gaps between day. If there is a record for "march 03" and there is inserted a new the record for "march 05" then a record for "march 04" should exists, or the server would prohibit adding such a row. Well, maybe not, that is dependent upon you business processes. The general idea is that server should reject storing any data that is not valid and server can know it.
you per-date and per-month tables should have proper UNIQUE CONSTRAINTs prohibiting entering duplicate rows. It also means the former should have DATE-type column and the latter should either have month and year INTEGER-type columns or have a DATE-type column with the day part in it always being "1" - you would want a CHECK CONSTRAINT for it.
If your company has some registry of cars (and probably it does, it is not looking like those car were driven in by random one-time customers driving by) you have to introduce a dictionary table of cars. Integer ID (PK), registration plate, engine factory number, vagon factory number, colour and whatever else.
The per-month totals table would not have many columns per every status. It would instead have a special row for every status! The structure would probably be like that: Month / Year / ID of car in the registry / ID of status in the dictionary / count. All columns would be integer type (some may be SmallInt or BigInt, but that is minor nuancing). All the columns together (without count column) should constitute a UNIQUE CONSTRAINT or even better a "compound" Primary Key. Adding a special dedicated PK column here in the totaling table seems redundant to me.
Consequently, your per-day and per-month tables would not have literal (textual and immediate) data for status and car id. Instead they would have integer IDs referencing proper records in the corresponding cars dictionary and status dictionary tables. That you would code as FOREIGN KEY.
Remember the rule of thumb: it is easy to add/delete a row to any table but quite hard to add/delete a column.
With design like yours, column-oriented, what would happen if next year the boss would introduce some more statuses? you would have to redesign the table, the program in many points and so on.
With the rows-oriented design you would just have to add one row in the statuses dictionary and maybe few rows to transition rules dictionary, and the rest works without any change.
That way you would not
I don't know what is the best wording of the question, but I have a table that has 2 columns: ID and NAME.
when I delete a record from the table the related ID field deleted with it and then the sequence spoils.
take this example:
if I deleted row number 2, the sequence of ID column will be: 1,3,4
How to make it: 1,2,3
ID's are meant to be unique for a reason. Consider this scenario:
**Customers**
id value
1 John
2 Jackie
**Accounts**
id customer_id balance
1 1 $500
2 2 $1000
In the case of a relational database, say you were to delete "John" from the database. Now Jackie would take on the customer_id of 1. When Jackie goes in to check here balance, she will now show $500 short.
Granted, you could go through and update all of her other records, but A) this would be a massive pain in the ass. B) It would be very easy to make mistakes, especially in a large database.
Ids (primary keys in this case) are meant to be the rock that holds your relational database together, and you should always be able to rely on that value regardless of the table.
As JohnFx pointed out, should you want a value that shows the order of the user, consider using a built in function when querying.
In SQL Server identity columns are not guaranteed to be sequential. You can use the ROW_NUMBER function to generate a sequential list of ids when you query the data from the database:
SELECT
ROW_NUMBER() OVER (ORDER BY Id) AS SequentialId,
Id As UniqueId,
Name
FROM dbo.Details
If you want sequential numbers don't store them in the database. That is just a maintenance nightmare, and I really can't think of a very good reason you'd even want to bother.
Just generate them dynamically using tSQL's RowNumber function when you query the data.
The whole point of an Identity column is creating a reliable identifier that you can count on pointing to that row in the DB. If you shift them around you undermine the main reason you WANT an ID.
In a real world example, how would you feel if the IRS wanted to change your SSN every week so they could keep the Social Security Numbers sequential after people died off?
I don't clearly understand the purpose of using $PARTITION in SQL server? I have read the content from MSDN but don't still understand.
What benefit can you use it?
Partitioning tables in a database means splitting it into multiple pieces while still treating it like a single table. We are talking here about horizontal partitioning which means that each partition contains all of a columns but only some of the rows. To do this, you have to define a partition function which defines which partition a row belongs to, based on that row's value in the partitioning column. $PARTITION tells you which partition a row belongs to.
Let's say we had an integer column that represented the rating of our rows and we expect that users will often ask only for items rated above 4. We could partition based on rating. Our partition function could be:
CREATE PARTITION FUNCTION partitionByRating (int) FOR VALUES (4);
Then, to find out which partition the rows of our table belong to, we could query:
SELECT $partition.partitionByRating(i.Rating) AS [Partition Number],
rating AS [Rating],
name AS [Item Name]
FROM dbo.Items AS i
And we might get a result like:
Partition Rating Item Name
1 1 Ball
1 4 Scooter
2 5 Blaster
Which means that Ball and Scooter will be stored together, but depending on how you assign partitions to storage, Blaster might be stored somewhere else.
You can use this function to determine the partition number that will be used for a certain value. For example, with a table partitioned by date, this will tell you which partition will be used for July 1, 2001, even before you insert the value into the table:
SELECT $PARTITION.PF_RangeByYear('20010701')
You can also use it to find the row count in each partition number:
SELECT $partition.PF_RangeByYear(orderdate) AS partition#,
COUNT(*) AS cnt
FROM Orders
GROUP BY $partition.PF_RangeByYear(orderdate);
Which arguably you could do much simpler from sys.partitions or sys.dm_db_partition_stats.
(Both of these examples taken from Itzik's SQL Mag article here.)
I am working with an extensive amount of third party data. Each data set has items with unique identifiers. So it is very easy for me to utilise UNIQUE column in SQLITE to enforce some data integrity.
Out of thousands of records I have id from third party source A matching 2 unique ids from third party source B.
Is there a way of bending the rules, and allowing a duplicate entry in a unique column? If not how should I reorganise my data to take care of this single edge case.
UPDATE:
CREATE TABLE "trainer" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"name" TEXT NOT NULL,
"betfair_id" INTEGER NOT NULL UNIQUE,
"racingpost_id" INTEGER NOT NULL UNIQUE
);
Problem data:
Miss Beverley J Thomas http://www.racingpost.com/horses/trainer_home.sd?trainer_id=20514
Miss B J Thomas http://www.racingpost.com/horses/trainer_home.sd?trainer_id=11096
vs. Miss Beverley J. Thomas http://form.horseracing.betfair.com/form/trainer/1/00008861
Both Racingpost entires (my primary data source) match a single Betfair entry. This is the only one (so far) out of thousands of records.
If racingpost should have had only 1 match it is an error condition.
If racingpost is allowed to have 2 matches per id, you must either have two ids, select one, or combine the data.
Since racingpost is your primary source, having 2 ids may make sense. However if you want to improve upon that data set combining that data or selecting the most useful may be more accurate. The real question is how much data overlaps between these two records and when it does can you detect it reliably. If the overlap is small or you have good detection of an overlap condition, then combining makes more sense. If the overlap is large and you cannot detect it reliably, then selecting the most recent updated or having two ids is more useful.