How to structure an SQL table with metadata - sql

So I am an SQL noobie, and I would like to organize a structured database that has metadata and data in SQLite. I am not sure how to do this, and I have looked around at different internet sites but I haven't found anything helpful.
Basically what I would want is something like this (using different data collection stations as an example):
SQL TABLE:
Location
Lat
Long
Other Important info about the station
And then somehow when I query this table and want to see info about specific stations data I would be able to pull up the data that would look something like this:
datetime data
1/1/1980 11.6985
1/2/1980 43.6431
1/3/1980 54.9089
1/4/1980 63.1225
1/5/1980 72.4399
1/6/1980 79.1363
1/7/1980 82.2778
1/8/1980 86.0785
1/9/1980 86.8612
1/10/1980 84.3342
1/11/1980 80.4646
1/12/1980 77.1508
1/13/1980 74.827
1/14/1980 73.387
1/15/1980 72.1774
1/16/1980 71.6423
Since I don't know much about table hierarchy, I don't know how to do this, but I feel like it is probably possible. Any help would be appreciated!

using different data collection stations
Immediately indicates that a separate table for stations should be used and that the readings should relate to/associate with/reference the stations table.
For the stations table you could have something like :-
CREATE TABLE IF NOT EXISTS stations (id INTEGER PRIMARY KEY, station_name TEXT, station_latitude REAL, station_longitude REAL);
This will create a table (if it doesn't already exist) that has 4 columns :-
The first column id is a unique identifier that will be generated automatically and is what you would use to reference a specific station.
The second column, station_name is for the name of the station an is of type TEXT.
The third and fourth columns are for the stations location according to lat and long.
You could add a couple of stations using :-
INSERT INTO stations (station_name, station_latitude,station_longitude) VALUES("Zebra", 100.7892, 60.789);
INSERT INTO stations (station_name, station_latitude,station_longitude) VALUES("Yankee", 200.2967, 95.234);
You could display/return these using :-
SELECT * FROM stations
that is SELECT all columns (*) FROM the table called stations, the result would be :-
Next you could create the readings table e.g. :-
CREATE TABLE IF NOT EXISTS readings(recorded_datetime INTEGER DEFAULT (datetime('now')), data_recorded REAL, station_reference INTEGER);
This will create a table named readings (if it doesn't already exist) it will have 3 columns :-
recorded_datetime which is of type INTEGER (can store integer of up to 8 bytes i.e. pretty large). This will be used to store a time stamp. Although perhaps not what you want, but as an example, a default value will be used, the current datetime, if no value is specified for this column.
data_recorded as a REAL that is for the data.
station_reference this will refer to the station's id.
You could then insert a reading for the for the Zebra station using :-
INSERT INTO readings (data_recorded,station_reference) VALUES(11.6985,1);
As the record_datetime column has not been provided then the current datetime will be used.
If :-
INSERT INTO readings VALUES(datetime('1980-01-01 10:40'),11.6985,1);
Then this reading would be for 1/1/1980 at 10:40 for station 1.
Using :-
INSERT INTO readings VALUES(datetime('1980-01-01 10:40'),13.6985,2);
INSERT INTO readings VALUES(datetime('1966-03-01 10:40'),15.6985,2);
INSERT INTO readings VALUES(datetime('2000-01-01 10:40'),11.6985,2);
Will add some readings for Yankee station (id 2).
using SELECT station_reference, recorded_datetime, data_recorded FROM READINGS; will select all the columns but the station_reference will be the first column in the result etc e.g. :-
The obvious progression is to display the data including the respective station. For this a JOIN will be used. That is the readings table will be joined with the stations table where the respective stations details are according to the station_refrence value matching the station's id.
However, let's say that we wanted the Station info to be something like stationname (Long=???? - Lat=????) date/time data and be sorted according to station name and then according to date/time. Then the following could be used :-
SELECT
stations.station_name ||
'(Long='||station_longitude||' - Lat='||station_latitude||')'
AS stationinfo,
readings.recorded_datetime,
readings.data_recorded
FROM readings
JOIN stations ON readings.station_reference = stations.id
ORDER BY stations.station_name ASC, readings.recorded_datetime
Note this is shown more of an example that you can do quite complex things in SQL, rather than there being an expectation of fully understanding the coding.
This would result in :-
You may (or some would) argue but why can't I just have a single table with reading, datetime, station name, station latitude, station longitude.
Well you could. BUT :-
Say there were a directive to change station Zebra's name then you'd have to trawl through all the rows to and change the name numerous times. Easy to code but relatively costly in terms of resource usage. Using the two tables means that just a single update is required.
For each row you would have to duplicate data very likely wasting disk space and thus increasing the resources needed to access the data. That is Zebra would take at least 5 bytes, Real's take 8 bytes (2 of them) so that's 21 bytes. A reference to one occurrence of that repeated data would take a maximum of 8 bytes for an integer (initially just a single byte). So there would be a cost of 13 bytes per reading.

Related

Partitioned by in Apache HIVE, more questions

There are some good questions/answers here
Hive clustered by on more than one column
hive subquery optimization using cluster by
difference between Cluster By and CLUSTERED BY in hive?
What is the difference between partitioning and bucketing a table in Hive ?
but I have a few more, unfortunately there is no good explanation here on page 24:
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/using-hiveql/hive_using_hiveql.pdf
My questions:
In below example from the above:
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, from STRING)
PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS;
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23') VALUES
('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
INSERT INTO TABLE pageviews PARTITION (datestamp) VALUES ('tjohnson',
'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null,
'2014-09-21');
why does "datestamp STRING" do not exist in the the schema of the pageviews?
Why is it defined as string? should not be TIMESTAMP?
Why does the second insert miss it and only has it as type but it has as values (i.e. '2014-09-23' and '2014-09-21?
why does "datestamp STRING" do not exist in the the schema of the pageviews?
Although datestamp looks and behaves like a standard column defined in the schema, it's actually just a reference to a particular partition of the underlying data for the table. When you see '2014-09-23' in the datestamp column, it's not actually showing you a value contained in a particular record in one of the data files, instead it's telling you that the data in the rest of the row comes from an HDFS directory called 'datestamp=2014-09-23' that contains a partition or "chunk" of the data. This is were a lot of the optimization comes in, since filtering a query to a particular partition allows Hive to simply go to the data in that particular directory and ignore the data contained in the other n number of partitions.
Why is it defined as string? should be TIMESTAMP?
Since a partition is simply referring to a directory name, it only makes sense that the type is a string representation of a specific date format instead of a timestamp or date. Conceptually, a date field would not make sense since although '2014-09-23' and '9/23/2014' are two equal datestamps, these would be considered different directories if they were directory names. In other words, if a directory is named '2014-09-23', you cannot refer to it by any other name making it more like a string and less like a date which has many alternate forms that are all equivalent. Furthermore, Hive already treats dates as strings which makes it a better solution than say, a type of int. For example if you pass in a timestamp to Hive's to_date() user defined function, it returns the date as a string.
Also, since you mentioned timestamp, using a full timestamp that has fractions of a second in it is a bad idea for partitions, even if you use a string representation of it. You would end up with a massive amount of partitions and probably one or at most only a few records in each partition. I would imagine you would quickly lose any of the performance benefits of partitioning.
Why does the second insert miss it and only has it as type but it has as values (i.e. '2014-09-23' and '2014-09-21?
This is simply a different syntax that produces the same result. When you include partitions, Hive will assume the values at the end of the values array refer to the partitions. So if you have a table with 3 columns in your schema and 1 partition, when you perform an insert into table command and specify partition (datestamp), you can just pass in 4 values and Hive will know that the first 3 values are to be inserted into the 3 columns in your schema, and the fourth value refers to which datestamp partition you want to add this record's data to.

Row Stores vs Column Stores

Assuming that the database is already populated with data, and that each of the following SQL statements is the one and only query that an application will perform, why is it better to use row-wise or column-wise record storage for the following queries?...
1) SELECT * FROM Person
2) SELECT * FROM Person WHERE id=5
3) SELECT AVG(YEAR(DateOfBirth)) FROM Person
4) INSERT INTO Person (ID,DateOfBirth,Name,Surname) VALUES(2e25,’1990-05-01’,’Ute’,’Muller’)
In those examples Person.id is the primary key.
The article Row Store and Column Store Databases gives a general discussion on this, but I am specifically concerned about the four queries above.
SELECT * FROM ... queries are better for row stores since it has to access numerous files.
Column store is good for aggregation over large volume of date or when you have quesries that only need a few fields from a wide table.
Therefore:
1st querie: row-wise
2nd query: row-wise
3rd query: column-wise
4th query: row-wise
I have no idea what you are asking. You have this statement:
INSERT INTO Person (ID, DateOfBirth, Name, Surname)
VALUES('2e25', '1990-05-01', 'Ute', 'Muller');
This suggests that you have a table with four columns, one of which is an id. Each person is stored in their own column.
You then have three queries. The first cannot be optimized. The second is optimized, assuming that id is a primary key (a reasonable assumption). The third requires a full table scan -- although that could be ameliorated with an index only on DateOfBirth.
If the data is already in this format, why would you want to change it?
This is a very simple data structure. Three of your four query examples access all columns. I see no reason why you would not use a regular row-store table structure.

in sql in a table, in a given column with data type text, how can we show the rest of the entries in that column after a particular entry

in sql, in any given table, in a column named "name", wih data type as text
if there are ten entries, suppose an entry in the column is "rohit". i want to show all the entries in the name column after rohit. and i do not know the row id or id. can it be done??
select * from your_table where name > 'rohit'
but in general you should not treat text columns like that.
a database is more than a collection of tables.
think about how to organize your data, what defines a datarow.
maybe, beside their name, there is another thing how you would classify such a row? some things like "shall be displayed?" "is modified" "is active"?
so if you had a second column, say display of type int and your table looked like
CREATE TABLE MYDATA
NAME TEXT,
DISPLAY INT NOT NULL DEFAULT(1);
you could flag every row with 1 or 0 whether it should be displayed or not and then your query could look like
SELECT * FROM MYDATA WHERE DISPLAY=1 ORDER BY NAME
to get your list of values.
it's not much of a difference with ten rows, you don't even need indexes here, but if you build something bigger, say 10,000+ rows, you'd be surprised how slow that would become!
in general, TEXT columns are good to select and display, but should be avoided as a WHERE condition as much as you can. Use describing columns, preferrably int fields which can be indexed with extreme high efficiency and an application doesn't get slower even if the record size goes over 100k.
You can use "default" keyword for it.
CREATE TABLE Persons (
ID int NOT NULL,
name varchar(255) DEFAULT 'rohit'
);

I need help counting char occurencies in a row with sql (using firebird server)

I have a table where I have these fields:
id(primary key, auto increment)
car registration number
car model
garage id
and 31 fields for each day of the mont for each row.
In these fields I have char of 1 or 2 characters representing car status on that date. I need to make a query to get number of each possibility for that day, field of any day could have values: D, I, R, TA, RZ, BV and LR.
I need to count in each row, amount of each value in that row.
Like how many I , how many D and so on. And this for every row in table.
What best approach would be here? Also maybe there is better way then having field in database table for each day because it makes over 30 fields obviously.
There is a better way. You should structure the data so you have another table, with rows such as:
CarId
Date
Status
Then your query would simply be:
select status, count(*)
from CarStatuses
where date >= #month_start and date < month_end
group by status;
For your data model, this is much harder to deal with. You can do something like this:
select status, count(*)
from ((select status_01 as status
from t
) union all
(select status_02
from t
) union all
. . .
(select status_31
from t
)
) s
group by status;
You seem to have to start with most basic tutorials about relational databases and SQL design. Some classic works like "Martin Gruber - Understanding SQL" may help. Or others. ATM you miss the basics.
Few hints.
Documents that you print for user or receive from user do not represent your internal data structures. They are created/parsed for that very purpose machine-to-human interface. Inside your program should structure the data for easy of storing/processing.
You have to add a "dictionary table" for the statuses.
ID / abbreviation / human-readable description
You may have a "business rule" that from "R" status you can transition to either "D" status or to "BV" status, but not to any other. In other words you better draft the possible status transitions "directed graph". You would keep it in extra columns of that dictionary table or in one more specialized helper table. Dictionary of transitions for the dictionary of possible statuses.
Your paper blank combines in the same row both totals and per-day detailisation. That is easy for human to look upon, but for computer that in a sense violates single responsibility principle. Row should either be responsible for primary record or for derived total calculation. You better have two tables - one for primary day by day records and another for per-month total summing up.
Bonus point would be that when you would change values in the primary data table you may ask server to automatically recalculate the corresponding month totals. Read about SQL triggers.
Also your triggers may check if the new state properly transits from the previous day state, as described in the "business rules". They would also maybe have to check there is not gaps between day. If there is a record for "march 03" and there is inserted a new the record for "march 05" then a record for "march 04" should exists, or the server would prohibit adding such a row. Well, maybe not, that is dependent upon you business processes. The general idea is that server should reject storing any data that is not valid and server can know it.
you per-date and per-month tables should have proper UNIQUE CONSTRAINTs prohibiting entering duplicate rows. It also means the former should have DATE-type column and the latter should either have month and year INTEGER-type columns or have a DATE-type column with the day part in it always being "1" - you would want a CHECK CONSTRAINT for it.
If your company has some registry of cars (and probably it does, it is not looking like those car were driven in by random one-time customers driving by) you have to introduce a dictionary table of cars. Integer ID (PK), registration plate, engine factory number, vagon factory number, colour and whatever else.
The per-month totals table would not have many columns per every status. It would instead have a special row for every status! The structure would probably be like that: Month / Year / ID of car in the registry / ID of status in the dictionary / count. All columns would be integer type (some may be SmallInt or BigInt, but that is minor nuancing). All the columns together (without count column) should constitute a UNIQUE CONSTRAINT or even better a "compound" Primary Key. Adding a special dedicated PK column here in the totaling table seems redundant to me.
Consequently, your per-day and per-month tables would not have literal (textual and immediate) data for status and car id. Instead they would have integer IDs referencing proper records in the corresponding cars dictionary and status dictionary tables. That you would code as FOREIGN KEY.
Remember the rule of thumb: it is easy to add/delete a row to any table but quite hard to add/delete a column.
With design like yours, column-oriented, what would happen if next year the boss would introduce some more statuses? you would have to redesign the table, the program in many points and so on.
With the rows-oriented design you would just have to add one row in the statuses dictionary and maybe few rows to transition rules dictionary, and the rest works without any change.
That way you would not

How does Oracle perform read operation?

Suppose we have a table which holds information about person. Columns like NAME or SURNAME are small (I mean their size isn't very large), but columns that hold a photo or maybe a person's video (blob columns) may be very large. So when we perform a select operation:
select * from person
it will retrieve all this information. But in most cases we need only retrieve name or surname of person, so we perform this query:
select name, surname from person
Question: will Oracle read the whole record (including the blob columns) and then simply filter out name and surname columns, or will it only read name and surname columns?
Also, even if we create a separate table for such large data(person's photo and video) and have a foreign key to that table in person's table and want to retrieve only photo, so we perform this query:
select photo
from person p
join largePesonData d on p.largeDataID = d.largeDataID
where p.id = 1
Will Oracle read a whole record in person table and whole record in largePesonData or will it simply read the column with photo in largePesonData?
Oracle reads the data in blocks.
Let's assume that your block size is 8192 bytes and your average row size is 100 bytes - that would mean each block would populate 8192/100 = 81 rows (It's not accurate since there is some overhead coming from the block header - but I'm trying to keep things simple).
So when you
select name, surname from person;
You actually retrieve at least on block with all of it's data (81 rows), and later after it is being screened returning you only the data you requested.
Two exceptions to this are:
BLOB Column - "select name, surename from person" will not retrieve the BLOB contents itself because BLOB columns contain a reference to the actual BLOB (which sits somewhere else on the tablespace or even in anoter TS)
Indexed columns - In case you created an index on the table using the columns name and surname it is possible that Oracle will only scan this specific index and retrieve only those two columns.