How does Oracle perform read operation? - sql

Suppose we have a table which holds information about person. Columns like NAME or SURNAME are small (I mean their size isn't very large), but columns that hold a photo or maybe a person's video (blob columns) may be very large. So when we perform a select operation:
select * from person
it will retrieve all this information. But in most cases we need only retrieve name or surname of person, so we perform this query:
select name, surname from person
Question: will Oracle read the whole record (including the blob columns) and then simply filter out name and surname columns, or will it only read name and surname columns?
Also, even if we create a separate table for such large data(person's photo and video) and have a foreign key to that table in person's table and want to retrieve only photo, so we perform this query:
select photo
from person p
join largePesonData d on p.largeDataID = d.largeDataID
where p.id = 1
Will Oracle read a whole record in person table and whole record in largePesonData or will it simply read the column with photo in largePesonData?

Oracle reads the data in blocks.
Let's assume that your block size is 8192 bytes and your average row size is 100 bytes - that would mean each block would populate 8192/100 = 81 rows (It's not accurate since there is some overhead coming from the block header - but I'm trying to keep things simple).
So when you
select name, surname from person;
You actually retrieve at least on block with all of it's data (81 rows), and later after it is being screened returning you only the data you requested.
Two exceptions to this are:
BLOB Column - "select name, surename from person" will not retrieve the BLOB contents itself because BLOB columns contain a reference to the actual BLOB (which sits somewhere else on the tablespace or even in anoter TS)
Indexed columns - In case you created an index on the table using the columns name and surname it is possible that Oracle will only scan this specific index and retrieve only those two columns.

Related

Does no.of columns effect read query performance with projection

Consider that I have a table USERS_1 with 2 columns : id and name
and another table USERS_2 with 3 columns: id, name and age.
I have indexes on id on both tables and both tables contains 20 rows with same date for id and name. Lets consider postgres DB as an example.
Will there be a performance difference between the following queries:
SELECT id, name FROM USERS_1 WHERE id < 10
SELECT id, name FROM USERS_2 WHERE id < 10
Lets say this WHERE clause matches with 5 rows in both tables.
I have heard that since the no.of columns are more in USERS_2, the I/O operations to be done might be more as the DB server has to read the entire row from disk, before projecting. Projection only helps in transferring lesser data to the client. Is that correct?
Ref: https://community.oracle.com/tech/developers/discussion/3764712/does-the-number-of-columns-in-a-table-can-affect-the-performance#:~:text=So%20yes%2C%20250%20columns%20typically,rows%20of%205%20cols%20each.
I do know that the no.of rows and columns are too minimal to observe any performance difference, but the intent is to understand how projection and I/O reads are related.

Row Stores vs Column Stores

Assuming that the database is already populated with data, and that each of the following SQL statements is the one and only query that an application will perform, why is it better to use row-wise or column-wise record storage for the following queries?...
1) SELECT * FROM Person
2) SELECT * FROM Person WHERE id=5
3) SELECT AVG(YEAR(DateOfBirth)) FROM Person
4) INSERT INTO Person (ID,DateOfBirth,Name,Surname) VALUES(2e25,’1990-05-01’,’Ute’,’Muller’)
In those examples Person.id is the primary key.
The article Row Store and Column Store Databases gives a general discussion on this, but I am specifically concerned about the four queries above.
SELECT * FROM ... queries are better for row stores since it has to access numerous files.
Column store is good for aggregation over large volume of date or when you have quesries that only need a few fields from a wide table.
Therefore:
1st querie: row-wise
2nd query: row-wise
3rd query: column-wise
4th query: row-wise
I have no idea what you are asking. You have this statement:
INSERT INTO Person (ID, DateOfBirth, Name, Surname)
VALUES('2e25', '1990-05-01', 'Ute', 'Muller');
This suggests that you have a table with four columns, one of which is an id. Each person is stored in their own column.
You then have three queries. The first cannot be optimized. The second is optimized, assuming that id is a primary key (a reasonable assumption). The third requires a full table scan -- although that could be ameliorated with an index only on DateOfBirth.
If the data is already in this format, why would you want to change it?
This is a very simple data structure. Three of your four query examples access all columns. I see no reason why you would not use a regular row-store table structure.

in sql in a table, in a given column with data type text, how can we show the rest of the entries in that column after a particular entry

in sql, in any given table, in a column named "name", wih data type as text
if there are ten entries, suppose an entry in the column is "rohit". i want to show all the entries in the name column after rohit. and i do not know the row id or id. can it be done??
select * from your_table where name > 'rohit'
but in general you should not treat text columns like that.
a database is more than a collection of tables.
think about how to organize your data, what defines a datarow.
maybe, beside their name, there is another thing how you would classify such a row? some things like "shall be displayed?" "is modified" "is active"?
so if you had a second column, say display of type int and your table looked like
CREATE TABLE MYDATA
NAME TEXT,
DISPLAY INT NOT NULL DEFAULT(1);
you could flag every row with 1 or 0 whether it should be displayed or not and then your query could look like
SELECT * FROM MYDATA WHERE DISPLAY=1 ORDER BY NAME
to get your list of values.
it's not much of a difference with ten rows, you don't even need indexes here, but if you build something bigger, say 10,000+ rows, you'd be surprised how slow that would become!
in general, TEXT columns are good to select and display, but should be avoided as a WHERE condition as much as you can. Use describing columns, preferrably int fields which can be indexed with extreme high efficiency and an application doesn't get slower even if the record size goes over 100k.
You can use "default" keyword for it.
CREATE TABLE Persons (
ID int NOT NULL,
name varchar(255) DEFAULT 'rohit'
);

How to structure an SQL table with metadata

So I am an SQL noobie, and I would like to organize a structured database that has metadata and data in SQLite. I am not sure how to do this, and I have looked around at different internet sites but I haven't found anything helpful.
Basically what I would want is something like this (using different data collection stations as an example):
SQL TABLE:
Location
Lat
Long
Other Important info about the station
And then somehow when I query this table and want to see info about specific stations data I would be able to pull up the data that would look something like this:
datetime data
1/1/1980 11.6985
1/2/1980 43.6431
1/3/1980 54.9089
1/4/1980 63.1225
1/5/1980 72.4399
1/6/1980 79.1363
1/7/1980 82.2778
1/8/1980 86.0785
1/9/1980 86.8612
1/10/1980 84.3342
1/11/1980 80.4646
1/12/1980 77.1508
1/13/1980 74.827
1/14/1980 73.387
1/15/1980 72.1774
1/16/1980 71.6423
Since I don't know much about table hierarchy, I don't know how to do this, but I feel like it is probably possible. Any help would be appreciated!
using different data collection stations
Immediately indicates that a separate table for stations should be used and that the readings should relate to/associate with/reference the stations table.
For the stations table you could have something like :-
CREATE TABLE IF NOT EXISTS stations (id INTEGER PRIMARY KEY, station_name TEXT, station_latitude REAL, station_longitude REAL);
This will create a table (if it doesn't already exist) that has 4 columns :-
The first column id is a unique identifier that will be generated automatically and is what you would use to reference a specific station.
The second column, station_name is for the name of the station an is of type TEXT.
The third and fourth columns are for the stations location according to lat and long.
You could add a couple of stations using :-
INSERT INTO stations (station_name, station_latitude,station_longitude) VALUES("Zebra", 100.7892, 60.789);
INSERT INTO stations (station_name, station_latitude,station_longitude) VALUES("Yankee", 200.2967, 95.234);
You could display/return these using :-
SELECT * FROM stations
that is SELECT all columns (*) FROM the table called stations, the result would be :-
Next you could create the readings table e.g. :-
CREATE TABLE IF NOT EXISTS readings(recorded_datetime INTEGER DEFAULT (datetime('now')), data_recorded REAL, station_reference INTEGER);
This will create a table named readings (if it doesn't already exist) it will have 3 columns :-
recorded_datetime which is of type INTEGER (can store integer of up to 8 bytes i.e. pretty large). This will be used to store a time stamp. Although perhaps not what you want, but as an example, a default value will be used, the current datetime, if no value is specified for this column.
data_recorded as a REAL that is for the data.
station_reference this will refer to the station's id.
You could then insert a reading for the for the Zebra station using :-
INSERT INTO readings (data_recorded,station_reference) VALUES(11.6985,1);
As the record_datetime column has not been provided then the current datetime will be used.
If :-
INSERT INTO readings VALUES(datetime('1980-01-01 10:40'),11.6985,1);
Then this reading would be for 1/1/1980 at 10:40 for station 1.
Using :-
INSERT INTO readings VALUES(datetime('1980-01-01 10:40'),13.6985,2);
INSERT INTO readings VALUES(datetime('1966-03-01 10:40'),15.6985,2);
INSERT INTO readings VALUES(datetime('2000-01-01 10:40'),11.6985,2);
Will add some readings for Yankee station (id 2).
using SELECT station_reference, recorded_datetime, data_recorded FROM READINGS; will select all the columns but the station_reference will be the first column in the result etc e.g. :-
The obvious progression is to display the data including the respective station. For this a JOIN will be used. That is the readings table will be joined with the stations table where the respective stations details are according to the station_refrence value matching the station's id.
However, let's say that we wanted the Station info to be something like stationname (Long=???? - Lat=????) date/time data and be sorted according to station name and then according to date/time. Then the following could be used :-
SELECT
stations.station_name ||
'(Long='||station_longitude||' - Lat='||station_latitude||')'
AS stationinfo,
readings.recorded_datetime,
readings.data_recorded
FROM readings
JOIN stations ON readings.station_reference = stations.id
ORDER BY stations.station_name ASC, readings.recorded_datetime
Note this is shown more of an example that you can do quite complex things in SQL, rather than there being an expectation of fully understanding the coding.
This would result in :-
You may (or some would) argue but why can't I just have a single table with reading, datetime, station name, station latitude, station longitude.
Well you could. BUT :-
Say there were a directive to change station Zebra's name then you'd have to trawl through all the rows to and change the name numerous times. Easy to code but relatively costly in terms of resource usage. Using the two tables means that just a single update is required.
For each row you would have to duplicate data very likely wasting disk space and thus increasing the resources needed to access the data. That is Zebra would take at least 5 bytes, Real's take 8 bytes (2 of them) so that's 21 bytes. A reference to one occurrence of that repeated data would take a maximum of 8 bytes for an integer (initially just a single byte). So there would be a cost of 13 bytes per reading.

decompose source data into new, many-to-many schema

I am using MS Access 2010 to do some transformations of data. Specifically, I need to create the data structure for a many-to-many relationship between concept (summarized by rxnconso.rxcui) and word (summarized by drugwords.id. Note that each value of drugwords.id needs to correspond with a unique value of name from the words table in the images below.). To accomplish this, I need to create two tables, drugwords and drugwordsConsoJunction, and also decompose the contents of an existing table words into the drugwords and drugwordsConsoJunction tables. The structure of the destination tables is:
drugwords table: (this table needs to be created)
id (autonumber pk needs to be created from distinct values of words.name)
name
drugwordsConsoJunction: (this table needs to be created)
word_id (fk to drugwords.id)
rxcui (fk to rxnconso.rxcui)
rxnconso (this table already exists):
rxcui
...other fields
The source table for this transformation is called words and has two columns; a value for rxcui, and a value for name. As you can see from the images below, there can be many name values for a given rxcui value. And the second image below shows that there can be many rxcui values for a given name value.
How do I write the SQL to transform words into drugwords and drugwordsConsoJunction, as per the above specifications?
I have uploaded a copy of the database to a file sharing site. You can download it at this link.
If the proposed [drugwords] table is already going to have unique values in its [name] column then you don't need an AutoNumber ID column, you can just use the [name] field as a Primary Key. In that case, the table that maps "words" to the corresponding [rxcui] values could be created by simply doing
SELECT DISTINCT rxcui, [name] INTO drugwordsConsoJunction FROM words
Then you can use the "words" themselves instead of introducing another layer of mapping from (distinct) "words" to (distinct) "IDs".