Row Stores vs Column Stores - sql

Assuming that the database is already populated with data, and that each of the following SQL statements is the one and only query that an application will perform, why is it better to use row-wise or column-wise record storage for the following queries?...
1) SELECT * FROM Person
2) SELECT * FROM Person WHERE id=5
3) SELECT AVG(YEAR(DateOfBirth)) FROM Person
4) INSERT INTO Person (ID,DateOfBirth,Name,Surname) VALUES(2e25,’1990-05-01’,’Ute’,’Muller’)
In those examples Person.id is the primary key.
The article Row Store and Column Store Databases gives a general discussion on this, but I am specifically concerned about the four queries above.

SELECT * FROM ... queries are better for row stores since it has to access numerous files.
Column store is good for aggregation over large volume of date or when you have quesries that only need a few fields from a wide table.
Therefore:
1st querie: row-wise
2nd query: row-wise
3rd query: column-wise
4th query: row-wise

I have no idea what you are asking. You have this statement:
INSERT INTO Person (ID, DateOfBirth, Name, Surname)
VALUES('2e25', '1990-05-01', 'Ute', 'Muller');
This suggests that you have a table with four columns, one of which is an id. Each person is stored in their own column.
You then have three queries. The first cannot be optimized. The second is optimized, assuming that id is a primary key (a reasonable assumption). The third requires a full table scan -- although that could be ameliorated with an index only on DateOfBirth.
If the data is already in this format, why would you want to change it?
This is a very simple data structure. Three of your four query examples access all columns. I see no reason why you would not use a regular row-store table structure.

Related

Selecting a large number of rows by index using SQL

I am trying to select a number of rows by the value of a column called ID. I know you can do this pretty easily by:
SELECT col1, col2, col3 FROM mytable WHERE id IN (1,2,3,4,5...)
However, what if there are a few million IDs I want to select and the IDs don't always have pattern (which means I can't use something like BETWEEN x AND y)? Does this select statement still work or is there better ways of doing so?
The actual application is this. Filters are specified by users, which is compared to some attributes of the records. From those filters, we create a subset of the data which is of interest to a particular user. There are about 30 million records each with roughly ~3000 attributes (which is stored in roughly 30 tables, but every table has ID as a primary key), so every time someone makes a query about their desired subset of records, we'd have to join many tables, apply those filters, and figure out what his subset looks like. In order to avoid joining many tables all the time, I thought maybe it's a better idea to join the tables once, figure out the id of the selected subset, and this way each time a new query is made, all we have to do is select the relevant columns of the rows that match the filtered ids.
This depends on the database and the interface you are using. For a few hundred or thousand values, no problem. But your question specifies millions. And that could start to get into limits on the length of the query -- either specified by the database, the tool you are using, or intermediate libraries.
If you have so many ids, I would strongly recommend that you load them into a table in the database with the id as the primary key. Then use join or exists to identify the rows in your table that match.
Often, such a list would be generated in the database anyway. In that case, you can use a subquery or CTE and just include that code in your final query.

Sql statement to combine three tables with char column criteria

I have to fix a very poorly designed database.
The problem:
One Job Advertisment has one jobtitle, but many qualifying degrees.
(e.g., JobTitle:Analyst, Qualifications: Accounting Degree, or Finance Degree or Business Degree)
The tables:
TableName: UniqueJobName Columns: jobName(char) uniqueJobUid(bigint)
TableName: UniqueDegree Columns: degreeName(char) degreeUid(bigint)
TableName: Jobs Columns: jobName(char) jobUid(bigint),uniqueJobUid(bigint)
TableName: Job_Degree: jobUid(char) degreeName(char)
Relations
onetomany UniqueJobName.uniqueJobUid -> Jobs.uniqueJobUid
onetomany Jobs.jobUid-> Job_Degree.jobUid
There is NO relation between Jobs and UniqueDegree.
Technical Requirement
Rather than creating a column in Job_Degree for degreeUid, I want to create a new table: UniqueJob_UniqueDegree_Job (There are reasons for this that I won't explain here)
UniqueJob_UniqueDegree_Job will have three columns:
uniqueJobUid
jobId
degreeId
The trouble is that the Job table is already very big, 500,000 rows (and the Job_Degree table even bigger)
QUESTION:
What is the most efficient SQL statement for creating the UniqueJob_UniqueDegree_Job table given that part of the statement will be comparing the char column of UniqueDegree.degreeName and Job_Degree.degreeName?
Any hints would be most appreciated.
select j.jobname, j.jobuid, ud.degreeid
into UniqueJob_UniqueDegree_Job
from jobs j
join job_degree jd on j.jobuid = jd.jobuid
join uniquedegree ud on ud.jobname = jd.jobname
Having a hard time with getting uppercase letters etc because I use a worthless cellphone.
This should however do it. Note in order to do select Into... From the table cannot be created already (you can use convert or cast on each attribute in the select statement to get the data types correct with certainty.
If the table already exist then alter the query into
insert Into ..
select ...
from ....
500k rows is rather small as well. This shouldn't take more than a couple of seconds I'd estimate.

SQL or statement vs multiple select queries

I'm having a table with an id and a name.
I'm getting a list of id's and i need their names.
In my knowledge i have two options.
Create a forloop in my code which executes:
SELECT name from table where id=x
where x is always a number.
or I'm write a single query like this:
SELECT name from table where id=1 OR id=2 OR id=3
The list of id's and names is enormous so i think you wouldn't want that.
The problem of id's is the id is not always a number but a random generated id containting numbers and characters. So talking about ranges is not a solution.
I'm asking this in a performance point of view.
What's a nice solution for this problem?
SQLite has limits on the size of a query, so if there is no known upper limit on the number of IDs, you cannot use a single query.
When you are reading multiple rows (note: IN (1, 2, 3) is easier than many ORs), you don't know to which ID a name belongs unless you also SELECT that, or sort the results by the ID.
There should be no noticeable difference in performance; SQLite is an embedded database without client/server communication overhead, and the query does not need to be parsed again if you use a prepared statement.
A "nice" solution is using the INoperator:
SELECT name from table where id in (1,2,3)
Also, the IN operator is syntactic sugar built for exactly this purpose..
SELECT name from table where id IN (1,2,3,4,5,6.....)
Hoping that you are getting the list of ID's on which you have to perform a query for names as input temp table #InputIDTable,
SELECT name from table WHERE ID IN (SELECT id from #InputIDTable)

Transpose to Count columns of Boolean values on Access SQL

Ok, so I have a Student table that has 6 fields, (StudentID, HasBamboo, HasFlower, HasAloe, HasFern, HasCactus) the "HasPlant" fields are boolean, so 1 for having the plant, 0 for not having the plant.
I want to find the average number of plants that a student has. There are hundreds of students in the table. I know this could involve transposing of some sort and of course counting the boolean values and getting an average. I did look at this question SQL to transpose row pairs to columns in MS ACCESS database for information on Transposing (never done it before), but I'm thinking there would be too many columns perhaps.
My first thought was using a for loop, but I'm not sure those exist in SQL in Access. Maybe a SELECT/FROM/WHERE/IN type structure?
Just hints on the logic and some possible reading material would be greatly appreciated.
you could just get individual totals per category:
SELECT COUNT(*) FROM STUDENTS WHERE HasBamboo
add them all up, and divide by
SELECT COUNT(*) FROM STUDENTS
It's not a great database design though... Better normalized would be:
Table Students; fields StudentID, StudentName
Table Plants; fields PlantID, PlantName
Table OwnedPlants; fields StudentID,PlantID
The last table then stores records for each student that owns a particular plant; but you could easily add different information at the right place (appartment number to Students; Latin name to Plants; date aquired to OwnedPlants) without completely redesigning table structure and add lots of fields. (DatAquiredBamboo, DateAquiredFlower, etc etc)

SQL query: have results into a table named the results name

I have a very large database I would like to split up into tables. I would like to make it so when I run a distinct, it will make a table for every distinct name. The name of the table will be the data in one of the fields.
EX:
A --------- Data 1
A --------- Data 2
B --------- Data 3
B --------- Data 4
would result in 2 tables, 1 named A and another named B. Then the entire row of data would be copied into that field.
select distinct [name] from [maintable]
-make table for each name
-select [name] from [maintable]
-copy into table name
-drop row from [maintable]
Any help would be great!
I would advise you against this.
One solution is to create indexes, so you can access the data quickly. If you have only a handful of names, though, this might not be particularly effective because the index values would have select almost all records.
Another solution is something called partitioning. The exact mechanism differs from database to database, but the underlying idea is the same. Different portions of the table (as defined by name in your case) would be stored in different places. When a query is looking only for values for a particular name, only that data gets read.
Generally, it is bad design to have multiple tables with exactly the same data columns. Here are some reasons:
Adding a column, changing a type, or adding an index has to be done times instead of one time.
It is very hard to enforce a primary key constraint on a column across the tables -- you lose the primary key.
Queries that touch more than one name become much more complicated.
Insertions and updates are more complex, because you have to first identify the right table. This often results in overuse of dynamic SQL for otherwise basic operations.
Although there may be some simplifications (security comes to mind), most databases have other mechanisms that are superior to splitting the data into separate tables.
what you want is
CREATE TABLE new_table
AS (SELECT .... //the data that you want in this table);