Distinct top 10 from multiple tables - sql

I have these two tables in SQLite
CREATE TABLE "freq" (
`id` INTEGER,
`translation_id` INTEGER,
`freq` INTEGER DEFAULT NULL,
`year` INTEGER,
PRIMARY KEY(`id`),
FOREIGN KEY(`translation_id`) REFERENCES `translation`(`id`) DEFERRABLE INITIALLY DEFERRED
)
CREATE TABLE translation (
id INTEGER PRIMARY KEY,
w_id INTEGER,
word TEXT DEFAULT NULL,
located_in TEXT,
UNIQUE (word, language)
ON CONFLICT ABORT
)
Based on the values from these tables I want to create a third one which contains the top 10 words for every translation.located_in for every freq.year. This could look like this:
CREATE TABLE top_ten_words_by_country (
id INTEGER PRIMARY KEY,
located_in TEXT,
year INTEGER,
`translation_id` INTEGER,
freq INTEGER,
FOREIGN KEY(`translation_id`) REFERENCES `translation`(`id`) DEFERRABLE INITIALLY DEFERRED
)
Thats what I tried (for one country and one year) so far:
SELECT * FROM freq f, translation t
WHERE t.located_in = 'Germany' ORDER BY f.freq DESC
which has these problems:
it doesn't add up multiple words from translation which have the same w_id (which means they are a translation from each other)
it only works for one year and one country
it takes veeeeery long (I know joins are expensive, so its not that important to speed this up)
it contains duplicate translation.word
So can anyone provide me a way to do what I want?
The speed is the least important thing here for me.

Look, you have a cartesian product(there's no relation between your tables).
Besides, you have to use 'group by' clause.
And you can create a view instead a table.
Change your query to:
SELECT sum(f.freq) total_freq
, t.w_id
, t.located_in
, f.year
FROM freq f
, translation t
WHERE f.translation_id = t.id
group by t.w_id
, t.located_in
, f.year
ORDER BY total_freq DESC

Related

How to write SQL query to find duplicates in tables

I'm currently trying to write a SQL query that finds any conflicts where any rows have the same x and y values.
Here are the tables I'm currently working with:
CREATE TABLE Slot (
sid INT,
wall varchar(200),
x FLOAT,
y FLOAT,
PRIMARY KEY (sid)
)
CREATE TABLE Route (
rid INT,
name varchar(200),
circuit varchar(200),
PRIMARY KEY (rid)
)
CREATE TABLE Placement (
rid INT FOREIGN KEY REFERENCES Route(rid),
hid INT FOREIGN KEY REFERENCES Hold(hid),
sid INT FOREIGN KEY REFERENCES Slot(sid)
)
So I'm trying to find any Slots that are on the same wall and have identical x and y values. In addition to this, I want them to all be the same Route circuit.
I don't know if I should be trying to use the third table of "Placement", as I'm pretty new to this and got confused when trying to join them because they don't have any shared columns.
Here is what I currently have
SELECT
DISTINCT
S.sid
FROM
Slot as S,
Route as R
WHERE
R.circuit = 'Beginner'
GROUP BY
S.x,
S.y,
S.wall
HAVING
COUNT(*) > 1
But this throws an error because I have to be using 'S.sid' in a GROUP BY or an aggregate function, but I don't want to group by that.
Here are the INSERT functions I was using to try and use as examples for what I have so far.
INSERT INTO Slot (sid, wall, x, y) VALUES (2345, 'south', 4, 7)
INSERT INTO Slot (sid, wall, x, y) VALUES (4534, 'south', 4, 7)
INSERT INTO Slot (sid, wall, x, y) VALUES (2456, 'west', 1, 7)
So here it would return the sid's 2345 and 4534 because they're both on the South wall and have the same x and y values.
Some things you need to be made aware-of:
The ancient-style of JOIN, where you do SELECT ... FROM x, y WHERE x.a = y.b, should not be used. I wish modern RDBMS would block queries using it (outside of any compatibility mode).
Always use explicit JOIN clauses! for the sake of readability and maintainability (while performance shouldn't be different, using explicit JOINs make it far, far easier to investigate performance issues should they occur).
Don't use the float or real types for representing exact quantities (by "exact" I don't mean integral: you can have exact fractional quantities), instead the decimal type should be preferred.
You should always include the Schema Name (default is dbo.) when referencing tables in queries as it solves issues relating to ambiguity.
Including the schema-name is now required when referencing any UDFs, UDTs, and other more modern SQL Server features.
This is because performing equality-checks on float and real values in T-SQL is a pain. This includes evaluating JOIN criteria.
Your many-to-many linking table dbo.Placement also allows duplicates because it doesn't have a PK defined.
Don't use short, cryptic column names like rid, hid and sid. Software should be self-documenting. I would name those columns to RouteId, HoldId, and SlotId respectively.
Don't fall for the mistake of naming a column just Id. Column names should not need the name of their parent table to be understandable (this is because queries can/will/do expose your data, often with their original column names, in contexts without their original table names, such as in CTEs, derived-table queries, VIEWs, etc).
It's subjective, but I believe the table-names should be plural, not singular (after-all, a table holds multiple rows - I'd only give a table a singular name if that table will only ever hold a single row).
The worst argument I've heard so far advocating for singular instead of plural is because (apparently) some ORMs and code-gen tools lack the ability to convert a plural noun to a singular noun. Yeesh. That hasn't been true for 20+ years now.
First, to avoid problems caused by using float types in JOIN conditions I'll change your dbo.Slot table to use decimal:
CREATE TABLE dbo.Slot2 (
sid int NOT NULL,
wall varchar(200) NOT NULL,
x decimal(19,6) NOT NULL, -- 6 decimal places should be enough.
y decimal(19,6) NOT NULL,
CONSTRAINT PK_Slot PRIMARY KEY ( sid ),
-- CONSTRAINT UK_SlotValues UNIQUE ( wall, x, y ) -- This will prevent duplicate values in future.
);
INSERT INTO dbo.Slot2 ( sid, wall, x, y )
SELECT
sid,
wall,
CONVERT( decimal(19,6), x ) AS x2,
CONVERT( decimal(19,6), y ) AS y2
FROM
dbo.Slot;
DROP TABLE dbo.Slot;
EXEC sp_rename 'dbo.Slot2', 'Slot';
With that taken care-of, let's now get the duplicate values in the set of slots (i.e. find the identical wall, x, y values without other values):
SELECT
wall,
x,
y
FROM
dbo.Slot
GROUP BY
wall,
x,
y
HAVING
COUNT(*) >= 2
Then we do an INNER JOIN between the original dbo.Slot table and this set of duplicate values, as well as adding a ROW_NUMBER value to make it easier to choose a single row to keep if the other duplicates are removed:
WITH duplicateValues (
SELECT
wall,
x,
y
FROM
dbo.Slot
GROUP BY
wall,
x,
y
HAVING
COUNT(*) >= 2
)
SELECT
ROW_NUMBER() OVER ( PARTITION BY s.wall, s.x, s.y ORDER BY s.sid ) AS n,
s.*
FROM
dbo.Slot AS s
INNER JOIN duplicateValues AS d ON
s.wall = d.wall
AND
s.x = d.x
AND
s.y = d.y
In your post you mentioned wanting to also consider the Placement table, however we need further details because your post doesn't explain how the Placement table should work.
However your Placement table should still have a PK. I'm assuming that the Placement table's HoldId column is not a key column, so should look like this:
CREATE TABLE dbo.Placement (
RouteId int NOT NULL,
SlotId int NOT NULL,
HoldId int NOT NULL,
CONSTRAINT PK_Placement PRIMARY KEY ( RouteId, SlotId ),
CONSTRAINT FK_Placement_Route FOREIGN KEY ( RouteId ) REFERENCES dbo.Route ( rid ),
CONSTRAINT FK_Placement_Slot FOREIGN KEY ( SlotId ) REFERENCES dbo.Slot ( sid ),
CONSTRAINT FK_Placement_Hold FOREIGN KEY ( HoldId ) REFERENCES dbo.Hold ( hid )
);

Outputting the name of the column in SQLite

I have created two tables and now I want to find the movie that yielded the highest revenue for each platform(Hulu, Disney and Netflix). The problem here is I do not know how to output the names of the platform as it is a column title. Can anyone help me?
CREATE TABLE "StreamedMovies" (
"Title" TEXT,
"Netflix" INTEGER, -- 1 if the movie is streamed in this platform, 0 otherwise
"Hulu" INTEGER, -- 1 if the movie is streamed in this platform, 0 otherwise
"Disney" INTEGER, -- 1 if the movie is streamed in this platform, 0 otherwise
"ScreenTime" REAL,
PRIMARY KEY("Title")
)
CREATE TABLE "MovieData" (
"Title" TEXT,
"Genre" TEXT,
"Director" TEXT,
"Casting" TEXT,
"Rating" REAL,
"Revenue" REAL,
PRIMARY KEY("Title")
)
You'll have to write a case statement.
select
Title,
case
when Netflix == 1 then 'Netflix'
when Hulu = 1 then 'Hulu'
when Disney = 1 then 'Disney'
end as Platform
from StreamedMovies
This indicates a flaw in your design. A number of flaws. For example, there's nothing stopping a row from having multiple platforms. Or no platforms. Or having a platform set to 42.
Instead, add a platforms table and a join table to indicate which movies are streaming on which platforms.
While we're at it we'll fix some other issues.
Titles can change. Use a simple integer primary key.
Don't quote column and table names, it makes them case sensitive.
Declare your foreign keys.
Use not null to require important data.
-- The platforms available for streaming.
create table platforms (
id integer primary key,
name text not null
);
insert into platforms (id, name)
values ('Netflix'), ('Hulu'), ('Disney+');
-- The movies.
create table movies (
id integer primary key,
title text not null
);
insert into movies (title) values ('Bad Taste');
-- A join table for which platforms movies are streaming on.
create table streamed_movies (
movie_id integer not null references movies,
platform_id integer not null references platforms
);
insert into streamed_movies (movie_id, platform_id) values (1, 1), (1, 3);
select
movies.title, platforms.name
from streamed_movies sm
join movies on sm.movie_id = movies.id
join platforms on sm.platform_id = platforms.id
title name
--------- -------
Bad Taste Netflix
Bad Taste Disney+

How to make sure only one column is not null in postgresql table

I'm trying to setup a table and add some constraints to it. I was planning on using partial indexes to add constraints to create some composite keys, but ran into the problem of handling NULL values. We have a situation where we want to make sure that in a table only one of two columns is populated for a given row, and that the populated value is unique. I'm trying to figure out how to do this, but I'm having a tough time. Perhaps something like this:
CREATE INDEX foo_idx_a ON foo (colA) WHERE colB is NULL
CREATE INDEX foo_idx_b ON foo (colB) WHERE colA is NULL
Would this work? Additionally, is there a good way to expand this to a larger number of columns?
Another way to write this constraint is to use the num_nonulls() function:
create table table_name
(
a integer,
b integer,
check ( num_nonnulls(a,b) = 1)
);
This is especially useful if you have more columns:
create table table_name
(
a integer,
b integer,
c integer,
d integer,
check ( num_nonnulls(a,b,c,d) = 1)
);
You can use the following check:
create table table_name
(
a integer,
b integer,
check ((a is null) != (b is null))
);
If there are more columns, you can use the trick with casting boolean to integer:
create table table_name
(
a integer,
b integer,
...
n integer,
check ((a is not null)::integer + (b is not null)::integer + ... + (n is not null)::integer = 1)
);
In this example only one column can be not null (it simply counts not null columns), but you can make it any number.
One can do this with an insert/update trigger or checks, but having to do so indicates it could be done better. Constraints exist to give you certainty about your data so you don't have to be constantly checking if the data is valid. If one or the other is not null, you have to do the checks in your queries.
This is better solved with table inheritance and views.
Let's say you have (American) clients. Some are businesses and some are individuals. Everyone needs a Taxpayer Identification Number which can be one of several things such as a either a Social Security Number or Employer Identification Number.
create table generic_clients (
id bigserial primary key,
name text not null
);
create table individual_clients (
ssn numeric(9) not null
) inherits(generic_clients);
create table business_clients (
ein numeric(9) not null
) inherits(generic_clients);
SSN and EIN are both Taxpayer Identification Numbers and you can make a view which will treat both the same.
create view clients as
select id, name, ssn as tin from individual_clients
union
select id, name, ein as tin from business_clients;
Now you can query clients.tin or if you specifically want businesses you query business_clients.ein and for individuals individual_clients.ssn. And you can see how the inherited tables can be expanded to accommodate more divergent information between types of clients.

SQLITE3 Sub-querys

I'm having a problem, that i can't figure it out, even after reserching here and at sqlite.org
So, I have these tables:
CREATE TABLE MEDICO(
idMedico INTEGER PRIMARY KEY AUTOINCREMENT,
nome VARCHAR(50) NOT NULL,
morada VARCHAR(50) NOT NULL,
telefone VARCHAR(9) NOT NULL
);
CREATE TABLE PRESCRICAO(
idPrescricao INTEGER PRIMARY KEY AUTOINCREMENT,
idConsulta INTEGER,
idMedico INTEGER NOT NULL,
nrOperacional INTEGER NOT NULL,
FOREIGN KEY(idConsulta) REFERENCES CONSULTA(idConsulta),
FOREIGN KEY(idMedico) REFERENCES MEDICO(idMedico),
FOREIGN KEY(nrOperacional) REFERENCES UTENTE(nrOperacional)
);
CREATE TABLE PRESCRICAO_MEDICAMENTO(
idPrescricao INTEGER ,
idMedicamento INTEGER,
nrEmbalagens INTEGER NOT NULL,
FOREIGN KEY(idPrescricao) REFERENCES PRESCRICAO(idPrescricao),
FOREIGN KEY(idMedicamento) REFERENCES MEDICAMENTO(idMedicamento),
PRIMARY key(idPrescricao, idMedicamento)
);
I want the idMedicamento that is the most used by the MEDICO lets say with idMedico=7,
until here, everything's fine, i'm doing:
SELECT idmedicamento, MAX(total) as maximum
FROM (SELECT idMedicamento, COUNT(idMedicamento) AS total
FROM PRESCRICAO_MEDICAMENTO
WHERE PRESCRICAO_MEDICAMENTO.idPrescricao IN (
SELECT idPrescricao FROM PRESCRICAO
WHERE PRESCRICAO.idmedico= 7
)
GROUP BY idMedicamento);
and i get:
IDmedicamento:3
maximum:5
wich is something that I want and it is correct.
but when i do:
SELECT idMedicamento
FROM (SELECT idMedicamento, MAX(total) as maximum
FROM (SELECT idMedicamento, COUNT(idMedicamento) AS total
FROM PRESCRICAO_MEDICAMENTO
WHERE PRESCRICAO_MEDICAMENTO.idPrescricao IN (
SELECT idPrescricao FROM PRESCRICAO
WHERE PRESCRICAO.idmedico= 7
)
GROUP BY idMedicamento));
All i get is the last used idMedicamento by the MEDICO, in this case, MEDICAMENTO with idMedicamento=5.
Any idea what i'm doing wrong? Really can't figure it out.
Thanks
In many cases, the easiest way to get other columns from the record with a maximum value is to use ORDER BY/LIMIT 1:
SELECT idMedicamento
FROM PRESCRICAO_MEDICAMENTO
WHERE idPrescricao IN (SELECT idPrescricao
FROM PRESCRICAO
WHERE idmedico = 7)
GROUP BY idMedicamento
ORDER BY COUNT(*) DESC
LIMIT 1
The second query is wrong because the first query is wrong; I don't think it's valid SQL, and I'm not sure why it's working for you.
The inner query (which I'll call Q1) of the first query is ok: SELECT id, COUNT(id) AS total FROM ... GROUP BY id;
But the outer query of the first query is broken: SELECT id, MAX(total) FROM ...; without a GROUP BY. This is wrong because the MAX forces an aggregation over the entire table (which is what you want), but the 'id' is not aggregated.
If you remove 'id, ' from the query, you should correctly get the maximum: SELECT MAX(total) AS maximum FROM ...; which I'll call Q2.
Then it gets ugly, because SQLite doesn't support CTEs. Basically it is:
SELECT id FROM (Q1) WHERE total = (Q2);
but you have to write out Q1 and Q2, and there's a lot of repetition because Q2 includes Q1.

How to combine particular rows in a pl/pgsql function that returns set of a view row type?

I have a view, and I have a function that returns records from this view.
Here is the view definition:
CREATE VIEW ctags(id, name, descr, freq) AS
SELECT tags.conc_id, expressions.name, concepts.descr, tags.freq
FROM tags, concepts, expressions
WHERE concepts.id = tags.conc_id
AND expressions.id = concepts.expr_id;
The column id references to the table tags, that, references to another table concepts, which, in turn, references to the table expressions.
Here are the table definitions:
CREATE TABLE expressions(
id serial PRIMARY KEY,
name text,
is_dropped bool DEFAULT FALSE,
rank float(53) DEFAULT 0,
state text DEFAULT 'never edited',
UNIQUE(name)
);
CREATE TABLE concepts(
id serial PRIMARY KEY,
expr_id int NOT NULL,
descr text NOT NULL,
source_id int,
equiv_p_id int,
equiv_r_id int,
equiv_len int,
weight int,
is_dropped bool DEFAULT FALSE,
FOREIGN KEY(expr_id) REFERENCES expressions,
FOREIGN KEY(source_id),
FOREIGN KEY(equiv_p_id) REFERENCES concepts,
FOREIGN KEY(equiv_r_id) REFERENCES concepts,
UNIQUE(id,equiv_p_id),
UNIQUE(id,equiv_r_id)
);
CREATE TABLE tags(
conc_id int NOT NULL,
freq int NOT NULL default 0,
UNIQUE(conc_id, freq)
);
The table expressions is also referenced from my view (ctags).
I want my function to combine rows of my view, that have equal values in the column name and that refer to rows of the table concepts with equal values of the column equiv_r_id so that these rows are combined only once, the combined row has one (doesn't matter which) of the ids, the value of the column descr is concatenated from the values of the rows being combined, and the row freq contains the sum of the values from the rows being combined. I have no idea how to do it, any help would be appreciated.
Basically, what you describe looks like this:
CREATE FUNCTION f_test()
RETURNS TABLE(min_id int, name text, all_descr text, sum_freq int) AS
$x$
SELECT min(t.conc_id) -- AS min_id
,e.name
,string_agg(c.descr, ', ') -- AS all_descr
,sum(t.freq) -- AS sum_freq
FROM tags t
JOIN concepts c USING (id)
JOIN expressions e ON e.id = c.expr_id;
-- WHERE e.name IS DISTINCT FROM
$x$
LANGUAGE sql;
Major points:
I ignored the view ctags altogether as it is not needed.
You could also write this as View so far, the function wrapper is not necessary.
You need PostgreSQL 9.0+ for string_agg(). Else you have to substitute with
array_to_string(array_agg(c.descr), ', ')
The only unclear part is this:
and that refer to rows of the table concepts with equal values of the column equiv_r_id so that these rows are combined only once
Waht column exactly refers to what column in table concepts?
concepts.equiv_r_id equals what exactly?
If you can clarify that part, I might be able to incorporate it into the solution.