Clustering/Similarity between text cells in an postgres aggregate - sql

I've got a table that has a text column and some other identifying features. I want to be able to group by one of the features and find out whether the text in the groups are similar or not. I want to use this to determine if there are multiple groups in my data or a single group (with some possible bad spelling) so that I can provide a rough "confidence" value showing if the aggregate represents a single group or not.
CREATE TABLE data_test (
Id serial primary key,
Name VARCHAR(70) NOT NULL,
Job VARCHAR(100) NOT NULL);
INSERT INTO data_test
(Name, Job)
VALUES
('John', 'Astronaut'),
('John', 'Astronaut'),
('Ann', 'Sales'),
('Jon', 'Astronaut'),
('Jason', 'Sales'),
('Pranav', 'Sales'),
('Todd', 'Sales'),
('John', 'Astronaut');
I'd like to run a query that was something like:
select
Job,
count(Name),
Similarity_Agg(Name)
from data_test
group by Job;
and receive
Job count Similarity
Sales 4 0.1
Astronaut 4 0.9
Basically showing that Astronaut names are very similar (or, more likely in my data, all the rows are referring to a single astronaut) and the Sales names aren't (more people working in sales than in space). I see there is a Postgres Module that can handle comparing two strings but it doesn't seem to have any aggregate functions in it.
Any ideas?

One option is a self-join:
select
d.job,
count(distinct d.id) cnt,
avg(similarly(d.name, d1.name)) avg_similarity
from data_test d
inner join data_test d1 on d1.job = d.job
group by d.job

Related

Group By clause in SQLite

Aim:
I would like to query the table to pick only the latest version of each item.
Question:
Why does Query1 work in SQLite (I was thinking the group by clause would throw an error, because select statement contains the column content and it not part of the group by clause) ?
Would Query1 throw an error in Oracle ?
Is Query1 better than Query2 ?
Is there a better way to write the query ?
Query1:
select item_id,
max(version_number),
content
from item_version
group by item_id;
Query2:
select iv.*
from item_version iv,
(select item_id,
max(version_number) latest_version_number
from item_version
group by item_id) liv
where iv.item_id = liv.item_id
and iv.version_number = liv.latest_version_number;
Setting up the table:
create table item_version(
item_id varchar,
version_number integer,
content varchar,
primary key (item_id, version_number)
);
insert into item_version values (1, 1, null);
insert into item_version values (2, 1, "Content A");
insert into item_version values (2, 2, "Content B");
insert into item_version values (3, 1, "Content C");
insert into item_version values (3, 2, null);
insert into item_version values (4, 1, "Content D");
insert into item_version values (4, 2, null);
From the documentation:
In most SQL implementations, output columns of an aggregate query may only reference aggregate functions or columns named in the GROUP BY clause. It does not make good sense to reference an ordinary column in an aggregate query because each output row might be composed from two or more rows in the input table(s).
SQLite does not impose this restriction. The output columns from an aggregate query can be arbitrary expressions that include columns not found in GROUP BY clause.
With SQLite (but not any other SQL implementation that we know of) if an aggregate query contains a single min() or max() function, then the values of columns used in the output are taken from the row where the min() or max() value was achieved. If two or more rows have the same min() or max() value, then the columns values will be chosen arbitrarily from one of those rows.
For example to find the highest paid employee:
SELECT max(salary), first_name, last_name FROM employee;
In the query above, the values for the first_name and last_name columns will correspond to the row that satisfied the max(salary) condition.
If a query contains no aggregate functions at all, then a GROUP BY clause can be added as a substitute of DISTINCT ON clause. In other words, output rows are filtered so that only one row is shows for each distinct set of values in the GROUP BY clause. If two or more output rows would have otherwise had the same set of values for the GROUP BY columns, then one of the rows is chosen arbitrarily.
Your query 1 would cause an error in most databases, yes, but as long as you're only going to use it with sqlite, it's perfectly fine.
An alternative to finding the highest version of each item uses the window functions added in Sqlite 3.25:
SELECT item_id, version_number, content
FROM (SELECT item_id, version_number, content
, row_number() OVER (PARTITION BY item_id ORDER BY version_number DESC) AS rnk
FROM item_version) AS sq
WHERE rnk = 1
ORDER BY item_id;
giving
item_id version_number content
---------- -------------- ----------
1 1
2 2 Content B
3 2
4 2
This one should work on other databases like Oracle, as long as they support window functions too.
Shawn does a really good job of explaining the issue. A typical way to solve this uses a correlated subquery:
select iv.*
from item_version iv
where iv.version_number = (select max(iv2.version_number)
from item_version iv2
where iv2.item_id = iv.item_id
);
With an index on item_version(item_id, version_number) this may be the fastest way to get the results that you want. You already have this index with your primary key definition.

SQL Query and Sort From Multiple Tables

Working with SQL via a NOVA Oracle DB. Need to know how to query from multiple tables and arrange results based on being sorted by the highest values. Here are a few lines of code to reflect the three tables:
INSERT INTO VEHICLES
(vehicleVIN,vehicleType,vehicleMake,vehicleModel,vehicleWhereFrom,vehicleWholesaleCost,vehicleTradeID)
VALUES
('147258HHE91K3RT','compact','chevrolet','spark','Maryland',20583.00,NULL);
INSERT INTO VEHICLES
(vehicleVIN,vehicleType,vehicleMake,vehicleModel,vehicleWhereFrom,vehicleWholesaleCost,vehicleTradeID)
VALUES
('789456ERT0923RFB6','Midsize','ford','Taurus','washington, d.c.',25897.22,1);
INSERT INTO VEHICLES
(vehicleVIN,vehicleType,vehicleMake,vehicleModel,vehicleWhereFrom,vehicleWholesaleCost,vehicleTradeID)
VALUES
('1234567890QWERTYUIOP','fullsize','Lincoln','towncar','Virginia',44222.10,NULL);
AND
INSERT INTO SALES
(saleID,grossSalePrice,vehicleStatus,saleDate,saleMileage,customerID,salespersonID,vehicleVIN)
VALUES
(1,25987.28,'sold',date '2012-10-15',10,1,1,'147258HHE91K3RT');
INSERT INTO SALES
(saleID,grossSalePrice,vehicleStatus,saleDate,saleMileage,customerID,salespersonID,vehicleVIN)
VALUES
(2,29999.99,'sold',date '2012-10-17',50087,2,2,'789456ERT0923RFB6');
INSERT INTO SALES
(saleID,grossSalePrice,vehicleStatus,saleDate,saleMileage,customerID,salespersonID,vehicleVIN)
VALUES
(3,47490.88,'sold',date '2012-11-05',30,3,3,'1234567890QWERTYUIOP');
AND
INSERT INTO CUSTOMERS
(customerID,customerFirName,customerLasName,customerMiName,customerStreet,customerState,customerCity,customerZip)
VALUES
(1,'Regorna','Trasper','J','11111 Address Way','Maryland','Hollywood','20636');
INSERT INTO CUSTOMERS
(customerID,customerFirName,customerLasName,customerMiName,customerStreet,customerState,customerCity,customerZip)
VALUES
(2,'Bob','Seagram','A','22222 Seagram Lane','Texas','Houston','77001');
INSERT INTO CUSTOMERS
(customerID,customerFirName,customerLasName,customerMiName,customerStreet,customerState,customerCity,customerZip)
VALUES
(3,'Sally','Anderson','P','33333 Pheonix Drive','Arizona','Pheonix','85001');
Obviously there are other tables that come into play here (salesperson, etc.), however these are the only tables needed for the query. The query I want to pull needs to show the total count of sales for each model, sorted by the highest values, and the total count of sales for each zip code, sorted by the highest values. An example (using the data provided above) would look similar to this:
MODEL NUMBER of SALES ZIP CODE NUMBER OF SALES
spark 1 20636 1
Taurus 1 77001 1
towncar 1 85001 1
The results need to be sorted by highest values, based on the number of sales. I'm also trying to accomplish this via a single SELECT query.
I've tried some ideas, but haven't been able to find anything that hits the home run yet. Thanks for the help!
See if this is what you're after:
SELECT DISTINCT v.VEHICLEMODEL, COUNT(*) OVER (PARTITION BY s.VEHICLEVIN) "CAR_SALES"
, c.CUSTOMERZIP, COUNT(*) OVER (PARTITION BY c.CUSTOMERZIP )"TOTAL_SALES_AT_ZIP"
FROM SALES s, VEHICLES v, CUSTOMERS c
WHERE s. VEHICLEVIN = v. VEHICLEVIN
and c. CUSTOMERID = s. CUSTOMERID
ORDER BY 2 DESC , 4 DESC

Oracle SQL - Insertion with subquery returning multiple rows

I've an issue with a n-n relationship while trying to insert in the "middle table".
The goal is to associate Commune and ZipCode (in France, a Commune is a city, and the city name can have multiple ZipCode because there are commune with the same name. But not in the same place)
And a ZipCode can handle multiple City, here is my n-n relationShip.
So here is the request i use :
INSERT INTO FR(IDCODEPOSTAL, IDCOM_SIM)
VALUES
('24209 CEDEX', (SELECT DISTINCT IDCOM_SIM FROM COMMUNE WHERE NCCENR='Creysse'));
But here the SELECT returns 2 rows. I've read much but I didn't find a way to deal with this.
I'm not sure what you are trying to achieve, but usually you'd use an INSERT ... SELECT (without the VALUES) to insert multiple rows with a single statement:
INSERT INTO FR
(IDCODEPOSTAL, IDCOM_SIM)
VALUES
SELECT '24209 CEDEX', IDCOM_SIM
FROM COMMUNE
WHERE NCCENR='Creysse';
If you however want to insert only a single row, you need to make sure the sub-select returns only one. This is usually done using an aggregate function such as max()
INSERT INTO FR
(
IDCODEPOSTAL,
IDCOM_SIM
)
VALUES
(
'24209 CEDEX',
(SELECT max(IDCOM_SIM) FROM COMMUNE WHERE NCCENR='Creysse')
);

Placing different rows in succession

I've started working with access around 1 month ago and I'm actually making a tool for preventive medicine so they can use a digital version of their actual paper form.
While the program is nearly finished, the med who requested it now wants to export to excel (the easy part) all the data from a patient his treatment and all the medicines used during that treatment in a single line (the problem).
I've been beating my head over that for two days, trying and researching on google, but all i could find was how to put values from a column in a single cell, and that's not how it has to be displayed.
So far, my best attempt (which is far from a good one) has been something like that:
CREATE TABLE Patient
(`SIP` int, `name` varchar(10));
INSERT INTO Patient
(`SIP`, `name`)
VALUES
(70,'John');
-- A patient can have multiple treatments
CREATE TABLE Treatment
(`id` int, `SIPFK` int);
INSERT INTO Treatment
(`id`,`SIPFK`)
VALUES
(1,70);
-- A treatment can have multiple medicines used while it's open
CREATE TABLE Medicine
(`Id` int, `Name` varchar(8), `TreatFK` int);
INSERT INTO Medicine
(`Id`, `Name`, `TreatFK`)
VALUES
(7, 'Apples', 1),
(7, 'Tomatoes', 1),
(7, 'Potatoes', 1),
(8, 'Banana', 2),
(8, 'Peach', 2);
-- The query
select c.id, c.Name, p.id as id2, p.Name as name2, r.id as id3, r.Name as name3
from Medicine as c, Medicine as p, Medicine as r
where c.id = 7 and p.id=7 and r.id=7;
The output I was trying to get was:
7 | Apples | 7 | Tomatoes | 7 | Potatoes
The table medicines will have more columns than that and i need to show every row related to a treatment in a single row along with the treatment.
But the values keep repeating themselves on different rows and the output on the subsequent columns besides the first ones is not as expected. Also GROUP BY won't solve the problem and DISTINCT doesn't work.
The output of the query is as follows: sqlfiddle.com
If any one could give me a hint, I would be grateful.
EDIT: Since access is a derp and won't let me use any good SQL fix nor will recognize DISTINCT to make the data from the queries not repeat themselves, I will try and search for a way to organize the rows directly in the exported excel.
Thank you all for your help, I'll save it cause I'm sure it'll save me hours of hands in the head.
This is a bit problemation, because MS Access does not support recursive CTE's and I dont see a way of doing that without Ranking.
Hence, I have tried to reproduce the results by using subquery which ranks the Medicines
and store these into a temporary table.
create table newtable
select c.id
, c.Name
,(SELECT COUNT(T1.Name) FROM Medicine AS T1 WHERE t1.id=c.id and T1.Name >= c.Name) AS Rank
from Medicine as c;
Afterwards, it is easy because my query is mostly based on Ranks and IDs.
select distinct id
,(select Name from newtable t2 where t1.id=t2.id and rank=1) as firstMed
,(select Name from newtable t2 where t1.id=t2.id and rank=2) as secMed
,(select Name from newtable t2 where t1.id=t2.id and rank=3) as ThirdMed
from newtable t1;
According to me, the SELF JOIN concept and the notion of recursive CTE's are the most important points for that particular example and a good practice would be to do a resarch on these.
for reference: http://sqlfiddle.com/#!2/f80a9/2

Is SQL GROUP BY a design flaw? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Why does SQL require that I specify on which attributes to group? Why can't it just use all non-aggregates?
If an attribute is not aggregated and is not in the GROUP BY clause then nondeterministic choice would be the only option assuming tuples are unordered (mysql kind of does this) and that is a huge gotcha. As far as I know, Postgresql requires that all attributes not appearing in the GROUP BY must be aggregated, which reinforces that it is superfluous.
Am I missing something or is this a language design flaw that promotes loose implementations and makes queries harder to write?
If I am missing something, what is an example query where group attributes can not be inferred?   
You don't have to group by the exactly the same thing you're selecting, e.g. :
SQL:select priority,count(*) from rule_class
group by priority
PRIORITY COUNT(*)
70 1
50 4
30 1
90 2
10 4
SQL:select decode(priority,50,'Norm','Odd'),count(*) from rule_class
group by priority
DECO COUNT(*)
Odd 1
Norm 4
Odd 1
Odd 2
Odd 4
SQL:select decode(priority,50,'Norm','Odd'),count(*) from rule_class
group by decode(priority,50,'Norm','Odd')
DECO COUNT(*)
Norm 4
Odd 8
There is one more reason for why does SQL requires that I specify on which attributes to group.
Lets sat we have two simple tables: friend and car, where we store info about our friends and their cars.
And lets say we want to show all our friends's data (from table friend) and for everyone of our friends, how many cars they own now, have sold, have crashed and the total number. Oh, and we want the elders first, younger last.
We'd do something like:
SELECT f.id
, f.firstname
, f.lastname
, f.birthdate
, COUNT(NOT c.sold AND NOT c.crashed) AS owned
, COUNT(c.sold) AS sold
, COUNT(c.crashed) AS crashed
, COUNT(c.friendid) AS totalcars
FROM friend f
LEFT JOIN car c <--to catch (shame!) those friends who have never had a car
ON f.id = c.friendid
GROUP BY f.id
, f.firstname
, f.lastname
, f.birthdate
ORDER BY f.birthdate DESC
But do we really need all those fields in the GROUP BY? Isn't every friend uniquely determined by his id? In other words, aren't the firstname, lastname and birthdate functionally dependend on the f.id? Why not just do (as we can in MySQL):
SELECT f.id
, f.firstname
, f.lastname
, f.birthdate
, COUNT(NOT c.sold AND NOT c.crashed) AS owned
, COUNT(c.sold) AS sold
, COUNT(c.crashed) AS crashed
, COUNT(c.friendid) AS totalcars
FROM friend f
LEFT JOIN car c <--to catch (shame!) those friends who have never had a car
ON f.id = c.friendid
GROUP BY f.id
ORDER BY f.birthdate
And what if we had 20 fields in the SELECT (plus ORDER BY) parts? Isn't the second query shorter, clearer and probably faster (in the RDBMS that accept it)?
I say yes. So, do the SQL 1999 and 2003 specs say, if this article is correct: Debunking group by myths
I would say if you have a large number of items in the group by clause then perhaps the core info should be pulled out into a tabular sub-query which you inner join into.
There is a probably a performance hit, but it makes for neater code.
select id, count(a), b, c, d
from table
group by
id, b, c, d
becomes
select id, myCount, b, c, d
from table t
inner join (
select id, count(*) as myCount
from table
group by id
) as myCountTable on myCountTable.id = t.id
That said, I'm interested to hear counter-arguments for doing this as opposed to having a large group by clause.
I agree its verbose that the group by list shouldn't implicitly be the same as then non-aggregated select columns. In Sas there are data aggregation operations that are more succinct.
Also : it's hard to come up with an example where it would be useful to have a longer list of columns in the group list than the select list. The best I can come up with is ...
create table people
( Nam char(10)
,Adr char(10)
)
insert into people values ('Peter', 'Tibet')
insert into people values ('Peter', 'OZ')
insert into people values ('Peter', 'OZ')
insert into people values ('Joe', 'NY')
insert into people values ('Joe', 'Texas')
insert into people values ('Joe', 'France')
-- Give me people where there is a duplicate address record
select * from people where nam in
(
select nam
from People
group by nam, adr -- group list different from select list
having count(*) > 1
)
If you issue just regarding to easier way to write scripts.
Here is one tip:
In MS SQL MGMS write you query in text something like select * from my_table
after that select text right click and "Design Query in Editor.."
Sql studio will open new editor with filed up all fields after that again right click and select "Add Gruop BY"
Sql MGM studio will add code for you .
I fund this method extremely useful for insert statements. When I need to write script for insert a lot of fields in table, I just do select * from table_where_want_to_insert and after that change type of select statement to insert,
I Agree
I quite agree with the question. I asked the same one here.
I honestly think it's a language flaw.
I realise that there are arguments against that, but I have yet to use a GROUP BY clause containing anything other than all the non-aggregated fields from the SELECT clause in the real world.
This thread provides some useful explanations.
http://social.msdn.microsoft.com/Forums/en/transactsql/thread/52482614-bfc8-47db-b1b6-deec7363bd1a
I'd say it is more likely to be a language design choice that decisions be explicit, not implicit. For instance, what if I wish to group the data in a different order than that in which I output the columns? Or if I want to group by columns that aren't included in the columns selected? Or if I want to output grouped columns only and not use aggregate functions? Only by explicitly stating my preferences in the group by clause are my intentions clear.
You also have to remember that SQL is a very old language (1970). Look at how Linq flipped everything around in order to make Intellisense work - it looks obvious to us now, but SQL predates IDEs and so couldn't have taken into account such issues.
The "superflous" attributes influence the ordering of the result.
Consider:
create table gb (
a number,
b varchar(3),
c varchar(3)
);
insert into gb values ( 3, 'foo', 'foo');
insert into gb values ( 1, 'foo', 'foo');
insert into gb values ( 0, 'foo', 'foo');
insert into gb values ( 20, 'foo', 'bar');
insert into gb values ( 11, 'foo', 'bar');
insert into gb values ( 13, 'foo', 'bar');
insert into gb values ( 170, 'bar', 'foo');
insert into gb values ( 144, 'bar', 'foo');
insert into gb values ( 130, 'bar', 'foo');
insert into gb values (2002, 'bar', 'bar');
insert into gb values (1111, 'bar', 'bar');
insert into gb values (1331, 'bar', 'bar');
This statement
select sum(a), b, c
from gb
group by b, c;
results in
44 foo bar
444 bar foo
4 foo foo
4444 bar bar
while this one
select sum(a), b, c
from gb
group by c, b;
results in
444 bar foo
44 foo bar
4 foo foo
4444 bar bar