How do I do a DISTINCT and ORDER BY in PostgreSQL? - sql

PostgreSQL is about to make me punch small animals. I'm doing the following SQL statement for MySQL to get a listing of city/state/countries that are unique.
SELECT DISTINCT city
, state
, country
FROM events
WHERE (city > '')
AND (number_id = 123)
ORDER BY occured_at ASC
But doing that makes PostgreSQL throw this error:
PGError: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
But if I add occured_at to the SELECT, then it kills of getting back the unique list.
Results using MySQL and first query:
BEDFORD PARK IL US
ADDISON IL US
HOUSTON TX US
Results if I add occured_at to the SELECT:
BEDFORD PARK IL US 2009-11-02 19:10:00
BEDFORD PARK IL US 2009-11-02 21:40:00
ADDISON IL US 2009-11-02 22:37:00
ADDISON IL US 2009-11-03 00:22:00
ADDISON IL US 2009-11-03 01:35:00
HOUSTON TX US 2009-11-03 01:36:00
The first set of results is what I'm ultimately trying to get with PostgreSQL.

Well, how would you expect Postgres to determine which occured_at value to use in creating the sort order?
I don't know Postgres syntax particularly, but you could try:
SELECT DISTINCT city, state, country, MAX(occured_at)
FROM events
WHERE (city > '') AND (number_id = 123) ORDER BY MAX(occured_at) ASC
or
SELECT city, state, country, MAX(occured_at)
FROM events
WHERE (city > '') AND (number_id = 123)
GROUP BY city, state, country ORDER BY MAX(occured_at) ASC
That's assuming you want the results ordered by the MOST RECENT occurrence. If you want the first occurrence, change MAX to MIN.
Incidentally, your title asks about GROUP BY, but your syntax specifies DISTINCT.

Related

How to get the differences between two rows **and** the name of the field where the difference is, in BigQuery?

I have a table in BigQuery like this:
Name
Phone Number
Address
John
123456778564
1 Penny Lane
John
873452987424
1 Penny Lane
Mary
845704562848
87 5th Avenue
Mary
845704562848
54 Lincoln Rd.
Amy
342847327234
4 Ocean Drive Avenue
Amy
347907387469
98 Truman Rd.
I want to get a table with the differences between two consecutive rows and the name of the field where occurs the difference:
I mean this:
Name
Field
Before
After
John
Phone Number
123456778564
873452987424
Mary
Address
87 5th Avenue
54 Lincoln Rd.
Amy
Phone Number
342847327234
347907387469
Amy
Address
4 Ocean Drive Avenue
98 Truman Rd.
How can I do this ? I've looked on other posts but couldn't find something that corresponds to my need.
Thank you
Consider below BigQuery'ish solution
select Name, ['Phone Number', 'Address'][offset(offset)] Field,
prev_field as Before, field as After
from (
select timestamp, Name, offset, field,
lag(field) over (partition by Name, offset order by timestamp) as prev_field
from yourtable,
unnest([`Phone Number`, Address]) field with offset
)
where prev_field != field
if applied to sample data in your question - output is
As you can see here - no matter how many columns in your table that you need to compare - it is still just one query - no unions and such.
You just need to enumerate your columns in two places
['Phone Number', 'Address'][offset(offset)] Field
and
unnest([`Phone Number`, Address]) field with offset
Note: you can further refactor above using scripting's execute immediate to compose such lists within the query on the fly (check my other answers - I frequently use such technique in them)
One method is just use to use lag() and union all
select name, 'phone', prev_phone as before, phone as after
from (select name, phone,
lag(phone) over (partition by name order by timestamp) as prev_phone
from t
) t
where prev_phone <> phone
union all
select name, 'address', prev_address as before, address as afte4r
from (select name, address,
lag(address) over (partition by name order by timestamp) as prev_address
from t
) t
where prev_address <> address

Select top three records grouping by two factors

I am trying to identify the three records with the highest values grouped by two factors. I realize this question is similar to this one PostgreSQL: select top three in each group, but I can't figure out how to generalize from this example which includes a single factor, to two factors. I have tried searching stack overflow for an answer to this question beyond the one listed above and I can't find one, but perhaps I'm not searching for the correct terms.
Briefly, I'm connecting to a table with the following schema
city, country, value
I only have a single row per city, country combination, but I have a variable, but the number of city entries I have per country is variable. For example, I have a few dozen cities for Canada, a hundred for the United States, but only two for Uzbekistan.
What I want, as output is a table with the same schema, but only countaining the rows containing the highest three values for city, nested within country. For example, if Canada has the cities and values of
{Canada, toronto, 100}, {Canada, vancouver, 80},
{Canada, montreal,112}, {Canada, calgary, 109},
{Canada, edmonton, 76}, {Canada, winnipeg, 73},
and the United States has the entries of
{{us, nyc, 104}, {us, chicago, 87},
{us, boston, 98}, {us, seattle, 105},
{us, sanfran, 88}, {us, minneapolis, 84},
{us, miami, 103}, {us, houston, 112},
{us, dallas, 78}, {us, tucson, 83}}
and Uzbekistan has the entries of
{uzbekistan, qarshi, 95}, {uzbeckistan, gluiston, 101}
What I would like as output would be
Canada, Montreal, 112
Canada, Toronto, 100
Canada, Calgary, 109
us, houston, 112
us, seattle, 105
us, nyc, 103,
uzbeckistan, qarshi, 95
uzbeckistan, gluiston 101
I've tried the following query
SELECT logincity, logincountry, VAL
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY logincountry, logincity ORDER BY
val DESC) AS Row_ID
FROM a_table)
WHERE Row_ID < 4
ORDER BY logincity
But I end up with more than three cities per country.
Can someone help me out?
Thanks Stack Overflow!
I think you only need partition by logincountry
SELECT logincity, logincountry, VAL
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY logincountry
ORDER BY val DESC) AS Row_ID
FROM a_table ) T
WHERE Row_ID < 4
ORDER BY logincity
TIP: You probably will realize the problem if you include the Row_id on the SELECT
SELECT logincity, logincountry, VAL, Row_ID
On your query all Row_ID = 1
TIP 2: Your query want top 3 cities for each country, so you only have one partition country. So the linked question is the right answer, top 3 of each group in this case country.

Trying to find duplication in records where address is different only in the one field and only by a certain number

I have a table of listings that has NAP fields and I wanted to find duplication within it - specifically where everything is the same except the house number (within 2 or 3 digits).
My table looks something like this:
Name Housenumber Streetname Streettype City State Zip
1 36 Smith St Norwalk CT 6851
2 38 Smith St Norwalk CT 6851
3 1 Kennedy Ave Campbell CA 95008
4 4 Kennedy Ave Campbell CA 95008
I was wondering how to set up a qry to find records like these.
I've tried a few things but can't figure out how to do it - any help would be appreciated.
Thanks
Are you looking to find something that shows the amount of these rows you have like this?
SELECT
StreenName,
City,
State,
Zip,
COUNT(*)
FROM YourTable
group by StreenName, City, State, Zip
HAVING COUNT(*) >1
Or maybe trying to find all of the rows that have the same street, city, state, and zip?
SELECT
A.HouseNumber,
A.StreetName,
A.City,
A.State,
A.Zip
FROM YourTable as A
INNER JOIN YourTable as B
ON A.StreetName = B.StreetName
AND A.City = B.City
AND A.State = B.State
AND A.Zip = B.Zip
AND A.HouseNumber <> B.HouseNumber
Here is one way to do it. You'll need a unique ID for the table to run this, as you wouldn't want to select the exact same person if theyre the only one there. This'll just spit out all the results where there is at least one duplicate.
Edit: Woops, just realized in comments it says varchar for the street number...hmm. So you could just run a cast on it. The OP never said anything about house numbers in varchar or being letters and numbers in the original post. As for letters in the street number field, I've been a third party shipping provider for 2 yrs in the past and I have never seen one; with the exception of an apt., which would be a diff field. Its just as likely that someone put varchar there for some other reason(leading 0's), or for no reason. Of oourse there could be, but no way of knowing whats in the field without response from OP. To run cast to int its the same except this for each instance: Cast(mt.HouseNumber as int)
select *
from MyTable mt
where exists (select 1
from MyTable mt2
where mt.name = mt2.name
and mt.street = mt2.street
and mt.state = mt2.state
and mt.city = mt2.city
and mt2.HouseNumber between (mt.HouseNumber -3) and (mt.HouseNumber +3)
and mt.UID != mt2.UID
)
order by mt.state, mt.city, mt.street
;
Not sure how to run the -3 +3 if there are letters involed...unless you know excatly where they are and you can just simply cut them out then cast.

Count number of rows that have a specific word in a varchar (in postgresql)

I have a table similar to the below:
id | name | direction |
--------------------------------------
1 Jhon Washington, DC
2 Diego Miami, Florida
3 Michael Orlando, Florida
4 Jenny Olympia, washington
5 Joe Austin, Texas
6 Barack Denver, Colorado
and I want to count how many people live in a specific state:
Washington 2
Florida 2
Texas 1
Colorado 1
How can I do this? (By the way this is just an question with an academic point of view )
Thanks in advance!
Postgres offers the function split_part(), which will break up a string by a delimiter. You want the second part (the part after the comma):
select split_part(direction, ', ', 2) as state, count(*)
from t
group by split_part(direction, ', ', 2);
Initially I would obtain the state from the direction field. Once you have that, it's quite simple:
SELECT state, count(*) as total FROM initial_table group by state.
To obtain the state, some functions depending on the dbms are useful. It depends on the language.
A possible pseudocode (given a function like substring_index of MySQL) for the query would be:
SELECT substring_index(direction,',',-1) as state, count(*) as total
FROM initial_table group by substring_index(direction,',',-1)
Edit: As it is suggested above, the query should return 1 for the Washington state.
My way do making such a queries is two-step - first, prepare fields you need, second, do you grouping or other calculation. That way you're following DRY principle and don't repeating yourself. I think CTE is the best tool for this:
with cte as (
-- we don't need other fields, only state
select
split_part(direction, ', ', 2) as state
from table1
)
select state, count(*)
from cte
group by state
sql fiddle demo
If you writing queries that way, it's easy to change grouping field in the future.
Hope that helps, and remember - readability counts! :)

SQL Selecting distinct rows from multiple columns based on max value in one column

This is my SQL View - lets call it MyView :
ECode SHCode TotalNrShare CountryCode Country
000001 +00010 100 UKI United Kingdom
000001 ABENSO 900 USA United States
000355 +00012 1000 ESP Spain
000355 000010 50 FRA France
000042 009999 10 GER Germany
000042 +00012 999 ESP Spain
000787 ABENSO 500 USA United States
000787 000150 500 ITA Italy
001010 009999 100 GER Germany
I would like to return the single row with the highest number in the column TotalNrShare for each ECode.
For example, I’d like to return these results from the above view:
ECode SHCode TotalNrShare CountryCode Country
000001 ABENSO 900 USA United States
000355 +00012 1000 ESP Spain
000042 +00012 999 ESP Spain
000787 ABENSO 500 USA United States
001010 009999 100 GER Germany
(note in the case of ECode 000787 where there are two SHCode's with 500 each, as they are the same amount we can just return the first row rather than both, it isnt important for me which row is returned since this will happen very rarely and my analysis doesnt need to be 100%)
Ive tried various things but do not seem to be able to return either unqiue results or the additional country code/country info that I need.
This is one of my attempts (based on other solutions on this site, but I am doing something wrong):
SELECT tsh.ECode, tsh.SHCode, tsh.TotalNrShare, tsh.CountryCode, tsh.Country
FROM dbo.MyView AS tsh INNER JOIN
(SELECT DISTINCT ECode, MAX(TotalNrShare) AS MaxTotalSH
FROM dbo.MyView
GROUP BY ECode) AS groupedtsh ON tsh.ECode = groupedtsh.ECode AND tsh.TotalNrShare = groupedtsh.MaxTotalSH
WITH
sequenced_data AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY ECode ORDER BY TotalNrShare) AS sequence_id
FROM
myView
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
This should, however, give the same results as your example query. It's simply a different approach to accomplish the same thing.
As you say that something is wrong, however, please could you elaborate on what is going wrong? Is TotalNrShare actually a string for example? And is that messing up your ordering (and so the MAX())?
EDIT:
Even if the above code was not compatible with your SQL Server, it shouldn't crash it out completely. You should just get an error message. Try executing Select * By Magic, for example, and it should just give an error. I strongly suggest getting your installation of Management Studio looked at and/or re-installed.
In terms of an alternative, you could do this...
SELECT
*
FROM
(SELECT ECode FROM MyView GROUP BY ECode) AS base
CROSS APPLY
(SELECT TOP 1 * FROM MyView WHERE ECode = base.ECode ORDER BY TotalNrShare DESC) AS data
Ideally you would replace the base sub-query with a table that already has a distinct list of all the ECodes that you are interested in.
try this;
with cte as(
SELECT tsh.ECode, tsh.SHCode, tsh.TotalNrShare, tsh.CountryCode, tsh.Country,
ROW_NUMBER() over (partition by ECode order by SHCode ) as row_num
FROM dbo.MyView)
select * from cte where row_num=1