SQL - Count instances in 1 field with variations in 2nd - sql

Thanks for taking a second to read this... Have tried this in SQL, Python, and VBA with no luck (various reasons).
The data and concept are, I think, pretty simple - but I can't seem to make it work.
Column 1 has a stock market ticker; Column 2 has the company name. However, many of the company names have been truncated, or have changed over time. I want to find every instance of each ticker, where that ticker has more than 1 name.
So for example, my file has these 4 lines
IBM | Int Bus Mach
IBM | International Business M
IBM | Intl Bus Machines
IBM | Int Bus Mach
I would like to see the 3 unique company names
IBM | Int Bus Mach
IBM | International Business M
IBM | Intl Bus Machines
Any ideas are certainly appreciated!
Thanks!

According to your data examples, you should do something like this:
SELECT
market_ticker,
company_name,
count(*)
FROM yourTable
GROUP BY
market_ticker,
company_name
The extra column will give how many times the market ticker and company name repeat.

2 step Sql Solution here.
I would do the following
Simple distinct query to get all ticker and company combos then pivot
select *,
ROW_NUMBER() OVER(PARTITION BY TickerColumn ORDER BY NameColumn) AS NameNum
into #Temp
From
(
Select distinct TickerColumn, NameColumn
from Table
)x
Then Pivot ColumnNames
Select TickerColumn, [1],[2],[3],[4]
From
(select * from #temp) as Source
Pivot
(
Max(NameColumn)
For NameNum in ([1],[2],[3],[4]) ---You can add or reduce number of columns
) As PivotTable;

May be try something like this. Would bring distinct names.
WITH cte
AS (
SELECT market_ticker
,Company_name
FROM Stocks
)
SELECT a.market_ticker
,b.Company_name
FROM cte a
INNER JOIN stocks b
ON a.market_ticker = b.market_ticker
GROUP BY a.market_ticker
,b.company_name
HAVING count(a.company_name) >= 2
ORDER BY 1
,2

Related

How to query partitions that you get from using window functions?

I have a table that has the following structure
------------------------------------
|Company_ID| Company_Name| Join_Key|
------------------------------------
| 1 | ACompany | AC |
| 2 | BCompany | BC |
While this table doesn't have many column, there are somewhere around 4 million rows.
I want to calculate some string distance calculations on these company names. I have the following query
select a.Company_Name as Name1,
b.Company_Name as Name2,
Fuzzy_Match(a.Company_Name, b.Company_Name, 'JaccardDistance') as Jaccard --this is a custom function
from [Companies] a, [Companies] b
While something like this would work on a smaller database, since my database is so large, there is no way for me to be able to get through all of the combinations in a reasonable amount of time. So I thought about partitioning the database with a window function.
select Company_Name,
ROW_NUMBER() over(partition by Join_Key order by Join_Key asc) as row_num
Join_Key
from [Companies]
This gives me a list of the companies numbered and partitioned by their join_key, but the thing that I'm not sure of is how to do both things.
How can I perform a cross join and calculate the string similarity measures for each partition so that I'm only comparing companies that both have 'AC' as their join key?

SQL Remove Duplicates, save lowest of certain column

I've been looking for an answer to this but couldn't find anything the same as this particular situation.
So I have a one table that I want to remove duplicates from.
__________________
| JobNumber-String |
| JobOp - Number |
------------------
So there are multiples of these two values, together they make the key for the row. I want keep all distinct job numbers with the lowest job op. How can I do this? I've tried a bunch of things, mainly trying the min function, but that only seems to work on the entire table not just the JobNumber sets. Thanks!
Original Table Values:
JobNumber Jobop
123 100
123 101
456 200
456 201
780 300
Code Ran:
DELETE FROM table
WHERE CONCAT(JobNumber,JobOp) NOT IN
(
SELECT CONCAT(JobNumber,MIN(JobOp))
FROM table
GROUP BY JobNumber
)
Ending Table Values:
JobNumber Jobop
123 100
456 200
780 300
With SQL Server 2008 or higher you can enhance the MIN function with an OVER clause specifying a PARTITION BY section.
Please have a look at https://msdn.microsoft.com/en-us/library/ms189461.aspx
You can simply select the values you want to keep:
select jobOp, min(number) from table group by jobOp
Then you can delete the records you don't want:
DELETE t FROM table t
left JOIN (select jobOp, min(number) as minnumber from table group by jobOp ) e
ON t.jobob = e.jobob and t.number = e.minnumber
Where e.jobob is null
I like to do this with window functions:
with todelete as (
select t.*, min(jobop) over (partition by numbers) as minjop
from table t
)
delete from todelete
where jobop > minjop;
It sounds like you are not using the correct GROUP BY clause when using the MIN function. This sql should give you the minimum JobOp value for each JobNumber:
SELECT JobNumber, MIN(JobOp) FROM test.so_test GROUP BY JobNumber;
Using this in a subquery, along with CONCAT (this is from MySQL, SQL Server might use different function) because both fields form your key, gives you this sql:
SELECT * FROM so_test WHERE CONCAT(JobNumber,JobOp)
NOT IN (SELECT CONCAT(JobNumber,MIN(JobOp)) FROM test.so_test GROUP BY JobNumber);

sql merge tables side-by-side with nothing in common

I'm looking for an sql answer on how to merge two tables without anything in common.
So let's say you have these two tables without anything in common:
Guys Girls
id name id name
--- ------ ---- ------
1 abraham 5 sarah
2 isaak 6 rachel
3 jacob 7 rebeka
8 leah
and you want to merge them side-by-side like this:
Couples
id name id name
--- ------ --- ------
1 abraham 5 sarah
2 isaak 6 rachel
3 jacob 7 rebeka
8 leah
How can this be done?
I'm looking for an sql answer on how to merge two tables without anything in common.
You can do this by creating a key, which is the row number, and joining on it.
Most dialects of SQL support the row_number() function. Here is an approach using it:
select gu.id, gu.name, gi.id, gi.name
from (select g.*, row_number() over (order by id) as seqnum
from guys g
) gu full outer join
(select g.*, row_number() over (order by id) as seqnum
from girls g
) gi
on gu.seqnum = gi.seqnum;
Just because I wrote it up anyway, an alternative using CTEs;
WITH guys2 AS ( SELECT id,name,ROW_NUMBER() OVER (ORDER BY id) rn FROM guys),
girls2 AS ( SELECT id,name,ROW_NUMBER() OVER (ORDER BY id) rn FROM girls)
SELECT guys2.id guyid, guys2.name guyname,
girls2.id girlid, girls2.name girlname
FROM guys2 FULL OUTER JOIN girls2 ON guys2.rn = girls2.rn
ORDER BY COALESCE(guys2.rn, girls2.rn);
An SQLfiddle to test with.
Assuming, you want to match guys up with girls in your example, and have some sort of meaningful relationship between the records (no pun intended)...
Typically you'd do this with a separate table to represent the association (relationship) between the two.
This wouldn't give you a physical table, but it would enable you to write an SQL query representing the final results:
SELECT Girls.ID AS GirlId, Girls.Name AS GirlName, Guys.ID AS GuyId, Guys.Name AS GuyName
FROM Couples INNER JOIN
Girls ON Couples.GirlId = Girls.ID INNER JOIN
Guys ON Couples.GuyId = Guys.ID
which you could then use to create a table on the fly using the Select Into syntax
SELECT Girls.ID AS GirlId, Girls.Name AS GirlName, Guys.ID AS GuyId, Guys.Name AS GuyName
INTO MyNewTable
FROM Couples INNER JOIN
Girls ON Couples.GirlId = Girls.ID INNER JOIN
Guys ON Couples.GuyId = Guys.ID
(But standard Normalization rules would say it's best to keep them in distinct tables rather than creating a temp table, unless there's a performance reason not to do so.)
I need this all the time, -- creating templates in Excel using input from my tables. This pulls from one table that has my regions, the other with the quarters in a year. the result gives me one region name for each quarter/period.
SELECT b.quarter_qty, a.mkt_name FROM TBL_MKTS a, TBL_PERIODS b

How to concatenate rows delimited with comma using standard SQL?

Let's suppose we have a table T1 and a table T2. There is a relation of 1:n between T1 and T2. I would like to select all T1 along with all their T2, every row corresponding to T1 records with T2 values concatenated, using only SQL-standard operations.
Example:
T1 = Person
T2 = Popularity (by year)
for each year a person has a certain popularity
I would like to write a selection using SQL-standard operations, resulting something like this:
Person.Name Popularity.Value
John Smith 1.2,5,4.2
John Doe NULL
Jane Smith 8
where there are 3 records in the popularity table for John Smith, none for John Doe and one for Jane Smith, their values being the values represented above. Is this possible? How?
I'm using Oracle but would like to do this using only standard SQL.
Here's one technique, using recursive Common Table Expressions. Unfortunately, I'm not confident on its performance.
I'm sure that there are ways to improve this code, but it shows that there doesn't seem to be an easy way to do something like this using just the SQL standard.
As far as I can see, there really should be some kind of STRINGJOIN aggregate function that would be used with GROUP BY. That would make things like this much easier...
This query assumes that there is some kind of PersonID that joins the two relations, but the Name would work too.
WITH cte (id, Name, Value, ValueCount) AS (
SELECT id,
Name,
CAST(Value AS VARCHAR(MAX)) AS Value,
1 AS ValueCount
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) AS id,
Name,
Value
FROM Person AS per
INNER JOIN Popularity AS pop
ON per.PersonID = pop.PersonID
) AS e
WHERE id = 1
UNION ALL
SELECT e.id,
e.Name,
cte.Value + ',' + CAST(e.Value AS VARCHAR(MAX)) AS Value,
cte.ValueCount + 1 AS ValueCount
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) AS id,
Name,
Value
FROM Person AS per
INNER JOIN Popularity AS pop
ON per.PersonID = pop.PersonID
) AS e
INNER JOIN cte
ON e.id = cte.id + 1
AND e.Name = cte.Name
)
SELECT p.Name, agg.Value
FROM Person p
LEFT JOIN (
SELECT Name, Value
FROM (
SELECT Name,
Value,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ValueCount DESC)AS id
FROM cte
) AS p
WHERE id = 1
) AS agg
ON p.Name = agg.Name
This is an example result:
--------------------------------
| Name | Value |
--------------------------------
| John Smith | 1.2,5,4.2 |
--------------------------------
| John Doe | NULL |
--------------------------------
| Jane Smith | 8 |
--------------------------------
As per in Oracle you can use listagg to achive this -
select t1.Person_Name, listagg(t2.Popularity_Value)
within group(order by t2.Popularity_Value)
from t1, t2
where t1.Person_Name = t2.Person_Name (+)
group by t1.Person_Name
I hope this will solve your problem.
But the comment you have given after #DavidJashi question .. well this is not sql standard and I think he is correct. I am also with David that you can not achieve this in pure sql statement.
I know that I'm SUPER late to the party, but for anyone else that might find this, I don't believe that this is possible using pure SQL92. As I discovered in the last few months fighting with NetSuite to try to figure out what Oracle methods I can and cannot use with their ODBC driver, I discovered that they only "support and guarantee" SQL92 standard.
I discovered this, because I had a need to perform a LISTAGG(). Once I found out I was restricted to SQL92, I did some digging through the historical records, and LISTAGG() and recursive queries (common table expressions) are NOT supported in SQL92, at all.
LISTAGG() was added in Oracle SQL version 11g Release 2 (2009 – 11 years ago: reference https://oracle-base.com/articles/misc/string-aggregation-techniques#listagg) , CTEs were added to Oracle SQL in version 9.2 (2007 – 13 years ago: reference https://www.databasestar.com/sql-cte-with/).
VERY frustrating that it's completely impossible to accomplish this kind of effect in pure SQL92, so I had to solve the problem in my C# code after I pulled a ton of extra unnecessary data. Very frustrating.

How to Select and Order By columns not in Groupy By SQL statement - Oracle

I have the following statement:
SELECT
IMPORTID,Region,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
GROUP BY
IMPORTID, Region,RefObligor
Order BY
IMPORTID, Region,RefObligor
There exists some extra columns in table Positions that I want as output for "display data" but I don't want in the group by statement.
These are Site, Desk
Final output would have the following columns:
IMPORTID,Region,Site,Desk,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
Ideally I'd want the data sorted like:
Order BY
IMPORTID,Region,Site,Desk,RefObligor
How to achieve this?
It does not make sense to include columns that are not part of the GROUP BY clause. Consider if you have a MIN(X), MAX(Y) in the SELECT clause, which row should other columns (not grouped) come from?
If your Oracle version is recent enough, you can use SUM - OVER() to show the SUM (grouped) against every data row.
SELECT
IMPORTID,Site,Desk,Region,RefObligor,
SUM(NOTIONAL) OVER(PARTITION BY IMPORTID, Region,RefObligor) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
Order BY
IMPORTID,Region,Site,Desk,RefObligor
Alternatively, you need to make an aggregate out of the Site, Desk columns
SELECT
IMPORTID,Region,Min(Site) Site, Min(Desk) Desk,RefObligor,SUM(NOTIONAL) AS SUM_NOTIONAL
From
Positions
Where
ID = :importID
GROUP BY
IMPORTID, Region,RefObligor
Order BY
IMPORTID, Region,Min(Site),Min(Desk),RefObligor
I believe this is
select
IMPORTID,
Region,
Site,
Desk,
RefObligor,
Sum(Sum(Notional)) over (partition by IMPORTID, Region, RefObligor)
from
Positions
group by
IMPORTID, Region, Site, Desk, RefObligor
order by
IMPORTID, Region, RefObligor, Site, Desk;
... but it's hard to tell without further information and/or test data.
A great blog post that covers this dilemma in detail is here:
http://bernardoamc.github.io/sql/2015/05/04/group-by-non-aggregate-columns/
Here are some snippets of it:
Given:
CREATE TABLE games (
game_id serial PRIMARY KEY,
name VARCHAR,
price BIGINT,
released_at DATE,
publisher TEXT
);
INSERT INTO games (name, price, released_at, publisher) VALUES
('Metal Slug Defense', 30, '2015-05-01', 'SNK Playmore'),
('Project Druid', 20, '2015-05-01', 'shortcircuit'),
('Chroma Squad', 40, '2015-04-30', 'Behold Studios'),
('Soul Locus', 30, '2015-04-30', 'Fat Loot Games'),
('Subterrain', 40, '2015-04-30', 'Pixellore');
SELECT * FROM games;
game_id | name | price | released_at | publisher
---------+--------------------+-------+-------------+----------------
1 | Metal Slug Defense | 30 | 2015-05-01 | SNK Playmore
2 | Project Druid | 20 | 2015-05-01 | shortcircuit
3 | Chroma Squad | 40 | 2015-04-30 | Behold Studios
4 | Soul Locus | 30 | 2015-04-30 | Fat Loot Games
5 | Subterrain | 40 | 2015-04-30 | Pixellore
(5 rows)
Trying to get something like this:
SELECT released_at, name, publisher, MAX(price) as most_expensive
FROM games
GROUP BY released_at;
But name and publisher are not added due to being ambiguous when aggregating...
Let’s make this clear:
Selecting the MAX(price) does not select the entire row.
The database can’t know and when it can’t give the right answer every
time for a given query it should give us an error, and that’s what it
does!
Ok… Ok… It’s not so simple, what can we do?
Use an inner join to get the additional columns
SELECT g1.name, g1.publisher, g1.price, g1.released_at
FROM games AS g1
INNER JOIN (
SELECT released_at, MAX(price) as price
FROM games
GROUP BY released_at
) AS g2
ON g2.released_at = g1.released_at AND g2.price = g1.price;
Or Use a left outer join to get the additional columns, and then filter by the NULL of a duplicate column...
SELECT g1.name, g1.publisher, g1.price, g2.price, g1.released_at
FROM games AS g1
LEFT OUTER JOIN games AS g2
ON g1.released_at = g2.released_at AND g1.price < g2.price
WHERE g2.price IS NULL;
Hope that helps.