SQL SELECT DISTINCT and GROUP BY both produce duplicates - sql

I have a decent size table with 20+ columns and almost 3 million rows, and I want to select all the unique values from a single column and enter them into a newly created table. After research, I have attempted this using both the DISTINCT and GROUP BY approaches, but both are producing duplicate values. Furthermore, I've set the new column in the new table as a Primary Key, which I don't believe should allow duplicate values.
I'm definitely a beginner here, so perhaps there is something simple I'm doing wrong. Here's some sample code:
Using GROUP BY
INSERT INTO ResourceGroups(ResourceGroup)
SELECT ResourceGroup
FROM dbo.UsageData
WHERE ResourceGroup IS NOT NULL
GROUP BY ResourceGroup
Using DISTINCT
INSERT INTO ResourceGroups(ResourceGroup)
SELECT DISTINCT ResourceGroup
FROM dbo.UsageData
WHERE ResourceGroup IS NOT NULL
The results of both of these seem to be the same. Here's a sample of the first few rows:
ResourceGroup
aiiInnovationTime
Api-Default-Central-US
Api-Default-Central-US
applicationinsights
applicationinsights
azurefunctions-southeastasia
azurefunctions-southeastasia
The query resulted in 532 rows, and it clearly eliminated some duplicates after consolidating down from 3 million. However, there are obviously still duplicates here, and they also successfully inserted into a primary key column which shouldn't allow duplicate. Furthermore there's a blank row despite my attempt to filter out NULLs (though maybe there's a space or something there?). Needless to say, I'm a bit confused about what I'm doing wrong, and would greatly appreciate any help that this community can provide!

Both the queries you mentioned should give you unique results, the anomaly however, is due to may be leading or trailing white-spaces.
Depending on the DB you can modify the query for e.g.
For Oracle DB: You can use TRIM function which removes both leading and trailing white-spaces.
SQL Server Don't have single function you have to use LTRIM and RTRIM to remove spaces.

Assuming there are spaces in your data
SELECT DISTINCT
REPLACE(REPLACE(REPLACE(REPLACE(ResourceGroup, CHAR(13) + CHAR(10), ' ... '),
CHAR(10) + CHAR(13), ' ... '), CHAR(13), ' '), CHAR(10), ' ... ')
FROM dbo.UsageData
WHERE LTRIM(RTRIM(ResourceGroup)) IS NOT NULL
LTRIM trims leading spaces and RTRIM trims trailing spaces. Try this out and see if it works!

As Chetan Ranpariya mentioned, checked leading and trailing spaces. The way you do it depends on the SQL engine. For instance, in MySQL you can use https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_trim.

Related

postgresql - check if a row contains a string without considering spaces

Is it possible to check if a row contains a string without conisdering spaces?
Suppose I have a table like the one above. I want to know if the query column contains a string that may have different consecutive number of space than the one stored or vice versa?
For example: the first row's query is select id, username from postgresql, and the one I want to know if stored in the table is:
select id, username
from postgresql
That is to say the one that I want to know if exists in the table is indented differently and hence has different number of space.
You can use REGEXP_REPLACE; this will likely be very slow on large data set.
SELECT * from table
where REGEXP_REPLACE('select id, username from postgresql ', '\s+$', '') = REGEXP_REPLACE(query, '\s+$', '')
I think you would phrase this as:
where $str ~ replace('select id, username from postgresql', ' ', '[\s]+')
Note: This assumes that your string does not have other regular expression special characters.

Trailing spaces cant be removed in WHERE clause?

I need to list out the records with some records in column_A & column_B both are of char(10) type. But the issue is Column_A has trailing spaces in it, Even though both the columns having same data but it couldn't match.
I tried to LTRIM(RTRIM(column_A)), REPLACE(column_A, CHAR(32), '') etc., but none of them doesn't work.
Could someone suggest any other method to solve this issue.
Note: The above mentioned methods are resulting fine in SELECT clause.
Thanks in advance.
This should work:
WHERE LTRIM(RTRIM(COLA)) = LTRIM(RTRIM(COLB))

Remove a comma in a number in Redshift

My data contains numbers like 100,000.89 and so on. What function should I use in Redshift to remove the comma and keep it like 100000.89? Do we write the function while creating a table since it is at column level or after its creation and then post process the table?
To remove commas from text columns, use replace():
select replace(col, ',', '')
from t
EDIT : In case of null data use coalesce() :
select coalesce(replace(col,',', ''), '')
from t
I just added all the columns in my insert query with coalesce since all of them somewhere had null values and it worked like a charm. The redshift error for missing data in not null fields is misleading here as mentioned : https://forums.aws.amazon.com/thread.jspa?threadID=119640
I changed my copy command too, and added BLANKSASNULL. This worked! Thanks for all your help. Below is my command:
insert into test.t_final
(select
coalesce(project_number) as project_number,
coalesce(contract_po) as contract_po,
coalesce(tracking_date) as tracking_date,
(coalesce(replace(amount,',',''))) as amount,
(coalesce(replace(tax,',',''))) as tax,
(coalesce(replace(contract_value,',',''))) as contract_value,
coalesce(comments) from test.t)

SQL - find rows with values that has 2 spaces or more in between

I have sql table with name and surname. Surname is in own column. The problem is with users with two surnames, because sometimes they add more than one space between surnames and then I have to find and fix them manualy.
How to find these surnames with more than one space in between?
If you want to find records which have more than one space then you can use the following trick:
SELECT surname
FROM yourTable
WHERE LENGTH(REPLACE(surname, ' ', '')) < LENGTH(surname) - 1
This query will detect two or more spaces in the surname column. If you want to also do an UPDATE this is possible, but it would be fairly database-specific, and you did not specify your database as of the time I wrote this answer.
First remove those extra spaces. Then add a constraint that makes sure it doesn't happen again:
alter table tablename add constraint surname_verify check (surname not like '% %')
(Or, even better, have a trigger making sure the surnames are properly spaced, cased etc.)
How to remove extra spaces? Depends on the dbms.
You can perhaps do something like:
update tablename set surname = replace(surname, ' ', ' ')
where surname like '% %'
The where clause isn't needed, but makes the transaction much smaller.
(Iterate to get rid of triple or more spaces.) Or use regexp_replace.
Even tidier:
select string = replace(replace(replace(' select single spaces',' ','<>'),'><',''),'<>',' ')
Output:
select single spaces

SQL Query - Remove Commas Within Select Statement

I am running a SQL query on some customer data. Within my Select statement I ideally need to remove all commas.(I am not looking to update my source tables.)
Not certain how to do this on all fields and all columns, so any help or insight would be much appreciated.
If you need any additional information, please let me know.
Thank you,
Allison
I assume that you need to remove the commas in any of your data (in columns)
and replacing the comma by a space
Select Replace(ColA,',',' ') as ColA, Replace(ColB,',',' ') as ColB ...
From