Postgres Query to match column - sql

Need help in identifying the matching records and filter out.
Input :
field1
field2
test_data
test data
test_data
test_data
test_data
test_data
test_data
test_data
test_data
test_da
test data
test_data
Output:
Need query to create a query and identify only those records that are not matching :
I have tried below query:
select *
from test.table
where lower(trim(field1))!=replace(lower(trim(field2)),'_',' ')
and lower(trim(field1))!=lower(trim(field2))
I am getting the output but I need to check if there is any other query that experts can help and works better.

Related

Concatenate IDs only IF not unique in table

I am bit stuck in writing a query that might be easy for some of you. I am working with Redshift in Coginiti. So this is the problem I want to solve:
I have a a big table but for this particular query I will only use 3 columns: ID, X,Y
the requirement is if ID is unique then I should leave it as is, ID. If ID is not unique then I want to concatenate ID,X,Y. I am not looking to overwrite the column but rather create a new column I would call NEW_ID
if ID is unique in table T-->ID
else concatenate(ID,X,Y) using '_' as delimiter
I did have a kind of solution, where I write a subquery to get the count of ID then write and if statement saying if count(ID)=1 then ID, else the concatenated but I am blanking out on to actually implement it in SQL world.
Thanks, I appreciate your help in advance :)
I did have a kind of solution, where I write a subquery to get the count of ID then write and if statement saying if count(ID)=1 then ID, else the concatenated but I am blanking out on to actually implement it in SQL world.
SELECT *, CONCAT(ID,X,Y)
from table
left join ....got stuck here on how to tie it to the next part
SELECT ID, COUNT(ID)
FROM table
group by id
having count(ID)<>1 ...or perhaps =1. I need to work with all values anyway
This should be straight forward. On Redshift I like the decode() function for cases like this but a CASE statement works just as well.
select id, x, y,
decode(id_count, 1, id, id::text || '_' || x || '_' || y) as concat_col
from (
select id, x, y, count(*) over (partition by id) as id_count
from <table>
);

Deduplicate rows in complex schema in a bigquery partition

I have read some threads but I know too little sql to solve my problem.
I have a table with a complex schema with records and nested fields.
Below you see a query which finds the exact row that I need to deduplicate.
SELECT *
FROM my-data-project-214805.rfid_data.rfid_data_table
WHERE DATE(_PARTITIONTIME) = "2020-02-07"
AND DetectorDataMessage.Header.MessageID ='478993053'
DetectorDataMessage.Header.MessageID is supposed to be unique.
How can I delete one of these rows? (there are two)
If possible I would like deduplicate the whole table but its partitioned and I can't get it right. I try the suggestions in below threads but I get this error Column DetectorDataMessage of type STRUCT cannot be used in...
Threads of interest:
Deduplicate rows in a BigQuery partition
Delete duplicate rows from a BigQuery table
Any suggestions? Can you guide me in the right direction?
Try using a MERGE to remove the existing duplicate rows, and a single identical one. In this case I'm going for a specific date and id, as in the question:
MERGE `temp.many_random` t
USING (
# choose a single row to replace the duplicates
SELECT a.*
FROM (
SELECT ANY_VALUE(a) a
FROM `temp.many_random` a
WHERE DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
GROUP BY _PARTITIONTIME, DetectorDataMessage.Header.MessageID
)
)
ON FALSE
WHEN NOT MATCHED BY SOURCE
# delete the duplicates
AND DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Based on this answer:
Deduplicate rows in a BigQuery partition
If all of the values in the duplicate rows are the same, just use 'SELECT distinct'.
If not, I would use the ROW_NUMBER() function to create a rank for each unique index, and then just choose the first rank.
I don't know what your columns are, but here's an example:
WITH subquery as
(select MessageId
ROW_NUMBER() OVER(partition by MessageID order by MessageId ASC) AS rank
)
select *
from subquery
where rank = 1

Hive : Identification of exact duplicate records

I have a requirement.
I have a hive table which has more than 200 columns.
Now i have to write a insert query to load the data to another hive table after removing all identical duplicate records.
I know i can achive it by using row number () over () .
Code snippet
Insert into table target
Select col1,col2..col200
from
(
Select col1,col2...col200,row_number () over ( partition by col1,col2...col200 order by null ) as rn from source
) a
where
rn=1
But this would very lengthy as need to write all 200 columns name multiple time,
Is there any easier solution available?
Thanks for your advice.
You can use select distinct:
Insert into table target
Select distinct col1,col2..col200
from source ;

How to concatenate and group rows in large BigQuery table with "Resources exceeded" problems

I have a single field table with 1.1 billion rows in BigQuery.
Table properties:
One field where Field name - id and Field Type - String
Table total size - 8.3GB
I would like to create a new as follow:
The first column is a UUID field using GENERATE_UUID()
The second column, id_str, which is 25,000 id records concatenated into this column with comma separated id values
I have tried different solutions but keep running into
"Resources exceeded"
Is there a smart way around this limitation? Any other approach to solve my problem inside BigQuery?
The code I have at the moment that generates the above-mentioned error
SELECT
GENERATE_UUID() as batch_id,
STRING_AGG(id) as ids_str
from
(
WITH vars AS (
SELECT 25000 as rec_count
)
SELECT
cast(ceiling(ROW_NUMBER() OVER () / 25000) as int64) as batch_count,
25000 as rec_count,
cast(id as string) as id
FROM
tbl_profile
)
group by rec_count
Any other approach to solve my problem inside BigQuery?
If your use-case allows you to relax a little requirements so instead of
The second column to be 25,000 id concatenated into one column
it would be
The second column to be about (close to) 25,000 id concatenated into one column
In this case below (for BigQuery Standard SQL) can/should work well for you
#standardSQL
SELECT
GENERATE_UUID() AS batch_id,
COUNT(1) batch_size,
STRING_AGG(id) AS ids_str
FROM (
SELECT
CAST((cnt * RAND()) / 25000 + 0.5 AS INT64) AS batch_count,
CAST(id AS STRING) AS id
FROM `project.dataset.table`
CROSS JOIN (SELECT COUNT(1) cnt FROM `project.dataset.table`)
)
GROUP BY batch_count
this should produce result as below
As you can see here, number of id's in each row not exact 25,000 but close enough to it
Hope this can be an option for you

Select into with max()

I have a basic query I use to determine the max value of a column in a table:
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
This results in ~580 rows (the entire table has over 2400 rows).
This works just fine for my query results but what I don't know is how to insert the 580 rows into a new able based on the max value. I realize this isn't the right code but what I am thinking of would look something like this:
select * into new_table from rev_code_lookup where max(revenue_code_version)
You can use the row_number() function to get the data you want. Combine with the other answer to insert the results into a table (I've made up a couple of extra columns as an example):
Select
x.revenue_code_id,
x.revenue_code_version,
x.update_timestamp,
x.updated_by
From (
Select
revenue_code_id,
revenue_code_version,
update_timestamp,
updated_by,
row_number() over (partition by revenue_code_id Order By revenue_code_version Desc) as rn
From
revenue_code_lookup
) x
Where
x.rn = 1
Example Fiddle
The insert in another table is always the same way, no matter the complexity of your select:
insert into table
[unbeliavablycomplicatedselecthere]
So in your case:
insert into new_table
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
Similarly, if you need to create a brand new table, do this first:
CREATE TABLE new_table
AS
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
This will create the corresponding table schema and then you can execute the previous query to insert the data.