Identify duplicate record in Dataframe - sql

I have a dataframe as below which identifies full name of any person:
-------------------
| f_name | l_name |
-------------------
| abc | xyz |
| xyz | abc |
| pqr | lmn |
-------------------
Here the second row is basically same as first row.
Consider the case where an entry has come in the data where by mistake last name is put under first name(f_name) and first name is put under last name(l_name)
How can I identify and drop/resolve such duplicate/erroneous records in spark dataframe?
Desired Result:
-------------------
| f_name | l_name |
-------------------
| abc | xyz |
| pqr | lmn |
-------------------
Solution could be with udf or SQL or both. Thnx!

Use dropDuplicates function available for Dataset with the proper key:
val df = Seq(
("abc", "xyz"),
("xyz", "abc"),
("pqr", "lmn")
).toDF("f_name", "l_name")
df.withColumn("key", array_sort(array('f_name, 'l_name))).dropDuplicates("key")
+------+------+----------+
|f_name|l_name| key|
+------+------+----------+
| pqr| lmn|[lmn, pqr]|
| abc| xyz|[abc, xyz]|
+------+------+----------+

Related

Amending the output by adding extra value in SQL

I want to amending the output from SQL table for instance adding extra text or element from the selective table in SQL. And below query is not able to execute as facing mismatch input.
select date, '123' & text from database123
Normal Output
| Date | Text |
| -------- | ------|
| 01/01/2021 | Car |
| 01/02/2021 | Car |
Expecting Output
| Date | Text |
| -------- | ------ |
| 01/01/2021 | 123Car |
| 01/02/2021 | 123Car |
You can use concat or ||:
SELECT concat('123', text), '123' || text
FROM database123

Check the elements in two columns of different dataframes

I have two dataframes.
Df1
Id | Name | Remarks
---------------------
1 | A | Not bad
1 | B | Good
2 | C | Very bad
Df2
Id | Name | Place |Job
-----------------------
1 | A | Can | IT
2 |C | Cbe | CS
4 |L | anc | ME
5 | A | cne | IE
Output
Id | Name | Remarks |Results
------------------------------
1 | A | Not bad |True
1 | B | Good |False
2 | C | VeryGood |True
That is the result should be true if same id and name are present in both dataframes. I tried
df1['Results']=np.where(Df1['id','Name'].isin(Df2['Id','Name']),'true','false')
But it was not successful.
Use DataFrame.merge with indicator parameter and compare both values:
df = Df1[['id','Name']].merge(Df2[['Id','Name']], indicator='Results', how='left')
df['Results'] = df['Results'].eq('both')
Your solution is possible by compare index values by DataFrame.set_index with Index.isin:
df1['Results']= Df1.set_index(['id','Name']).index.isin(Df2.set_index(['id','Name']).index)
Or compare tuples from both columns:
df1['Results']= Df1[['id','Name']].agg(tuple, 1).isin(Df2[['id','Name']].agg(tuple, 1))
You can easily achieve by merge like #jezrael 's answer.
You can also achieve it with np.where,list comprehension and zip like below:
df1['Results']=np.where([str(i)+'_'+str(j)==str(k)+'_'+str(l) for i,j,k,l in zip(Df1['ID'],Df1['Name'],Df2['ID'],Df2['Name'])],True,False)

Snowflake Create View with JSON (VARIANT) field as columns with dynamic keys

I am having a problem creating VIEWS with Snowflake that has VARIANT field which stores JSON data whose keys are dynamic and keys definition is stored in another table. So I want to create a VIEW that has dynamic columns based on the foreign key.
Here are my table looks like:
companies:
| id | name |
| -- | ---- |
| 1 | Company 1 |
| 2 | Company 2 |
invoices:
| id | invoice_number | custom_fields | company_id |
| -- | -------------- | ------------- | ---------- |
| 1 | INV-01 | {"1": "Joe", "3": true, "5": "2020-12-12"} | 1 |
| 2 | INV-01 | {"2":"Hello", "4": 1000} | 2 |
customization_fields:
| id | label | data_type | company_id |
| -- | ----- | --------- | ---------- |
| 1 | manager | text | 1 |
| 2 | reference | text | 2 |
| 3 | emailed | boolean | 1 |
| 4 | account | integer | 2 |
| 5 | due_date | date | 1 |
So I want to create a view for getting each companies invoices something like:
CREATE OR REPLACE VIEW companies_invoices AS SELECT * FROM invoices WHERE company_id = 1
which should get a result like below:
| id | invoice_number | company_id | manager | emailed | due_date |
| -- | -------------- | ---------- | ------- | ------- | -------- |
| 1 | INV-01 | 1 | Joe | true | 2020-12-12 |
So my challenge above here is I cannot make sure the keys when I write the query. If I know that I could write
SELECT
id,
invoice_number,
company_id,
custom_fields:"1" AS manager,
custom_fields:"3" AS emailed,
custom_fields:"5" AS due_date
FROM invoices
WHERE company_id = 1
These keys and labels are written in the customization_fields table, so I tried different ways and I am not able to do that.
So could anyone tell me if we can do or not? If we can please give me an example so it would really help.
You cannot do what you want to do with a view. A view has a fixed set of columns and they have specific types. Retrieving a dynamic set of columns requires some other mechanism.
If you're trying to change the number of columns or the names of the columns based on the rows in the customization_fields table, you can't do it in a view.
If you have a defined schema and just need to grab dynamic JSON properties, you may want to consider looking into Snowflake's GET function. It allows you to get any part of a JSON using a string for the path rather than using a literal path in the SQL statement. For example:
create temp table foo(v variant);
insert into foo select parse_json('{ "name":"John", "age":30, "car":null }');
-- This uses a literal path in the SQL to get to a JSON property
select v:name::string as first_name from foo;
-- This uses the GET function to get the value from a path in a string
select get(v, 'name')::string as first_name from foo;
You can replace the 'name' in the second parameter of the GET function with the value stored in the customization_fields table.
In SF, You will have to use a Stored Proc function to retrieve the dynamic set of columns

Replacing multiple strings from a databsae column with distinct replacements

I have a hive table as below:
+----+---------------+-------------+
| id | name | partnership |
+----+---------------+-------------+
| 1 | sachin sourav | first |
| 2 | sachin sehwag | first |
| 3 | sourav sehwag | first |
| 4 | sachin_sourav | first |
+----+---------------+-------------+
In this table I need to replace strings such as "sachin" with "ST" and "Sourav" with "SG". I am using following query, but it is not solving the purpose.
Query:
select
*,
case
when name regexp('\\bsachin\\b')
then regexp_replace(name,'sachin','ST')
when name regexp('\\bsourav\\b')
then regexp_replace(name,'sourav','SG')
else name
end as newName
from sample1;
Result:
+----+---------------+-------------+---------------+
| id | name | partnership | newname |
+----+---------------+-------------+---------------+
| 4 | sachin_sourav | first | sachin_sourav |
| 3 | sourav sehwag | first | SG sehwag |
| 2 | sachin sehwag | first | ST sehwag |
| 1 | sachin sourav | first | ST sourav |
+----+---------------+-------------+---------------+
Problem: My intention is, when id = 1, the newName column should bring value as "ST SG". I mean it should replace both strings.
You can nest the replaces:
select s.*,
replace(replace(s.name, 'sachin', 'ST'), 'sourav', 'SG') as newName
from sample1 s;
You don't need regular expressions, so just use replace().

postgres: Multiply column of table A with rows of table B

Fellow SOers,
Currently I am stuck with the following Problem.
Say we have table "data" and table "factor"
"data":
---------------------
| col1 | col2 |
----------------------
| foo | 2 |
| bar | 3 |
----------------------
and table "factor" (the amount of rows is variable)
---------------------
| name | val |
---------------------
| f1 | 7 |
| f2 | 8 |
| f3 | 9 |
| ... | ... |
---------------------
and the following result should look like this:
---------------------------------
| col1 | f1 | f2 | f3 | ...|
---------------------------------
| foo | 14 | 16 | 18 | ...|
| bar | 21 | 24 | 27 | ...|
---------------------------------
So basically I want the column "col2" multiplicated with all the contents of "val" of table "factor" AND the content of column "name" should act as tableheader/columnname for the result.
We are using postgres 9.3 (upgrade to higher version may be possible), so an extended Search resulted in multiple possible solutions: using crosstab (though even with crosstab I was not able to figure this one out), using CTE "With" (preferred, but also no luck). Probably this may also be done with the correct use of array() and unnest().
Hence, any help is appreciated on how to achieve this (the less code, the better)
Tnx in advance!
This package seems to do what you want:
https://github.com/hnsl/colpivot