I have a table where I am seeing duplicate rows. I am seeing DQ issues for the supplier name. Some rows are missing the supplier code (3213) or are simply not known.
NK_1
PRODUCT
SUPPLIER
HASH
32187
Mango
Happy Fruits
a160d756c2f0dbd88f1b07b82a504fc6
32187
Mango
Happy Fruits (3213)
9b08d35d051bc0cc08188b17c0bc9180
32187
Mango
Not Known
634b8c15cbb6f0b41c05542f07af5664
32187
Mango
Happy Fruits (3213)
097bfc53b4bbcb45078baf9bb8a601b20
I want to filter for all rows where the Company name and code are present. This table contains thousands of different products and has the DQ issue with every supplier. How best to tackle this situation in your opinion?
In sql you can use regexp_like to match patterns in a column
Code example
select nk_1, product, supplier, hash
from dq_table
where 1=1
and regexp_like(supplier, '^\w+.\n{4,}$')
;
The objetive of the regexp is match a pattern that beggin with any letter and end with 4 number characters
REGEXP_LIKE() should do the trick.
Here, I'm looking for one or more words followed by blank, followed by an open parenthesis, by one or more digits and a closed parenthesis:
WITH
-- your input
indata(NK_1,PRODUCT,SUPPLIER,HASH) AS (
SELECT 32187,'Mango','Happy Fruits','a160d756c2f0dbd88f1b07b82a504fc6'
UNION ALL SELECT 32187,'Mango','Happy Fruits (3213)','9b08d35d051bc0cc08188b17c0bc9180'
UNION ALL SELECT 32187,'Mango','Not Known','634b8c15cbb6f0b41c05542f07af5664'
UNION ALL SELECT 32187,'Mango','Happy Fruits (3213)','097bfc53b4bbcb45078baf9bb8a601b20'
)
SELECT
*
FROM indata
WHERE REGEXP_LIKE(supplier,'(\w+ )+[(]\d+[)]')
Result:
NK_1
PRODUCT
SUPPLIER
HASH
32,187
Mango
Happy Fruits (3213)
9b08d35d051bc0cc08188b17c0bc9180
32,187
Mango
Happy Fruits (3213)
097bfc53b4bbcb45078baf9bb8a601b20
Related
HELP! Kind of new to SQL. I've been working with simple statements for a few years but I need a little advanced help. I know it can be done and will save me time.
Here is my example to try to find results:
select top 1 apples, color from fruits
where apples in ('gala', 'fuji', 'granny')
and (inStock is not null and inStock <> '')
In the above query I would get the first color in 'gala' apples and thats it. What I want is the first color in 'gala', the first in 'fuji', first in 'granny' and so on.
InStock isn't as important - it's just an additional filter in the search results.
What I want is a two column list. Left Column being apple types and right column being the first color result for each apple type.
You can use row_number() window ranking function to serialize apples wise colors in a specific order. Then choose first one from each group by selecting first rows.
with cte as
(
select apples, color ,row_number()over(partition by apples order by apples) rn from fruits
where apples in ('gala', 'fuji', 'granny')
and (inStock is not null and inStock <> '')
)
select apples, color from cte where rn=1
I think one issue you might have here is the concept of "first". A color is a categorical variable and tables don't typically attach meaning to a "first" or "last" value with a few exceptions. If you're dead set on returning the first row for each fruit, one easy way to get the result utilizes union all.
SELECT top 1 apples, color from fruits where apples = 'gala'
UNION ALL
SELECT top 1 apples, color from fruits where apples = 'fuji'
UNION ALL
SELECT top 1 apples, color from fruits where apples = 'granny'
Let's assume, a table has the following rows
ID Name Value
1 Apple Red
1 Taste Sour
2 Apple Yellow
2 Taste Sweet
3 Apple Red
3 Taste Sour
4 Apple Green
4 Taste Tart
5 Apple Yellow
5 Taste Sweet
I wonder, how can I select ID's corresponding to distinct combination of Apple and Taste? For example, ID=1 corresponds to red sour apple and ID=3 can be omitted in the query result. Similarly, ID=2 is for yellow sweet apple and ID=5 can be excluded from the query result, etc. A valid query result can be any of the following ID sets: (1,2,4), (1,4,5), (2,3,4) etc.
The query or the model could be improved with more understanding of the problem.
But assuming the model is correct and the problem is presented as this, this would be my quick approach.
SELECT MIN(a.ID) as ID
FROM Table a
INNER JOIN Table b ON a.ID = b.ID AND a.Name > b.Name
GROUP BY a.Value, b.Value
This query is joining the table with itself using the ID. But because you would have four lines for each possible combination (Ex.: Apple-Apple, Taste-Taste, Apple-Taste and Taste-Apple), you need to state not only that they are different (Because you would still have Apple-Taste and Taste-Apple) but that one of them is bigger than the other (That way you choose to have Apples on one side of the join and Tastes in the other). That's why there is the a.Name > b.Name.
You then group by both the values, stating that you don't want to have more than one combination of Apple values and Taste values. Resulting in only three lines.
The Select I think it depends of the RDBMS (I used SQL Server syntax), and it's selecting the lowest ID. You don't care, so you could choose Min or Max. Min results in lines with 1,2,4. Max would result in 3,4,5.
Let's say there is a table call ITEM and it contains 3 attributes(name, id, price):
name id price
Apple 1 3
Orange 1 3
Banana 2 4
Cherry 3 5
Mango 1 3
How should I write a query to use a constants selection operator to select those item that have same prices and same ids ? The first thing come into my mind is use a rename operator to rename id to id', and price to price', then union it with the ITEM table, but since I need to select 2 tuples (price=price' & id=id') from the table, how can I select them without using the conjunctions operator in relational algebra ?
Thank you.
I'm not quite sure but for me, it would be something like this in relational calculus:
and then in SQL:
SELECT name FROM ITEM i WHERE
EXISTS ITEM u
AND u.name != i.name
AND u.price=i.price
AND u.id = i.id
But still, I think your assumption is right, you can still do it by renaming. I do believe it is a bit longer than what I did above.
I have a Hive table, titled 'UK.Choices' with a column, titled 'Fruit', with each row as follows:
AppleBananaAppleOrangeOrangePears
BananaKiwiPlumAppleAppleOrange
KiwiKiwiOrangeGrapesAppleKiwi
etc.
etc.
There are 2.5M rows and the rows are much longer than the above.
I want to count the number of instances that the word 'Apple' appears.
For example above, it is:
Number of 'Apple'= 5
My sql so far is:
select 'Fruit' from UK.Choices
Then in chunks of 300,000 I copy and paste into Excel, where I'm more proficient and able to do this using formulas. Problem is, it takes upto an hour and a half to generate each chunk of 300,000 rows.
Anyone know a quicker way to do this bypassing Excel? I can do simple things like counts using where clauses, but something like the above is a little beyond me right now. Please help.
Thank you.
I think I am 2 years too late. But since I was looking for the same answer and I finally managed to solve it, I thought it was a good idea to post it here.
Here is how I do it.
Solution 1:
+-----------------------------------+---------------------------+-------------+-------------+
| Fruits | Transform 1 | Transform 2 | Final Count |
+-----------------------------------+---------------------------+-------------+-------------+
| AppleBananaAppleOrangeOrangePears | #Banana#OrangeOrangePears | ## | 2 |
| BananaKiwiPlumAppleAppleOrange | BananaKiwiPlum##Orange | ## | 2 |
| KiwiKiwiOrangeGrapesAppleKiwi | KiwiKiwiOrangeGrapes#Kiwi | # | 1 |
+-----------------------------------+---------------------------+-------------+-------------+
Here is the code for it:
SELECT length(regexp_replace(regexp_replace(fruits, "Apple", "#"), "[A-Za-z]", "")) as number_of_apples
FROM fruits;
You may have numbers or other special characters in your fruits column and you can just modify the second regexp to incorporate that. Just remember that in hive to escape a character you may need to use \\ instead of just one \.
Solution 2:
SELECT size(split(fruits,"Apple"))-1 as number_of_apples
FROM fruits;
This just first split the string using "Apple" as a separator and makes an array. The size function just tells the size of that array. Note that the size of the array is one more than the number of separators.
This is straight-forward if you have any delimiter ( eg: comma ) between the fruit names. The idea is to split the column into an array, and explode the array into multiple rows using the 'explode' function.
SELECT fruit, count(1) as count FROM
( SELECT
explode(split(Fruit, ',')) as fruit
FROM UK.Choices ) X
GROUP BY fruit
From your example, it looks like fruits are delimited by Capital letters. One idea is to split the column based on capital letters, assuming there are no fruits with same suffix.
SELECT fruit_suffix, count(1) as count FROM
( SELECT
explode(split(Fruit, '[A-Z]')) as fruit_suffix
FROM UK.Choices ) X
WHERE fruit_suffix <> ''
GROUP BY fruit_suffix
The downside is that, the output will not have first letter of the fruit,
pple - 5
range - 4
I think you want to run in one select, and use the Hive if UDF to sum for the different cases. Something like the following...
select sum( if( fruit like '%Apple%' , 1, 0 ) ) as apple_count,
sum( if( fruit like '%Orange%', 1, 0 ) ) as orange_count
from UK.Choices
where ID > start and ID < end;
instead of a join in the above query.
No experience of Hive, I'm afraid, so this may or may not work. But on SQLServer, Oracle etc I'd do something like this:
Assuming that you have an int PK called ID on the row, something along the lines of:
select AppleCount, OrangeCount, AppleCount - OrangeCount score
from
(
select count(*) as AppleCount
from UK.Choices
where ID > start and ID < end
and Fruit like '%Apple%'
) a,
(
select count(*) as OrangeCount
from UK.Choices
where ID > start and ID < end
and Fruit like '%Orange%'
) o
I'd leave the division by the total count to the end, when you have all the rows in the spreadsheet and can count them there.
However, I'd urgently ask my boss to let me change the Fruit field to be a table with an FK to Choices and one fruit name per row. Unless this is something you can't do in Hive, this design is something that makes kittens cry.
PS I'd missed that you wanted the count of occurances of Apple which this won't do. I'm leaving my answer up, because I reckon that my However... para is actually a good answer. :(
I've got two tables in SQL, one with a project and one with categories that projects belong to, i.e. the JOIN would look roughly like:
Project | Category
--------+---------
Foo | Apple
Foo | Banana
Foo | Carrot
Bar | Apple
Bar | Carrot
Qux | Apple
Qux | Banana
(Strings replaced with IDs from a higher normal form, obviously, but you get the point here.)
What I want to do is allow filtering such that users can select any number of categories and results will be filtered to items that are members of all the selected categories. For example, if a user selects categories "Apple" and "Banana", projects "Foo" and "Qux" show up. If a user select categories "Apple", "Banana", and "Carrot" then only the "Foo" project shows up.
The first thing I tried was a simple SELECT DISTINCT Project FROM ... WHERE Category = 'Apple' AND Category = 'Banana', but of course that doesn't work since Apple and Banana show up in the same column in two different rows for any common project.
GROUP BY and HAVING don't do me any good, so tell me: is there an obvious way to do this that I'm missing, or is it really so complicated that I'm going to have to resort to recursive joins?
This is in PostgreSQL, by the way, but of course standard SQL code is always preferable when possible.
See this article in my blog for performance details:
PostgreSQL: selecting items that belong to all categories
The solution below:
Works on any number of categories
Is more efficient that COUNT and GROUP BY, since it checks existence of any project / category pair exactly once, without counting.
SELECT *
FROM (
SELECT DISTINCT Project
FROM mytable
) mo
WHERE NOT EXISTS
(
SELECT NULL
FROM (
SELECT 'Apple' AS Category
UNION ALL
SELECT 'Banana'
UNION ALL
SELECT 'Carrot'
) list
WHERE NOT EXISTS
(
SELECT NULL
FROM mytable mii
WHERE mii.Project = mo.Project
AND mii.Category = list.Category
)
)
Since a project can only be in a category once, we can use COUNT to pull this stunt off:
SELECT project, COUNT(category) AS cat_count
FROM /* your join */
WHERE category IN ('apple', 'banana')
GROUP BY project
HAVING cat_count = 2
A project with a category of only apple or banana will get a count of 1, and thus fail the HAVING clause. Only a project with both categories will get a count of 2.
If for some reason you have duplicate categories, you can use something like COUNT(DISTINCT category). COUNT(*) should work as well, and differs only if category can be null.
One other solution is, of course, something like "SELECT DISTINCT Project FROM ... AS a WHERE 'Apple' IN (SELECT Category FROM ... AS b WHERE a.Project = b.Project) AND 'Banana' IN (SELECT Category FROM ... AS b WHERE a.Project = b.Project)", but that gets pretty computationally expensive pretty quickly. I was hoping for something more elegant, and you guys haven't disappointed. I'm including this one mostly for completeness in case someone else consults this question. It's clearly worth zero points. :)