substring and trim in Teradata - sql

I am working in Teradata with some descriptive data that needs to be transformed from a gerneric varchar(60) into the different field lengths based on the type of data element and the attribute value. So I need to take whatever is in the Varchar(60) and based on field 'ABCD' act on field 'XYZ'. In this case XYZ is a varchar(3). To do this I am using CASE logic within my select. What I want to do is
eliminate all occurances of non alphabet/numeric data. All I want left are upper case Alpha chars and numbers.
In this case "Where abcd = 'GROUP' then xyz should come out as a '000', '002', 'A', 'C'
eliminate extra padding
Shift everything Right
abcd xyz
1 GROUP NULL
2 GROUP $
3 GROUP 000000000000000000000000000000000000000000000000000000000000
4 GROUP 000000000000000000000000000000000000000000000000000000000002
5 GROUP A
6 GROUP C
7 GROUP r
To do this I have tried TRIM and SUBSTR amongst several other things that did not work. I have pasted what I have working now, but I am not reliably working through the data within the select. I am really looking for some options on how to better work with strings in Teradata. I have been working out of the "SQL Functions, Operators, Expressions and Predicates" online PDF. Is there a better reference. We are on TD 13
SELECT abcd
, CASE
-- xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WHEN abcd= 'GROUP'
THEN(
CASE
WHEN SUBSTR(tx.abcd,60, 4) = 0
THEN (
SUBSTR(tx.abcd,60, 3)
)
ELSE
TRIM (TRAILING FROM tx.abcd)
END
)
END AS abcd
FROM db.descr tx
WHERE tx.abcd IS IN ( 'GROUP')
The end result should look like this
abcd xyz
1 GROUP 000
2 GROUP 002
3 GROUP A
4 GROUP C
I will have to deal with approx 60 different "abcd" types, but they should all conform to the type of data I am currently seeing.. ie.. mixed case, non numeric, non alphabet, padded, etc..
I know there is a better way, but I have come in several circles trying to figure this out over the weekend and need a little push in the right direction.
Thanks in advance,
Pat

The SQL below uses the CHARACTER_LENGTH function to first determine if there is a need to perform what amounts to a RIGHT(tx.xyz, 3) using the native functions in Teradata 13.x. I think this may accomplish what you are looking to do. I hope I have not misinterpreted your explanation:
SELECT CASE WHEN tx.abcd = 'GROUP'
AND CHARACTER_LENGTH(TRIM(BOTH FROM tx.xyz) > 3
THEN SUBSTRING(tx.xyz FROM (CHARACTER_LENGTH(TRIM(BOTH FROM tx.xyz)) - 3))
ELSE tx.abcd
END
FROM db.descr tx;
EDIT: Fixed parenthesis in SUBSTRING

Related

New column based on list of values SQL

I am new to SQL and working on a database that needs a binary indicator based on the presence of string values in a column. I'm trying to make a new table as follows:
Original:
Indicator
a, b, c
c, d, e
Desired:
Indicator
type
a, b, c
1
c, d, e
0
SQL code:
SELECT
ID,
Contract,
Indicator,
CASE
WHEN Indicator IN ('a', 'b')
THEN 1
ELSE 0
END as Type
INTO new_table
FROM old_table
The table I keep creating reports every type as 0.
I also have 200+ distinct indicators, so it will be really time-consuming to write each as:
CASE
WHEN Indicator = 'a' THEN '1'
WHEN Indicator = 'b' THEN '1'
Is there a more streamlined way to think about this?
Thanks!
I think the first step is to understand why your code doesn’t work right now.
If your examples of what’s Indicator column are literally the strings you noted (a, b, c in one string and c, d, e in another) you should understand that your case statement is saying “I am looking for an exact match on the full value of Indicator against the following list -
The letter A or
The letter B
Essentially- you are saying “hey SQL, does ‘a,b,c’ match to ‘a’? Or does ‘a,b,c’ match to ‘b’. ?”
Obviously SQL’s answer is “these don’t match” which is why you get all 0s.
You can try wildcard matching with the LIKE syntax.
Case when Indicator like ‘%a%’ or Indicator like ‘%b%’ then 1 else 0 end as Type
Now, if the abc and cde strings aren’t REALLY what’s in your database then this approach may not work well for you.
Example, let’s say your real values are words that are all slapped together in a single string.
Let’s say that your strings are 3 words each.
Cat, Dog, Man
Catalog, Stick, Shoe
Hair, Hellcat, Belt
And let’s say that Cat is a value that should cause Type to be 1.
If you write: case when Indicator like ‘%cat%’ then 1 else 0 end as Type - all 3 rows will get a 1 because the wildcard will match Cat in Catalog and cat in Hellcat.
I think the bottom line is that unless your Indicator values really are 3 letters and your match criteria is a single letter, you very well could be better off writing a 200 line long case statement if you need this done any time soon.
A better approach to consider (depending on things like are you going to have 300 different combinations a week or month or year from now?)
If yes, wouldn’t it be nice if you had a table with a total of 6 rows - like so?
Indicator | Indictor_Parsed
a,b,c | a
a,b,c | b
a,b,c | c
c,d,e | c
c,d,e | d
c,d,e | e
Then you could write the query as you have it case when Indicator_Parsed in (‘a’, ‘b’) then 1 else 0 end as Type - as a piece of a more verbose solution.
If this approach seems useful to you, here’s a link to the page that lets you parse those comma-separated-values into additional rows. Turning a Comma Separated string into individual rows
ON mysql/sql server You can do it as follows :
insert into table2
select Indicator,
CASE WHEN Indicator like '%a%' or Indicator like '%b%' THEN 1 ELSE 0 END As type
from table1;
demo here
You can use the REGEXP operator to check for presence of either a, b or both.
SELECT Indicator,
Indicator REGEXP '.*[ab].*'
FROM tab
If you need that into a table, you either create it from scratch
CREATE your_table AS
SELECT Indicator,
Indicator REGEXP '.*[ab].*'
FROM tab
or you insert values in it:
INSERT INTO your_table
SELECT Indicator,
Indicator REGEXP '.*[ab].*'
FROM tab
Check the demo here.

Horizontal and vertical output from same SQL Server table

I am not sure this can be done, and tried numerous searches but no real result yet.
I have a SQL Server database with a table where I want to output results from a single table both horizontally and vertically. I realise this will be a complex SQL statement and have managed part of the vertical using a UNION but the horizontal eludes me.
The table has a field called "reference" and contains a string of characters such as "A03ACCEVEN18JS-SN1AA" or "A02ACCVCOM18JS-FN1AA". I want to create an output with a row for the count of references commencing A02 then a row for A03, A04 etc that also contain "18". Then expand horizontally to count the references with different letters after the hyphen, i.e. "-s" and "-f" etc. So the output would look like below,
S_Count | F_Count | J_Count etc
---------------------------------
A02 Row --> 58 | 23 | 16
A03 Row --> 22 | 43 | 53
A04 Row --> 7 | 31 | 23
etc
I managed to get one column so far with multiple where clauses and UNIONS like below but I now need the vertical. Can this be done please?
SELECT COUNT(reference) FROM mytable
WHERE reference LIKE 'A02%' AND reference LIKE '%%18%%' AND PATINDEX('%-P%',
reference) <> 0
UNION
SELECT COUNT(reference) FROM mytable
WHERE reference LIKE 'A03%' AND reference LIKE '%%18%%' AND PATINDEX('%-P%',
reference) <> 0
UNION
SELECT COUNT(reference) AS TOTAL FROM mytable
WHERE reference LIKE 'A04%' AND reference LIKE '%%18%%' AND PATINDEX('%-P%',
reference) <> 0;
Let's do it all in one hit :)
SELECT
LEFT(reference, 3) as ao_number,
SUM(CASE WHEN reference LIKE '%-S%' THEN 1 ELSE 0 END) as s_count,
SUM(CASE WHEN reference LIKE '%-F%' THEN 1 ELSE 0 END) as j_count,
SUM(CASE WHEN reference LIKE '%-J%' THEN 1 ELSE 0 END) as s_count
FROM
table
WHERE
reference like 'A0%18%'
GROUP BY
LEFT(reference, 3)
Notes:
LEFT(reference, 3) pulls the A0x number off the start. Grouping by this will give us one row per distinct A0x number, so if a thousand variations of A00 to A09 are present, we'll get 10 rows
You don't need to (and shouldn't) say WHERE reference LIKE 'A03%' AND reference LIKE '%%18%%' etc.. I just combine them to 'A0%18%'. Note that I didn't combine them to 'A03%18%' as that would restrict our data too much. Don't double up your percent signs when doing a like
The SUM performs a count; the case when looks a the reference and if it has e.g. an -S in it, then it returns 1 else 0. Summing these effectively counts the reference patterns
By th way, for future searching purposes, this type of query is called a PIVOT. Most databases have some proprietary syntax to carry out pivoting, but I tend to remember/utilize this pattern because it's a bit more flexible and is cross-db compatible

Oracle SQL - Multiple return from case

I may be trying it wrong. I am looking for any approach which is best.
Requirement:
My Query joins 4-5 tables based on few fields.
I have a column called product id. In my table there are 1.5 million rows. Out of those only 10% rows has product ids with the following attribute
A300X-%
A500Y-%
300,500, 700 are valid model numbers. X and Y are classifications. My query picks all the systems.
I have a check as follows
CASE
WHEN PID LIKE 'A300X%'
THEN 'A300'
...
END AS MODEL
Similarly
CASE
WHEN PID LIKE 'A300X%'
THEN 'X'
...
END AS GENRE
I am looking for the best option from the below
How do I Combine both case statement and add another[third] case which will have these two cases. i.e
CASE
WHEN desc in ('AAA')
First Case
Second Case
ELSE
don't do anything for other systems
END
Is there any regex way of doing this? Before first - take the string. Look for X, Y and also 300,500,700.
Is there any other way of doing this? Or doing via code is the best way?
Any suggestions?
EDIT:
Sample desc:
AAA,
SoftwARE,
sw-app
My query picks all the desc. But the case should be running for AAA alone.
And Valid models are
A300X-2x-P
A500Y-5x-p
A700X-2x-p
A50CE-2x-P
I have to consider only 300,500,700. And the above two cases.
Expected result:
MODEL GENRE
A300 X
A500 Y
A300 Y
Q: How do I Combine both CASE statement expressions
Each CASE expression will return a single value. If the requirement is to return two separate columns in the resultset, that will require two separate expressions in the SELECT list.
For example:
DESC PID model_number genre
---- ---------- ------------ ------
AAA A300X-2x-P 300 X
AAA A500Y-5x-p 500 Y
AAA A700X-2x-p 700 X
AAA A50CE-2x-P (NULL) (NULL)
FOO A300X-2x-P (NULL) (NULL)
There will need to be an expression to return the model_number column, and a separate expression to return the genre column.
It's not possible for a single expression to return two separate columns.
Q: and add another[third] case which will have these two cases.
A CASE expression returns a value; we can use a CASE expression almost anywhere in a SQL statement where we can use a value, including within another CASE expression.
We can also combine multiple conditions in a WHEN test with AND and OR
As an example of combining conditions and nesting CASE expressions ditions...
CASE
WHEN ( ( t.PID LIKE '_300%' OR t.PID LIKE '_500%' OR t.PID LIKE '_700%' )
AND ( t.DESC = 'AAA' )
)
THEN CASE
WHEN ( t.PID LIKE '____X%' )
THEN 'X'
WHEN ( t.PID LIKE '____Y%' )
THEN 'Y'
ELSE NULL
END
ELSE NULL
END AS `genre`
There are other expressions that will return an equivalent result; the example shown here isn't necessarily the best expression. It just serves as a demonstration of combining conditions and nesting CASE expressions.
Note that to return another column model we would need to include another expression in the SELECT list. Similar conditions will need to be repeated; it's not possible to reference the WHEN conditions in another CASE expression.
Based on your sample data, logic such as this would work:
(CASE WHEN REGEXP_LIKE(PID, '^A[0-9]{3}[A-Z]-')
THEN SUBSTR(PID, 1, 4)
ELSE PID
END) AS MODEL
(CASE WHEN REGEXP_LIKE(PID, '^A[0-9]{3}[A-Z]-')
THEN SUBSTR(PID, 5, 1)
ELSE PID
END) AS GENRE
This assumes that the "model number" always starts with "A" and is followed by three digits (as in your example data). If the model number is more complicated, you may need regexp_substr() to extract the values you want.

Count particular substring text within column

I have a Hive table, titled 'UK.Choices' with a column, titled 'Fruit', with each row as follows:
AppleBananaAppleOrangeOrangePears
BananaKiwiPlumAppleAppleOrange
KiwiKiwiOrangeGrapesAppleKiwi
etc.
etc.
There are 2.5M rows and the rows are much longer than the above.
I want to count the number of instances that the word 'Apple' appears.
For example above, it is:
Number of 'Apple'= 5
My sql so far is:
select 'Fruit' from UK.Choices
Then in chunks of 300,000 I copy and paste into Excel, where I'm more proficient and able to do this using formulas. Problem is, it takes upto an hour and a half to generate each chunk of 300,000 rows.
Anyone know a quicker way to do this bypassing Excel? I can do simple things like counts using where clauses, but something like the above is a little beyond me right now. Please help.
Thank you.
I think I am 2 years too late. But since I was looking for the same answer and I finally managed to solve it, I thought it was a good idea to post it here.
Here is how I do it.
Solution 1:
+-----------------------------------+---------------------------+-------------+-------------+
| Fruits | Transform 1 | Transform 2 | Final Count |
+-----------------------------------+---------------------------+-------------+-------------+
| AppleBananaAppleOrangeOrangePears | #Banana#OrangeOrangePears | ## | 2 |
| BananaKiwiPlumAppleAppleOrange | BananaKiwiPlum##Orange | ## | 2 |
| KiwiKiwiOrangeGrapesAppleKiwi | KiwiKiwiOrangeGrapes#Kiwi | # | 1 |
+-----------------------------------+---------------------------+-------------+-------------+
Here is the code for it:
SELECT length(regexp_replace(regexp_replace(fruits, "Apple", "#"), "[A-Za-z]", "")) as number_of_apples
FROM fruits;
You may have numbers or other special characters in your fruits column and you can just modify the second regexp to incorporate that. Just remember that in hive to escape a character you may need to use \\ instead of just one \.
Solution 2:
SELECT size(split(fruits,"Apple"))-1 as number_of_apples
FROM fruits;
This just first split the string using "Apple" as a separator and makes an array. The size function just tells the size of that array. Note that the size of the array is one more than the number of separators.
This is straight-forward if you have any delimiter ( eg: comma ) between the fruit names. The idea is to split the column into an array, and explode the array into multiple rows using the 'explode' function.
SELECT fruit, count(1) as count FROM
( SELECT
explode(split(Fruit, ',')) as fruit
FROM UK.Choices ) X
GROUP BY fruit
From your example, it looks like fruits are delimited by Capital letters. One idea is to split the column based on capital letters, assuming there are no fruits with same suffix.
SELECT fruit_suffix, count(1) as count FROM
( SELECT
explode(split(Fruit, '[A-Z]')) as fruit_suffix
FROM UK.Choices ) X
WHERE fruit_suffix <> ''
GROUP BY fruit_suffix
The downside is that, the output will not have first letter of the fruit,
pple - 5
range - 4
I think you want to run in one select, and use the Hive if UDF to sum for the different cases. Something like the following...
select sum( if( fruit like '%Apple%' , 1, 0 ) ) as apple_count,
sum( if( fruit like '%Orange%', 1, 0 ) ) as orange_count
from UK.Choices
where ID > start and ID < end;
instead of a join in the above query.
No experience of Hive, I'm afraid, so this may or may not work. But on SQLServer, Oracle etc I'd do something like this:
Assuming that you have an int PK called ID on the row, something along the lines of:
select AppleCount, OrangeCount, AppleCount - OrangeCount score
from
(
select count(*) as AppleCount
from UK.Choices
where ID > start and ID < end
and Fruit like '%Apple%'
) a,
(
select count(*) as OrangeCount
from UK.Choices
where ID > start and ID < end
and Fruit like '%Orange%'
) o
I'd leave the division by the total count to the end, when you have all the rows in the spreadsheet and can count them there.
However, I'd urgently ask my boss to let me change the Fruit field to be a table with an FK to Choices and one fruit name per row. Unless this is something you can't do in Hive, this design is something that makes kittens cry.
PS I'd missed that you wanted the count of occurances of Apple which this won't do. I'm leaving my answer up, because I reckon that my However... para is actually a good answer. :(

Finding what words a set of letters can create?

I am trying to write some SQL that will accept a set of letters and return all of the possible words it can make. My first thought was to create a basic three table database like so:
Words -- contains 200k words in real life
------
1 | act
2 | cat
Letters -- contains the whole alphabet in real life
--------
1 | a
3 | c
20 | t
WordLetters --First column is the WordId and the second column is the LetterId
------------
1 | 1
1 | 3
1 | 20
2 | 3
2 | 1
2 | 20
But I'm a bit stuck on how I would write a query that returns words that have an entry in WordLetters for every letter passed in. It also needs to account for words that have two of the same letter. I started with this query, but it obviously does not work:
SELECT DISTINCT w.Word
FROM Words w
INNER JOIN WordLetters wl
ON wl.LetterId = 20 AND wl.LetterId = 3 AND wl.LetterId = 1
How would I write a query to return only words that contain all of the letters passed in and accounting for duplicate letters?
Other info:
My Word table contains close to 200,000 words which is why I am trying to do this on the database side rather than in code. I am using the enable1 word list if anyone cares.
Ignoring, for the moment, the SQL part of the problem, the algorithm I'd use is fairly simple: start by taking each word in your dictionary, and producing a version of it with the letters in sorted order, along with a pointer back to the original version of that word.
This would give a table with entries like:
sorted_text word_id
act 123 /* we'll assume `act` was word number 123 in the original list */
act 321 /* we'll assume 'cat' was word number 321 in the original list */
Then when we receive an input (say, "tac") we sort it's letters, look it up in our table of sorted letters joined to the table of the original words, and that gives us a list of the words that can be created from that input.
If I were doing this, I'd have the tables for that in a SQL database, but probably use something else to pre-process the word list into the sorted form. Likewise, I'd probably leave sorting the letters of the user's input to whatever I was using to create the front-end, so SQL would be left to do what it's good at: relational database management.
If you use the solution you provide, you'll need to add an order column to the WordLetters table. Without that, there's no guarantee that you'll retrieve the rows that you retrieve are in the same order you inserted them.
However, I think I have a better solution. Based on your question, it appears that you want to find all words with the same component letters, independent of order or number of occurrences. This means that you have a limited number of possibilities. If you translate each letter of the alphabet into a different power of two, you can create a unique value for each combination of letters (aka a bitmask). You can then simply add together the values for each letter found in a word. This will make matching the words trivial, as all words with the same letters will map to the same value. Here's an example:
WITH letters
AS (SELECT Cast('a' AS VARCHAR) AS Letter,
1 AS LetterValue,
1 AS LetterNumber
UNION ALL
SELECT Cast(Char(97 + LetterNumber) AS VARCHAR),
Power(2, LetterNumber),
LetterNumber + 1
FROM letters
WHERE LetterNumber < 26),
words
AS (SELECT 1 AS wordid, 'act' AS word
UNION ALL SELECT 2, 'cat'
UNION ALL SELECT 3, 'tom'
UNION ALL SELECT 4, 'moot'
UNION ALL SELECT 5, 'mote')
SELECT wordid,
word,
Sum(distinct LetterValue) as WordValue
FROM letters
JOIN words
ON word LIKE '%' + letter + '%'
GROUP BY wordid, word
As you'll see if you run this query, "act" and "cat" have the same WordValue, as do "tom" and "moot", despite the difference in number of characters.
What makes this better than your solution? You don't have to build a lot of non-words to weed them out. This will constitute a massive savings of both storage and processing needed to perform the task.
There is a solution to this in SQL. It involves using a trick to count the number of times that each letter appears in a word. The following expression counts the number of times that 'a' appears:
select len(word) - len(replace(word, 'a', ''))
The idea is to count the total of all the letters in the word and see if that matches the overall length:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word, (LEN(w.word) - LEN(replace(w.word, wl.letter))) as LettersInWord
from word w
cross join wordletters wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
This particular solution allows multiple occurrences of a letter. I'm not sure if this was desired in the original question or not. If we want up to a certain number of occurrences, then we might do the following:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word,
(case when (LEN(w.word) - LEN(replace(w.word, wl.letter))) <= maxcount
then (LEN(w.word) - LEN(replace(w.word, wl.letter)))
else maxcount end) as LettersInWord
from word w
cross join
(
select letter, count(*) as maxcount
from wordletters wl
group by letter
) wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
If you want an exact match to the letters, then the case statement should use " = maxcount" instead of " <= maxcount".
In my experience, I have actually seen decent performance with small cross joins. This might actually work server-side. There are two big advantages to doing this work on the server. First, it takes advantage of the parallelism on the box. Second, a much smaller set of data needs to be transfered across the network.