Extracting number from a description filed for Hive

Extracting number from a description filed for Hive - sql

I am trying to extract numbers from a column which contains number and characters. They are however, structured hence I would like to know if we can just extract the number. I wonder if explode will work.
The current description column:
I need a help in setting up a campaign soon. Revenue: 1000
What I tried to do is to create a new column for that number called revenue.
My current command is:
SELECT description, X.value
FROM task
lateral view
explode(description) X as value

You could try using the Split function like this
SELECT
description,
split (description, ':\\s')[1] as Revenue
FROM task
Where :\\s is the regex pattern to match a colon followed by a space.
-------- EDIT: --------
If there are multiple : in the data then you could try (not sure if it will work though) the following (assuming that the last split will always contain the digits)
SELECT
description,
split (description, ':\\s')[size(split (description, ':\\s')) - 1] as Revenue
FROM task
Also your try of using Revenue\\s:\\s as the pattern may not be working due to the extra space matching try `Revenue:\s'
---------------------------
Or alternatively if the description doesn't always have the colon you could use the method regexp_extract(string subject, string pattern, int index)
Something like:
SELECT
description,
regexp_extract(description, '.*?(\d+)$', 1) as Revenue
FROM task
Where the regex pattern .*?(\\d+)$ will match multiple digits at the end of the description (but only if they are at the end)
With the latter option you should be able to find a suitable pattern if the description is not always consistent.

You can also use the following to remove any non-numeric characters:
select regexp_replace(description, '[^0-9]', '') as Revenue from task
This only works, though, if there is only a single number in the [description] field. If it's reliably formatted, using a more specific RegEx would likely be preferable.

Related

Query to retrieve only columns which have last name starting with 'K'

]
The name column has both first and last name in one column.
Look at the 'rep' column. How do I filter only those rep names where the last name is starting with 'K'?

The way that table is defined won't allow you to do that query reliably--particularly if you have international names in the table. Not all names come with the given name first and the family name second. In Asia, for example, it is quite common to write names in the opposite order.
That said, assuming you have all western names, it is possible to get the information you need--but your indexes won't be able to help you. It will be slower than if your data had been broken out properly.
SELECT rep,
RTRIM(LEFT(LTRIM(RIGHT(rep, LEN(rep) - CHARINDEX(' ', rep))), CHARINDEX(' ', LTRIM(RIGHT(rep, LEN(rep) - CHARINDEX(' ', rep)))) - 1)) as family_name
WHERE family_name LIKE 'K%'
So what's going on in that query is some string manipulation. The dialect up there is SQL Server, so you'll have to refer to your vendor's string manipulation function. This picks the second word, and assumes the family name is the second word.
LEFT(str, num) takes the number of characters calculated from the left of the string
RIGHT(str, num) takes the number of characters calculated from the right of the string
CHARINDEX(char, str) finds the first index of a character
So you are getting the RIGHT side of the string where the count is the length of the string minus the first instance of a space character. Then we are getting the LEFT side of the remaining string the same way. Essentially if you had a name with 3 parts, this will always pick the second one.
You could probably do this with SUBSTRING(str, start, end), but you do need to calculate where that is precisely, using only the string itself.
Hopefully you can see where there are all kinds of edge cases where this could fail:
There are a couple records with a middle name
The family name is recorded first
Some records have a title (Mr., Lord, Dr.)
It would be better if you could separate the name into different columns and then the query would be trivial--and you have the benefit of your indexes as well.
Your other option is to create a stored procedure, and do the calculations a bit more precisely and in a way that is easier to read.

Assuming that the name is <firstname> <lastname> you can use:
where rep like '% K%'

How run Select Query with LIKE on thousands of rows

Newbie here. Been searching for hours now but I can seem to find the correct answer or properly phrase my search.
I have thousands of rows (orderids) that I want to put on an IN function, I have to run a LIKE at the same time on these values since the columns contains json and there's no dedicated table that only has the order_id value. I am running the query in BigQuery.
Sample Input:
ORD12345
ORD54376
Table I'm trying to Query: transactions_table
Query:
SELECT order_id, transaction_uuid,client_name
FROM transactions_table
WHERE JSON_VALUE(transactions_table,'$.ordernum') LIKE IN ('%ORD12345%','%ORD54376%')
Just doesn't work especially if I have thousands of rows.
Also, how do I add the order id that I am querying so that it appears under an order_id column in the query result?
Desired Output:

Option one
WITH transf as (Select order_id, transaction_uuid,client_name , JSON_VALUE(transactions_table,'$.ordernum') as o_num from transactions_table)
Select * from transf where o_num like '%ORD12345%' or o_num like '%ORD54376%'
Option two
split o_num by "-" as separator , create table of orders like (select 'ORD12345' as num
Union
Select 'ORD54376' aa num) and inner join it with transf.o_num

One method uses OR:
WHERE JSON_VALUE(transactions_table, '$.ordernum') LIKE IN '%ORD12345%' OR
JSON_VALUE(transactions_table, '$.ordernum') LIKE '%ORD54376%'
An alternative method uses regular expressions:
WHERE REGEXP_CONTAINS(JSON_VALUE(transactions_table, '$.ordernum'), 'ORD12345|ORD54376')

According to the documentation, here, the LIKE operator works as described:
Checks if the STRING in the first operand X matches a pattern
specified by the second operand Y. Expressions can contain these
characters:
A percent sign "%" matches any number of characters or
bytes.
An underscore "_" matches a single character or byte.
You can escape "\", "_", or "%" using two backslashes. For example, "\%". If
you are using raw strings, only a single backslash is required. For
example, r"\%".
Thus , the syntax would be like the following:
SELECT
order_id,
transaction_uuid,
client_name
FROM
transactions_table
WHERE
JSON_VALUE(transactions_table,
'$.ordernum') LIKE '%ORD12345%'
OR JSON_VALUE(transactions_table,
'$.ordernum') LIKE '%ORD54376%
Notice that we specify two conditions connected with the OR logical operator.
As a bonus information, when querying large datasets it is a good pratice to select only the columns you desire in your out output ( either in a Temp Table or final view) instead of using *, because BigQuery is columnar, one of the reasons it is faster.
As an alternative for using LIKE, you can use REGEXP_CONTAINS, according to the documentation:
Returns TRUE if value is a partial match for the regular expression, regex.
Using the following syntax:
REGEXP_CONTAINS(value, regex)
However, it will also work if instead of a regex expression you use a STRING between single/double quotes. In addition, you can use the pipe operator (|) to allow the searched components to be logically ordered, when you have more than expression to search, as follows:
where regexp_contains(email,"gary|test")
I hope if helps.

how to query some characters in one field?

My database is Oracle 11g.
I want to do a in query in sql. The query criteria is to matched some of the characters of a field:
description
CWLV321900017391;EFHU3832239
CWLV321900017491;ERHU3832239
CWLV321900017591;ERHU3832239
CWLV321900017691;ERHU3832239
My query is like this:
select * from product where description in ('CWLV321900017391', 'CWLV321900017491');
It returns no records in result.
I expect the result like below:
description
CWLV321900017391;EFHU3832239
CWLV321900017491;ERHU3832239
How to get it by SQL?
thanks.

You are using IN to search for a partial string match in the description table. This is not going to return the results, as IN will only match exact values.
Instead, one way to achieve this is to use a LIKE operator with %:
select *
from product
where (description LIKE 'CWLV321900017391%' OR
description LIKE 'CWLV321900017491%');
The % at the end indicates that anything can follow after the specified text.
This will return any description that starts with CWLV321900017391 or CWLV321900017491.
Incidentally, if your search term occurs anywhere in the description field, you will need to use a % at each end of the search term:
description LIKE '%CWLV321900017391%' OR description LIKE '%CWLV321900017491%'

There are various ways to solve it. Here is one:
select * from product
where instr (description, 'CWLV321900017391') > 0
or instr (description, 'CWLV321900017491') > 0;
If you know you're always searching for the start of the description you can use:
select * from product
where substr (description, 1, 16) in ('CWLV321900017391','CWLV321900017491')
Also, there's LIKE or REGEX_LIKE solution. It depends really on what strings you're searching for.
Of course none of these solutions is truly satisfactory, and for large volumes of data may exhibit suck-y performance. The problem is the starting data model violates First Normal Form by storing non-atomic values. Poor data models engender clunky SQL.

You can try below way -
select * from product
where substr(description,1,instr(description,';',1,1)-1) in ('CWLV321900017391', 'CWLV321900017491')

I am opposed to storing multiple values in a delimited string format. There are many better alternatives, but let me assume that you don't have a choice on the data model.
If you are stuck with strings in this format, you should take the delimiters into account. You can do this using like:
where ';' || description || ';' like '%;CWLV321900017391;%'
You can also do something similar with regexp_like() if you want to look for one of several values:
where regexp_like(';' || description || ';',
'(;CWLV321900017391;)|(;CWLV321900017391;)'
)

Returning text within a string in Oracle SQL

I have string data in a column of a table that will contain a monetary amount in it somewhere.
E.G. the column may contain something like:
"Dave once paid £50.00 to a lottery syndicate"
"Total Investment Returns for the fund in 2017 came to £150,964.39"
How can I search for the occurrence of the '£' sign and then return the number that occurs after it?
Thanks

Here is one way. The search expression is a bit complicated because it must allow for thousand separators and decimal points, all optional. It assumes "Western" usage of thousands separators - it would have to be modified slightly to allow for Lakh (Indian) notation, for example. It will produce NULL when there is no pound sign, or if there is a pound sign not immediately followed by at least one digit. (So it will have to be modified slightly if you allow things like £.60 instead of £0.60.) You can also capture just the amount (without the currency symbol) if desired - that is also a slight modification to the use of REGEXP_SUBSTR (use capture groups).
The biggest change would be needed if you may have more than one amount per input row.
with
inputs ( str ) as (
select 'Dave once paid £50.00 to a lottery syndicate.' from dual union all
select 'Total Returns in 2017 came to £150,964.39.' from dual
)
-- End of simulated inputs (for testing purposes only, not part of the solution).
-- Use your actual table and column names in the SQL query below.
select str, regexp_substr(str, '£\d{1,3}(,?\d{3})*(\.\d+)?') as amount
from inputs
;
STR AMOUNT
--------------------------------------------- -----------
Dave once paid £50.00 to a lottery syndicate. £50.00
Total Returns in 2017 came to £150,964.39. £150,964.39
Edit
In a Comment below, the OP asked how to obtain just the amount, without the currency symbol. The easiest way is to use capture groups directly in the REGEXP_SUBSTR() function. The version below uses all six arguments to the function: as before the first is the input string and the second is the search pattern. The third and forth are the starting position and the occurrence (both always equal to 1 for this problem). The fifth, NULL, is for some special options we don't need. The sixth argument is relevant: 1 means return the first capture group, i.e. the part of the search pattern included in the first pair of matching parentheses (counting from left to right). Notice the additional pair of parentheses in the search pattern, to isolate the amount from the pound symbol:
regexp_substr(str, '£(\d{1,3}(,?\d{3})*(\.\d+)?)', 1, 1, null, 1)
Edit #2
To extract the amount in NUMBER data type, it is not necessary to remove the pound sign; the TO_NUMBER() function can handle it. Instead, the substring that is just the pound sign followed by the amount must be wrapped within TO_NUMBER(), using the proper format model and explicit currency symbol:
to_number(regexp_substr(str, '£\d{1,3}(,?\d{3})*(\.\d+)?'),
'L999,999,999,999,999.000000', 'nls_currency=£')
Just make sure to include enough digits on the right of the decimal point to accommodate all possible amounts. (Too many digits in the format model is never a problem.)

Comparing fields when a field has data in between 2 characters that match the field being compared

I have code that looks like this:
left outer join
gme_batch_header bh
on
substr(ln.lot_number,instr(ln.lot_number,'(') + 1,
instr(ln.lot_number,')') - instr(ln.lot_number,'(') - 1)
=
bh.batch_no
It works fine, but I have come across a few lot numbers that have two sections of strings that are between parenthesis. How would I compare what is between the second set of parenthesis? Here is an example of the data in the lot number field:
E142059-307-SCRAP-(74055)
This one works with the code,
58LF-3-B-2-2-2 (SCRAP)-(61448)
This one tries comparing SCRAP with the batch no, which isn't correct. It needs to be the 61448.
The result is always the last item in parenthesis.

After more research, I actually got it to work with this code:
substr(ln.lot_number,instr(ln.lot_number,'(',-1) + 1, instr(ln.lot_number,')',-1) - instr(ln.lot_number,'(',-1) - 1)

Assuming SQL2005+, and it is always the last occurrence you want, then I would suggest finding the last instance of a ( in your query and substring to there. To get the last instance you could use something like:
REVERSE(SUBSTRING(REVERSE(lot_number),0,CHARINDEX('(',REVERSE(lot_number))))

If your version of Oracle supports regular expressions try this:
substr(regexp_substr(ln.lot_number,'[0-9]+\)$'),1,length(regexp_substr(ln.lot_number,'[0-9]+\)$'))-1)
Explanation:
regexp_substr(scrap_row,'[0-9]+\)$' ==> find me just numbers in the string that ends in ). This returns the numbers but it includes the closing parenthesis.
To remove the closing parenthsis, just send it through substring and extract first number through the length of the number stopping at 1 character from the end of the string.
Query for analysis:
with scrap
as (select '58LF-3-B-2-2-2 (SCRAP)-(61448)' as scrap_row from dual)
select scrap_row,
regexp_substr(scrap_row,'[0-9]+\)$') as regex_substring,
length(regexp_substr(scrap_row,'[0-9]+\)$')) as length_regex_substring,
substr(regexp_substr(scrap_row,'[0-9]+\)$'),1,length(regexp_substr(scrap_row,'[0-9]+\)$'))-1) as regex_sans_parenthesis
from scrap

If you have 11g, this will do it pretty simply by using the subgroup argument of regexp_substr() and constructing the regex appropriately:
SQL> with tbl(data) as
(
select 'E142059-307-SCRAP-(74055)' from dual
union
select '58LF-3-B-2-2-2 (SCRAP)-(61448)' from dual
)
select data from tbl
where regexp_substr(data, '\((\d+)\)$', 1, 1, NULL, 1)
= '61448';
DATA
------------------------------
58LF-3-B-2-2-2 (SCRAP)-(61448)
The regular expression can be read as:
\( - Search for a literal left paren
( - Start a remembered subgroup
\d+ - followed by 1 more more digits
) - End remembered subgroup
\) - followed by a literal right paren
$ - at the end of the line.
The regexp_substr function arguments are:
Source - the source string
Pattern - The regex pattern to look for
position - Position in the string to start looking for the pattern
occurrence - If the pattern occurs multiple times, which occurrence you want
match_params - See the docs, not used here
subexpression - which subexpression to use (the remembered group)
So in English, look for a series of 1 or more digits surrounded by parens, where it occurs at the end of the line and save the digit part only to use to compare. IMHO a lot easier to follow/maintain than nested instr(), substr().
For re-useability, make a function called get_last_number_in_parens() that contains this code and uses an argument of the string to search. This way that logic is encapsulated and can be re-used by folks that may not be so comfortable with regular expressions, but can benefit from the power! One place to maintain code too. Then call like this:
select data from tbl
where get_last_number_in_parens(data) = '61448';
How easy is that?!

Hello you can check with this code. It works whaever the condition may be
SELECT SUBSTR('58LF-3-B-2-2-2-(61448)',instr('58LF-3-B-2-2-2-(61448)','(',-1)+1,LENGTH('58LF-3-B-2-2-2-(61448)')-instr('58LF-3-B-2-2-2-(61448)','(',-1)-1)
FROM dual;
SELECT SUBSTR('58LF-3-B-2-2-2 (SCRAP)-(61448)',instr('58LF-3-B-2-2-2 (SCRAP)-(61448)','(',-1)+1,LENGTH('58LF-3-B-2-2-2 (SCRAP)-(61448)')-instr('58LF-3-B-2-2-2 (SCRAP)-(61448)','(',-1)-1)
FROM dual;
Output
==================================
61448
==================================

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas