Working with Strings to convert to numeric value with range - sql

I am working with the Texas business email dataset and i want to target all companies that have 25 to 300 employees. The Schema is currently setup with the employee count as a string with values like Employee_count: "25 to 300" "1 to 100" etc and others simply a single digit like Employee_Count: "10" , "3,000" etc. Is there a way for me to first parse the string so that it converts both numbers into a numeric range, Or at least get the larger of the two numbers, so that i can grab companies by employee count ranges?
I tried using CAST, JSON_FUNCTIONS etc but i am also fairly new to SQL so any tips would be greatly appreciated.
The end result im trying to get is to get a list of employers with 25 to 300, 301 to 1,000.

You want to split the string at the "to" position. Trim all spaces and remove all ,. Since the clean up will be used twice, we create a tempory function.
create temp function help_parse(str string) as (
safe_cast(replace(trim(str),",","") as int64)
);
with tbl as (Select * from unnest(["25 to 300","1 to 100" ,"10","3,000","1200"]) Employee_count)
select * ,
help_parse(split(Employee_count,"to")[safe_offset(0)]) as Employee_count_low,
help_parse(split(Employee_count,"to")[safe_offset(1)]) as Employee_count_high,
from tbl

Related

How to filter based on string in hive

So, I have a table of following schema
user_id int,
movie_id int,
score float,
demography string
the demography is a comma delimited string
like 'm,22,ca,.....'. This can have variable number of elements in it.
Now, I want to filter the records based on certain characterstics...
which is if demography is "m" or is from "ca" etc etc..
So, currently what I am doing is..
split the string into array (split(table.demography, "\\,")) and then explode it and do the filter.. using a where clause..
Where exploded_demography = 'm' or exploded_demography='ca' (etc etc)
But, explode causes the records to.. well.. explode.. I am trying to avoid that as it seems to bloat up the number of records..
Is there a way I can do this without exploding the records?
Try using:
find_in_set('ca', table.demography) > 0
From: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
int find_in_set(string str, string strList) Returns the first
occurance of str in strList where strList is a comma-delimited string.
Returns null if either argument is null. Returns 0 if the first
argument contains any commas. For example, find_in_set('ab',
'abc,b,ab,c,def') returns 3.

SQL Server: How to select rows which contain value comprising of only one digit

I am trying to write a SQL query that only returns rows where a specific column (let's say 'amount' column) contains numbers comprising of only one digit, e.g. only '1's (1111111...) or only '2's (2222222...), etc.
In addition, 'amount' column contains numbers with decimal points as well and these kind of values should also be returned, e.g. 1111.11, 2222.22, etc
If you want to make the query generic that you don't have to specify each possible digit you could change the where to the following:
WHERE LEN(REPLACE(REPLACE(amount,LEFT(amount,1),''),'.','') = 0
This will always use the first digit as comparison for the rest of the string
If you are using SQL Server, then you can try this script:
SELECT *
FROM (
SELECT CAST(amount AS VARCHAR(30)) AS amount
FROM TableName
)t
WHERE LEN(REPLACE(REPLACE(amount,'1',''),'.','') = 0 OR
LEN(REPLACE(REPLACE(amount,'2',''),'.','') = 0
I tried like this in place of 1111111 replace with column name:
Select replace(Str(1111111, 12, 2),0,left(11111,1))

SQL: Replacing dates contained within a text string

I am using SQL Server Management Studio 2012. I work with medical records and need to de-identify reports. The reports are structured in a table with columns Report_Date, Report_Subject, Report_Text, etc... The string I need to update is in report_text and there are ~700,000 records.
So if I have:
"patient had an EKG on 04/09/2012"
I need to replace that with:
"patient had an EKG on [DEIDENTIFIED]"
I tried
UPDATE table
SET Report_Text = REPLACE(Report_Text, '____/___/____', '[DEIDENTIFED]')
because I need to replace anything in there that looks like a date, and it runs but doesn't actually replace anything, because apparently I can't use the _ wildcard in this command.
Any recommendations on this? Advance thanks!
You can use PATINDEX to find the location of Date and then use SUBSTRING and REPLACE to replace the dates.
Since there may be multiple dates in the Text you have to run a while loop to replace all the dates.
Below sql will work for all dates in the form of MM/DD/YYYY
WHILE EXISTS( SELECT 1 FROM dbo.MyTable WHERE PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%',Report_Text) > 0 )
BEGIN
UPDATE t
SET Report_Text = REPLACE(Report_Text, DateToBeReplaced, '[DEIDENTIFIED]')
FROM ( SELECT * ,
SUBSTRING(Report_Text,PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%',Report_Text), 10) AS DateToBeReplaced
FROM dbo.MyTable AS a
WHERE PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%',Report_Text) > 0
) AS t
END
I have tested the above sql on a dummy table with few rows.I don't know how it will scale for your data but recommend you to give it a try.
To keep it simple, assume that a number represents an identifying element in the string so look for the position of the first number in the string and the position of the last number in the string. Not sure if this will apply to your entire set of records but here is the code ...
I created two test strings ... the one you supplied and one with the date at the beginning of the string.
Declare #tstString varchar(100)
Set #tstString = 'patient had an EKG on 04/09/2012'
Set #tstString = '04/09/2012 EKG for patient'
Select #tstString
-- Calculate 1st Occurrence of a Number
,PATINDEX('%[0-9]%',#tstString)
-- Calculate last Occurrence of a Number
,LEN(#tstString) - PATINDEX('%[0-9]%',REVERSE(#tstString))
,CASE
-- No numbers in the string, return the string
WHEN PATINDEX('%[0-9]%',#tstString) = 0 THEN #tstString
-- Number is the first character to find the last position and remove front
WHEN PATINDEX('%[0-9]%',#tstString) = 1 THEN
CONCAT('[DEIDENTIFIED]',SUBSTRING(#tstString, LEN(#tstString)-PATINDEX('%[0-9]%',REVERSE(#tstString))+2,LEN(#tstString)))
-- Just select string up to the first number
ELSE CONCAT(SUBSTRING(#tstString,1,PATINDEX('%[0-9]%',#tstString)-1),'[DEIDENTIFIED]')
END AS 'newString'
As you can see, this is messy in SQL.
I would rather achieve this with a parser service and move the data with SSIS and call the service.

Searching a varchar field for numeric values of a certain range

Using Oracle SQL Developer v3.2.20.09:
I have a table of data where, among all the other data, I have a column of all numeric results (for examples' sake, RESULT_NUM) but due to the way it is stored and used, is a varchar2 field. I need to pull all the records for the Body Temperature Codes (BT, TEMP, TEMPERATURE in the VT_CODE field) where result_num > 100 (everything is in Fahrenheit, so searching on the result alone will work).
So, my simple statement is:
Select * from VITALS where VT_CODE in ('BT', 'TEMP', 'TEMPERATURE');
This kicks me back all of the body temp records, which is over 2M. Now I need to refine it to get results that are over over 100.
When I try to add "and result_num > 100" I get an error because it is a varchar field I am trying to search with a number.
When I try to add "and result_num > '100'", it executes without error because it is a character value, but it returns everything greater than 1, not 100, which is everything, obviously.
Please help.
Try the following
and convert(int,result_num) > 100
or
and CAST(result_num as INTEGER) > 100

select using wildcard to find ending in two character then numeric

I am querying to find things ending in "ST" followed by a number 1 - 999.
SELECT NUMBER WHERE NUMBER LIKE '%ST -- works correctly to return everything ending in "ST"
SELECT NUMBER WHERE NUMBER LIKE '%[1-999] -- works correctly to return everything ending in 1 - 999
SELECT NUMBER WHERE NUMBER LIKE '%ST[1-999] -- doesn't work - returns nothing
Also tried:
SELECT NUMBER WHERE NUMBER LIKE '%ST%[1-999] -- works, but also returns things like "GRASTNT3" that have extra things between the "ST" and the number
Can anyone help this struggling beginner?
Thanks!
The problem is that [1-999] doesn't mean what you think it does.
SQL Server interprets that as a set of values (1-9, 9, 9) which basically means that if there's more than 1 digit after the ST, the entry won't be returned.
So far as I can tell, your best bet is:
SELECT NUMBER WHERE
NUMBER LIKE '%ST[1-9][0-9][0-9]' OR
NUMBER LIKE '%ST[1-9][0-9]' OR
NUMBER LIKE '%ST[1-9]'
(assuming that your numbers don't have leading zeros - if they do, replace the ones with more zeros)
You need to do
SELECT NUMBER WHERE
NUMBER LIKE '%ST[1-9][0-9][0-9]'
OR NUMBER LIKE '%ST[1-9][0-9]'
OR NUMBER LIKE '%ST[1-9]';
The group in the the [] is a Char/NChar not an Int.
Better still normalise and type your data, so you have an ST bit and an int column for the number.
If you find you need to define different filters on variable string data, consider Full Text Searching or another Lucene related technology depending on your RDBMS.