SQL:How to find similar strings in a tuple - sql

I tried to use difflib to get_close_matches in a tuple data...but it does not work...I have earlier used difflib in a JSON file but couldn't use it in an SQL...Result expectationI want to find words similar to the input given..even if there is any spelling mistake...for example...if the input is treeeee or TREEEEE or Treeea...my program should consider the nearest match...that is a tree...Similar to the Did you mean? function in GOOGLE. I also tried SELECT * FROM Dictionary WHERE Expression LIKE '%s but the problem persists. Please help me solve this. Thanks in advance.

SQL functions Soundex and DIFFERENCE look like the closest fit.

Related

How to using regexp_contains for the similiar first word

i have a data with entrance_page_name:
/search?q=
/search?
/search?ast
can i get the data with the similar first word
WHEN REGEXP_CONTAINS(entrance_page_name, '^/search/q=') THEN 'search?q='
WHEN REGEXP_CONTAINS(entrance_page_name, '^search?') THEN 'search?'
But it's not really works. Any assistance with this would be greatly appreciated!
Thank you
You can possibly use the raw string, and an escape character to override ? symbol.
SELECT CASE WHEN REGEXP_CONTAINS('/search?q=SDbmoLZK89s', r'^/search\?q=') THEN 'search?q=' END as test
The above code should ideally work in your situation.

Not Like in Teradata

I am new to Teradata and trying to figure out how to do a NOT LIKE statement with multiple wildcards. I've tried several different ways, but haven't found a way that works. Most recently I've tried the code below.
WHERE DIAG_CD NOT IN ALL ('S060%','S340%')
Any help you all can provide would be much appreciated.
Thanks!
You are on the right track. You can use ANY / ALL quantifier with LIKE or NOT LIKE.
WHERE DIAG_CD NOT LIKE ALL ('S060%','S340%')
or
WHERE NOT (DIAG_CD LIKE ANY ('S060%','S340%'))
IN does not support wildcards. You need to repeat the conditions:
where diag_cd not like 'S060%' and diag_cd not like 'S340%'
Or you can do regex matching instead: ^ represents the beginning of the string, and | stands for or. This syntax is easier to extend with more strings patterns.
where not regexp_like(diag_cd, '(^S060)|(^S340)')

CountVectorizer method get_feature_names() produces codes but not words

I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:
vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()
And I got the following output:
[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',
and so on.
I need real feature names (words), not these codes. Can anybody help me please?
UPDATE:
I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?
You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.
Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.
Here is what you can do get what you exactly want:
vectorizer=CountVectorizer()
vectorizer.fit_transform(df['message_encoding'])
feat_dict=vectorizer.vocabulary_.keys()
instead of vectorizer.get_feature_names() you can write vectorizer.vocabulary_.keys() to get the words.

SSRS if field value in list

I've looked through a number of tutorials and asks, and haven't found a working solution to my problem.
Suppose my dataset has two columns: sort_order and field_value. sort_order is an integer and field_value is a numerical (10,2).
I want to format some rows as #,#0 and others as #,#0.00.
Normally I would just do
iif( fields!sort_order.value = 1 or fields!sort_order.value = 23 or .....
unfortunately, the list is fairly long.
I'd like to do the equivalent of if fields!sort_order.value in (1,2,21,63,78,...) then...)
As recommended in another post, I tried the following (if sort in list, then just output a 0, else a 1. this is just to test the functionality of the IN operator):
=iif( fields!sort_order.Value IN split("1,2,3,4,5,6,8,10,11,15,16,17,18,19,20,21,26,30,31,33,34,36,37,38,41,42,44,45,46,49,50,52,53,54,57,58,59,62,63,64,67,68,70,71,75,76,77,80,81,82,92,98,99,113,115,116,120,122,123,127,130,134,136,137,143,144,146,147,148,149,154,155,156,157,162,163,164,165,170,171,172,173,183,184,185,186,192,193,194,195,201,202,203,204,210,211,212,213,263",","),0,1)
However, it doesn't look like the SSRS expression editor wants to accept the "IN" operator. Which is strange, because all the examples I've found that solve this problem use the IN operator.
Any advice?
Try using IndexOf function:
=IIF(Array.IndexOf(split("1,2,3,4,...",","),fields!sort_order.Value)>-1,0,1)
Note all values must be inside quotations.
Consider the recommendation of #Jakub, I recommend this solution if
your are feeding your report via SP and you can't touch it.
Let me know if this helps.

What is the correct name for this data format?

I am a perfectionist and need a good name for a function that parses data that has this type of format:
userID:12,year:2010,active:1
Maybe perhaps
parse_meta_data()
I'm not sure what the correct name for this type of data format is. Please advise! Thanks for your time.
parse_dict or parse_map
Except for the lack of braces and the quotes around the keys, it looks like either JSON or a Python dict.
parse_tagged_csv()
parse_csv()
parse_structured_csv()
parse_csv_with_attributes()
parse csvattr()
If it’s a proprietary data format, you can name it whatever you want. But it would be good to use a common term like serialized data or mapping list.
If it's just a list of simple items, each of which has a name and a value, then "key-value pairs" is probably the right term.
I would go with:
parse_named_records()
ParseCommaSeparatedNameValuePairs()
ParseDelimitedNameValuePairs()
ParseCommaSeparatedKeyValuePairs()
ParseDelimitedKeyValuePairs()
From the one line you gave, ParseJson() seems appropriate