Word2Vec word containing numeric values

Word2Vec word containing numeric values - tensorflow

When I am adding sentences to Word2Vec model it seems to remove the words which end or start with numeric values, for example "ISO 9001" is returned as "ISO ", I've guessing it's something simple...
Thanks in advance.

I think you already answered your question in the tags you gave to this question. Most likely your tokenizer splits by blank spaces, and leaves out numbers. If you paste the tokenize code you use here we will be able to help you further.
Good luck!

Related

Pentaho - Spoon Decimal from Text File Input

I'm new to Pentaho and have a little problem with the Text file Input.
Currently I have to have several data records written to a database. In the files, the decimal numbers are separated by a point.
Pentaho is currently transforming the number 123.3659 € to 12.33 €.
Can someone help?

When you read the file, do you read it as a csv, excel or something like that? If that's the case, then you can specify the format of the column to interpret the number correctly (I think, I'm talking from memory now) Or maybe playing with the language of the file might work.
If it's a file containing a string, you can use some step like the string operator to replace the point with a comma.

This problem might come from various reasons.
Although I think that by following the next steps you can solve the issue.
-First, you must get a "Replace in String" step;
-Then search for the dot and replace it with nothing as I show in the following image, or with a coma if the number you show is a float;
Example snip
Hope this helped!
Give feedback if so!
Have a good day!

How hive sentences function breaks each sentence

Before posting, I tried the hive sentences function and did some search but couldn't get a clear understanding, my question is based on what delimiter hive sentences function breaks each sentence? hive manual says "appropriate boundary" what does that mean? Below is an example of my tries, I tried adding period (.) and exclamatory sign(!) at different points of the sentence. I'm getting different outputs, can someone explain on this?
with period (.)
select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
output - 1 array
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
with '!'
select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
output - 2 arrays
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

If you understand the functionality of sentences()..it clears your doubt.
Definition of sentences(str):
Splits str into arrays of sentences, where each sentence is an array
of words.
Example:
SELECT sentences('Hello there! I am a UDF.') FROM src LIMIT 1;
[ ["Hello", "there"], ["I", "am", "a", "UDF"] ]
SELECT sentences('review . language') FROM movies;
[["review","language"]]
An exclamation point is a type of punctuation mark that goes at the end of a sentence. Other examples of related punctuation marks include periods and question marks, which also go at the end of sentences.But as per the definition of sentences() ,Unnecessary punctuation, such as periods and commas in English, is automatically stripped.So,we are able to get two arrays of words with !. It completely involves java.util.Locale.java

I don't know the actual reason but observed after period(.) if you put space and next word first letter as capital then it is working.
Here I changed from where to Where it it worked. However this is not require for !
Tokenizes a string of natural language text into words and sentences. Where each sentence is broken at the appropriate sentence boundary and returned as an array of words.
And this is giving below output
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["Where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

Regex positive lookbehind

Let me apologize first. I've been fighting this SO editor for an hour. Sorry for the lousy formatting.
If I have a regex that matches a given input, then I put that regex into the positive look-behind wrapper, won't it still match the input it matched before?
For example, this input :
(NSString*)
will register a match with this regex:
\(\w*\*\)
I have confirmed this on gskinner.com. When I put that regex into the look-behind wrapper like so
(?<=\(\w*\*\))....
with this as the input:
(NSString*)help
I do not receive the word help as a return.
This leads me to think I just plainly don't understand the look-behind concept. I watched a tutorial on this concept, but I am at a loss as to why this won't work. If I want to match:
(NSString*)
and return the next word, how can I go about that?

You have a space as the last character of the look behind, but your input has no space before "help". Also, there is no colon character before the input text, yet your look behind requires one.
Remove the space and the colon:
(?<=\(\w*\*\))\w+
Note that many regex engines disallow variable length look behinds, so a work around is to limit the.number of characters in the word to some large number, eg 99:
(?<=\(\w{1,99}\*\))\w+

DataTable.Select with ' (single quote) Character in the query vb.net

I have a string like "Hello'World" and a DataTable with some records in it. One of those records is "Hello'World".
The problem is, when I do a .Select in the DataTable, it only tries to search for the "Hello" part and throws an error on "World" because it interprets the ' (single quote) like the closing quote on sql.
DataTable.Select("text = 'Hello'World'")
I have gone through msdn doc, and it says I can escape some characters with [] brackets or f.slashes \, but I just can't figure out: .select("text = 'Hello[']world'")
I've done some reading: Verbatim in vb - c# and "jmcilhinney" explains it really well. BUT, it did not answer my question for what I want to do. In stackoverflow.com, a same question is posted but in c#, but I can't find a way to use # in vb.
Can you please redirect me to more doc, examples or any one of you have ever encountered this problem?

Use '' (this is 2 ' characters).
DataTable.select("text = 'Hello''World'")

How to remove strings contained in a list in VB.NET?

How can I find words like and, or, to, a, no, with, for etc. in a sentence using VB.NET and remove them. Also where can I find all words list like above.

Note that unless you use Regex word boundaries you risk falling afoul of the Scunthorpe (Sfannythorpe) problem.
string pattern = #"\band\b";
Regex re = new Regex(pattern);
string input = "a band loves and its fans";
string output = re.Replace(input, ""); // a band loves its fans
Notice the 'and' in 'band' is untouched.

You can indeed replace your list of words using the .Replace function (as colithium described) ...
myString.Replace("and", "")
Edit:
... but indeed, a nicer way is to use Regular Expressions (as edg suggested) to avoid replacing parts of words.
As your question suggests that you would like to clean-up a sentence to keep meaningfull words, you have to do more than just remove two- and three letter words.
What you need is a list of stop-words:
http://en.wikipedia.org/wiki/Stop_word
A comma seperated list of stop-words for the English language can be found here:
http://www.textfixer.com/resources/common-english-words.txt

The easiest way is:
myString.Replace("and", "")
You'd loop over your word list and have a statement like the above. Google for a list of common English words?
List of English 2 Letter Words
List of English 3 Letter Words

You can match the words and remove them using regular expressions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Word2Vec word containing numeric values - tensorflow

When I am adding sentences to Word2Vec model it seems to remove the words which end or start with numeric values, for example "ISO 9001" is returned as "ISO ", I've guessing it's something simple... Thanks in advance.

I think you already answered your question in the tags you gave to this question. Most likely your tokenizer splits by blank spaces, and leaves out numbers. If you paste the tokenize code you use here we will be able to help you further. Good luck!

Related

Pentaho - Spoon Decimal from Text File Input

How hive sentences function breaks each sentence

Regex positive lookbehind

DataTable.Select with ' (single quote) Character in the query vb.net

How to remove strings contained in a list in VB.NET?

Categories

Resources