Is there a function in sPacy to get the string given hash? - spacy

Basically looking for opposite of spacy.strings.get_string_id() which does not need to load the language model to get the vocabulary. I tried StringStore methods, but you need to add the string first or else you get a "Can't retrieve string for hash 'xxx'" error.
Use case is the hash is serialized then it is unserialized somewhere else.

No, you will need to keep a copy of the StringStore from the pipeline you used to process your documents in order to look up strings for hashes in the future.
In the end, it's nothing more than a list of strings that have been seen before, either as tokens or annotations, which you can simply re-add to a new StringStore.

Related

SHOW KEYS in Aerospike?

I'm new to Aerospike and am probably missing something fundamental, but I'm trying to see an enumeration of the Keys in a Set (I'm purposefully avoiding the word "list" because it's a datatype).
For example,
To see all the Namespaces, the docs say to use SHOW NAMESPACES
To see all the Sets, we can use SHOW SETS
If I want to see all the unique Keys in a Set ... what command can I use?
It seems like one can use client.scan() ... but that seems like a super heavy way to get just the key (since it fetches all the bin data as well).
Any recommendations are appreciated! As of right now, I'm thinking of inserting (deleting) into (from) a meta-record.
Thank you #pgupta for pointing me in the right direction.
This actually has two parts:
In order to retrieve original keys from the server, one must -- during put() calls -- set policy to save the key value server-side (otherwise, it seems only a digest/hash is stored?).
Here's an example in Python:
aerospike_client.put(key, {'bin': 'value'}, policy={'key': aerospike.POLICY_KEY_SEND})
Then (modified Aerospike's own documentation), you perform a scan and set the policy to not return the bin data. From this, you can extract the keys:
Example:
keys = []
scan = client.scan('namespace', 'set')
scan_opts = { 'concurrent': True, 'nobins': True, 'priority': aerospike.SCAN_PRIORITY_MEDIUM }
for x in (scan.results(policy=scan_opts)): keys.append(x[0][2])
The need to iterate over the result still seems a little clunky to me; I still think that using a 'master-key' Record to store a list of all the other keys will be more performant, in my case -- in this way, I can simply make one get() call to the Aerospike server to retrieve the list.
You can choose not bring the data back by setting includeBinData in ScanPolicy to false.

Possible to write an Express GET route that accepts an array of unknown length?

Like the title says, is it possible to write an Express GET route that accepts an array of unknown length?
I know I can use a POST request and just include an array in the body, but it isn't posting something so much as getting something!
I need to know how to encode the url. Most of what I am seeing is for arrays of particular length. This could be for 2 or 20, or more.
YES!
You can use the query parameter and use a delimiter, as search engines of old where your search string was actually in the url with spaces demarcated with +. This allows for an array of indeterminate length.
Did not get any feedback for HOW to encode an array to accept a url, so this is the approach I am going with. Even in my googling I couldn't find much on HOW to encode a URL to have an array in the params object, rather than the query object, possibly because I was searching for how to do it for an array of indeterminate length?

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

sorting and getting uniques

i have a string that looks like this
"apples,fish,oranges,bananas,fish"
i want to be able to sort this list and get only the uniques. how do i do it in vb.net? please provide code
A lot of your questions are quite basic, so rather than providing the code I'm going to provide the thought process and let you learn from implementing it.
Firstly, you have a string that contains multiple items separated by commas, so you're going to need to split the string at the commas to get a list. You can use String.Split for that.
You can then use some of the extension methods for IEnumerable<T> to filter and order the list. The ones to look at are Enumerable.Distinct and Enumerable.OrderBy. You can either write these as normal methods, or use Linq syntax.
If you need to get it back into a comma-separated string, then you'll need to re-join the strings using the String.Join method. Note that this needs an array so Enumerable.ToArray will be useful in conjunction.
You can do it using LINQ, like this:
Dim input = "apples,fish,oranges,bananas,fish"
Dim strings = input.Split(","c).Distinct().OrderBy(Function(s) s)
I'm not a VB.NET programmer, but I can give you a suggestion:
Split the string into an array
Create a second array
Cycle through the first array, adding any value that is not in the second.
Upon completion, your second array will have only unique values.

Is it safe to convert a mysqlpp::sql_blob to a std::string?

I'm grabbing some binary data out of my MySQL database. It comes out as a mysqlpp::sql_blob type.
It just so happens that this BLOB is a serialized Google Protobuf. I need to de-serialize it so that I can access it normally.
This gives a compile error, since ParseFromString() is not intended for mysqlpp:sql_blob types:
protobuf.ParseFromString( record.data );
However, if I force the cast, it compiles OK:
protobuf.ParseFromString( (std::string) record.data );
Is this safe? I'm particularly worried because of this snippet from the mysqlpp documentation:
"Because C++ strings handle binary data just fine, you might think you can use std::string instead of sql_blob, but the current design of String converts to std::string via a C string. As a result, the BLOB data is truncated at the first embedded null character during population of the SSQLS. There’s no way to fix that without completely redesigning either String or the SSQLS mechanism."
Thanks for your assistance!
It doesn't look like it would be a problem judging by that quote (it's basically saying if a null character is found in the blob it will stop the string there, however ASCII strings won't have random nulls in the middle of them). However, this might present a problem for internalization (multibyte charsets may have nulls in the middle).