So, I'm at work atm and I had a co-worker create some SQL code for me to extract out text from a larger description field. The problem I'm running into is it doesn't stop extracting where I need it to. I need it to stop after it either sees the word "Specifications:" or when it finds two CRLF back to back. This would allow it to grab out only the "Features" which is what I'm trying for.
Here's an example of the current code:
SELECT IN_Desc, Replace(IN_Desc, Left(IN_Desc, InStr(IN_Desc, "- ") - 1), "")
FROM Inventory
WHERE IN_MfgName = "BERK"
Here's an example of the text it's looking through:
Gulp! has 400 times more scent dispersion than ordinary plastic bait.
The extreme scent dispersion greatly expands the strike zone allowing
you to catch more fish! Even more impressive, the natural formulation
of Gulp! out fishes live bait in head to head field tests. Berkley
Gulp! truly is the next generation in soft bait!
Features:
Ideal on jigs or as a trailer
Favorite for all SW species when targeting big fish
Proven tail action design swims under all conditions
Expand your strike zone with 400x more scent dispersion than plastic baits
15 years of Gulp! evolution…the best keeps getting better
Specifications:
Bait Length: 6"
Color: White
Quantity: Per 4
Packaging: Bag
Desired output:
Ideal on jigs or as a trailer
Favorite for all SW species when targeting big fish
Proven tail action design swims under all conditions
Expand your strike zone with 400x more scent dispersion than plastic baits
15 years of Gulp! evolution…the best keeps getting better
Thanks to everyone in advance for any and all help.
This is a bit ugly, but it seems to do the trick. It may need some tweaking to get exactly what you want, but this will get everything between Features and the next double carriagereturn/linefeed:
Mid(yourfield,InStr(1,yourfield, "Features:")+Len("Features: "),InStr(InStr(1,yourfield, "Features:")+Len("Features: "),yourfield, Chr(13) & Chr(10) & Chr(13) & Chr(10)))
I'm certain that it could be written prettier, but my access is rusty as hell. I feel like a VBA UDF would be a lot cleaner and then you could employ regex to really strip thing this apart.
Related
Sorry if this is documented somewhere, but I haven't been able to find it. When using brace delimiters with qq, code is not interpolated:
qq.raku
#!/usr/bin/env raku
say qq{"Two plus two": { 2 + 2 }};
say qq["Two plus two": { 2 + 2 }];
$ ./qq.raku
"Two plus two": { 2 + 2 }
"Two plus two": 4
Obviously, this isn't a big deal since I can use a different set of delimiters, but I ran across it and thought I'd ask.
Update
As #raiph pointed out, I forgot to put the actual question: Is this the way it's supposed to work?
The quote language "nibbler" (the bit of the grammar that eats its way through a quoted string) looks like this:
[
<!stopper>
[
|| <starter> <nibbler> <stopper>
|| <escape>
|| .
]
]*
That is, until we see a stopper, eat whichever comes first of:
A starter (the opening { in your case), followed by some internal stuff, followed by a stopper (the }); this allows for nesting of the construct inside of the string
An escape (and closure interpolation is considered a kind of escape)
Any other character
This ordering in the grammar means that a nesting of the chosen quote starter/stopper will always win over an escape. This issue was discussed during the language design; we could, after all, have reordered the alternation in the grammar to have escapes win. On balance, however, it was felt that the choice of starter/stopper was the more local decision than the general properties of the quoting language, and so should take precedence. (This is also consistent with how quote languages are constructed: we take the base quoted string grammar and mix starter/stopper methods into it.)
Obviously, this isn't a big deal since I can use a different set of delimiters, but I ran across it and thought I'd ask.
You didn't ask anything. :)
Let's say you've got some text. And you want to use double quote processing to get interpolation, except you don't want braced text to be interpolated as code. You could write, say, qq:!c '...'. But don't you think it's a lot easier to remember, write, and read qq{ ... }?
Nice little touch, right?
Which is why it's the way it is -- it's a very nice touch.
And, perhaps, why it's not documented -- it's little, and, once you encounter it, obvious what you need to do.
That said, the Q lang escapes include ones to recursively re-enter the Q lang:
say qq{"Two plus two": \qq[{ 2 + 2 }] }; # "Two plus two": 4
Does that answer your question? :)
We have a source file ("source-A") that looks like this (if you see blue text, it comes from stackoverflow, not the text file):
The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences
Each sentence in "source-A" is on its own line and terminates with a newline (\n)
We have a dictionary/conversion file ("converse-B") that looks like this:
aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes
"converse-B" is a two column, tab delimited file.
Each equivalence map (term-on-left<tab>term-on-right) is on its own line and terminates with a newline (\n)
How to read "converse-B", and replace terms in "source-A" where a term in "converse-B" column-1 is replaced with the term in column-2, and then write to an output file ("output-C")?
For example, the "output-C" would look like this:
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
The tricky part is the term potato.
If a "simple" awk solution cannot handle a singular term (potato) and a plural term (potatoes), we'll use a manual substitution method. The awk solution can skip that use case.
In other words, an awk solution can stipulate that it only works for an unambiguous word or a term composed of space separated, unambiguous words.
An awk solution will get us to a 90% completion rate; we'll do the remaining 10% manually.
sed probably suits better since since it's only phrase/word replacements. Note that if the same words appear in multiple phrases first come first serve; so change your dictionary order accordingly.
$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences
file substitute sed statement converts dictionary entries into sed expressions and the main sed uses them for the content replacements.
NB: Note that production quality script should take of word cases and also word boundaries to eliminate unwanted substring substitution, which are ignored here.
Question 1:
In wide_n_deep_tutorial.py, there is a hyper-parameter named hash_bucket_size for both tf.feature_column.categorical_column_with_hash_bucket and tf.feature_column.crossed_column methods, and the value is hash_bucket_size=1000.
But why 1000? How to set this parameter ?
Question 2:
The second question about crossed_columns, that is,
crossed_columns = [
tf.feature_column.crossed_column( ["education", "occupation"], hash_bucket_size=1000),
tf.feature_column.crossed_column( [age_buckets, "education", "occupation"], hash_bucket_size=1000),
tf.feature_column.crossed_column( ["native_country", "occupation"], hash_bucket_size=1000) ]
in wide_n_deep_tutorial.py,
Why choose ["education", "occupation"], [age_buckets, "education", "occupation"] and ["native_country", "occupation"] as crossed_columns, are there any rule of thumb ?
Regarding the hash_bucket_size: if you set it too low, there will be many hash collisions, where different categories will be mapped to the same bucket, forcing the neural network to use other features to distinguish them. If you set it too high then you will use a lot of RAM for nothing: I am assuming that you will wrap the categorical_column_with_hash_bucket() in an embedding_column() (as you generally should), in which case the hash_bucket_size will determine the number of rows of the embedding matrix.
The probability of a collision if there are k categories is approximately equal to: 1 - exp(-k*(k-1)/2/hash_bucket_size) (source), so if there are 40 categories and you use hash_bucket_size=1000, the probability is surprisingly high: about 54%! To convince yourself, try running len(np.unique(np.random.randint(1000, size=40))) several times (it picks 40 random numbers between 0 and 999 and counts how many unique numbers there are), and you will see that the result is quite often less than 40. You can use this equation to choose a value of hash_bucket_size that does not cause too many collisions.
That said, if there are just a couple collisions, it's probably not going to be too bad in practice, as the neural network will still be able to use other features to distinguish the colliding categories. The best option may be to experiment with different values of hash_bucket_size to find the value below which performance starts to degrade, then increase it by 10-20% to be safe.
For the hash_bucket
The general idea is that ideally the result of the hash functions should not result in any collisions (otherwise you/the algorithm would not be able to distinguish between two cases). Hence the 1000 is in this case 'just' a value. If you look at the unique entries for occupation and country (16 and 43) you'll see that this number is high enough:
edb#lapelidb:/tmp$ cat adult.data | cut -d , -f 7 | sort | uniq -c | wc -l
16
edb#lapelidb:/tmp$ cat adult.data | cut -d , -f 14 | sort | uniq -c | wc -l
43
Feature crossing
I think the rule of thumb there is that crossing makes sense if the combination of the features actually has meaning. In this example education and occupation are linked. As for the second one it probably make sense to define people as 'junior engineer with a ph.d' vs 'senior cleaning staff without a degree'. Another typical example you see quite often is the crossing of longitude and latitude since they have more meaning together than individually.
As per the title I am looking for a method to search data on an equivalence basis
Ie user searches for a value of 20" it will also search for 20 inch, 20 inches etc...
I've looked at possibly using full text search and a thesaurus but would have to build my own equivalence library
Is there any other alternatives I should be looking at? Or are there common symbol/word equivalence libraries already written?
EDIT:
I dont mean the like keyword and wild cards
if my data is
A pipe that is 20" wide
A pipe that is 20'' wide - NOTE::(this is 2 single quotes)
A pipe that is 20 cm wide
A pipe that is 20 inch wide
A pipe that is 20 inches wide
I would like to search for '20 inch' and be returned
A pipe that is 20" wide
A pipe that is 20'' wide
A pipe that is 20 inch wide
A pipe that is 20 inches wide
just answering this in case anyone else comes across it as I finally figured it out.
I ended up using an FTS thesaurus to assign equivalence to inch inches and ", and this work wonderfully for inch and inches but would return no results when I searched for 6"
It eventually turned out the underlying issue I had was that characters such as " are treated as word breakers by full text search.
I found that custom dictionary items seems to override the languages word breakers and so introducing a file called Custom0009.lex with a few lines of " and a few other characters/terms I wanted included that had word breakers in to C:\Program Files\Microsoft SQL Server\{instance name}\MSSQL\Binn and restarting the fdhost and rebuilding the index allowed my search for
select * from tbldescriptions where FREETEXT(MainDesc,'"')
or
select * from tbldescriptions where contains(MainDesc,'FORMSOF(Thesaurus,"""")')
notice the double " on the contains one as the search term is within " already it needed to be escaped to be seen.
Given a list like:
Dog bone
Cat catnip
Human ipad
Dog collar
Dog collar
Cat collar
Human car
Human laptop
Cat catnip
Human ipad
How can I get results like this, using awk:
Dog bone 1
Dog collar 2
Cat catnip 2
Cat collar 1
Human car 1
Human laptop 1
Human ipad 2
Do I need a sub array? It seems to me like a need an array of "owners" which is populated by arrays of "things."
I'd like to use awk to do this, as this is a subscript of another program in awk, and for now, I'd rather not create a separate program.
By the way, I can already do it using sort and grep -c, and a few other pipes, but I really won't be able to do that on gigantic data files, as it would be too slow. Awk is generally much faster for this kind of thing, I'm told.
Thanks,
Kevin
EDIT: Be aware, that the columns are actually not next to eachother like this, in the real file, they are more like column $8 and $11. I say this because I suppose if they were next to eachother I could incorporate an awk regex ~/Dog\ Collar/ or something. But I won't have that option. -thanks!
awk does not have multi-dimensional arrays, but you can manage by constructing 2D-ish array keys:
awk '{count[$1 " " $2]++} END {for (key in count) print key, count[key]}' | sort
which, from your input, outputs
Cat catnip 2
Cat collar 1
Dog bone 1
Dog collar 2
Human car 1
Human ipad 2
Human laptop 1
Here, I use a space to separate the key values. If your data contains spaces, you can use some other character that does not appear in your input. I typically use array[$a FS $b] when I have a specific field separator, since that's guaranteed not to appear in the field values.
GNU Awk has some support for multi-dimensional arrays, but it's really just cleverly concatenating keys to form a sort of compound key.
I'd recommend learning Perl, which will be fairly familiar to you if you like awk, but Perl supports true Lists of Lists. In general, Perl will take you much further than awk.
Re your comment:
I'm not trying to be superior. I understand you asked how to accomplish a task with a specific tool, awk. I did give a link to the documentation for simulating multi-dimensional arrays in awk. But awk doesn't do that task well, and it was effectively replaced by Perl nearly 20 years ago.
If you ask how to cross a lake on a bicycle, and I tell you it'll be easier in a boat, I don't think that's unreasonable. If I tell you it'll be easier to first build a bridge, or first invent a Star Trek transporter, then that would be unreasonable.