In some situations when a user wants only the plural of a noun such as:
mkN("apple");
What is the best practice to get the machine to print out the result apples instead of printing the whole set apple apple's apples apples'.
Relevant flags to cc in GF shell
To get cc (short for compute_concrete) in the GF shell to print only one result, you can use the flag -one. Like this:
$ gf
> i -retain alltenses/ParadigmsEng.gfo
> cc mkN "apple"
{s = table
ParamX.Number
[table ResEng.Case ["apple"; "apple's"];
table ResEng.Case ["apples"; "apples'"]];
g = ResEng.Neutr; lock_N = <>}
> cc -one mkN "apple"
apple
If you apply the flag for the N, then the first thing it prints out is the singular form. So how to fix the number to get plural instead?
Fixed number
N is a category of nouns, and nouns have inflection tables. N can be used to create many different things: the nominative forms apple and apples can become heads of noun phrases, singular or plural. The genitive forms can become determiners ("the apple's sweet taste" or "all my apples' cores are rotten"). So N is open for number and case. When you print out the forms of a N without any extra flags, it makes sense to show them all.
If you want to go a step further, and only restrict apples to be plural, you need to make it into a plural NP:
apples_NP = mkNP aPl_Det (mkN "apple") ;
Open for case
Note that a NP is still open for case. In fact, the inflection table of NP is as big as N's, even though we know the number. That's because NPs can be made out of pronouns, and pronouns can inflect more than nouns. This is the worst case for a NP:
> cc -table i_NP
s . ResEng.NCase ResEng.Nom => I
s . ResEng.NCase ResEng.Gen => my
s . ResEng.NPAcc => me
s . ResEng.NPNomPoss => mine
Of course, for a NP made out of a noun, most of those fields are identical.
> cc -table apples_NP
s . ResEng.NCase ResEng.Nom => apples
s . ResEng.NCase ResEng.Gen => apples'
s . ResEng.NPAcc => apples
s . ResEng.NPNomPoss => apples
But because some NPs are different in all 4 fields, that's why the GF lincat for NP needs to have them.
Display apples in the GF shell
To get GF shell only display apples, you need to create a NP out of the noun, and then call cc -one on the plural NP. Here's a GF file you can paste into a file called Apples.gf.
resource Apples = open ParadigmsEng, SyntaxEng in {
oper
apple_N : N = mkN "apple" ;
apples_NP : NP = mkNP aPl_Det apple_N ;
}
Go to GF shell:
> i -retain Apples.gf
> cc -one apples_NP
apples
Output apples in any other situation
If you use apples_NP as a subject or object in any sentence, you will get the string apples. If you give it as an argument to Extend.GenNP, you get a quantifier with the string apples'.
Related
In Redis, let's say I have a set called animals.
127.0.0.1:6379> sadd animals cat:bob cat:fred dog:joe dog:rover hamster:harvey
I know I can use SRANDMEMBER to pull a random member from the set, which might be any of the values.
And I know I can get all cats out of the set with something like SSCAN animals 0 MATCH cat:*
But is there a way to retrieve a random cat?
Edit for clarity: My example has the important designator at start of string, but I'm looking for something that follows a general pattern where the "match" might be anywhere within the string.
Not in a single command. If you are using a Sorted Set, you can get ranges of values based on the lexical content:
> ZADD animals 0 cat:bob 0 cat:fred 0 dog:joe 0 dog:rover 0 hamster:harvey
> ZRANGESTORE cats animals [cat: (cau BYLEX
> ZRANDMEMBER cats
> DEL cats
Note that [cat: means "range staring with cat, inclusive" and (cau means "range ending with cau, exclusive". I picked "cau" because it would be next in the sequence and would only pick cats.
This is, admittedly, a bit of a hack. 😉
How can I configure John the Ripper to generate only mangled (Jumbo) palindromes from a word-list to crack a password hash?
(I've googled it but only found "how to avoid palindromes")
in john/john.conf (for e.g. 9 and 10 letter palindromes) -append the following rules at the end:
# End of john.conf file.
# Keep this comment, and blank line above it, to make sure a john-local.conf
# that does not end with \n is properly loaded.
[List.Rules:palindromes]
f
f D5
then run john with your wordlist plus the newly created "palindromes" rules:
$ john --wordlist=wordlist.lst --rules:palindromes hashfile.hash
rule f simply appends a reflection of itself to the current word from the wordlist, e.g. P4ss! -> P4ss!!ss4P
rule f D5 not only reflects the word but then deletes the 5th character, e.g. P4ss! -> P4ss!ss4P
I haven't found a way to "delete the middle character" so as of now, the rule has to be adjusted to the required length of palindromes, e.g. f D4 for length of 7, f D6 for length of 11 etc.
Edit: Possible solution for variable length (not tested yet):
f
Mr[6
M = Memorize current word, r = Reverse the entire word , [ = Delete first character, 6 = Prepend the word saved to memory to current word
With this approach the palindromes could additionally be "turned inside out" (word from wordlist at the end of the resulting palindrome instead of at beginning)
f
Mr[6
Mr]4
M = Memorize current word, r = Reverse the entire word , ] = Delete last character, 4 = Append the word saved to memory to current word
print cites from visited_cities list in alphabetical order using .sort()
only print cities that names start "Q" or earlier (meaning a to q)
visited_cities = ["New York", "Shanghai", "Munich", "Toyko", "Dubai", "Mexico City", "São Paulo", "Hyderabad"]
the .sort() was easy to do but I don't know how figure out the second part of the problem.
you could do it with regular expressions and filtering like:
import re
regex=re.compile('[A-Q]{1}.*')
cities = list(filter(lambda city: re.match(regex, city), visited_cities))
print(*cities, sep='\n')
the regex variable looks for any city starting from [A-Q]
there is even an easier solution by utilizing the Unicode code point of a character. look at the method ord
for city in visited_cities:
first_character = city[0]
if ord(first_character) >= ord('A') and ord(first_character) <= ord('Q'):
print(city)
the Unicode code points are ordered so an A is at 65, B is at 66 ... Q is at 81 ... Z is at 90. so if you want to print only those cities starting with letters from A to Q you have to make sure their Unicode code point is between 65 and 81
I am doing a sort and would like to control the cmp of alpha values to be case insensitive (viz. https://perl6.org/archive/rfc/143.html).
Is there some :i adverb for this, perhaps?
Perl 6 doesn't currently have that as an option, but it is a very mutable language so let's add it.
Since the existing proto doesn't allow named values we have to add a new one, or write an only sub.
(That is you can just use the multi below except with an optional only instead.)
This only applies lexically, so if you write this you may want to mark the proto/only sub as being exportable depending on what you are doing.
proto sub infix:<leg> ( \a, \b, *% ){*}
multi sub infix:<leg> ( \a, \b, :ignore-case(:$i) ){
$i
?? &CORE::infix:<leg>( fc(a) , fc(b) )
!! &CORE::infix:<leg>( a , b )
}
say 'a' leg 'A'; # More
say 'a' leg 'A' :i; # Same
say 'a' leg 'A' :!i; # More
say 'a' leg 'A' :ignore-case; # Same
Note that :$i is short for :i( $i ) so the two named parameters could have been written as:
:ignore-case( :i( $i ) )
Also I used the sub form of fc() rather than the method form .fc because it allows the native form of strings to be used without causing autoboxing.
If you want a "dictionary" sort order, #timotimo is on the right track when they suggest collate and coll for sorting.
Use collate() on anything listy to sort it. Use coll as an infix operator in case you need a custom sort.
$ perl6
To exit tyype 'exit' or '^D'
> <a Zp zo zz ab 9 09 91 90>.collate();
(09 9 90 91 a ab zo Zp zz)
> <a Zp zo zz ab 9 09 91 90>.sort: * coll *;
(09 9 90 91 a ab zo Zp zz)
You can pass a code block to sort. If the arity of the block is one, it works on both elements when doing the comparison. Here's an example showing the 'fc' from the previous answer.
> my #a = <alpha BETA gamma DELTA>;
[alpha BETA gamma DELTA]
> #a.sort
(BETA DELTA alpha gamma)
> #a.sort(*.fc)
(alpha BETA DELTA gamma)
From the documentation
In order to do case-insensitive comparison, you can use .fc
(fold-case). The problem is that people tend to use .lc or .uc, and it
does seem to work within the ASCII range, but fails on other
characters. This is not just a Perl 6 trap, the same applies to other
languages.
For example:
say ‘groß’.fc eq ‘GROSS’.fc; # ← RIGHT; True
If you are working with regexes, then there is no need to use .fc and you can use :i (:ignorecase) adverb instead.
I have a big table in my database with a lot of words from various texts in the text order. I want to find the number of times/frequency that some set of words appears together.
Example: Supposing I have this 4 words in many texts: United | States | of | America. I will get as result:
United States: 50
United States of: 45
United States of America: 40
(This is only an example with 4 words, but can there are with less and more than 4).
There is some algorithm that can do this or similar to this?
Edit: Some R or SQL code showing how to do is welcome. I need a practical example of what I need to do.
Table Structure
I have two tables: Token which haves id and text. The text is is UNIQUE and each entrance in this table represents a different word.
TextBlockHasToken is the table which keeps the text order. Each row represents a word in a text.
It haves textblockid that is the block of the text the token belongs. sentence that is the sentence of the token, position that is the token position inside the sentence and tokenid that is the token table reference.
It is called an N-gram; in your case a 4-gram. It can indeed be obtained as the by-product of a Markov-chain, but you could also use a sliding window (size 4) to walk through the (linear) text while updating a 4-dimensionsal "histogram".
UPDATE 2011-11-22:
A markov chain is a way to model the probability of switching to a new state, given the current state. This is the stochastic equivalent of a "state machine". In the natural language case, the "state" is formed by the "previous N words", which implies that you consider the prior probability (before the previous N words) as equal_to_one. Computer people will most likely use a tree for implementing Markov chains in the NLP case. The "state" is simply the path taken from the root to the current node, and the probabilities of the words_to_follow are the probabilities of the current node's offspring. But every time that we choose a new child node, we actually shift down the tree, and "forget" the root node, out window is only N words wide, which translates to N levels deep into the tree.
You can easily see that if you are walking a Markov chain/tree like this, at any time the probability before the first word is 1, the probability after the first word is P(w1), after the second word = P(w2) || w1, etc. So, when processing the corpus you build a Markov tree ( := update the frequencies in the nodes), at the end of the ride you can estimate the probability of a given choice of word by freq(word) / SUM(freq(siblings)). For a word 5-deep into the tree this is the probability of the word, given the previous 4 words. If you want the N-gram probabilities, you want the product of all the probabilities in the path from the root to the last word.
This is a typical use case for Markov chains. Estimate the Markov model from your textbase and find high probabilites in the transition table. Since these indicate probabilities that one word will follow another, phrases will show up as high transition probabilites.
By counting the number of times the phrase-start word showed up in the texts, you can also derive absolute numbers.
Here is a small snippet that calculates all combinations/ngrams of a text for a given set of words. In order to work for larger datasets it uses the hash library, though it is probably still pretty slow...
require(hash)
get.ngrams <- function(text, target.words) {
text <- tolower(text)
split.text <- strsplit(text, "\\W+")[[1]]
ngrams <- hash()
current.ngram <- ""
for(i in seq_along(split.text)) {
word <- split.text[i]
word_i <- i
while(word %in% target.words) {
if(current.ngram == "") {
current.ngram <- word
} else {
current.ngram <- paste(current.ngram, word)
}
if(has.key(current.ngram, ngrams)) {
ngrams[[current.ngram]] <- ngrams[[current.ngram]] + 1
} else{
ngrams[[current.ngram]] <- 1
}
word_i <- word_i + 1
word <- split.text[word_i]
}
current.ngram <- ""
}
ngrams
}
So the following input ...
some.text <- "He states that he loves the United States of America,
and I agree it is nice in the United States."
some.target.words <- c("united", "states", "of", "america")
usa.ngrams <- get.ngrams(some.text, some.target.words)
... would result in the following hash:
>usa.ngrams
<hash> containing 10 key-value pair(s).
america : 1
of : 1
of america : 1
states : 3
states of : 1
states of america : 1
united : 2
united states : 2
united states of : 1
united states of america : 1
Notice that this function is case insensitive and registers any permutation of the target words, e.g:
some.text <- "States of united America are states"
some.target.words <- c("united", "states", "of", "america")
usa.ngrams <- get.ngrams(some.text, some.target.words)
...results in:
>usa.ngrams
<hash> containing 10 key-value pair(s).
america : 1
of : 1
of united : 1
of united america : 1
states : 2
states of : 1
states of united : 1
states of united america : 1
united : 1
united america : 1
I'm not sure if its of a help to you, but here is a little python program I wrote about a year ago that counts N-grams (well, only mono-, bi-, and trigrams). (It also calculates the entropy of each N-gram). I used it to count those N-grams in a large text.
Link