Count the frequency of a string in map in clojure - file-io

I am fairly new to clojure and just practicing some exam questions to prepare for a final exam.
I am trying to find the frequency of all the names is in a file. I read the file line by line and I save each string if it contains a specific keyword in a map. Since I do not want any repetitions, I am trying put distinct out in front of it, but I still keep getting the repeating elements.
(defn readFile []
(map (fn [line] (clojure.string/split line #";"))
(with-open [rdr (reader "C:/Users/Rohil's Computer/Desktop/textfile.txt")]
(doseq [[idx line] (map-indexed vector(line-seq rdr))]
(if(.contains line "2007")
(if(.contains line "May")
(if(not(.contains line "Batman"))
(map save [(nth(clojure.string/split line #";")3 (nth(clojure.string/split line #";")19)])
(distinct(map))
)
)
)
)
)
)
)
Here is my sample output:
I want to get rid of the 2 iron man elements.
(May 2007 Spiderman)
Clojure2.clj:
(March 2007 Iron man)
Clojure2.clj:
(March 2007 Iron man)
Clojure2.clj:
(April 2007 Captain America)
Is there anything I am missing?

With Clojure a good approach is to break down what you need to accomplish a task into functions that are as simple as possible. Once you have written and tested those simple functions you will be able to combine them together. Often times, the simple function you need already exists, as is here the case with frequencies:
(frequencies ["foo" "bar" "foo" "quux" "foo"])
=> {"foo" 3, "bar" 1, "quux" 1}
So it sounds like all you need to do really is tokenize the input file and apply frequencies to the list of tokens.

Related

How do I generate numbered lists with pod?

Looking at https://docs.raku.org/language/pod#Lists. I don't see a way to create a numbered list:
one
three
four
Is there an undocumented way to do it?
There is not currently (as of January 2022) an implemented way to use ordered list in Pod6.
The historical design documents contain Pod6 syntax for ordered lists and, as far as I know, this remains something that we'd like to add. Once that syntax is implemented, you'll be able to write something like:
=item1 # Animal
=item2 # Vertebrate
=item2 # Invertebrate
=item1 # Phase
=item2 # Solid
=item2 # Liquid
=item2 # Gas
This would produce output along the lines of:
1. Animal
1.1 Vertebrate
1.2 Invertebrate
2. Phase
2.1 Solid
2.2 Liquid
2.3 Gas
(Though the exact syntax for rendering the list would be up to the implementation of the Pod renderer.)
But until that's implemented, there isn't any way to use Pod6 syntax to create an ordered list.
Edit:
I just checked the actual parsed Pod6, and it looks like (to my surprise) the ordered list syntax I showed above actually is parsed internally. For example, running say $=pod[5].raku with the Pod6 shows the following (based on the =item2 # Liquid line):
Pod::Item.new(level => 2, config => {:numbered(1)}, contents => [Pod::Block::Para.new(config => {}, contents => ["Liquid"])])
So the parsing work is in place; it's just the Pod::To::_ renderer that need to add support. (And there could even be some out there that have that support. I do know that neither Rakudo's Pod::To::Text nor Raku's Pod::To::HTML (v0.8.1) currently render ordered lists, however.)
Workarounds
Depending on the output formats you're targeting, you could of course write the ordered list yourself (pretty easy if you're rendering to plain text, more annoying to do if you're printing to HTML). This does, of course, sacrifice Pod6's multi-output-format support, which is one of its key features.
For a workaround that doesn't sacrifice Pod's multi-output nature, you'd probably want to look into manipulating/reformatting the Pod text programmatically. If you do so, the docs to start with are the Pod6 section on accessing Pod and the (unfortunately very short) section on the DOC phaser.
Just use a list and a loop?
my #list = [ (1, 2, 3), (1, 2, ),
[<a b c>, <d e f>],
[[1]]];
for #list -> #element {
say "{#element} → {#element.^name}";
for #element -> $sub-element {
say $sub-element;
}
}
# OUTPUT:
#1 2 3 → List
#1
#2
#3
#1 2 → List
#1
#2
#a b c d e f → Array
#(a b c)
#(d e f)
#1 → Array
#1

How to keep translations separated where the same word is used in English but a different one in other languages?

Imagine I have a report, a letter actually, which I need to translate to several languages. I have created a greeting field in the form which is filled programatically by an onchange event method.
if self.partner_id.gender == 'female':
self.letter_greeting = _('Dear %s %s,') % ( # the translation should be "Estimada"
self.repr_recipient_id.title.shorcut, surname
)
elif self.partner_id.gender == 'male':
self.letter_greeting = _('Dear %s %s,') % ( # translation "Estimado"
self.repr_recipient_id.title.shorcut, surname
)
else:
self.letter_greeting = _('Dear %s %s,') % ( # translation: "Estimado/a"
self.partner_id.title.shorcut, surname
)
In that case the word Dear should be translated to different Spanish translations depending on which option is used, this is because we use different termination depending on the gender. Exporting the po file I found that all the options are altogether, that make sense because almost all the cases the translations will be the same, but not in this case:
#. module: custom_module
#: code:addons/custom_module/models/sale_order.py:334
#: code:addons/custom_module/models/sale_order.py:338
#: code:addons/custom_module/models/sale_order.py:342
#, python-format
msgid "Dear %s %s,"
msgstr "Dear %s %s,"
Solutions I can apply directly
Put all the terms in different entries to avoid the same translation manually every time I need to update the po file. This can be cumbersome if you have many different words with that problem. If I do it and I open the file with poedit, this error appears: duplicate message definition
Put all the possible combinations with slashes, this is done y some other parts of Odoo. For both gender would be:
#. module: stock
#: model:res.company,msg:stock.res_company
msgid "Dear"
msgstr "Estimado/a"
This is just an example. I can think of many words that look the same in English, but they use different spelling or meanings in other languages depending on the context.
Possible best solutions
I don't know if Odoo know anything aboutu the context of a word to know if it was already translated or not. Adding a context manually could solve the problem, at least for words with different meanings.
The nicest solution would be to have a parameter to the translation module to make sure that the word is exported as an isolated entry for that especific translation.
Do you think that I am giving to it too much importance haha? Do you know if there is any better solution? Why is poedit not taking into account that problem at all?
I propose an extension of models res.partner.title and res.partner.
res.partner.title should get a translateable field for saving salutation prefixes like 'Dear' or 'Sehr geehrter' (German). Maybe it's worth to get something about genders, too, but i won't get into detail here about that.
You probably want to show the configuring user an example like "Dear Mr. Name" or something like that. A computed field should work.
On res.partner you should just implement either a computed field or just a method to get a full salutation for a partner record.
To some degree this is a linguistics problem. I believe the best solution would be to use a different "Source Language", one made up of keys, and then have English as another Translation. The word "Dear" in English does not have a gender context (and typically, much of English doesn't), while the word "Estimado" in Spanish does. The translation from that Spanish word to English is more appropriately "Masculine Dear." Therefore, using keys as your source language, you would have this:
SourceText (EnglishDescription) -> Translation (English) -> Translation (Spanish)
DearMasculine -> Dear -> Estimado
DearFeminine -> Dear -> Estimada
DearNuetral -> Dear -> Estimado/a

Sentence segmentation and dependency parser

I’m pretty new to python (using python 3) and spacy (and programming too). Please bear with me.
I have three questions where two are more or less the same I just can’t get it to work.
I took the “syntax specific search with spacy” (example) and tried to make different things work.
My program currently reads txt and the normal extraction
if w.lower_ != 'music':
return False
works.
My first question is: How can I get spacy to extract two words?
For example: “classical music”
With the previous mentioned snippet I can make it extract either classical or music. But if I only search for one of the words I also get results I don’t want like.
Classical – period / era
Or when I look for only music
Music – baroque, modern
The second question is: How can I get the dependencies to work?
The example dependency with:
elif w.dep_ != 'nsubj': # Is it the subject of a verb?
return False
works fine. But everything else I tried does not really work.
For example, I want to extract sentences with the word “birthday” and the dependency ‘DATE’. (so the dependency is an entity)
I got
if d.ent_type_ != ‘DATE’:
return False
To work.
So now it would look like:
def extract_information(w,d):
if w.lower_ != ‘birthday’:
return False
elif d.ent_type_ != ‘DATE’:
return False
else:
return True
Does something like this even work?
If it works the third question would be how I can filter sentences for example with a DATE. So If the sentence contains a certain word and a DATE exclude it.
Last thing maybe, I read somewhere that the dependencies are based on the “Stanford typed dependencies manual”. Is there a list which of those dependencies work with spacy?
Thank you for your patience and help :)
Before I get into offering some simple suggestions to your questions, have you tried using displaCy's visualiser on some of your sentences?
Using an example sentence 'John's birthday was yesterday', you'll find that within the parsed sentence, birthday and yesterday are not necessarily direct dependencies of one another. So searching based on the birthday word having a dependency of a DATE type entity, might not be yield the best of results.
Onto the first question:
A brute force method would be to look for matching subsequent words after you have parsed the sentence.
doc = nlp(u'Mary enjoys classical music.')
for (i,token) in enumerate(doc):
if (token.lower_ == 'classical') and (i != len(doc)-1):
if doc[i+1].lower_ == 'music':
print 'Target Acquired!'
If you're unsure of what enumerate does, look it up. It's the pythonic way of using python.
To questions 2 and 3, one simple (but not elegant) way of solving this is to just identify in a parsed sentence if the word 'birthday' exists and if it contains an entity of type 'DATE'.
doc = nlp(u'John\'s birthday was yesterday.')
for token in doc:
if token.lower_ == 'birthday':
for entities in doc.ents:
if entities.label_ == 'DATE':
print 'Found ya!'
As for the list of dependencies, I presume you're referring to the Part-Of-Speech tags. Check out the documentation on this page.
Good luck! Hope that helped.

Pig - comparing two similar statement : one working, the other not

I begin to be really annoyed with PIG :the language seems really not stable, the documentation is poor, there are not that many examples on internet, and any small change in the code can give radical differences :from failure to expected result.... Here is another kind of this last theme :
grunt> describe actions_by_unite;
actions_by_unite: {
group: chararray,
nb_actions_by_unite_and_action: {
(
unite: chararray,
lib_type_action: chararray,
double
)
}
}
-- works :
z = foreach actions_by_unite {
generate group, SUM(nb_actions_by_unite_and_action.$2);};
-- doesn't work :
z = foreach actions_by_unite {
x = SUM(nb_actions_by_unite_and_action.$2);
generate group, x;};
-- error :
2015-05-08 14:43:44,712 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 107, column 16> Invalid scalar projection: x : A column needs to be projected from a relation for it to be used as a scalar
Details at logfile: /private/tmp/pig-err.log
And so :
-- doesn't work neither:
z = foreach actions_by_unite { x = SUM(nb_actions_by_unite_and_action.$2);
generate group, x.$0;};
--error :
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (AC,EMAIL,1.1186133550060547E-4), 2nd :(AC,VISITE,6.25755280560356E-4)
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:120)
Does anyone would know why ?
Do you have some nice blog / ressources to propose with examples to master this language ?
I have the o'reilly book, but it seems a bit old, I have the 'Agile Data Science' and the "Hadoop definitive guide" book with some examples in it... I found this page really interesting : https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/
Any good video on coursera or other inputs ? Do you guys also have problems with this language ? or I am simply dumb ?....
That thing in particular is not because of Pig being unstable, it's because what you are trying to do is correct in the first approach, but wrong in the others.
When you make a group by, you have for each group a bag that contains X tuples. Inside a nested foreach, you have one group with its bag for each iteration, which means that a SUM inside there will yield a scalar value: the sum of the bag you are currently working with. Apache Pig does not work with scalars, it works with relations, therefore you cannot assign a scalar value to an alias, which is exactly what you are doing in the second and third approach.
Therefore, the error comes from attempting something like:
A = foreach B {
x = SUM(bag.$0);
}
However, if you want to emit for each of the groups a scalar, you can perfectly do this as long as you never assign a scalar to an alias. That is why it works perfectly if you do the sum at the end of the foreach, because you are returning for each of the groups a tuple with two values: the group and the sum.

USING Filter in a Nested FOREACH in PIG

I have two pig relations. The first one count_pairs shows pairs of words and how many times they were seen. ex ((car,tire), 4). The second is word_counts, which keeps track of how many times each word was seen ex. (car, 20). I would like to find the percentage of how many times each pair was seen compared to how many times just the first word was seen. In our case I would want ((car,tire), 4/20). I tried to write a nested foreach to solve this problem :
> percent_count_pairs = FOREACH count_pairs {
> denom = FILTER word_counts BY ($0 ==count_pairs.pair.word1);
> GENERATE pair, count2/(double)denom.$1;}
I keep getting this error:
'Pig script failed to parse:
<file src/cluster.pig, line 27, column 15> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)'
This point to the line with the FILTER;
googling this error did not lead me to anything helpful. Please help!
(ps. this does work if I take the line with FILTER out of the foreach...)
After more googling I came to realize that this is a bug in Pig that will not allow this:
https://issues.apache.org/jira/browse/PIG-1798. I ended up writing my own UDF to filter.