segmenting bs4.element.Tag - beautifulsoup

Is it possible to segment a bs4.element.Tag into several bs4.element.Tag?
You can think of an application as the following:
1- The original bs4.element.Tag contains a paragraph.
2- We want to segment the paragraph in the original bs4.element.Tag into sentences and get a bs4.element.Tag corresponding to each sentence.
Example:
paragraphs = soup.find_all('p') gives all the paragraphs in an HTML file.
Suppose a paragraph (which is also a bs4.element.Tag instance) is the following:
<p><i>Le Bassin Aux Nymphéas</i>, 1919. Monet's late series of water lily paintings are among his best-known works.
I would like to turn this bs4.element.Tag instance (which is also a paragraph) into 2 bs4.element.Tag instances as the following (one for each sentence):
First bs4.element.Tag should correspond to the first sentence:
<i>Le Bassin Aux Nymphéas</i>, 1919.
Second bs4.element.Tag should correspond to the second sentence:
Monet's late series of water lily paintings are among his best-known works.

Related

Line from another line objects

Is there any way to extract some line from linestring where parallel line object is near?
The thing is I have multiple lines (some are intersecting the major line some not) that I have to somehow mark or extract from the major line. Big black line is the major, call it Pipe, and red is secondary line, call it Road.
My job is to mark or extract sections (that is shown between thin black lines) from Road on Pipe and create line from that sections. I have tried something with STStartPoint and STEndPoint then connect to line with ShortestLineTo but without any success. I'm new at spatial SQL and I really can't figure out how to do that.

How do I determine the model organism from FASTA format

So I have this fasta format: For example
>sp|A9X7L0|ANMT_RUTGR Anthranilate N-methyltransferase OS=Ruta graveolens OX=37565 PE=1 SV=1
MGSLSESHTQYKHGVEVEEDEEESYSRAMQLSMAIVLPMATQSAIQLGVFEIIAKAPGGR
LSASEIATILQAQNPKAPVMLDRMLRLLVSHRVLDCSVSGPAGERLYGLTSVSKYFVPDQ
DGASLGNFMALPLDKVFMESWMGVKGAVMEGGIPFNRVHGMHIFEYASSNSKFSDTYHRA
MFNHSTIALKRILEHYKGFENVTKLVDVGGGLGVTLSMIASKYPHIQAINFDLPHVVQDA
ASYPGVEHVGGNMFESVPEGDAILMKWILHCWDDEQCLRILKNCYKATPENGKVIVMNSV
VPETPEVSSSARETSLLDVLLMTRDGGGRERTQKEFTELAIGAGFKGINFACCVCNLHIM
EFFK
So I am wondering how do I determine if one is:
Bacteria
Viruses
Archaea
Eukaryota
The anwser can be found when looking at the OS part of the header of your FASTA file. But suppose you don't have this information, then you would perform a BLAST search. If the letters in your sequence would consist of only A, T, C and G it would be a DNA sequence. But since they are not you are dealing with a protein sequence. So we need to use protein BLAST.
Copy/paste the FASTA file in the online tool:
Leave the rest at the default settings and click on the BLAST button. After some time you will get the following results:
You will see that there is a 100% similarity match found with Ruta graveolens (as mentioned in the FASTA header) and around 80% similarity match found in Citrus sinensis.
If you want to know to which domain these species belong, you can click on the link to the accession records. For Ruta graveolens that is A9X7L0.1. There you see that the common name of this plant is common rue which has the following taxonomy:
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;
Pentapetalae; rosids; malvids; Sapindales; Rutaceae; Rutoideae;
Ruta.

What are terminal and nonterminal symbols?

I am reading Rebol Wikipedia page.
"Parse expressions are written in the parse dialect, which, like the do dialect, is an expression-oriented sublanguage of the data exchange dialect. Unlike the do dialect, the parse dialect uses keywords representing operators and the most important nonterminals"
Can you explain what are terminals and nonterminals? I have read a lot about grammars, but did not understand what they mean. Here is another link where this words are used very often.
Definitions of terminal and non-terminal symbols are not Parse-specific, but are concerned with grammars in general. Things like this wiki page or intro in Grune's book explain them quite well. OTOH, if you're interested in how Red Parse works and yearn for simple examples and guidance, I suggest to drop by our dedicated chat room.
"parsing" has slightly different meanings, but the one I prefer is conversion of linear structure (string of symbols, in a broad sense) to a hierarchical structure (derivation tree) via a formal recipe (grammar), or checking if a given string has a tree-like structure specified by a grammar (i.e. if "string" belongs to a "language").
All symbols in a string are terminals, in a sense that tree derivation "terminates" on them (i.e. they are leaves in a tree). Non-terminals, in turn, are a form of abstraction that is used in grammar rules - they group terminals and non-terminals together (i.e. they are nodes in a tree).
For example, in the following Parse grammar:
greeting: ['hi | 'hello | 'howdy]
person: [name surname]
name: ['john | 'jane]
surname: ['doe | 'smith]
sentence: [greeting person]
greeting, person, name, surname and sentence are non-terminals (because they never actually appear in the linear input sequence, only in grammar rules);
hi, hello, howdy with john, jane, doe and smith are terminals (because parser cannot "expand" them into a set of terminals and non-terminals as it does with non-terminals, hence it "terminates" by reaching the bottom).
>> parse [hi jane doe] sentence
== true
>> parse [howdy john smith] sentence
== true
>> parse [wazzup bubba ?] sentence
== false
As you can see, terminal and non-terminal are disjoint sets, i.e. a symbol can be either in one of them, but not in both; moreso, inside grammar rules, only non-terminals can be written on the left side.
One grammar can match different strings, and one string can be matched by different grammars (in the example above, it could be [greeting name surname], or [exclamation 2 noun], or even [some noun], provided that exclamation and noun non-terminals are defined).
And, as usual, one picture is worth a thousand words:
Hope that helps.
think of it like that
a digit can be 1-9
now i will tell you to write down on a page a digit.
so you know that you can write down 1,2,3,4,5,6,7,8,9
basically the nonterminal symbol is "digit"
and the terminals symbols are the 1,2,3,4,5,6,7,8,9
when i told you to write down on a page a digit you wrote down 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
you didn't wrote down the word "digit" you wrote down the 1 or 2 or 3....
do you see where i'm going ?
let's try to make our own "rules"
let's "create" a nonterminal symbol we will call it "Olaf"
Olaf can be a dog (NOTE: dog is terminal)
Olaf can be a cat (NOTE: cat is terminal)
Olaf can be a digit (NOTE: digit is nonterminal)
Now i'm telling you that you can write down on a page an Olaf.
so that's mean that you can write down "dog"
you can also write down "cat"
you can also write down a digit so that's mean you can write down 1 or 2 or 3...
because digit is nonterminal symbol you dont write down "digit" you write down
the symbols that digit is referring to which is 1 or 2 or 3 etc...
in the end only terminals symbols are written on the "page"
one more thing i have to say is something that you may encounter one day, basically when you say "a nonterminal can be something".
there is a special term for that and that's basically called a "production rule"(can also be called a "production")
for example
Olaf can be "dog"
Olaf can be "cat"
Olaf can be digit
we got 3 productions here in other words we got here 3 definitions of Olaf
specifications of Programming languages use those ideas quite a lot when defining a syntax of a language

How to awk to read a dictionary and replace words in a file?

We have a source file ("source-A") that looks like this (if you see blue text, it comes from stackoverflow, not the text file):
The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences
Each sentence in "source-A" is on its own line and terminates with a newline (\n)
We have a dictionary/conversion file ("converse-B") that looks like this:
aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes
"converse-B" is a two column, tab delimited file.
Each equivalence map (term-on-left<tab>term-on-right) is on its own line and terminates with a newline (\n)
How to read "converse-B", and replace terms in "source-A" where a term in "converse-B" column-1 is replaced with the term in column-2, and then write to an output file ("output-C")?
For example, the "output-C" would look like this:
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
The tricky part is the term potato.
If a "simple" awk solution cannot handle a singular term (potato) and a plural term (potatoes), we'll use a manual substitution method. The awk solution can skip that use case.
In other words, an awk solution can stipulate that it only works for an unambiguous word or a term composed of space separated, unambiguous words.
An awk solution will get us to a 90% completion rate; we'll do the remaining 10% manually.
sed probably suits better since since it's only phrase/word replacements. Note that if the same words appear in multiple phrases first come first serve; so change your dictionary order accordingly.
$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content
The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences
file substitute sed statement converts dictionary entries into sed expressions and the main sed uses them for the content replacements.
NB: Note that production quality script should take of word cases and also word boundaries to eliminate unwanted substring substitution, which are ignored here.

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.