Delete everything before a character on certain lines in a large text in OpenRefine - openrefine

I’ve looked around but did not find an answer.
I’m cleaning a large amount of texts in OpenRefine. What I am trying to do is to suppress lines—between two end of lines (\n)—containing a specific character—in this case %. It looks like this:
...En trois mots, la bouffe lyonnaise, ça se résume à quoi?\n« Réconfortante, savoureuse, chaleureuse. » \n \nLa quenelle de brochet et sa sauce aux écrevisses %\nL'extra avec ça?\nLe chef Viola concoctera une soupe géante et celle-ci sera partagée GRATUITEMENT le samedi 25 février 2017! Stay tuned! \nLe bouchon lyonnais du Balmoral, c'est un rendez-vous! \nMontréal en Lumière - volet gastronomie\n23 février au 11 mars 2016 \nLE BALMORAL\n514 288-5992
I am looking for such result (without the bolded line):
...En trois mots, la bouffe lyonnaise, ça se résume à quoi?\n« Réconfortante, savoureuse, chaleureuse. » \n \n\nL'extra avec ça?\nLe chef Viola concoctera une soupe géante et celle-ci sera partagée GRATUITEMENT le samedi 25 février 2017! Stay tuned! \nLe bouchon lyonnais du Balmoral, c'est un rendez-vous! \nMontréal en Lumière - volet gastronomie\n23 février au 11 mars 2016 \nLE BALMORAL\n514 288-5992
This, for many instances in multiple texts.
Help would be greatly appreciated.

I'm not sure whether the "\n" are literal or a representation of the LF character, but I'll assume the former and you can adjust the formula, if necessary. The solution involves splitting the lines, iterating through the lines and filtering the lines containing '%' and joining the lines again. Use the following formula in the "Edit Cells -> Transform" dialog:
forEach(value.split('\\n'),l,if(l.contains('%'),'',l)).join('\\n')
To break it down:
value.split('\\n') yields an array of split lines
forEach(array,l,f) iterates through the array assigning each line to the variable l and applying function f
if(l.contains('%'),'',l)) returns the empty string if l contains a percent ('%') otherwise the original string
array.join('\\n') joins your filtered lines back together again

Related

Can Raku range operator on strings mimic Perl's behaviour?

In Perl, the expression "aa" .. "bb" creates a list with the strings:
aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az ba bb
In Raku, however, (at least with Rakudo v2021.08), the same expression creates:
aa ab ba bb
Even worse, while "12" .. "23" in Perl creates a list of strings with the numbers 12, 13, 14, 15, ..., 23, in Raku the same expression creates the list ("12", "13", "22", "23").
The docs seem to be quite silent about this behaviour; at least, I could not find an explanation there. Is there any way to get Perl's behaviour for Raku ranges?
(I know that the second problem can be solved via typecast to Int. This does not apply to the first problem, though.)
It's possible to get the Perl behavior by using a sequence with a custom generator:
say 'aa', *.succ … 'bb';
# OUTPUT: «aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az ba bb»
say '12', *.succ … '23';
# OUTPUT: «12 13 14 15 16 17 18 19 20 21 22 23»
(Oh, and a half solution for the '12'..'23' case: you already noted that you can cast the endpoints to a Numeric type to get the output you want. But you don't actually need to cast both endpoints – just the bottom. So 12..'23' still produces the full output. As a corollary, because ^'23' is sugar for 0..^'23', any Range built with &prefix:<^> will be numeric.)
For the "why" behind this behavior, please refer to my other answer to this question.
TL;DR Add one or more extra characters to the endpoint string. It doesn't matter what the character(s) is/are.
10 years after the current doc corpus was kicked started by Moritz Lenz++, Raku's doc is, as ever, a work in progress.
There's a goldmine of more than 16 years worth of chat logs that I sometimes spelunk, looking for answers. A search for range "as words" with nick: TimToady netted me this in a few minutes:
TimToady beginning and ending of the same length now do the specced semantics
considering each position as a separate character range
My instant reaction:
Here's why it does what it does. The guy who designed how Perl's range works not only deliberately specced it to work how it now does in Raku but implemented it in Rakudo himself in 2015.
It does that iff "beginning and ending of the same length". Hmm. 💡
A few seconds later:
say flat "aa" .. "bb (like perl)";
say flat "12" .. "23 (like perl)";
displays:
(aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az ba bb)
(12 13 14 15 16 17 18 19 20 21 22 23)
😊
[I'm splitting this into a separate answer because it addresses the "why" instead of the "how"]
I did a bit of digging, and learned that:
For Sequences, having "aa"…"bb" produce "aa", "ab", "ba", "bb" is specified in Roast
The original use case provided for this behavior was generating sequences of octal numbers (as Strs) (discussed again in 2018)
For Ranges, the behavior of "aa".."bb" is currently unspecified and there does not appear to be consensus about what it should be.
(As you already know), Rakudo's implementation has "aa".."bb" behave the same as "aa"…"bb".
In 2018, lizmat ([Elizabeth Mattijsen])https://stackoverflow.com/users/7424470/elizabeth-mattijsen) on StackOverflow) changed .. to make "aa".."bb" behave the way it does in Perl but reverted that change pending consensus on the correct behavior.
So I suppose we (as a community) are still thinking about it? Personally, I'm inclined to agree with lizmat that having "aa".."bb" provide the longer range (like Perl) makes sense: if users want the shorter one, they can use a sequence. (Or, for an octal range, something like (0..0o377).map: *.fmt('%03o'))
But, either way, I definitely agree with that 2018 commit that we should pin this down in Roast – and then get it noted in the docs.

load data from text with pandas with a two characters as separator

I'm trying to load data with pandas from a txt table.
The column separator was defined as "|#" as you can see in the example:
LINEA DE NEGOCIO|#NOMBRE CLIENTE|#NUMERO CLIENTE|#NUMERO DE CONTRATO|#TIPO DE SEGURO
The system does not allow to use "|#" as separator.
Could you help me with this loading?
Thanks in advance.
I share the code:
df = pd.read_table('D:/Art_492/Encabezado.txt', sep='|#', index_col=0).astype(str)
The | represents OR operator in regular expression, you need to escape it using \ so updating your regex string to \|# and setting engine='python' you can get your desired result.
pd.read_table(data,sep='\|#',engine='python',index_col=0).astype(str)

French character display on xaml page

I have some special characters in my French content. When i see the characters in XAML code, i can see the proper text in visual studio. But while running, the text is not getting rendered properly.
For example: <TextBlock xml:lang="fr-CA" Foreground="Black" Text="Bay Nº doit contenir uniquement caractères alphabétiques ou numériques"/>
In the given text, the underscore which we can see after N and below o is missing while running on the page.
Has anyone faced this issue/does anyone have any idea on resolving this issue?
So I think what you're running into is an issue between ordinal and numero in which case a workaround would be to just implement the Numero unicode hex directly as the character set instead of translating a single ordinal.
Numero hex : №
Shown as example which should render as desired both in designer and at runtime;
Hope this helps, cheers!

SQL Server : searching strings for equivalent phrasing such as inch, inches,'' and "

As per the title I am looking for a method to search data on an equivalence basis
Ie user searches for a value of 20" it will also search for 20 inch, 20 inches etc...
I've looked at possibly using full text search and a thesaurus but would have to build my own equivalence library
Is there any other alternatives I should be looking at? Or are there common symbol/word equivalence libraries already written?
EDIT:
I dont mean the like keyword and wild cards
if my data is
A pipe that is 20" wide
A pipe that is 20'' wide - NOTE::(this is 2 single quotes)
A pipe that is 20 cm wide
A pipe that is 20 inch wide
A pipe that is 20 inches wide
I would like to search for '20 inch' and be returned
A pipe that is 20" wide
A pipe that is 20'' wide
A pipe that is 20 inch wide
A pipe that is 20 inches wide
just answering this in case anyone else comes across it as I finally figured it out.
I ended up using an FTS thesaurus to assign equivalence to inch inches and ", and this work wonderfully for inch and inches but would return no results when I searched for 6"
It eventually turned out the underlying issue I had was that characters such as " are treated as word breakers by full text search.
I found that custom dictionary items seems to override the languages word breakers and so introducing a file called Custom0009.lex with a few lines of " and a few other characters/terms I wanted included that had word breakers in to C:\Program Files\Microsoft SQL Server\{instance name}\MSSQL\Binn and restarting the fdhost and rebuilding the index allowed my search for
select * from tbldescriptions where FREETEXT(MainDesc,'"')
or
select * from tbldescriptions where contains(MainDesc,'FORMSOF(Thesaurus,"""")')
notice the double " on the contains one as the search term is within " already it needed to be escaped to be seen.

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.