Openrefine : key collision-fingerprint clustering + diacritics - openrefine

I thinks there is a bug (or a very surprising feature...) in the way openrefine manage diacritics in "key collision-fingerprint" clustering:
row 1 : école
row 2 : école école ecole
-> clustering -> 0 cluster
same issue with
row 1 : école
row 2 : école école ecole
-> 0 cluster
But this case works well:
row 1 : ecole
row 2 : école école école
-> 1 cluster

Not too suprising. Fingerprint clustering only applies the fingerprint() function to each cell, and then compares their equivalence one by one. Now here is the result of fingerprint in the three cases you mention:
1
row value value.fingerprint()
1. école ecole
2. école école ecole ecole ecole
2
row value value.fingerprint()
1. école ecole
2. école école ecole ecole ecole
3
row value value.fingerprint()
1. ecole ecole
2. école école école ecole
Why this difference in the third case? Because the fingerprint algorithm actually performs the following operations, in a strict order.
1. remove leading and trailing whitespace
" école école école " -> "école école école"
2. change all characters to their lowercase representation
"éCole écoLe école" -> "école école école"
3. remove all punctuation and control characters
"école-école, école" -> "école école école"
4. split the string into whitespace-separated tokens
"école école école" -> ["école", "école", "école"]
5. sort the tokens and remove duplicates
["école", "école", "école"] -> ["école"]
6. join the tokens back together
["école"] -> "école"
7. normalize extended western characters to their ASCII representation
"école" -> "ecole"
One might wonder if operation 7 should not be done before. But in your example, the bug, if there is one, is maybe in the 3rd case. The string "école" is very different from the string "ecole école école", they should not be merged in my opinion. Neither the given name "John-John" is equivalent to "John".
EDIT : One of the developpers agrees with you.

Related

Get Position of a String in a field with delimiters BigQuery

I want to get the position of a word in a field that has the following data with the delimiter as "->":
Example:
Row 1| "ACT -> BAT -> CAT -> DATE -> EAT"
Row 2| "CAT -> ACT -> EAT -> BAT -> DATE"
I would like to lets say extract the position of CAT in each row.
Output would be -
Row 1| 3
Row 2| 1
Ive tried regex_instr and instr but they both return position of the alphabet i think not the word
Consider below
select *,
array_length(split(regexp_extract(col, r'(.*?)CAT'), '->')) as position
from your_table
if applied to sample data in your question - output is

Elm : How to make use of value with (Result String Value)

For example
fromIsoString : String -> Result String Date
fromIsoString will produce Ok (Value) ... Any methods that i can use to do something with the Value
As what i tested it is working with
text ( `Value` |> Date.add Days -1|> Date.toIsoString)
Method tried : Date.fromIsoString "2018-09-26" |> Result.withDefault 0 gives error -> expects:
Result String #Date#
Ideally i want to transform ISO date (2020-05-10) into Date format and do something with the date like -1 day.
Reference :
https://github.com/justinmimbs/date/blob/3.2.0/src/Date.elm
You’re seeing this Result String #Date# error because you’ve passed Result.withDefault a number where it expects a Date. If we look at the withDefault type annotation:
> Result.withDefault
<function> : a -> Result x a -> a
withDefault expects a default of the same type a as the successful result. Because you’ve specified 0 : number as the default, its type becomes:
> \result -> Result.withDefault 0 result
<function> : Result x number -> number
Note that result's type is Result x number, which doesn't line up with fromIsoString's Result String Date output type.
TLDR: Pass a Date as the default argument, e.g.:
> defaultDate = Date.fromCalendarDate 2020 Jan 1
RD 737425 : Date
> Date.fromIsoString "2018-09-26" |> Result.withDefault defaultDate
RD 736963 : Date
Take a look at the Elm Result documentation for other functions you can call on values of type Result String Date

F# Deedle and Multi Index

I have recently started to learn F# for Data Science (coming from simple C# and Python). I start to get used to the power of functional first paradigm for Science.
However, I am still confused on how to treat a problem I could easily fix using pandas in Python. It is related to Multi index time series / Data frame. I have extensively checked on Deedle but I am still not sure if Deedle could help me achieve such a table:
Column Index 1: A || B
Column Index 2: A1 A2 || B1 B2
Column Index 3: p1 p2 | p1 p2 || p1 p2 | p1 p2
Row Index:
date1 0.5 2. | 2. 0.5 || 3. 0. | 2. 3.
date2 ......
The idea being able to sum all p1 series when Index1 = A etc etc
I did not find example of such a thing using Deedle.
If it is not available, what structure for my data would you recommend me?
Thanks for helping a newbie (but in love with) in F#
In Deedle, you can create a frame or a series with hierarchical index by using a tuple as the key:
let ts =
series
[ ("A", "A1", "p1") => 0.5
("A", "A1", "p2") => 2.
("A", "A2", "p3") => 2.
("A", "A2", "p4") => 0.5 ]
Deedle does have some special handling for this. For example, it will output the data as:
A A1 p1 -> 0.5
p2 -> 2
A2 p3 -> 2
p4 -> 0.5
To apply aggregation over a part of the hierarchy, you can use the applyLevel function:
ts |> Series.applyLevel (fun (l1, l2, l3) -> l1) Stats.mean
ts |> Series.applyLevel (fun (l1, l2, l3) -> l1, l2) Stats.mean
The first argument is a function that gets the tuple of keys and selects what part of the level you want to group - so the above two create an aggregation over the top and top two levels, respectively.

Aligning text columns of different size and content

In a past posting, I asked about commands in Bash to align text columns against one another by row. It has become clear to me that the desired task (i.e., aligning text columns of different size and content by row) is much more complex than initially anticipated and that the proposed answer, while acceptable for the past posting, is insufficient on most empirical data sets. Thus, I would like to query the community on the following pseudocode. Specifically, I would like to know if and in what way the following pseudocode could be optimized.
Assume a file with n columns of strings. Some strings might be missing, others might be duplicated. The longest column may not be the first one listed in the file, but shall be the reference column. The order of the rows of this reference column must be maintained.
> cat file # where n=3; first row contains column headers
CL1 CL2 CL3
foo foo bar
bar baz qux
baz qux
qux foo
bar
Pseudocode attempt 1 (totally inadequate):
Shuffle columns so that columns ordered by size (i.e., longest column is first in matrix)
Rownames = strings of first column (i.e., of longest column)
For rownames
For (colname among columns 2:end)
if (string in current cell == rowname) {keep string in location}
if (string in current cell != rowname) {
if (string in current cell == rowname of next row) {add row to bottom of table; move each string of current column one row down}
if (string in current cell != rowname of next row) {add row to bottom of table; move each string of all other columns one row down}
}
Order columns by size:
> cat file_columns_ordered_by_size
CL2 CL1 CL3
foo foo bar
baz bar qux
qux baz
foo qux
bar
Sought output:
> my_code_here file_columns_ordered_by_size
CL2 CL1 CL3
foo foo
bar bar
baz baz
qux qux qux
foo
bar
Edit: Ugh, this doesn't produce the output you wanted. I guess I don't understand the problem. Maybe it will help, anyway.
If you don't mind slurping the entire table into memory, associative arrays (hashes) would work. (Or you can use trees, maps, dictionaries, etc.) There would be one for each column, mapping strings (found in the cells of that column) to the number of times that string is found in that column. Let's name the hashes after their column headers. After slurping, they would look something like this:
CL2 = {'foo':2, 'baz':1, 'bar':1, 'qux':1}
CL1 = {'foo':1, 'baz':1, 'bar':1, 'qux':1}
CL3 = {'bar':1, 'qux':1}
# Store the columns in an array
columnCounts = [CL2, CL1, CL3]
Then write a loop that produces the output, deleting from the associative arrays at each iteration:
while (columnCounts still has at least one non-empty hash) {
key = the hash-key that is present in most (a plurality) of the hashes
for each hash in columnCounts {
if the key is in the hash {
print key
Decrement hash[key]
}
else {
print whitespace
}
}
print newline
}

Regular Expression puzzle

In (Visual Basic, .NET):
Dim result As Match = Regex.Match(aStr, aMatchStr)
If result.Success Then
Dim result0 As String = result.Groups(0).Value
Dim result1 As String = result.Groups(1).Value
End If
With: aStr equal to (whitespace is normal space and there are seven spaces between n and ():
"AMEVDIEERPK + 7 Oxidation (M)"
Why does result1 become an empty string for aMatchStr equal to
"\s*(\d*).*?Oxidation\s+\(M\)"
but becomes "7" for aMatchStr equal to
"\s*(\d*)\s*Oxidation\s+\(M\)"
?
(result0 becomes equal to "AMEVDIEERPK + 7 Oxidation (M)")
(This is from MSQuant, MascotResultParser.vb, function modificationParseMatch()).
\s* Zero or more whitespace
(\d*) Zero or more digits (captured)
.*? Any characters (non greedy, so up to the next match
Oxidation Matches the word Oxidation
\s+(M) Matches with one or more whitespace then (M)
The problem here is that you are matching 0 or more of any characters prior to the word Oxidation, including any possible digits, eating the digits which might match the previous \d
\s*(\d*)\s*Oxidation\s+(M)
The difference here is that you are specifying whitespace only before the Oxidation. Not eating the digits.
Change the \d* to \d+ to catch the numbers
I think it's because the matching starts at the first character and moves on from there...
For your first regular expression:
Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*).*?Oxidation\s+(M)"? Yes.. stop matching.
For your second regular expression:
Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
Does "MEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
Does "EVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? No...
...
Does " 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"? Yes
If for the first regular expression you'd used \d+ instead of \d* you'd have got a better result.
This is not exactly how regular expressions work, but you get the idea.
Thanks for the quick responses!
The numbers in the input are left out if there is only one
(peptide) modification instead of 7 as in the previous
example, e.g.:
"AMEVDIEERPK + Oxidation (M)"
and there would be no match if "\d+" was used. But maybe I
should use two regular expressions, one for each of these two
cases. This would increase the complexity of the program
somewhat (as I want to avoid memory garbage from
constructing regular expression for each string to be
matched), but is acceptable.
What I really wanted to do was to let the user specificy a
match rule without requiring the rule to match from the
beginning of the (peptide) modification (that's why I tried
to introduce the non-greedy match).
Right now the user's rule is prepended with "\s*(\d*)\s*"
and the user must thus specifify "Oxidation\s+(M)" to
match. Specifying e.g. "dation\s+(M)" will not work.
To answer your second message, you (or your user) can specify \w*dation\s+\(M\) to match either Oxydation (M) or Gradation (M) or dation (M).
With the syntax update, it seems we don't need to worry about the difference between \d+ and \d*. There's always a + sign present, even if there are no digits. Matching this + constrains the regex to the point that it works as expected:
"\s* // whitespace before +
\+ // The + sign itself
\s* // whitespace after +
(\d*) // optional digits
.*? // any non-digit between the last digit and Oxidation (M)
Oxidation\s+\(M\)"
Since the + must be matched first, and must be matched precisely once, the AMEVDIEERPK prefix cannot be matched by .*?.
I settled on using \w* for now. The user will be required
to specify matching for any white space, but it covers the
majority of cases for this particular application and how it
is commonly used.
So for the example the regular expression is then:
\s*(\d*)\s*\w*Oxidation\s+\(M\)
". * ?" in this example will always match zero characters, since "* ?" does shortest possible match. As a result, since the thing right before the 'O' is a space, "\ d *" can match 0 digits.
(Sorry about the spaces in the quotes; the auto-formatter was eating my syntax.)
Reference: Quantifiers in Regular Expressions (MSDN)
I am sorry, there is more to the syntax...
The plus sign can not be relied on. It separates the
(peptide) sequence and the (peptide) modifications. There
can be more than one modification for each sequence. Sample
with two modifications (there is 7 spaces between "2" and
"L"):
"KLIDLTQFPAFVTPMGK + Oxidation (M); 2 Lysine-13C615N2 (K-full)"
The user could specify "\S+\s+(K-full)" for the second
modification and "2" should be extracted.
Here are some more sample lines (after the plus sign):
" Phospho (ST); 2 Dimethyl (K); Dimethyl (N-term)"
" Phospho (ST); 2 Dimethyl:2H(4) (K); Dimethyl:2H(4) (N-term)"
" N-Acetyl (Protein)"
" 2 Dimethyl:2H(4) (K); Dimethyl:2H(4) (N-term)"
" N-Acetyl (Protein); 2 Lysine-13C615N2 (K-full)"
" Oxidation (M); N-Acetyl (Protein)"
" Oxidation (M); N-Acetyl (Protein); Lysine-13C615N2 (K-full)"
" N-Acetyl (Protein); Lysine-13C615N2 (K-full)"
" Oxidation (M); Lysine-13C615N2 (K-full)"
" Oxidation (M)"
" 2 Oxidation (M); Lysine-13C615N2 (K-full)"
A sample file with user defined rules can be found at
(packed in 7-zip format):
<http://www.pil.sdu.dk/1/MSQuant/CEBIquantModes,2008-11-10.7z>