How to detect latin word in a file in mulesoft - mule

I want to detect latin / non-english word in a file in a Mule application running in Anypoint Studio (MuleSoft products), Can anyone help me?
Basically my code fetching a file from a legacy system and read it and post the data to salesforce, while reading the file I need to detect if any latin word / non-english word are there in the name column

There is no built-in function to detect characters outside the english alphabet that I'm aware of in Mule.
One alternative is to create a custom DataWeave function and use the charCode() or charCodeAt() functions to compare the Unicode characters of each character in the file with the allowed english characters, iterating over the characters of the file. This assumes that the file is a text file that can be read as a string.
Another alternative is to implement the same algorithm in a Java class and call it using the Java Module.
This is a solution with DataWeave using a recursive function to iterate over the characters. It would be more efficient if there was a way to avoid the recursion:
%dw 2.0
output application/json
import * from dw::core::Strings
fun isEnglishChar(c)=
(c >= 65 and c <= 90) or (c >= 97 and c <= 122) or (c == 32)
fun isEnglishWord(s)=
if (sizeOf(s) > 1) isEnglishChar(charCode(s)) and isEnglishWord(s[1 to -1])
else if (sizeOf(s) == 1) isEnglishChar(charCode(s))
else true
---
payload map isEnglishWord($.name)
Input:
[
{
"name": "has space"
},
{
"name": "JustEnglish"
},
{
"name": "ñó"
}]
Output:
[
true,
true,
false
]
Using functions makes it easy to reuse and to modify the logic if needed.

use the below regex to find non-English words, created a simple example as below
input:
{
"message": "你好"
}
code:
%dw 2.0
output application/json
---
payload.message contains (/[^\x00-\x7F]+/)
output : true
one more working example screenshot is as below:

Assuming you are able to parse input and get the string value in name field. You can iterate over the String value and for each word in the string, apply the below logic on each word.
Logic: Assuming the word is either english word or non-english word, pick the first letter from the word and check if it contains in the 26 english alphabets. When English word, value of No_Latin_Word should be True else False.
%dw 2.0
output application/json
//var name = "ĥć"
var name = "hc"
---
No_Latin_Word : upper(name[0]) contains /[A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z]/

Related

How do I replace part of a string with a lua filter in Pandoc, to convert from .md to .pdf?

I am writing markdown files in Obsidian.md and trying to convert them via Pandoc and LaTeX to PDF. Text itself works fine doing this, howerver, in Obsidian I use ==equal signs== to highlight something, however this doesn't work in LaTeX.
So I'd like to create a filter that either removes the equal signs entirely, or replaces it with something LaTeX can render, e.g. \hl{something}. I think this would be the same process.
I have a filter that looks like this:
return {
{
Str = function (elem)
if elem.text == "hello" then
return pandoc.Emph {pandoc.Str "hello"}
else
return elem
end
end,
}
}
this works, it replaces any instance of "hello" with an italicized version of the word. HOWEVER, it only works with whole words. e.g. if "hello" were part of a word, it wouldn't touch it. Since the equal signs are read as part of one word, it won't touch those.
How do I modify this (or, please, suggest another filter) so that it CAN replace and change parts of a word?
Thank you!
this works, it replaces any instance of "hello" with an italicized version of the word. HOWEVER, it only works with whole words. e.g. if "hello" were part of a word, it wouldn't touch it. Since the equal signs are read as part of one word, it won't touch those.
How do I modify this (or, please, suggest another filter) so that it CAN replace and change parts of a word?
Thank you!
A string like Hello, World! becomes a list of inlines in pandoc: [ Str "Hello,", Space, Str "World!" ]. Lua filters don't make matching on that particularly convenient: the best method is currently to write a filter for Inlines and then iterate over the list to find matching items.
For a complete example, see https://gist.github.com/tarleb/a0646da1834318d4f71a780edaf9f870.
Assuming we already found the highlighted text and converted it to a Span with with class mark. Then we can convert that to LaTeX with
function Span (span)
if span.classes:includes 'mark' then
return {pandoc.RawInline('latex', '\\hl{')} ..
span.content ..
{pandoc.RawInline('latex', '}')}
end
end
Note that the current development version of pandoc, which will become pandoc 3 at some point, supports highlighted text out of the box when called with
pandoc --from=markdown+mark ...
E.g.,
echo '==Hi Mom!==' | pandoc -f markdown+mark -t latex
⇒ \hl{Hi Mom!}

How to solve parse issues when a CSV has a field content escaped with double quotes

The input is received from a Salesforce Bulk API query.
INPUT
"RecordTypeId","Name","Description"
"AAA","Talent 2022 - Skills Renewal - ABC","DF - 14/03 - Monty affirmed that the ""mastercard approach"" would best fit in this situation. I will connect (abc, def, ghi) and the confirm booking tomorrow (15/03)"
SCRIPT:
%dw 2.0
output application/csv separator=",", ignoreEmptyLine=false, quoteValues=true, quoteHeader=true, lineSeparator="\r\n"
---
payload
OUTPUT:
"RecordTypeId","Name","Description"
"AAA","Talent 2022 - Skills Renewal - ABC","DF - 14/03 - Monty affirmed that the , def, ghi) and the confirm booking tomorrow (15/03)"
Expected OUTPUT:
The column description has " and , in it and therefore some description content is getting lost and some is getting shifted to different columns. I need entire description value in one column
The escape character has to be set to a double quote (") for DataWeave to recognize that "" is an escaped quote and not the end of a string. You can not use replace or any string operation because they are executed after the input is parsed.
You need to configure the reader properties in the source of that payload. For example in the SFTP or HTTP listeners, or whatever connector or operation reads the CSV. There you can add the outputMimeType attribute and set the input type and its properties. Note that because the flow is in an XML file you need to be mindful of XML escaping also to use double quotes, and also need to escape the double quotes as DataWeave expects it, with a backslash (\).
Example:
outputMimeType="application/csv; escape="\"""
It looks like your payload is using " as escape character. By default DataWeave expects \ as the escape character for CSV, so you will need to specify the escape character explicitly while reading your input, after which DataWeave should be able to read the completely description as a single value.
For example the below DataWeave shows how you can use the input derivative to read your csv correctly. I do not know what exactly is your expected output so I am just giving an example that writes the value of description as text
%dw 2.0
input payload application/csv escape='"'
output text
---
payload[0].Description
The output of this will is
DF - 14/03 - Monty affirmed that the "mastercard approach" would best fit in this situation. I will connect (abc, def, ghi) and the confirm booking tomorrow (15/03)

How do I match using :global in Raku grammar?

I'm trying to write a Raku grammar that can parse commands that ask for programming puzzles.
This is a simplified version just for my question, but the commands combine a difficulty level with an optional list of languages.
Sample valid input:
No language: easy
One language: hard javascript
Multiple languages: medium javascript python raku
I can get it to match one language, but not multiple languages. I'm not sure where to add the :g.
Here's an example of what I have so far:
grammar Command {
rule TOP { <difficulty> <languages>? }
token difficulty { 'easy' | 'medium' | 'hard' }
rule languages { <language>+ }
token language { \w+ }
}
multi sub MAIN(Bool :$test) {
use Test;
plan 5;
# These first 3 pass.
ok Command.parse('hard', :token<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :token<difficulty>), '<difficulty> should not parse random words';
# Why does this parse <languages>, but <language> fails below?
ok Command.parse('js', :rule<languages>), '<languages> can parse a language';
# These last 2 fail.
ok Command.parse('js', :token<language>), '<language> can parse a language';
# Why does this not match both words? Can I use :g somewhere?
ok Command.parse('js python', :rule<languages>), '<languages> can parse multiple languages';
}
This works, even though my test #4 fails:
my token wrd { \w+ }
'js' ~~ &wrd; #=> 「js」
Extracting multiple languages works with a regex using this syntax, but I'm not sure how to use that in a grammar:
'js python' ~~ m:g/ \w+ /; #=> (「js」 「python」)
Also, is there an ideal way to make the order unimportant so that difficulty could come anywhere in the string? Example:
rule TOP { <languages>* <difficulty> <languages>? }
Ideally, I'd like for anything that is not a difficulty to be read as a language. Example: raku python medium js should read medium as a difficulty and the rest as languages.
There are two things at issue here.
To specify a subrule in a grammar parse, the named argument is always :rule, regardless whether in the grammar it's a rule, token, method, or regex. Your first two tests are passing because they represent valid full-grammar parses (that is, TOP), as the :token named argument is ignored since it's unknown.
That gets us:
ok Command.parse('hard', :rule<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :rule<difficulty>), '<difficulty> should not parse random words';
ok Command.parse('js', :rule<languages> ), '<languages> can parse a language';
ok Command.parse('js', :rule<language> ), '<language> can parse a language';
ok Command.parse('js python', :rule<languages> ), '<languages> can parse multiple languages';
# Output
ok 1 - <difficulty> can parse a difficulty
ok 2 - <difficulty> should not parse random words
ok 3 - <languages> can parse a language
ok 4 - <language> can parse a language
not ok 5 - <languages> can parse multiple languages
The second issue is how implied whitespace is handled in a rule. In a token, the following are equivalent:
token foo { <alpha>+ }
token bar { <alpha> + }
But in a rule, they would be different. Compare the token equivalents for the following rules:
rule foo { <alpha>+ }
token foo { <alpha>+ <.ws> }
rule bar { <alpha> + }
token bar { [<alpha> <.ws>] + }
In your case, you have <language>+, and since language is \w+, it's impossible to match two (because the first one will consume all the \w). Easy solution though, just change <language>+ to <language> +.
To allow the <difficulty> token to float around, the first solution that jumps to my mind is to match it and bail in a <language> token:
token language { <!difficulty> \w+ }
<!foo> will fail if at that position, it can match <foo>. This will work almost perfect until you get a language like 'easyFoo'. The easy fix there is to ensure that the difficulty token always occurs at a word boundary:
token difficulty {
[
| easy
  | medium
| hard
]
>>
}
where >> asserts a word boundary on the right.

Formula delimiter differences on different locales

I'm trying to append cells with hyperlink to a spreadsheet file by following the instructions here https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#celldata
A hyperlink this cell points to, if any. This field is read-only. (To set it, use a =HYPERLINK formula in the userEnteredValue.formulaValue field.)
The problem is that some formulas has multiple parameters that delimited by comma. But delimiters are different on spreadsheet that has different locales- like Turkey as locale. The delimiter on Turkey locale has settled as semicolon not comma. I didn't check if delimiters are different on different locales.
After I tried to add link as formulaValue, the result looks like this on spreadsheet that has Turkey locale:
https://user-images.githubusercontent.com/5789670/77210180-61581500-6b11-11ea-9302-81dcf84256f8.png
and this is from a spreadsheet that has United States locale:
https://user-images.githubusercontent.com/5789670/77210238-8e0c2c80-6b11-11ea-9eb8-ea82fdc869d2.png
Both spreadsheets has same formulas and only difference is just this (compared to a blank spreadsheet)
https://user-images.githubusercontent.com/5789670/77210339-cc095080-6b11-11ea-8805-92b3f6c59b0b.png
It's not like possible for me to track/identify all the configuration for delimiter on different locales. I just simply finding a way to generate hyperlink formula without having delimiter issues.
Something like a function
.getDelimiter("Europe/Istanbul")
or a field in properties to understand what type of delimiter has used on the target spreadsheet file
// SpreadsheetProperties
"properties": {
"title": string,
"locale": string,
"timeZone": string,
"formulaDelimiter": string, // read-only
...
}
Environment details
OS: Ubuntu 18.04
Node.js version: v12.13.0
npm version: 6.13.7
googleapis version: ^48.0.0
Steps to reproduce
Have two different spreadsheets that has United States and Turkey locales.
Use following data to append cell with batchUpdate API
{
"requests": [
{
"appendCells": {
"fields": "*",
"rows": [
{
"values": [
{
"userEnteredFormat": {},
"userEnteredValue": {
"formulaValue": "=HYPERLINK('https://google.com','20006922')"
}
}
]
}
],
"sheetId": 111111
}
}
]
}
Making sure to follow these steps will guarantee the quickest resolution possible.
Thanks!
Original issue is on Github. Can be found here: https://github.com/googleapis/google-api-nodejs-client/issues/1994
In your case, how about this modification?
Issue and workaround:
When the comma , is used like "formulaValue": "=HYPERLINK('https://google.com','20006922')" to the locale which uses the semicolon ;, when the formula is put using the batchUpdate method of Sheets API, the comma is used without replacing. By this, such error occurs.
On the other hand, when the semicolon is used as the delimiter instead of the comma to the local which uses the comma, when the formula is put using Sheets API, the semicolon is automatically replaced with the comma. By this, no error occurs.
From above situation, how about the following modification? In this case, also I replaced ' to ".
From:
"formulaValue": "=HYPERLINK('https://google.com','20006922')"
To:
"formulaValue": "=HYPERLINK(\"https://google.com\";\"20006922\")"

How to convert a string to camel case in Mule 4

Is there any function to convert a string ex: "iamhuman" to camel case "iAmHuman" in Mule 4 application.
There is a camelize function you can use in dataweave. BUT it will NOT work with your example because it won't know where the word breaks are. If you had another separator, such as underscores or hyphens then this would work:
%dw 2.0
import * from dw::core::Strings
output application/json
---
{ "camelize" : camelize("i_am_human") }