Textual representation of LaBSE preprocessor output? - tokenize

I use the following model to tokenize sentences from multiple languages:
https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2
Which, for the following input:
"I wish you a pleasant flight and a good meal aboard this plane."
outputs the following tokens:
[101, 146, 34450, 15100, 170, 147508, 48088, 14999, 170, 17072, 66369, 351617, 15272, 69746, 119, 102]
From this output, I would like to recover a textual representation of the tokens. Something like :
[START, I, wish, ..., plane, .]
So far I've been looking for the token<=>text mapping, but found resources mostly about BERT, which has got several MONO-lingual models, while I want to stay language-agnostic.
Anyclue about how to do that ?
Thanks in advance for your help,

The default cache location for the google/universal-sentence-encoder-cmlm/multilingual-preprocess/2 model is /tmp/tfhub_modules/8e75887695ac632ead11c556d4a6d45194718ffb (more on caching). In the assets directory, you'll find cased_vocab.txt, which is the used vocabulary:
!cat /tmp/tfhub_modules/.../assets/cased_vocab.txt | sed -n 102p
> [CLS]
!cat /tmp/tfhub_modules/.../assets/cased_vocab.txt | sed -n 147p
> I
!cat /tmp/tfhub_modules/.../assets/cased_vocab.txt | sed -n 34451p
> wish
...
Note that sed assumes 1-based indexing while the output of the preprocessor is 0-based.

Related

Issues converting a small Hex value to a Binary value

I am trying to take the contents of a file that has a Hex number and convert that number to Binary and output to a file.
This is what I am trying but not getting the binary value:
xxd -r -p Hex.txt > Binary.txt
The contents of Hex.txt is: ff
I have also tried FF and 0xFF, but would like to just use ff since the device I am pulling the info from has it in that format.
Instead of 11111111 which it should be, I get a y with 2 dots above it.
If I change it to ee, I get an i with 2 dots. It seems to be reading it just fine but according to what I have read on the xxd -r -p command, it is not outputing it in the correct format.
The other ways I have found to convert Hex to Binary have either also not worked or is a pretty big Bash script that seems unnecessary to do what I thought would be a simple task.
This also gives me the y with 2 dots.
$ for i in $(cat Hex.txt) ; do printf "\x$i" ; done > Binary.txt
For some reason almost every solution I find gives me this format instead of a human readable Binary value with 1s and 0s.
Any help is appreciated. I am planning on using this in a script to pull the Relay values from Digital Loggers devices using curl and giving Home Assistant a readable file to record the Relay State. Digital Loggers curl cmd gives the state of all 8 relays at once using Hex instead of being able to pull the status of a specific relay.
If "file.txt" contains:
fe
0a
and you run this:
perl -ane 'printf("%08b\n",hex($_))' file.txt
You'll get this:
11111110
00001010
If you use it a lot, you might want to make a bash function of it in your login profile along these lines - being extremely respectful of spaces and semi-colons that might look unnecessary:
bin(){ perl -ane 'printf("%08b\n",hex($_))' $1 ; }
Then you'll be able to do:
bin file.txt
If you dislike Perl for some reason, you can achieve something similar without it as follows:
tr '[:lower:]' '[:upper:]' < file.txt |
while read h ; do
echo "obase=2; ibase=16; $h" | bc
done

Capture and parse output of Whateverable bots

Since that is the standard way to present output in the Perl 6 documentation, I'm using the whateverable bots to evaluate expressions via the #perl6 IRC channel or the #whateverable channel. Produced output is something like this:
10:28:19 jmerelo | p6: say 333444777 ~~ /(3+)/ │
10:28:19 evalable6 | jmerelo, rakudo-moar 5ce24929f: OUTPUT: «「333」␤ 0 => 「333」␤»
(in the WeeChat console program). From that output, I cut and paste to the document, erasing the parts I'm not interested in.
I was wondering if there was some easy way to parse and save that output directly, either server-based (some Whateverable bots save to gists, for instance), or client-based via scriptint the irssi or weechat platform.
I think the most convenient solution in this case would be to bypass irc bots and define a bash function. Something like this:
d6() { echo -n '# OUTPUT: «'; perl6 -e "$1" | sed -z 's/\n/␤/g'; echo '»'; }
Then you can use it like this:
d6 'say 42'
Which will produce this output:
# OUTPUT: «42␤»
Of course, you'd need a different solution for other operating systems.
As a bonus, you can also put it into the clipboard automatically:
d6 'say 42' | tee >(xclip -selection clipboard)

Removing steric (*) from the end of a fasta sequence in a multi fasta file

I have a multifasta file containi g predicted proteins from 2 abinitio tools. Every sequence contains a steric (*) in the end. I want to remove it from the file. my sequences are like this:
>snapgene1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP*
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP*
i want the sequences like this :
>snapgen1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP
Can anyone help me in this. Thankyou
If the text stored in a file "temp.txt",you can use command :
sed -i "s/*$//" temp.txt
In awk, if you keep your fastas in file:
$ awk '{sub(/\*$/,"")}1' file
>snapgene1
SFLPSAEAIEKVLSHMSRRIIDDMKAELQQPEMRWFWP
>snapgene2
SFLPSAEAIEKVLSHIIIIAAAAKKKPPFFDDMKAELQQPEMRWFWP
It replaces trailing * with nothing.

awk match pattern from file

I have very large data sets in which I need to find specific patterns located in a specific column index and need the entire line output. I've gotten [successfully] as far as a single cmd line pattern match:
awk -F'|' -v OFS='|' '$1=="100002"{print $1,$22,$11,$12,$13,$28,$25,$27}' searchfile > outfile
100022 - being the search pattern, is a an exact match and located in column 1
searchfile - is the data file with 3.8 million lines and 60 columns all | delimited
Now I want to modify this search by specifying an input patternfile, because I have a little over 800 patterns that need to be matched and outputted. I've done my best to search the site and did find the use of the -f flag however I don't know how to integrate that with my search criteria per above. I need to be able to specify: exact match, specific column index search, specify specific columns to output, and specific in/out delimiter.
sample data set (note this has been modified to protect data owner):
100001|0|60|100001|AAR Corp| | |Industrial|Aerospace/Defense|Aerospace/Defense-Equip|US|US|US|IL|DE|;2;6;1;1;1100 North Wood Dale Road;1; ;1;Wood Dale;1;IL;1;60191;1;United States;|
15460796|0|60|15460796|PayPal Data Services Inc|348546|eBay Inc|Consumer, Non-cyclical|Commercial Services|Inactive/Unknown|US|US|US|CA|DE|;2;6;1;1;2211 North 1st Street;1; ;1;San Jose;1;CA;1;95125;1;United States;|
100003|0|60|100003|Abex Inc|170435|Mafco Consolidated Group Inc|Industrial|Aerospace/Defense|Aerospace/Defense-Equip|US|US|US|NH|DE|;2;6;1;1;Liberty Lane;1; ;1;Hampton;1;NH;1;03842;1;United States;|
100004|0|60|100004|Abitibi-Consolidated Inc|23165941|Resolute Forest Products Inc|Basic Materials|Forest Products&Paper|Paper&Related Products|CA|CA|CA|QC|QC|;2;6;1;1;1155 Metcalfe Street;1;Suite 800;1;Montreal;1;QC;1;M5J 2P5;1;Canada;|
100005|0|60|100005|Acme Electric Corp|100763|Hubbell Inc|Industrial|Electrical Compo&Equip|Power Conv/Supply Equip|US|US|US|NC|NY|;2;6;1;1;400 Quaker Road;1; ;1;East Aurora;1;NY;1;14052;1;United States;|
100006|0|60|100006|ACME-Cleveland Corp|100430|Danaher Corp|Industrial|Hand/Machine Tools|Mach Tools&Rel Products|US|US|US|OH|OH|;2;6;1;1;30100 Chagrin Boulevard;1;Suite 100;1;Pepper Pike;1;OH;1;44124-5705;1;United States;|
100007|0|60|100007|Acuson Corp|196005|Siemens Corp|Consumer, Non-cyclical|Healthcare-Products|Ultra Sound Imaging Sys|US|US|US|CA|DE|;2;6;1;1;1220 Charleston Road;1; ;1;Mountain View;1;CA;1;94039;1;United States;|
100009|0|60|100009|ADT Ltd|101520|Tyco International Plc|Consumer, Non-cyclical|Commercial Services|Protection-Safety|BM|BM|BM| | |;2;6;1;1;Cedar House;1;41 Cedar Avenue;1;Hamilton;1; ;1;HM 12;1;Bermuda;|
100010|0|60|100010|Advanced Micro Devices Inc| | |Technology|Semiconductors|Electronic Compo-Semicon|US|US|US|CA|DE|;2;6;1;1;One AMD Place;1;PO Box 3453;1;Sunnyvale;1;CA;1;94088-3453;1;United States;|
input pattern search:
100006
100052
You can externalize all the variables from the script
$ awk -v sep='|' -v matchindex='1' -v matchvalue='100002' -v columns='1,22,11,12,13,28,25,27'
'BEGIN{FS=OFS=sep; n=split(columns,c,",")}
$matchindex==matchvalue{for(i=1;i<n;i++)
printf "%s",$c[i] OFS; printf "%s\n", $c[n]}'
and perhaps write another script to generate the first line from a config file.

grep/awk - how to filter out a certain keyword

I have the following line of text, where i want to filter out the N from (KEY_N) etc. Keep in mind that the N is not constant, it can be anything, like (KEY_J), (KEY_K), (KEY_L), (KEY_I), (KEY_SPACE) and so on..
Event: time 1442439135.995248, type 1 (EV_KEY), code 49 (KEY_N), value 0
Update:
I hope that I got the question properly, if not then please let me know.
Having GNU grep you can use this:
grep -oP '.*\(\K[^)]+' file
An alternative on non GNU systems might be to use sed:
sed 's/.*(\([^)]\{1,\}\)).*/\1/' file