Is there a way to get the location of the substring from which a certain token has been produced in BERT? - tokenize

I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings.
string = tokenizer.decode(...)
However, the reconstruction is not perfect. If you use an uncased pretrained model, the uppercase letters get lost. Also, if the tokenizer splits a word into 2 tokens, the second token will start with '##'. For example, the word 'coronavirus' gets split into 2 tokens: 'corona' and '##virus'.
So my question is: is there a way to get the indices of the substring from which every token is created?
For example, take the string "Tokyo to report nearly 370 new coronavirus cases, setting new single-day record". The 9th token is the token corresponding to 'virus'.
['[CLS]', 'tokyo', 'to', 'report', 'nearly', '370', 'new', 'corona', '##virus', 'cases', ',', 'setting', 'new', 'single', '-', 'day', 'record', '[SEP]']
I want something that tells me that the token '##virus' comes from the 'virus' substring in the original string, which is located between the indices 37 and 41 of the original string.
sentence = "Tokyo to report nearly 370 new coronavirus cases, setting new single-day record"
print(sentence[37:42]) # --> outputs 'virus

As far as I know their is no built-in method for that, but you can create one by yourself:
import re
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = "Tokyo to report nearly 370 new coronavirus cases, setting new single-day record"
b = []
b.append(([101],))
for m in re.finditer(r'\S+', sentence):
w = m.group(0)
t = (tokenizer.encode(w, add_special_tokens=False), (m.start(), m.end()-1))
b.append(t)
b.append(([102],))
b
Output:
[([101],),
([5522], (0, 4)),
([2000], (6, 7)),
([3189], (9, 14)),
([3053], (16, 21)),
([16444], (23, 25)),
([2047], (27, 29)),
([21887, 23350], (31, 41)),
([3572, 1010], (43, 48)),
([4292], (50, 56)),
([2047], (58, 60)),
([2309, 1011, 2154], (62, 71)),
([2501], (73, 78)),
([102],)]

I'd like to make an update to the answer. Since HuggingFace introduced their (much faster) version of Rust-written Fast Tokenizers, this task becomes much easier:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
sentence = "Tokyo to report nearly 370 new coronavirus cases, setting new single-day record"
encodings = tokenizer(sentence, return_offsets_mapping=True)
for token_id, pos in zip(encodings['input_ids'], encodings['offset_mapping']):
print(token_id, pos, sentence[pos[0]:pos[1]])
101 (0, 0)
5522 (0, 5) Tokyo
2000 (6, 8) to
3189 (9, 15) report
3053 (16, 22) nearly
16444 (23, 26) 370
2047 (27, 30) new
21887 (31, 37) corona
23350 (37, 42) virus
3572 (43, 48) cases
1010 (48, 49) ,
4292 (50, 57) setting
2047 (58, 61) new
2309 (62, 68) single
1011 (68, 69) -
2154 (69, 72) day
2501 (73, 79) record
102 (0, 0)
More than that, if instead of the regular string you feed the tokenizer with the list of words (and set is_split_into_words=True) then one can easily differentiate between first and the consequence tokens of each word (first value of the tuple would be zero), which is very common need for token classification tasks.

Related

How get the number of the sentence that include a span or token in Spacy?

from spacy.matcher import PhraseMatcher
import spacy
from spacy.tokens import Doc, Span, Token
nlp = spacy.load("en_core_web_sm", disable=["ner",'lemmatizer', 'attribute_rules', 'tagger'])
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("obama", [nlp("Barack Obama"), nlp("Baracko Obama"), nlp("Baracko OBAMA")])
doc = nlp("BARACK OBAMA lifts America one last time in emotional. Farewell Baracko Obama baracko obama")
matches = matcher(doc)
spans = [Span(doc, match[1], match[2]) for match in matches]
print(spans)
matches
The results are:
[BARACK OBAMA, Baracko Obama, baracko obama]
[(15955766757638404248, 0, 2),
(15955766757638404248, 11, 13),
(15955766757638404248, 13, 15)]
how to get the sentence or the number of the sentence in the doc that includes (15955766757638404248, 0, 2), (15955766757638404248, 11, 13), and (15955766757638404248, 13, 15).
I want for example to get (15955766757638404248, 0, 2) in sentence 1 and (15955766757638404248, 11, 13), and (15955766757638404248, 13, 15) in sentence. Is this possible in Spacy?
You can use the .sent property on a token or span to get the corresponding sentence. (If your span covers multiple sentences, the first is returned.) So given your data you can do something like this:
match_id, start, end = (..., 0, 2)
span = doc[start:end]
print(span.sent)
It sounds like you want to know if things are in the same sentence more than the sentence number, so this should be enough. If you actually need the number of the sentence, that isn't provided directly, but you can add it easily enough by counting the sentences or using extension attributes on spans or similar.

find frequency (times of occurrence) of a (list of) substring (which are elements in list of dictionaries) in another string

I would like to find the frequency(number) of occurrence of a substring (which are elements in list of dictionaries to determine their categories) in another string.
See the sample input and output below.
Find number of repetition of element of st in string namedstgs
The code:
def freqcounter(st, stgs):
"""
st: A mapping of st name to st keyword list.
st: dict of str -> list
stgs: A list of stgs (type:str)
return: A mapping of st name to st occurance in list of stgs.
:rtype: dict of str -> int
"""
stgs = str(stgs).split(" ") or str(stgs).split(' ')
dic = {}
count = 0
for k,v in st.items():
for i in range(len(v)):
for j in range(len(stgs)):
if v[i] == stgs[j]:
count+=1
dic[k]=count
return dic
if __name__ == '__main__':
stgs=['John Smith sells trees, he said the height of his tree is high. I expected more trees with lower price, but it is higher than my expectation.', 'I like my new tree, John!', "100 dollars per each tree is very high. Tree is source of oxygen. Next time I do my shopping from a Cheap Trees shoppers."]
st = {'Height': ['low', 'high', 'height'], 'Topic Work': ['tree', 'trees'], 'John Smith': ['John Smith']}
outtts = freqcounter(st,stgs)
outtts = sorted(list(outtts.items()))
for outtt in outtts:
print(outtt[0])
print(outtt[1])
sample input:
#For instance, if `stgs` is the input, and I would like to count frequency of each `st` in this text.
stgs=['John Smith sells trees, he said the height of his tree is high. I expected more trees with lower price, but it is higher than my expectation.', 'I like my new tree, John!', "100 dollars per each tree is very high. Tree is source of oxygen. Next time I do my shopping from a Cheap Trees shoppers."]
st = {'Height': ['low', 'high', 'height'], 'Topic Work': ['tree', 'trees'], 'John Smith': ['John Smith']}
I would like to calculate on 2 cases:
do not consider word+appendix the same as a word. For instance: do not count lower, higher as low, high
sample output:
'Height': 3 , 'Topic Work': 7, 'John Smith': 1
because 3 times found 'Height','high','low' which are the element of 'Height', 7 times found 'tree', 'trees' which are elements of 'Topic Work' and 1 time found 'John Smith' which are elements of 'John Smith'
2.consider word+appendix the same as a word. For instance: count lower, higher as low, high
sample output:
'Height': 5 , 'Topic Work': 7, 'John Smith': 1
What is my expectation is showing how many of each of them are found.

How do i convert an integer to an index in an "if" function?

My first question on this forum and im completely new to programming, so apologies if i do something wrong.
So im trying to program something to do a collatz sequense on a number i put in. To do that i have to check if the number i put in is even or odd. The easy fix would be to use the built in % 2 function but i want to do it "dirty" to learn better.
So my plan was to check if the last number in said input number is in the list (0, 2, 4, 6, 8) since an even number ends on those and is odd otherwise. Problem is how to check the very last index in the number to see if its in that list. I tried the following code
def Collatz(input_number):
input_number_num = int(input_number)
lenght = len(input_number_num)
position = lenght - 1
if [position] in input_number_num is in (0, 2, 4, 6, 8)
return (input_number_num / 2)
else:
return (input_number_num * 3 + 1)
This gives me a syntax error, im guessing since it reads [lenght] as an tuple rather than the index.
You could convert the number to a string and then use the -1 index to extract the last character:
if input_number[-1] in ('0', '2', '4', '6', '8'):
But to be completely honest, I can't see any advantage of doing this over just using % 2.

Function on CTE

Is function on CTE planned for SQL standard or in any of the current RDBMSes? Somewhat like this?
with strahler(node, sn) function(_parent int) as
(
select
s.node,
case
-- If the node is a leaf (has no children),
-- its Strahler number is one.
when count(st.*) = 0 then
1
when count(st.*) >= 2 then
case
-- If the node has one child with Strahler number i,
-- and all other children have Strahler numbers less than i,
-- then the Strahler number of the node is i again.
when min(st.sn) < max(st.sn) then
max(st.sn)
-- If the node has two or more children with Strahler number i,
-- and no children with greater number,
-- then the Strahler number of the node is i + 1.
when min(st.sn) = max(st.sn) then
max(st.sn) + 1
end
end
from streams s
left join lateral strahler(s.node) st on true
where _parent = 0 or s.to_node = _parent
group by s.node
)
select st.*, s.expected_order
from strahler(0) st
join streams s on st.node = s.node
order by st.node;
I have a hard time devising a recursive CTE solution to this stackoverflow question: How to determine Strahler number on a directed graph for a stream network
Note that the conceptualized "function on CTE" is working if a function is created separately. See: https://www.db-fiddle.com/f/8z58LCVhD62YvkeJjriW8d/3
I'm wondering if that solution can be done with pure CTE alone, without writing a function. I tried but CTE cannot do a left join on itself.
Anyway, I'll just re-post the nature of the problem here.
CREATE TABLE streams (
node integer PRIMARY KEY,
to_node integer REFERENCES streams(node),
expected_order integer
);
INSERT INTO streams(node, to_node, expected_order) VALUES
(1, NULL, 4),
(2, 1, 4),
(3, 2, 3),
(4, 2, 3),
(5, 4, 3),
(6, 3, 2),
(7, 3, 2),
(8, 5, 2),
(9, 5, 2),
(10, 6, 1),
(11, 6, 1),
(12, 7, 1),
(13, 7, 1),
(14, 8, 1),
(15, 8, 1),
(16, 9, 1),
(17, 9, 1),
(18, 4, 1),
(19, 1, 1);
From that data, using the following algorithm (sourced from wikipedia)...
All trees in this context are directed graphs, oriented from the root towards the leaves; in other words, they are arborescences. The degree of a node in a tree is just its number of children. One may assign a Strahler number to all nodes of a tree, in bottom-up order, as follows:
If the node is a leaf (has no children), its Strahler number is one.
If the node has one child with Strahler number i, and all other
children have Strahler numbers less than i, then the Strahler number
of the node is i again.
If the node has two or more children with
Strahler number i, and no children with greater number, then the
Strahler number of the node is i + 1.
...this is produced:
See the field expected_order above what should be the strahler order number of each node when the algorithm is applied.

Kettle Join Rows - Closest Element Larger than x

Using pentaho kettle (also known as pdi), I have a "Join Rows (cartesian product)" step which merges two streams of data.
Both the first and second stream have a numeric value attached. For example,
Stream 1 - Values 1, 3, 5
Stream 2 - Values 2, 4, 6
I want to join the two streams to get the following output:
(1, 2)
(3, 4)
(5, 6)
I would describe the correct output as having stream 1 pick the smallest value which is larger than the value from stream 1.
Within the Join Rows step, I can specify stream 2 having a value greater than the stream 1 value. Unfortunately, this produces the following incorrect outcome:
(1, 2)
(1, 4)
(1, 6)
(3, 4)
(3, 6)
(5, 6)
Is there a different step that I should use instead of "Join Rows" in Kettle? Or am I missing a setting on the join rows step?
Note: I also looked at using a Stream Lookup step, but it only works for equals and not for my logic.
Thanks.
You're already half way there.
You have two inputs: Stream1 (1, 3, 5) and Stream2(2, 4, 6)
You join rows (make sure you sort them before joining) on value(stream2) > value(stream1)
You sort resulting stream on {value(Stream1), value(Stream2)
This gives you
(1, 2)
(1, 4)
(1, 6)
(3, 4)
(3, 6)
(5, 6)
Put the "Add Value Fields Changing Sequence" step and set the "Init
sequence if value of the following fields change" to value(Stream1).
Resulting stream is:
(Stream1, Stream2, result)
(1, 2, 1)
(1, 4, 2)
(1, 6, 3)
(3, 4, 1)
(3, 6, 2)
(5, 6, 1)
Put a filter step and filter on "result=1".
Resulting stream from "true" branch of the filter is the deisired result.
I uploaded "example.ktr" with the solution (I used Kettle 4.3. version):
example.ktr