Using pentaho kettle (also known as pdi), I have a "Join Rows (cartesian product)" step which merges two streams of data.
Both the first and second stream have a numeric value attached. For example,
Stream 1 - Values 1, 3, 5
Stream 2 - Values 2, 4, 6
I want to join the two streams to get the following output:
(1, 2)
(3, 4)
(5, 6)
I would describe the correct output as having stream 1 pick the smallest value which is larger than the value from stream 1.
Within the Join Rows step, I can specify stream 2 having a value greater than the stream 1 value. Unfortunately, this produces the following incorrect outcome:
(1, 2)
(1, 4)
(1, 6)
(3, 4)
(3, 6)
(5, 6)
Is there a different step that I should use instead of "Join Rows" in Kettle? Or am I missing a setting on the join rows step?
Note: I also looked at using a Stream Lookup step, but it only works for equals and not for my logic.
Thanks.
You're already half way there.
You have two inputs: Stream1 (1, 3, 5) and Stream2(2, 4, 6)
You join rows (make sure you sort them before joining) on value(stream2) > value(stream1)
You sort resulting stream on {value(Stream1), value(Stream2)
This gives you
(1, 2)
(1, 4)
(1, 6)
(3, 4)
(3, 6)
(5, 6)
Put the "Add Value Fields Changing Sequence" step and set the "Init
sequence if value of the following fields change" to value(Stream1).
Resulting stream is:
(Stream1, Stream2, result)
(1, 2, 1)
(1, 4, 2)
(1, 6, 3)
(3, 4, 1)
(3, 6, 2)
(5, 6, 1)
Put a filter step and filter on "result=1".
Resulting stream from "true" branch of the filter is the deisired result.
I uploaded "example.ktr" with the solution (I used Kettle 4.3. version):
example.ktr
Related
I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings.
string = tokenizer.decode(...)
However, the reconstruction is not perfect. If you use an uncased pretrained model, the uppercase letters get lost. Also, if the tokenizer splits a word into 2 tokens, the second token will start with '##'. For example, the word 'coronavirus' gets split into 2 tokens: 'corona' and '##virus'.
So my question is: is there a way to get the indices of the substring from which every token is created?
For example, take the string "Tokyo to report nearly 370 new coronavirus cases, setting new single-day record". The 9th token is the token corresponding to 'virus'.
['[CLS]', 'tokyo', 'to', 'report', 'nearly', '370', 'new', 'corona', '##virus', 'cases', ',', 'setting', 'new', 'single', '-', 'day', 'record', '[SEP]']
I want something that tells me that the token '##virus' comes from the 'virus' substring in the original string, which is located between the indices 37 and 41 of the original string.
sentence = "Tokyo to report nearly 370 new coronavirus cases, setting new single-day record"
print(sentence[37:42]) # --> outputs 'virus
As far as I know their is no built-in method for that, but you can create one by yourself:
import re
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = "Tokyo to report nearly 370 new coronavirus cases, setting new single-day record"
b = []
b.append(([101],))
for m in re.finditer(r'\S+', sentence):
w = m.group(0)
t = (tokenizer.encode(w, add_special_tokens=False), (m.start(), m.end()-1))
b.append(t)
b.append(([102],))
b
Output:
[([101],),
([5522], (0, 4)),
([2000], (6, 7)),
([3189], (9, 14)),
([3053], (16, 21)),
([16444], (23, 25)),
([2047], (27, 29)),
([21887, 23350], (31, 41)),
([3572, 1010], (43, 48)),
([4292], (50, 56)),
([2047], (58, 60)),
([2309, 1011, 2154], (62, 71)),
([2501], (73, 78)),
([102],)]
I'd like to make an update to the answer. Since HuggingFace introduced their (much faster) version of Rust-written Fast Tokenizers, this task becomes much easier:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
sentence = "Tokyo to report nearly 370 new coronavirus cases, setting new single-day record"
encodings = tokenizer(sentence, return_offsets_mapping=True)
for token_id, pos in zip(encodings['input_ids'], encodings['offset_mapping']):
print(token_id, pos, sentence[pos[0]:pos[1]])
101 (0, 0)
5522 (0, 5) Tokyo
2000 (6, 8) to
3189 (9, 15) report
3053 (16, 22) nearly
16444 (23, 26) 370
2047 (27, 30) new
21887 (31, 37) corona
23350 (37, 42) virus
3572 (43, 48) cases
1010 (48, 49) ,
4292 (50, 57) setting
2047 (58, 61) new
2309 (62, 68) single
1011 (68, 69) -
2154 (69, 72) day
2501 (73, 79) record
102 (0, 0)
More than that, if instead of the regular string you feed the tokenizer with the list of words (and set is_split_into_words=True) then one can easily differentiate between first and the consequence tokens of each word (first value of the tuple would be zero), which is very common need for token classification tasks.
Is function on CTE planned for SQL standard or in any of the current RDBMSes? Somewhat like this?
with strahler(node, sn) function(_parent int) as
(
select
s.node,
case
-- If the node is a leaf (has no children),
-- its Strahler number is one.
when count(st.*) = 0 then
1
when count(st.*) >= 2 then
case
-- If the node has one child with Strahler number i,
-- and all other children have Strahler numbers less than i,
-- then the Strahler number of the node is i again.
when min(st.sn) < max(st.sn) then
max(st.sn)
-- If the node has two or more children with Strahler number i,
-- and no children with greater number,
-- then the Strahler number of the node is i + 1.
when min(st.sn) = max(st.sn) then
max(st.sn) + 1
end
end
from streams s
left join lateral strahler(s.node) st on true
where _parent = 0 or s.to_node = _parent
group by s.node
)
select st.*, s.expected_order
from strahler(0) st
join streams s on st.node = s.node
order by st.node;
I have a hard time devising a recursive CTE solution to this stackoverflow question: How to determine Strahler number on a directed graph for a stream network
Note that the conceptualized "function on CTE" is working if a function is created separately. See: https://www.db-fiddle.com/f/8z58LCVhD62YvkeJjriW8d/3
I'm wondering if that solution can be done with pure CTE alone, without writing a function. I tried but CTE cannot do a left join on itself.
Anyway, I'll just re-post the nature of the problem here.
CREATE TABLE streams (
node integer PRIMARY KEY,
to_node integer REFERENCES streams(node),
expected_order integer
);
INSERT INTO streams(node, to_node, expected_order) VALUES
(1, NULL, 4),
(2, 1, 4),
(3, 2, 3),
(4, 2, 3),
(5, 4, 3),
(6, 3, 2),
(7, 3, 2),
(8, 5, 2),
(9, 5, 2),
(10, 6, 1),
(11, 6, 1),
(12, 7, 1),
(13, 7, 1),
(14, 8, 1),
(15, 8, 1),
(16, 9, 1),
(17, 9, 1),
(18, 4, 1),
(19, 1, 1);
From that data, using the following algorithm (sourced from wikipedia)...
All trees in this context are directed graphs, oriented from the root towards the leaves; in other words, they are arborescences. The degree of a node in a tree is just its number of children. One may assign a Strahler number to all nodes of a tree, in bottom-up order, as follows:
If the node is a leaf (has no children), its Strahler number is one.
If the node has one child with Strahler number i, and all other
children have Strahler numbers less than i, then the Strahler number
of the node is i again.
If the node has two or more children with
Strahler number i, and no children with greater number, then the
Strahler number of the node is i + 1.
...this is produced:
See the field expected_order above what should be the strahler order number of each node when the algorithm is applied.
I have a sorted set that keeps growing in real time and it contains some ID's which I want to retrieve 5 at a time in reverse order of rank. This is basically to implement pagination. These ID's are keys to a Hashmap. Is there any way to get 5 elements at a time efficiently using redis ZSet operations?
For example, in the Sorted Set below, let's say I want to get 5 elements before "572c7d87e53156245a3fd167", how could I do that given that new ID's could keep getting added after my last element in run time? The expected result should give me the ID's 572c7c58e53156245a3fd166, 572c7ad2e53156245a3fd165, 572c746e1eeba6b059b08f1b, 572c74531eeba6b059b08f1a, and 572c6fc9612ad65757cca4f9.
1) "572b58c0dd319a1a4703eba8"
2) "1462429760.8629999"
3) "572c697e612ad65757cca4f7"
4) "1462499582.6889999"
5) "572c6a8e612ad65757cca4f8"
6) "1462499854.056"
7) "572c6fc9612ad65757cca4f9"
8) "1462501193.927"
9) "572c74531eeba6b059b08f1a"
10) "1462502355.5250001"
11) "572c746e1eeba6b059b08f1b"
12) "1462502382.313"
13) "572c7ad2e53156245a3fd165"
14) "1462504018.325"
15) "572c7c58e53156245a3fd166"
16) "1462504408.1370001"
17) "572c7d87e53156245a3fd167"
18) "1462504711.4200001"
19) "572c7da3e53156245a3fd168"
20) "1462504739.352"
One option is to look at ZRANGEBYLEX or ZRANGEBYSCORE and use the offset/count.
However what I usually do for pagination is create a new list (kind of a snapshot of the original list), that doesn't change dynamically, and load data from there. That way it doesn't feel like chasing a moving target. I just set a TTL to it and forget about it.
Lets say we are dealing with the keys 1-15. To get the worst case performance of a regular BST, you would insert the keys in ascending or descending order as follows:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
Then the BST would essentially become a linked list.
For best case of a BST you would insert the keys in the following order, they are arranged in such a way that the next key inserted is half of the total range to be inserted, so the first is 15/2 = 8, then 8/2 = 4 etc...
8, 4, 12, 2, 6, 10, 14, 1, 3, 5, 7, 9, 11, 13, 15
Then the BST would be a well balanced tree with optimal height 3.
The best case for a red black tree can also be constructed with the best case from a BST. But how do we construct the worst case for a red black tree? Is it the same as the worst case for a BST? Is there a specific pattern that will yield the worst case?
You are looking for a skinny tree, right? This can be produced by inserting [1 ... , 2^(n+1)-2] in reverse order.
You won't be able to. A Red-Black Tree keeps itself "bushy", so it would rotate to fix the imbalance. The length of your above worst case for a Red-Black Tree is limited to two elements, but that's still not a "bad" case; it's what's expected, as lg(2) = 1, and you have 1 layer past the root with two elements. As soon as you add the third element, you get this:
B B
\ / \
R => R R
\
R
I am trying to compare two entries of 6 numbers, each number which can either can be zero or 1 (i.e 100001 or 011101). If 3 out of 6 match, I want the output to be .5. If 2 out of 6 match, i want the output to be .33 etc.
Here are the SQL commands to create the table
CREATE TABLE sim
(sim_key int,
string int);
INSERT INTO sim (sim_key, string)
VALUES (1, 111000);
INSERT INTO sim (sim_key, string)
VALUES (2, 111111);
My desired output to compare the two strings, which share 50% of the characters, and output 50%.
Is it possible to do this sort of comparison in SQL? Thanks in advance
This returns the percentage of equal 1 bits in both strings:
select bit_count(conv(a.string, 2, 10) & conv(b.string, 2, 10))/6*100 as percent_match
from sim a, sim b where
a.sim_key=1 and b.sim_key=2;
As you store your bitfields as base 2 representation converted to numbers, we first need to do conversions: conv(a.string, 2, 10), conv(b.string, 2, 10).
Then we keep only bits that are 1 in each field: conv(a.string, 2, 10) & conv(b.string, 2, 10)
And we count them: bit_count(conv(a.string, 2, 10) & conv(b.string, 2, 10))
And finally we just compute the percentage: bit_count(conv(a.string, 2, 10) & conv(b.string, 2, 10)) / 6 * 100.
The query returns 50 for 111000 and 111111.
Here is an other version that also counts matching zeros:
select bit_count((conv(a.string, 2, 10) & conv(b.string, 2, 10)) | ((0xFFFFFFFF>>(32-6))&~(conv(a.string, 2, 10)|conv(b.string, 2, 10))))/6*100 as percent_match
from sim a, sim b where
a.sim_key=1 and b.sim_key=2;
Note that, while this solution works, you should really store this field like this instead:
INSERT INTO sim (sim_key, string)
VALUES (1, conv("111000", 2, 10));
INSERT INTO sim (sim_key, string)
VALUES (2, conv("111111", 2, 10));
Or to update existing data:
UPDATE sim SET string=conv(string, 10, 2);
Then this query gives the same results (if you updated your data as described above):
select bit_count(a.string & b.string)/6*100 as percent_match
from sim a, sim b where
a.sim_key=1 and b.sim_key=2;
And to count zeros too:
select bit_count((a.string & b.string) | ((0xFFFFFFFF>>(32-6))&~(a.string|b.string)))/6*100 as percent_match
from sim a, sim b where
a.sim_key=1 and b.sim_key=2;
(replace 6s by the size of your bitfields)
Since you are storing them as numbers, you can do this
SELECT BIT_COUNT(s1.string & s2.string) / BIT_COUNT(s1.string | s1.string)
FROM sim s1, sim s2
WHERE s1.sim_key = 1 AND s2.sim_key = 2