Cannot use bigquery udf (bqutil) in processing location: us-west-2 - google-bigquery

We are trying to use these in us-west2 - https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community.
this first query processes just fine, in US
this second query wont run
Our dataset models is in us West 2. It seems all queries from the 2nd query editor are then processed in us-west 2 where, it seems bqutil does not exist? How can we find the function bqutil.fn.levenshtein when processing in us-west2 (where our datasets all exist)?

To use the levenshtein UDF in your BigQuery table, you need to create a UDF in the location where your dataset resides.
You can refer to the below UDF and the screenshot where the data resides in us-west2 location.
UDF :
CREATE OR REPLACE FUNCTION
`stackdemo.fn_LevenshteinDistance`(in_a STRING, in_b STRING) RETURNS INT64 LANGUAGE js AS R"""
var a = in_a.toLowerCase();
var b = in_b.toLowerCase();
if(a.length == 0) return b.length;
if(b.length == 0) return a.length;
var matrix = [];
// increment along the first column of each row
var i;
for(i = 0; i <= b.length; i++){
matrix[i] = [i];
}
// increment each column in the first row
var j;
for(j = 0; j <= a.length; j++){
matrix[0][j] = j;
}
// Fill in the rest of the matrix
for(i = 1; i <= b.length; i++){
for(j = 1; j <= a.length; j++){
if(b.charAt(i-1) == a.charAt(j-1)){
matrix[i][j] = matrix[i-1][j-1];
} else {
matrix[i][j] =
Math.min(matrix[i-1][j-1] + 1, // substitution
Math.min(matrix[i][j-1] + 1, // insertion
matrix[i-1][j] + 1)); // deletion
}
}
}
return matrix[b.length][a.length];
""";
Query :
SELECT
source,
target,
`stackdemo.fn_LevenshteinDistance`(source, target) distance,
FROM UNNEST([
STRUCT('analyze' AS source, 'analyse' AS target),
STRUCT('opossum', 'possum'),
STRUCT('potatoe', 'potatoe'),
STRUCT('while', 'whilst'),
STRUCT('aluminum', 'alumininium'),
STRUCT('Connecticut', 'CT')
]);
Output :

Related

apps script: increase variable with each execution

Hi i have a list of values each value representing the output of a shift of packaging. I want to calculate the average output of 8 weeks. So each time a shift passes the average output changes. My idea is to trigger a function after each shift, which calculates the output. Now theres my problem, how do i get a varible (the one representing the row of the first value) to increase after each trigger of the function? What i tried is to declare the variable before the function and increase the variable of 1 inside the function. Buf ofc the starting value doesnt change this way.. Probably there is an easy way for this i just dont know yet (programming newbie here :)).
let i = 7;
let j = 126;
function schnitt() {
var summe = 0;
var counter = 0;
i++;
j++;
while(i <= j){
var aktuell = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(i,6,1,1).getValue();
if(aktuell != ""){
summe = summe + aktuell;
counter++;
i++;
}
else{
i++
}
}
var durchschnitt = summe / counter;
var ausgabe = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(8,7,1,1).setValue(durchschnitt);
}
I have found a work around. I just put var i and j into cells and do it like this:
function schnitt() {
var i = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(3,14,1,1).getValue();
var j = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(4,14,1,1).getValue();
var summe = 0;
var counter = 0;
while(i <= j){
var aktuell = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(i,6,1,1).getValue();
if(aktuell != ""){
summe = summe + aktuell;
counter++;
i++;
}
else{
i++
}
}
var durchschnitt = summe / counter;
var ausgabe = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(8,7,1,1).setValue(durchschnitt);
i = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(3,14,1,1).getValue();
i++;
SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(3,14,1,1).setValue(i);
j = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(4,14,1,1).getValue();
j++;
SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Auswertung").getRange(4,14,1,1).setValue(j);
}

Needing help on printing mode in java

I am given a task to make a method that takes a parameter of an ArrayList of Integer obj and print out the sum, average, and mode.
I can't seem to figure out how to find the mode. It should print out the number if there is only one mode, and it should print out "no single mode" if there is more than one (or none) mode. My method only prints out "no single mode". How can I fix my code to have the mode printed out?
This is what I have for my code:
public static void printStatistics(ArrayList<Integer> arr){
int sum = 0;
for(int i : arr){
sum += i;
}
System.out.println("Sum: "+sum);
System.out.println("Average: "+(double)sum/arr.size());
int temp = 0, counter = 0, max = 0;
for(int j = 0; j < arr.size() - 1; j++){
for(int k = j+1; k < arr.size(); k++){
if(arr.get(j) == arr.get(k)){
counter++;
if(counter > max){
max = counter;
temp = arr.get(j);
}
if(counter == max){
temp = -1;
}
}
}
}
if(temp > 0){
System.out.println("Mode: "+temp);
}
else if(temp < 0){
System.out.println("Mode: no single mode");
}
}
The problem lies here
if(counter > max){
max = counter;
temp = arr.get(j);
}
if(counter == max){
temp = -1;
}
You are assigning the value of counter to max in the first condition so the second if condition i.e., if(counter == max) will always be true, which results in temp having the value -1 which fulfills else if(temp < 0). This is why you are getting Mode: no single mode as the output every time.
Changing the condition should give you the desired output
if(counter < max){
temp = -1;
}

BIGQUERY - Query Exceeded resource limit

I am running the below query to join the two tables and get certain records based on Fuzzy logic (Levenshtein distance)
WITH main_table as (
select *
from
`project.data.Roof_Address`
), reference_table as (
select *
from `project.data.DATA_TREE_Address`
)
select
DR_NBR,
ARRAY_AGG(
STRUCT(n.LotSizeSqFt)
ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
)[OFFSET(0)].*,
ARRAY_AGG(
EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
)[OFFSET(0)] distance_score
FROM main_table l
CROSS JOIN reference_table n
GROUP BY 1
having ARRAY_AGG(
EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
)[OFFSET(0)] < 10
This query will return the
Project_Id(Dr_NBR)
from first table and
Project_area(LotSizeSqFt)
from second table based on the Levenshtein Score filter at the end.
This query is resulting in the below error
Any suggestions how to optimize the above query?
The distance I am using is from the below function
#standardSQL
CREATE TEMPORARY FUNCTION EDIT_DISTANCE(string1 STRING, string2 STRING)
RETURNS INT64
LANGUAGE js AS """
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* #param str1 String the first string.
* #param str2 String the second string.
* #return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_string1;
try {
the_string1 = decodeURI(string1).toLowerCase();
} catch (ex) {
the_string1 = string1.toLowerCase();
}
try {
the_string2 = decodeURI(string2).toLowerCase();
} catch (ex) {
the_string2 = string2.toLowerCase();
}
return Levenshtein.get(the_string1, the_string2)
""";
Snapshot for Roof_Address table
Snapshot for DATA_TREE_Address
The main query cost would most likely be the ORDER by in the :
ARRAY_AGG(
STRUCT(n.LotSizeSqFt)
ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
)[OFFSET(0)].*,
I see you're only returning a single record for each array_agg.
I'd recommend removing the ARRAY_AGG and do a MAX or MIN on the results from the EDIT_DISTANCE. A MAX or MIN is much much cheaper than ORDERING ALL records and taking the first or last one.

How to count letter differences of two strings in bigquery?

For example i have:
1: 6c71d997ba39
2: 6c71d997d269
I need to get 4.
You can consider using Levenshtein distance for your use-case
the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other
Below example is for BigQuery Standard SQL
#standardSQL
CREATE TEMPORARY FUNCTION EDIT_DISTANCE(string1 STRING, string2 STRING)
RETURNS INT64
LANGUAGE js AS """
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* #param str1 String the first string.
* #param str2 String the second string.
* #return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_string1;
try {
the_string1 = decodeURI(string1).toLowerCase();
} catch (ex) {
the_string1 = string1.toLowerCase();
}
try {
the_string2 = decodeURI(string2).toLowerCase();
} catch (ex) {
the_string2 = string2.toLowerCase();
}
return Levenshtein.get(the_string1, the_string2)
""";
WITH strings AS (
SELECT '1: 6c71d997ba39' string1, '2: 6c71d997d269' string2
)
SELECT string1, string2, EDIT_DISTANCE(string1, string2) changes
FROM strings
with result
Row string1 string2 changes
1 1: 6c71d997ba39 2: 6c71d997d269 4
SELECT
(SELECT COUNTIF(c != s2[OFFSET(off)])
FROM UNNEST(SPLIT(s1, '')) AS c WITH OFFSET off) AS count
FROM dataset.table
Source: https://stackoverflow.com/a/57499387/11059644
Ready to use shared UDFs - Levenshtein distance:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa'), fhoffa.x.levenshtein('googgle', 'goggles'), fhoffa.x.levenshtein('is this the', 'Is This The')

What is the time complexity of this function?

Here's a sample solution for Sliding Window Maximum problem in Java.
Given an array nums, there is a sliding window of size k which is
moving from the very left of the array to the very right. You can only
see the k numbers in the window. Each time the sliding window moves
right by one position.
I want to get the time and space complexity of this function. Here's what I think would be the answer:
Time: O((n-k)(k * logk)) == O(nklogk)
Space (auxiliary): O(n) for return int[] and O(k) for pq. Total of O(n).
Is this correct?
private static int[] maxSlidingWindow(int[] a, int k) {
if(a == null || a.length == 0) return new int[] {};
PriorityQueue<Integer> pq = new PriorityQueue<Integer>(k, new Comparator<Integer>() {
// max heap
public int compare(Integer o1, Integer o2) {
return o2 - o1;
}
});
int[] result = new int[a.length - k + 1];
int count = 0;
// time: n - k times
for (int i = 0; i < a.length - k + 1; i++) {
for (int j = i; j < i + k; j++) {
// time k*logk (the part I'm not sure about)
pq.offer(a[j]);
}
// logk
result[count] = pq.poll();
count = count + 1;
pq.clear();
}
return result;
}
You're right in most of the part except -
for (int j = i; j < i + k; j++) {
// time k*logk (the part I'm not sure about)
pq.offer(a[j]);
}
Here total number of executions is log1 + log2 + log3 + log4 + ... + logk. The summation of this series -
log1 + log2 + log3 + log4 + ... + logk = log(k!)
And second thought is, you can do it better than your linearithmic time solution using double-ended queue property which will be O(n). Here is my solution -
public int[] maxSlidingWindow(int[] nums, int k) {
if (nums == null || k <= 0) {
return new int[0];
}
int n = nums.length;
int[] result = new int[n - k + 1];
int indx = 0;
Deque<Integer> q = new ArrayDeque<>();
for (int i = 0; i < n; i++) {
// remove numbers out of range k
while (!q.isEmpty() && q.peek() < i - k + 1) {
q.poll();
}
// remove smaller numbers in k range as they are useless
while (!q.isEmpty() && nums[q.peekLast()] < nums[i]) {
q.pollLast();
}
q.offer(i);
if (i >= k - 1) {
result[indx++] = nums[q.peek()];
}
}
return result;
}
HTH.