I recently posted on stack overflow about an OOM error while running many udfs on a single table (bigquery udf out of memory issues). This error seems to have been partially fixed, however, I am running into a new error when running a udf on a 10,000 row table. Here is the error message:
Error: An error occurred while communicating with a subprocess. message: "Communication channel error during 4 command" sandbox_process_error { }
Error Location: User-defined function
Job ID: broad-cga-het:bquijob_32bc01d_1569f11b8a2
The error does not occur when I remove the emit statement in the udf, so the error must be occurring when the udf tries to write back to a different table.
Here is a copy of the udf itself:
bigquery.defineFunction(
'permute',
['obj_nums','num_obj_per_indiv','row_number'], // Names of input columns
[{"name": "num_cooccurrences_list","type": "string","mode":"nullable"}], // Output schema
permute
);
function permute(row, emit) {
var obj_ids = row['obj_nums'].split(",").map(function (x) {
return parseInt(x, 10);
});
var num_obj_per_indiv = row['num_obj_per_indiv'].split(",").map(function (x) {
return parseInt(x, 10);
});
var row_number = row['row_number']
// randomly shuffle objs using Durstenfeld shuffle algorithm
obj_ids = shuffle_objs(obj_ids);
// form dictionary of obj_pairs from obj_ids
var perm_run_obj_set = new Set(obj_ids);
var perm_run_obj_unique = Array.from(perm_run_obj_set);
perm_run_obj_unique.sort();
var perm_run_obj_pairs_dict = {};
output = {}
for (var i = 0; i < perm_run_obj_unique.length - 1; i++) {
for (var j = i + 1; j < perm_run_obj_unique.length; j++) {
var obj_pair = [perm_run_obj_unique[i],perm_run_obj_unique[j]].sort().join("_")
perm_run_obj_pairs_dict[obj_pair] = 0
}
}
// use fixed number of objs per indiv and draw from shuffled objs
var perm_cooccur_dict = {};
//num_obj_per_indiv = num_obj_per_indiv.slice(0,3);
for(var index in num_obj_per_indiv) {
var obj_count = num_obj_per_indiv[index]
var perm_run_objs = [];
for(var j = 0; j < obj_count; j++) {
perm_run_objs.push(obj_ids.pop());
}
perm_run_objs = new Set(perm_run_objs);
perm_run_objs = Array.from(perm_run_objs)
while(perm_run_objs.length > 1) {
current_obj = perm_run_objs.pop()
for(var pair_obj_ind in perm_run_objs) {
var pair_obj = perm_run_objs[pair_obj_ind]
var sorted_pair = [current_obj,pair_obj].sort().join("_")
perm_run_obj_pairs_dict[sorted_pair] += 1
// console.log({"obj_pair":[current_obj,pair_obj].sort().join("_"),"perm_run_id":row_number})
// emit({"obj_pair":[current_obj,pair_obj].sort().join("_"),"perm_run_id":row_number});
}
}
}
// emit({"obj_pair":[current_obj,pair_obj].sort().join("_"),"perm_run_id":row_number});
// form output dictionary
num_cooccur_output = ""
for (var obj_pair in perm_run_obj_pairs_dict) {
//emit({"obj_pair":obj_pair,"num_cooccur":perm_run_obj_pairs_dict[obj_pair]});
num_cooccur_output += String(perm_run_obj_pairs_dict[obj_pair])
num_cooccur_output += ","
}
num_cooccur_output = num_cooccur_output.substring(0, num_cooccur_output.length - 1);
emit({"num_cooccurrences_list":num_cooccur_output});
}
/**
* Randomize array element order in-place.
* Using Durstenfeld shuffle algorithm.
*/
function shuffle_objs(obj_array) {
for (var i = obj_array.length - 1; i > 0; i--) {
var j = Math.floor(Math.random() * (i + 1));
var temp = obj_array[i];
obj_array[i] = obj_array[j];
obj_array[j] = temp;
}
return obj_array;
}
Any help would be greatly appreciated!
Thank you,
Daniel
Not that this directly answers your original question, but I'm not convinced that you need a UDF for this type of transformation. For example, using standard SQL (uncheck "Use Legacy SQL" under "Show Options") you can perform an array transformation such as:
WITH T AS (SELECT [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] AS arr)
SELECT
x,
new_x
FROM T,
UNNEST(
ARRAY(SELECT AS STRUCT
x,
arr[OFFSET(CAST(RAND() * (off + 1) AS INT64))] AS new_x
FROM UNNEST(arr) AS x WITH OFFSET off));
+---+-------+
| x | new_x |
+---+-------+
| 0 | 1 |
| 1 | 2 |
| 2 | 0 |
| 3 | 2 |
| 4 | 4 |
| 5 | 0 |
| 6 | 5 |
| 7 | 4 |
| 8 | 8 |
| 9 | 3 |
+---+-------+
I can explain more if it helps, but the gist of the query is that it randomizes the elements in arr using the formula from your UDF above. The FROM T, UNNEST(... unrolls the elements of the array to make them easier to see, but I could have alternatively done:
WITH T AS (SELECT [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] AS arr)
SELECT
ARRAY(SELECT AS STRUCT
x,
arr[OFFSET(CAST(RAND() * (off + 1) AS INT64))] AS new_x
FROM UNNEST(arr) AS x WITH OFFSET off)
FROM T;
This gives an array of structs as output, where each x is associated with new_x.
Related
How to write Kotlin code to store all odd numbers starting at 7 till 101 and print the sum of them?
My code goes like this:
var sum:Int = 0
var num:Int? = null
for(num in 7..101 )
if(num % 2 != 0)
print("$num ")
var result = sum + num
num++
println("$result")
Simply filter the range 7..101 and sum the items:
val total = (7..101).filter { it % 2 == 1 }.sum()
println(total)
Or use sumBy():
val total = (7..101).sumBy { if (it % 2 == 1) it else 0}
println(total)
Or first create a list of all the odd numbers and then get the sum:
val list = (7..101).filter { it % 2 == 1 }
val total = list.sum()
println(total)
If you need to store them, just create a MutableList and add the odd numbers during the forEach execution
var oddNumbersTotal = 0
(7..101).forEach { n ->
if (n % 2 != 0) {
oddNumbersTotal += n
}
}
println(oddNumbersTotal)
You can also try this:
val array = IntArray(48){2*it+1}
print(array.sum())
What's the best way to pass and return arrays of floats in AssemblyScript?
Can I pass an array form JS (by reference) for WASM to edit?
export function nBodyForces(data: f64[], result: f64[]): void {}
Below is what I have now. Ignore the implementation details, and it's returning 2000, then incrementing it to 8000ish.
What's the best way to return an array of new values?
export function nBodyForces(data: f64[]): f64[] {
// Each body has x,y,z,m passed in.
if (data.length % bodySize !== 0) return new Array<f64>(10);
const numBodies: i32 = data.length / bodySize;
// return a 3-force x,y,z vector for each body
let ret: f64[] = new Array<f64>(numBodies * forceSize);
/**
* Calculate the 3-vector each unique pair of bodies applies to each other.
*
* 0 1 2 3 4 5
* 0 x x x x x
* 1 x x x x
* 2 x x x
* 3 x x
* 4 x
* 5
*
* Sum those forces together into an array of 3-vector x,y,z forces
*/
// For all bodies:
for (let i: i32 = 0; i < numBodies; i++) {
// Given body i: pair with every body[j] where j > i
for (let j: i32 = i + 1; i < numBodies; j++) {
// Calculate the force the bodies apply to one another
const bI: i32 = i * 4
const bJ: i32 = j * 4
let f: f64[] = twoBodyForces(
// b0
data[bI], data[bI+1], data[bI+2], data[bI+3], // x,y,z,m
// b1
data[bJ], data[bJ+1], data[bJ+2], data[bJ+3], // x,y,z,m
);
// Add this pair's force on one another to their total forces applied x,y,z
// body0
ret[bI] = ret[bI] + f[0];
ret[bI+1] = ret[bI+1] + f[1];
ret[bI+2] = ret[bI+2] + f[2];
// body1
ret[bJ] = ret[bJ] + f[0];
ret[bJ+1] = ret[bJ+1] + f[1];
ret[bJ+2] = ret[bJ+2] + f[2];
}
}
// For each body, return the summ of forces all other bodies applied to it.
return ret;
}
For faster interop with JS I recommend use typed arrays if this possible
export const FLOAT64ARRAY_ID = idof<Float64Array>();
export function nBodyForces(data: Float64Array): Float64Array { ... }
And later on JavaScript side:
const loader = require("assemblyscript/lib/loader");
const imports = {};
const wasm = await loader.instantiateStreaming(fetch("optimized.wasm"), imports);
const dataArray = [... your data ...]
const dataRef = wasm.__retain(wasm.__allocArray(wasm.FLOAT64ARRAY_ID, dataArray));
const resultRef = wasm.nBodyForces(dataRef);
const resultArray = wasm.__getFloat64Array(resultRef);
// release ARC resources
wasm.__release(dataRef);
wasm.__release(resultRef);
console.log("result: " + resultArray);
Given the data set below:
a | b | c | d
1 | 3 | 7 | 11
1 | 5 | 7 | 11
1 | 3 | 8 | 11
1 | 5 | 8 | 11
1 | 6 | 8 | 11
Perform a reverse Cartesian product to get:
a | b | c | d
1 | 3,5 | 7,8 | 11
1 | 6 | 8 | 11
I am currently working with scala, and my input/output data type is currently:
ListBuffer[Array[Array[Int]]]
I have come up with a solution (seen below), but feel it could be optimized. I am open to optimizations of my approach, and completely new approaches. Solutions in scala and c# are preferred.
I am also curious if this could be done in MS SQL.
My current solution:
def main(args: Array[String]): Unit = {
// Input
val data = ListBuffer(Array(Array(1), Array(3), Array(7), Array(11)),
Array(Array(1), Array(5), Array(7), Array(11)),
Array(Array(1), Array(3), Array(8), Array(11)),
Array(Array(1), Array(5), Array(8), Array(11)),
Array(Array(1), Array(6), Array(8), Array(11)))
reverseCartesianProduct(data)
}
def reverseCartesianProduct(input: ListBuffer[Array[Array[Int]]]): ListBuffer[Array[Array[Int]]] = {
val startIndex = input(0).size - 1
var results:ListBuffer[Array[Array[Int]]] = input
for (i <- startIndex to 0 by -1) {
results = groupForward(results, i, startIndex)
}
results
}
def groupForward(input: ListBuffer[Array[Array[Int]]], groupingIndex: Int, startIndex: Int): ListBuffer[Array[Array[Int]]] = {
if (startIndex < 0) {
val reduced = input.reduce((a, b) => {
mergeRows(a, b)
})
return ListBuffer(reduced)
}
val grouped = if (startIndex == groupingIndex) {
Map(0 -> input)
}
else {
groupOnIndex(input, startIndex)
}
val results = grouped.flatMap{
case (index, values: ListBuffer[Array[Array[Int]]]) =>
groupForward(values, groupingIndex, startIndex - 1)
}
results.to[ListBuffer]
}
def groupOnIndex(list: ListBuffer[Array[Array[Int]]], index: Int): Map[Int, ListBuffer[Array[Array[Int]]]] = {
var results = Map[Int, ListBuffer[Array[Array[Int]]]]()
list.foreach(a => {
val key = a(index).toList.hashCode()
if (!results.contains(key)) {
results += (key -> ListBuffer[Array[Array[Int]]]())
}
results(key) += a
})
results
}
def mergeRows(a: Array[Array[Int]], b: Array[Array[Int]]): Array[Array[Int]] = {
val zipped = a.zip(b)
val merged = zipped.map{ case (array1: Array[Int], array2: Array[Int]) =>
val m = array1 ++ array2
quickSort(m)
m.distinct
.array
}
merged
}
The way this works is:
Loop over columns, from right to left (the groupingIndex specifies which column to run on. This column is the only one which does not have to have values equal to each other in order to merge the rows.)
Recursively group the data on all other columns (not groupingIndex).
After grouping all columns, it is assumed that the data in each group have equivalent values in every column except for the grouping column.
Merge the rows with the matching columns. Take the distinct values for each column and sort each one.
I apologize if some of this does not make sense, my brain is not functioning today.
Here is my take on this. Code is in Java but could easily be converted into Scala or C#.
I run groupingBy on all combinations of n-1 and go with the one that has the lowest count, meaning largest merge depth, so this is kind of a greedy approach. However it is not guaranteed that you will find the optimal solution, meaning minimize the number k which is np-hard to do, see link here for an explanation, but you will find a solution that is valid and do it rather fast.
Full example here: https://github.com/jbilander/ReverseCartesianProduct/tree/master/src
Main.java
import java.util.*;
import java.util.stream.Collectors;
public class Main {
public static void main(String[] args) {
List<List<Integer>> data = List.of(List.of(1, 3, 7, 11), List.of(1, 5, 7, 11), List.of(1, 3, 8, 11), List.of(1, 5, 8, 11), List.of(1, 6, 8, 11));
boolean done = false;
int rowLength = data.get(0).size(); //4
List<Table> tables = new ArrayList<>();
// load data into table
for (List<Integer> integerList : data) {
Table table = new Table(rowLength);
tables.add(table);
for (int i = 0; i < integerList.size(); i++) {
table.getMap().get(i + 1).add(integerList.get(i));
}
}
// keep track of count, needed so we know when to stop iterating
int numberOfRecords = tables.size();
// start algorithm
while (!done) {
Collection<List<Table>> result = getMinimumGroupByResult(tables, rowLength);
if (result.size() < numberOfRecords) {
tables.clear();
for (List<Table> tableList : result) {
Table t = new Table(rowLength);
tables.add(t);
for (Table table : tableList) {
for (int i = 1; i <= rowLength; i++) {
t.getMap().get(i).addAll(table.getMap().get(i));
}
}
}
numberOfRecords = tables.size();
} else {
done = true;
}
}
tables.forEach(System.out::println);
}
private static Collection<List<Table>> getMinimumGroupByResult(List<Table> tables, int rowLength) {
Collection<List<Table>> result = null;
int min = Integer.MAX_VALUE;
for (List<Integer> keyCombination : getKeyCombinations(rowLength)) {
switch (rowLength) {
case 4: {
Map<Tuple3<TreeSet<Integer>, TreeSet<Integer>, TreeSet<Integer>>, List<Table>> map =
tables.stream().collect(Collectors.groupingBy(t -> new Tuple3<>(
t.getMap().get(keyCombination.get(0)),
t.getMap().get(keyCombination.get(1)),
t.getMap().get(keyCombination.get(2))
)));
if (map.size() < min) {
min = map.size();
result = map.values();
}
}
break;
case 5: {
//TODO: Handle n = 5
}
break;
case 6: {
//TODO: Handle n = 6
}
break;
}
}
return result;
}
private static List<List<Integer>> getKeyCombinations(int rowLength) {
switch (rowLength) {
case 4:
return List.of(List.of(1, 2, 3), List.of(1, 2, 4), List.of(2, 3, 4), List.of(1, 3, 4));
//TODO: handle n = 5, n = 6, etc...
}
return List.of(List.of());
}
}
Output of tables.forEach(System.out::println)
Table{1=[1], 2=[3, 5, 6], 3=[8], 4=[11]}
Table{1=[1], 2=[3, 5], 3=[7], 4=[11]}
or rewritten for readability:
a | b | c | d
--|-------|---|---
1 | 3,5,6 | 8 | 11
1 | 3,5 | 7 | 11
If you were to do all this in sql (mysql) you could possibly use group_concat(), I think MS SQL has something similar here: simulating-group-concat or STRING_AGG if SQL Server 2017, but I think you would have to work with text columns which is a bit nasty in this case:
e.g.
create table my_table (A varchar(50) not null, B varchar(50) not null,
C varchar(50) not null, D varchar(50) not null);
insert into my_table values ('1','3,5','4,15','11'), ('1','3,5','3,10','11');
select A, B, group_concat(C order by C) as C, D from my_table group by A, B, D;
Would give the result below, so you would have to parse and sort and update the comma separated result for any next merge iteration (group by) to be correct.
['1', '3,5', '3,10,4,15', '11']
I wanna make a program that detect a potential duplicates with 3 severity level.
let consider my data is only in two column, but with thousands row.
data in second column delimited only with comma. data example :
Number | Material
1 | helmet,valros,42
2 | helmet,iron,knight
3 | valros,helmet,42
4 | knight,helmet
5 | valros,helmet,42
6 | plain,helmet
7 | helmet, leather
and my 3 levels is :
very high : A,B,C vs A,B,C
high : A,B,C vs B,C,A
so so : A,B,C vs A,B
so far i just can make the first level, i don't know how to do the 2nd and the 3rd level.
what I've tried.
Sub duplicates_separation()
Dim duplicate(), i As Long
Dim delrange As Range, cell As Long
Dim shtIn As Worksheet, shtOut As Worksheet
Set shtIn = ThisWorkbook.Sheets("input")
Set shtOut = ThisWorkbook.Sheets("output")
x = 2
y = 1
Set delrange = shtIn.Range("b1:b10000") 'set your range here
ReDim duplicate(0)
'search duplicates in 2nd column
For cell = 1 To delrange.Cells.Count
If Application.CountIf(delrange, delrange(cell)) > 1 Then
ReDim Preserve duplicate(i)
duplicate(i) = delrange(cell).Address
i = i + 1
End If
Next
'print duplicates
For i = UBound(duplicate) To LBound(duplicate) Step -1
shtOut.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
End Sub
duplicates detected by the program :
3 | valros,helmet,42
5 | valros,helmet,42
what i expected:
Number | Material
1 | helmet,valros,42
3 | valros,helmet,42
5 | valros,helmet,42
4 | knight,helmet
2 | helmet,iron,knight
i have an idea for detecting duplicates lv 2, but I think it will be so complicated and make the program slow.
turn column 2 to columns with "text to columns" command
sort column from A to Z (alphabetically)
concatenate the column
do countif like done in detecting duplicates lv 1
is there a way to detect the 2nd & 3rd level duplicates?
UPDATE
Yesterday I went to friend's house to consult about this problem, but his solution is in JAVA languange.. >which i don't understand
public class ali {
static void sPrint(String[] Printed) {
for (int iC = 0; iC < Printed.length; iC++) {
System.out.print(String.valueOf(Printed[iC]) + " | ");
}
System.out.println();
}
public static void main(String Args[]) {
int defaultLength = 10;
int indexID = 0;
int indexDesc = 1;
String[] DETECTORP1 = new String[defaultLength];
String[] DETECTORP2 = new String[defaultLength];
String[] DETECTORP3 = new String[defaultLength];
String[] DETECTORP4 = new String[defaultLength];
String[][] theString = new String[5][2];
theString[0] = new String[]{"1", "A, B, C, D"};
theString[1] = new String[]{"2", "A, B, C, D"};
theString[2] = new String[]{"3", "A, B, C, D, E"};
theString[3] = new String[]{"4", "A, B, D, C, E"};
theString[4] = new String[]{"5", "A, B, D, C, E, F"};
int P1 = 0;
int P2 = 0;
int P3 = 0;
int P4 = 0;
for (int iC = 0; iC < theString.length; iC++) {
System.out.println(theString[iC][indexID] + " -> " + theString[iC][indexDesc]);
}
for (int iC = 0; iC < theString.length; iC++) {
int LEX;
String theReference[] = theString[iC][indexDesc].replace(",", ";;").split(";;");
for (int iD = 0; iD < theString.length; iD++) {
if (iC != iD) {
String theCompare[] = theString[iD][1].replace(",", ";;").split(";;");
if (theReference.length == theCompare.length) {
LEX=0;
int theLength = theReference.length;
for (int iE = 0; iE < theLength; iE++) {
if (theReference[iE].equals(theCompare[iE])) {
LEX += 1;
}
}
if (LEX == theLength) {
DETECTORP1[P1] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P1 += 1;
} else {
LEX = 0;
for (int iF = 0; iF < theReference.length; iF++) {
for (int iG = 0; iG < theCompare.length; iG++) {
if (theReference[iF].equals(theCompare[iG])) {
LEX += 1;
break;
}
}
}
if (LEX == theReference.length) {
DETECTORP2[P2] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P2 += 1;
}
}
} else {
LEX = 0;
if (theReference.length > theCompare.length) {
for (int iF = 0; iF < theReference.length; iF++) {
for (int iG = 0; iG < theCompare.length; iG++) {
if (iG == iF) {
if (theReference[iF].equals(theCompare[iF])) {
LEX += 1;
break;
}
}
}
}
if (LEX <= theReference.length && LEX >= theCompare.length) {
DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P3 += 1;
}
} else {
LEX =0;
for (int iF = 0; iF < theCompare.length; iF++) {
for (int iG = 0; iG < theReference.length; iG++) {
if (iG == iF) {
if (theCompare[iF].equals(theReference[iF])) {
LEX += 1;
// System.out.println(theReference[iG] + "==" + theCompare[iG]);
break;
}
}
}
}
if (LEX <= theCompare.length && LEX >= theReference.length) {
DETECTORP3[P3] = theString[iC][indexID] + " WITH " + theString[iD][indexID];
P3 += 1;
}
}
}
}
}
}
sPrint(DETECTORP1);
sPrint(DETECTORP2);
sPrint(DETECTORP3);
}
}
how to do this in VBA?
Really, it depends how you want to define "severity level". Here's one way to do it, not necessarily the best: Use the Levensthein distance.
Represent each of your items by a one-character attribute symbol, e.g.
H helmet
K knight
I iron
$ Leather
^ Valros
╔ Plain
¢ Whatever
etc.
Then convert your Material lists into a string containing sequence of characters representing these attributes:
HIK = helmet,iron,knight
¢H = plain,helmet
Then compute the Levenshtein distance between those two strings. That will be your "severity level".
Debug.Print LevenshteinDistance("HIK","¢H")
'returns 3
Two implementations of the Levenshtein distance are shown in Wikipedia. And indeed you are in luck: someone on StackOverflow ported this to VBA.
In the comments section below, you say you don't like having to represent each of your possible attributes by one-character symbols. Fair enough; I agree this is a bit silly. Workaround: It is, in fact, possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. I show how to make this change in my answer to your follow-up question.
is there any existing linq function or similiar functions to detect how often values in an ordered list changes from less than zero to greater than zero?
As example, values:
5
2
-2
-5
8 <--- First
6
2
0
1
-3
-5
-3
2 <--- Second
Total count: 2
Sure - it's certainly easy if you're using .NET 4 or higher, using Zip:
// TODO: Consider how you want to handle 0 itself
var count = list.Zip(list.Skip(1), (x, y) => new { x, y })
.Count(pair => pair.x > 0 && pair.y < 0);
That shouldn't be hard to convert into VB if you know VB well :)
Alternatively, if you've really got a list, you can just do it "manually" pretty easily without LINQ:
int count = 0;
for (int i = 0; i < list.Count - 1; i++)
{
if (list[i] > 0 && list[i + 1] < 0)
{
count++;
}
}
You can implement this in one pass using Aggregate:
seq.Aggregate(new { Count=0, LastN = 0}, (state, n) => new {
Count = state.Count + (n > 0 && state.LastN < 0 ? 1 : 0),
LastN = n == 0 ? state.LastN : n
}).Count
This takes into account your wish to include "gradual" transitions such as -1,0,1.
However, a foreach may be easier, simply because it's more conventional. It'll also be faster:
var count = 0;
var lastN = 0;
foreach(var n in seq) {
if(n > 0 && lastN < 0)
count++;
if (n != 0)
lastN = n;
}