How to flatten bigquery record with multiple repeated fields?

How to flatten bigquery record with multiple repeated fields? - google-bigquery

I'm trying to query against app-engine datastore backup data. In python, the entities are described as something like this:
class Bar(ndb.Model):
property1 = ndb.StringProperty()
property2 = ndb.StringProperty()
class Foo(ndb.Model):
bar = ndb.StructuredProperty(Bar, repeated=True)
baz = ndb.StringProperty()
Unfortunately when Foo gets backed up and loaded into bigquery, the table schema gets loaded as:
bar | RECORD | NULLABLE
bar.property1 | STRING | REPEATED
bar.property2 | STRING | REPEATED
baz | STRING | NULLABLE
What I would like to do is to get a table of all bar.property1 and associated bar.property2 where baz = 'baz'.
Is there a simple way to flatten Foo so that the bar records are "zipped" together? If that's not possible, is there another solution?

As hinted in a comment by #Mosha, it seems that big query supports User Defined Functions (UDF). You can input it in the UDF Editor tab on the web UI. In this case, I used something like:
function flattenTogether(row, emit) {
if (row.bar && row.bar.property1) {
for (var i=0; i < row.bar.property1.length; i++) {
emit({property1: row.bar.property1[i],
name: row.bar.property2[i]});
}
}
};
bigquery.defineFunction(
'flattenBar',
['bar.property1', 'bar.property2'],
[{'name': 'property1', 'type': 'string'},
{'name': 'property2', 'type': 'string'}],
flattenTogether);
And then the query looked like:
SELECT
property1,
property2,
FROM
flattenBar(
SELECT
bar.property1,
bar.property2,
FROM
[dataset.foo]
WHERE
baz = 'baz')

Since baz is not repeated, you can simply filter on it in WHERE clause without any flattening:
SELECT bar.property1, bar.property2 FROM t WHERE baz = 'baz'

Related

Scala MatchError while joining a dataframe and a dataset

I have one dataframe and one dataset :
Dataframe 1 :
+------------------------------+-----------+
|City_Name |Level |
+------------------------------+------------
|{City -> Paris} |86 |
+------------------------------+-----------+
Dataset 2 :
+-----------------------------------+-----------+
|Country_Details |Temperature|
+-----------------------------------+------------
|{City -> Paris, Country -> France} |31 |
+-----------------------------------+-----------+
I am trying to make a join of them by checking if the map in the column "City_Name" is included in the map of the Column "Country_Details".
I am using the following UDF to check the condition :
val mapEqual = udf((col1: Map[String, String], col2: Map[String, String]) => {
if (col2.nonEmpty){
col2.toSet subsetOf col1.toSet
} else {
true
}
})
And I am making the join this way :
dataset2.join(dataframe1 , mapEqual(dataset2("Country_Details"), dataframe1("City_Name"), "leftanti")
However, I get such error :
terminated with error scala.MatchError: UDF(Country_Details#528) AS City_Name#552 (of class org.apache.spark.sql.catalyst.expressions.Alias)
Has anyone previously got the same error ?
I am using Spark version 3.0.2 and SQLContext, with scala language.

There are 2 issues here, the first one is that when you're calling your function, you're passing one extra parameter leftanti (you meant to pass it to join function, but you passed it to the udf instead).
The second one is that the udf logic won't work as expected, I suggest you use this:
val mapContains = udf { (col1: Map[String, String], col2: Map[String, String]) =>
col2.keys.forall { key =>
col1.get(key).exists(_ eq col2(key))
}
}
Result:
scala> ds.join(df1 , mapContains(ds("Country_Details"), df1("City_Name")), "leftanti").show(false)
+----------------------------------+-----------+
|Country_Details |Temperature|
+----------------------------------+-----------+
|{City -> Paris, Country -> France}|31 |
+----------------------------------+-----------+

How to consistently retrieve a property from a multi-valued attribute using Kusto

I have Azure AD audit event sent to a log analytics workspace and I'd like to build a query that shows me all Unified Groups created with the IsPublic property set to True.
I have the relevant events in TargetResources[0].modifiedProperties however this is a multi-valued object and depending on how it was provisioned the position of the attribute I look for is different.
for ex.
TargetResources[0].modifiedProperties contains the IsPublic on the 3rd position, but sometimes it's on the second or fourth position.
[
{"displayName":"DisplayName","oldValue":"[]","newValue":"[\"Test Group\"]"},
{"displayName":"GroupType","oldValue":"[]","newValue":"[\"Unified\"]"},
{"displayName":"IsPublic","oldValue":"[]","newValue":"[false]"}
]
I am guessing there is a way to find the exact property and value dynamically?
Sincerely,
Tonino Bruno

You could use the mv-apply operator.
below are a few examples:
datatable(i:int, TargetResources:dynamic)
[
1, dynamic([{"p2":"v2","modifiedProperties":[{"displayName":"DisplayName","oldValue":"[]","newValue":"[\"Test Group\"]"},{"displayName":"GroupType","oldValue":"[]","newValue":"[\"Unified\"]"},{"displayName":"IsPublic","oldValue":"[]","newValue":"[false]"}]}]),
2, dynamic([{"p4":"v4","modifiedProperties":[{"displayName":"DisplayName","oldValue":"[]","newValue":"[\"Test Group\"]"},{"displayName":"GroupType","oldValue":"[]","newValue":"[\"Unified\"]"},{"displayName":"IsSomething","oldValue":"[]","newValue":"[false]"},{"displayName":"IsPublic","oldValue":"[]","newValue":"[true]"}]}]),
3, dynamic([{"p2":"v2","modifiedProperties":[{"displayName":"DisplayName","oldValue":"[]","newValue":"[\"Test Group\"]"},{"displayName":"GroupType","oldValue":"[]","newValue":"[\"Unified\"]"},{"displayName":"IsPublic","oldValue":"[]","newValue":"[true]"}]}]),
]
| project i, mp = TargetResources[0].modifiedProperties
| mv-apply mp on (
where mp.displayName == "IsPublic" and mp.newValue == '[true]'
)
datatable(i:int, TargetResources:dynamic)
[
1, dynamic([{"p2":"v2","modifiedProperties":[{"displayName":"DisplayName","oldValue":"[]","newValue":"[\"Test Group\"]"},{"displayName":"GroupType","oldValue":"[]","newValue":"[\"Unified\"]"},{"displayName":"IsPublic","oldValue":"[]","newValue":"[false]"}]}]),
2, dynamic([{"p4":"v4","modifiedProperties":[{"displayName":"DisplayName","oldValue":"[]","newValue":"[\"Test Group\"]"},{"displayName":"GroupType","oldValue":"[]","newValue":"[\"Unified\"]"},{"displayName":"IsSomething","oldValue":"[]","newValue":"[false]"},{"displayName":"IsPublic","oldValue":"[]","newValue":"[true]"}]}]),
3, dynamic([{"p2":"v2","modifiedProperties":[{"displayName":"DisplayName","oldValue":"[]","newValue":"[\"Test Group\"]"},{"displayName":"GroupType","oldValue":"[]","newValue":"[\"Unified\"]"},{"displayName":"IsPublic","oldValue":"[]","newValue":"[true]"}]}]),
]
| project i, mp = TargetResources[0].modifiedProperties
| mv-apply mp on (
project displayName = tostring(mp.displayName), newValue = tostring(parse_json(tostring(mp.newValue))[0])
| summarize b = make_bag(pack(displayName, newValue))
)
| where b.GroupType == "Unified" and b.IsPublic == "true"

Spark dataset filter elements

I have 2 spark datasets:
lessonDS and latestLessonDS ;
This is my spark dataset POJO:
Lesson class:
private List<SessionFilter> info;
private lessonId;
LatestLesson class:
private String id:
SessionFilter class:
private String id;
private String sessionName;
I want to get all Lesson data where info.id in Lesson class not in LatestLesson id .
something like this:
lessonDS.filter(explode(col("info.id")).notEqual(latestLessonDS.col("value"))).show();
latestLessonDS contain:
100A
200C
300A
400A
lessonDS contain:
1,[[100A,jon],[200C,natalie]]
2,[[100A,jon]]
3,[[600A,jon],[400A,Kim]]
result:
3,[[600A,jon]

If your dataset size latestLessonDS is reasonable enough you can collect it and broadcast
and then simple filter transformtion on lessonDS will give you desired result.
like
import scala.collection.JavaConversions._
import spark.implicits._
val bc = spark.sparkContext.broadcast(latestLessonDS.collectAsList().toSeq)
lessonDS.mapPartitions(itr => {
val cache = bc.value;
itr.filter(x => {
//check in cache
})
})

Usually the array_contains function could be used as join condition when joining lessonDs and latestLessonDs. But this function does not work here as the join condition requires that all elements of lessonDs.info.id appear in latestLessonDS.
A way to get the result is to explode lessonDs, join with latestLessonDs and then check if for all elements of lessonDs.info an entry in latestLessonDs exists by comparing the number of info elements before and after the join:
lessonDs
.withColumn("noOfEntries", size('info))
.withColumn("id", explode(col("info.id")))
.join(latestLessonDs, "id" )
.groupBy("lessonId", "info", "noOfEntries").count()
.filter("noOfEntries = count")
.drop("noOfEntries", "count")
.show(false)
prints
+--------+------------------------------+
|lessonId|info |
+--------+------------------------------+
|1 |[[100A, jon], [200C, natalie]]|
|2 |[[100A, jon]] |
+--------+------------------------------+

How to automate a field mapping using a table in snowflake

I have one column table in my snowflake database that contain a JSON mapping structure as following
ColumnMappings : {"Field Mapping": "blank=Blank,E=East,N=North,"}
How to write a query that if I feed the Field Mapping a value of E I will get East or if the value if N I will get North so on and so forth without hard coding the value in the query like what CASE statement provides.

You really want your mapping in this JSON form:
{
"blank" : "Blank",
"E" : "East",
"N" : "North"
}
You can achieve that in Snowflake e.g. with a simple JS UDF:
create or replace table x(cm variant) as
select parse_json(*) from values('{"fm": "blank=Blank,E=East,N=North,"}');
create or replace function mysplit(s string)
returns variant
language javascript
as $$
res = S
.split(",")
.reduce(
(acc,val) => {
var vals = val.split("=");
acc[vals[0]] = vals[1];
return acc;
},
{});
return res;
$$;
select cm:fm, mysplit(cm:fm) from x;
-------------------------------+--------------------+
CM:FM | MYSPLIT(CM:FM) |
-------------------------------+--------------------+
"blank=Blank,E=East,N=North," | { |
| "E": "East", |
| "N": "North", |
| "blank": "Blank" |
| } |
-------------------------------+--------------------+
And then you can simply extract values by key with GET, e.g.
select cm:fm, get(mysplit(cm:fm), 'E') from x;
-------------------------------+--------------------------+
CM:FM | GET(MYSPLIT(CM:FM), 'E') |
-------------------------------+--------------------------+
"blank=Blank,E=East,N=North," | "East" |
-------------------------------+--------------------------+
For performance, you might want to make sure you call mysplit only once per value in your mapping table, or even pre-materialize it.

Active Record Query - Search Multiple Columns for Multiple Strings and Return Only if They Are All Included

I need help designing a query via Active Record & Postgresql.
• The query must search across all of the following columns...
The Model looks like this:
Collection
item_id1: String
item_id2: String
item_id3: String
item_id4: String
item_id5: String
item_id6: String
• The query needs to pass in an array of strings that need to be searched across all of the item_id fields.
• The query must also only return results of Records containing all of the strings within the array.
Note: I also have the Textacular Full Text Search gem installed. However, I tested a search that I believe is supposed to search and return matches only if the records include all of the passed in strings, and the search came up with nothing, despite records with those strings existing in my database. Like this: Collection.advanced_search('B0066AJ5TK&B0041KJSL2')

Just to clarify: You want records where each of the strings in the array are found somewhere within the six item_id fields?
There's probably a more elegant way to do this, but here's what I've got off the top of my head:
terms = ['foo', 'bar', 'baz']
conditions = []
values = {}
terms.each_with_index do |t,i|
arg_id = "term#{i}".to_sym
conditions << "(item_id1 = :#{arg_id} OR item_id2 = :#{arg_id} OR item_id3 = :#{arg_id} OR item_id4 = :#{arg_id} OR item_id5 = :#{arg_id} OR item_id6 = :#{arg_id})"
values[arg_id] = t
end
Collection.where(conditions.join(' AND '), values)
This should produce a query like this:
SELECT "collections".* FROM "collections" WHERE ((item_id1 = 'foo' OR item_id2 = 'foo' OR item_id3 = 'foo' OR item_id4 = 'foo' OR item_id5 = 'foo' OR item_id6 = 'foo') AND (item_id1 = 'bar' OR item_id2 = 'bar' OR item_id3 = 'bar' OR item_id4 = 'bar' OR item_id5 = 'bar' OR item_id6 = 'bar') AND (item_id1 = 'baz' OR item_id2 = 'baz' OR item_id3 = 'baz' OR item_id4 = 'baz' OR item_id5 = 'baz' OR item_id6 = 'baz'))
Which is long and ugly, but should get the results you want.
If you meant that the fields might contain the strings to be searched for, rather than be equal to them, you could instead use
item_id1 LIKE #{arg_id}
and
values[arg_id] = "%#{t}%"

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to flatten bigquery record with multiple repeated fields? - google-bigquery

Since baz is not repeated, you can simply filter on it in WHERE clause without any flattening: SELECT bar.property1, bar.property2 FROM t WHERE baz = 'baz'

Related

Scala MatchError while joining a dataframe and a dataset

How to consistently retrieve a property from a multi-valued attribute using Kusto

Spark dataset filter elements

How to automate a field mapping using a table in snowflake

Active Record Query - Search Multiple Columns for Multiple Strings and Return Only if They Are All Included

Categories

Resources