How to create a filter on an aws glue dynamicframe that filters out set of (literal) values - aws-glue-spark

In a glue script (running in a zeppelin notebook forwarding to a dev endpoint in glue), I've created a dynamicframe from a glue table, that I would like to filter on field "name" not being in a static list of values, i.e. ("a","b","c").
Filtering on non-equality works just fine like this:
def unknownNameFilter(rec: DynamicRecord): Boolean = {
rec.getField("name").exists(_ != "a")
}
I have tried several things like
!rec.getField("name").exists(_ isin ("a","b","c"))
but it gives errors (value isin is not a member of Any), and I can only find pyspark examples and examples that first convert the dynamicframe to a dataframe on the web (which I want to prevent if possible).
Help much appreciated, thanks.

Okay, found my answer, I'll post it for anyone else looking for this, it is done with
!(knownevents.contains(eventname))
Like this in a filter function:
def unknownEventFilter(rec: DynamicRecord): Boolean = {
val knownevents = List("evt_a","evt_b")
rec.getField("name") match {
case Some(eventname: String) => !(knownevents.contains(eventname))
case _ => throw new IllegalArgumentException(s"Unable to extract field name")
}
}
val dfUnknownEvents = df.filter(unknownEventFilter)

Related

How to read Key Value pair in spark SQL?

How do I get this output using spark sql or scala ? I have a table with columns storing such values - need to split in seprate columns.
Input :
Output :
It pretty much depends on what libs you want to use (as you mentioned in Scala or Spark).
Using spark
val rawJson = """
{"Name":"ABC.txt","UploaddedById":"xxxxx1123","UploadedByName":"James"}
"""
spark.read.json(Seq(rawJson).toDS)
Using common json libraries:
// play
Json.parse(rawJson) match {
case obj: JsObject =>
val values = obj.values
val keys = obj.keys
// construct dataframe having keys and values
case other => // handle other types (like JsArray, etc,.)
}
// circe
import io.circe._, io.circe.parser._
parse(rawJson) match {
case Right(json) => // fetch key values, construct df, much like above
case Left(parseError) => ...
}
You can use almost any json library to parse your json object, and then convert it to spark df very easily.

DBArrayList to List<Map> Conversion after Query

Currently, I have a SQL query that returns information to me in a DBArrayList.
It returns data in this format : [{id=2kjhjlkerjlkdsf324523}]
For the next step, I need it to be in a List<Map> format without the id: [2kjhjlkerjlkdsf324523]
The Datatypes being used are DBArrayList, and List.
If it helps any, the next step is a function to collect the list and then to replace all single quotes if any [SQL-Injection prevention]. Using:
listMap = listMap.collect() { "'" + Util.removeSingleQuotes(it) + "'" }
public static String removeSingleQuotes(s) {
return s ? s.replaceAll(/'"/, '') : s
}
I spent this morning working on it, and I found out that I needed to actually collect the DBArrayList like this:
listMap = dbArrayList.collect { it.getAt('id')}
If you're in a bind like I was and restrained to a specific schema this might help, but #ou_ryperd has the correct answer!
While using a DBArrayList is not wrong, Groovy's idiom is to use the db result as a collection. I would suggest you use it that way directly from the db:
Map myMap = [:]
dbhandle.eachRow("select fieldSomeID, fieldSomeVal from yourTable;") { row ->
map[row.fieldSomeID] = row.fieldSomeVal.replaceAll(/'"/, '')
}

Spark Streaming + Spark SQL

I am trying to process logs via Spark Streaming and Spark SQL. The main idea is to have a "compacted" dataset with Parquet format for "old" data converted to DataFrame as needed for queries, the compacted dataset loading is done with:
SQLContext sqlContext = JavaSQLContextSingleton.getInstance(sc.sc());
DataFrame compact = null;
compact = sqlContext.parquetFile("hdfs://auto-ha/tmp/data/logs");
As the uncompacted dataset (I compact the dataset daily) is composed of many files, I would like to have the data in the current day within a DStream in order to get those queries fast.
I have tried the DataFrame approach without results....
DataFrame df = JavaSQLContextSingleton.getInstance(sc.sc()).createDataFrame(lastData, schema);
df.registerTempTable("lastData");
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
DataFrame df = JavaSQLContextSingleton.getInstance(v1.context()).createDataFrame(v1, schema);
......drop old data from lastData table
df.insertInto("lastData");
}
});
Using this approach I do not get any results if I query the temp table in a different thread for example.
I have also tried to use the RDD transform method, more specifically I tried to follow the Spark Example where I create a empty RDD and then I union the DSStream RDD contents with the empty RDD:
JavaRDD<Row> lastData = sc.emptyRDD();
JavaDStream SumStream = inputStream.transform(new Function<JavaRDD<Row>, JavaRDD<Object>>() {
#Override
public JavaRDD<Object> call(JavaRDD<Row> v1) throws Exception {
lastData.union(v1).filter(let only recent data....);
}
});
This approach does not work too as I do not get any contents in the lastData
Could I use for this purpose Windowed computations or updateStateBy key?
Any suggestions?
Thanks for your help!
Well I finally got it.
I use updateState function and return 0 if the timestamp is older than 24 hour like this.
final static Function2<List<Long>, Optional<Long>, Optional<Long>> RETAIN_RECENT_DATA
= (List<Long> values, Optional<Long> state) -> {
Long newSum = state.or(0L);
for (Long value : values) {
newSum += value;
}
//current milis uses UTC
if (System.currentTimeMillis() - newSum > 86400000L) {
return Optional.absent();
} else {
return Optional.of(newSum);
}
};
Then on each batch I register the DataFrame as temp table:
finalsum.foreachRDD((JavaRDD<Row> rdd, Time time) -> {
if (!rdd.isEmpty()) {
HiveContext sqlContext1 = JavaSQLContextSingleton.getInstance(rdd.context());
if (sqlContext1.cacheManager().isCached("alarm_recent")) {
sqlContext1.uncacheTable("alarm_recent");
}
DataFrame wordsDataFrame = sqlContext1.createDataFrame(rdd, schema);
wordsDataFrame.registerTempTable("alarm_recent");
wordsDataFrame.cache();//
wordsDataFrame.first();
}
return null;
});
You could use mapwithState with Spark1.6.
The mapwithState function is much more efficient and easy to implement.
Take a look at this link.
mapwithState supports cool functionality like State time out and initialRDD which comes handy while maintaining a Stateful Dstream.
Thanks
Manas

Strings concatenation in Spark SQL query

I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following:
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select("1"+"s.codeA".attr, "e.name".attr)
Let's say my tables contain:
sim:
codeA,codeB
0001,abcd
0002,efgh
events:
codeA,name
0001,freddie
0002,mercury
And I would want as output:
10001,freddie
10002,mercury
In SQL or HiveQL I know I have the concat function available, but it seems Spark SQL doesn't support this feature. Can somebody suggest me a workaround for my issue?
Thank you.
Note:
I'm using Language Integrated Queries but I could use just a "standard" Spark SQL query, in case of eventual solution.
The output you add in the end does not seem to be part of your selection, or your SQL logic, if I understand correctly. Why don't you proceed by formatting the output stream as a further step ?
val results = sqlContext.sql("SELECT s.codeA, e.code FROM foobar")
results.map(t => "1" + t(0), t(1)).collect()
It's relatively easy to implement new Expression types directly in your project. Here's what I'm using:
case class Concat(children: Expression*) extends Expression {
override type EvaluatedType = String
override def foldable: Boolean = children.forall(_.foldable)
def nullable: Boolean = children.exists(_.nullable)
def dataType: DataType = StringType
def eval(input: Row = null): EvaluatedType = {
children.map(_.eval(input)).mkString
}
}
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select(Concat("1", "s.codeA".attr), "e.name".attr)

Spark JavaPairRDD iteration

How can iterate on JavaPairRDD. I have done a group by and got back a RDD as below JavaPairRDD (Tuple 7 set of Strings and List of Objects)
Now I have to iterate over this RDD and do some calculations like FOR EACH in Pig.
Basically I would like to iterate the key and the list of values and do some operations and then return back a JavaPairRDD?
JavaPairRDD<Tuple7<String, String,String,String,String,String,String>, List<Records>> sizes =
piTagRecordData.groupBy( new Function<Records, Tuple7<String, String,String,String,String,String,String>>() {
private static final long serialVersionUID = 2885738359644652208L;
#Override
public Tuple7<String, String,String,String,String,String,String> call(Records row) throws Exception {
Tuple7<String, String,String,String,String,String,String> compositeKey = new Tuple7<String, String, String, String, String, String, String>(row.getAsset_attribute_id(),row.getDate_time_value(),row.getOperation(),row.getPi_tag_count(),row.getAsset_id(),row.getAttr_name(),row.getCalculation_type());
return compositeKey;
}
});
After this I want to perform FOR EACH member of sizes (JavaPairRDD), operation -- something like
rejected_records = FOREACH sizes GENERATE FLATTEN(Java function on the List of Records based on the group key
I am using Spark 0.9.0
Even though you are talking about "FOR EACH", it really sounds like you want the flatMap operation, since you want to produce new values and flatten them. This is available for Java RDDs, including a JavaPairRDD.
You can use void foreach(VoidFunction<T> f) method. More info and methods: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/api/java/JavaRDDLike.html#foreach(org.apache.spark.api.java.function.VoidFunction)
if you want to view some value of JavaPairRDD, I would do like this
for (Tuple2<String, String> test : pairRdd.take(10)) //or pairRdd.collect()
{
System.out.println(test._1);
System.out.println(test._2);
}
Note:Tuple2 (assuming you have strings inside the JavaPairRDD), change the datatype according to the data type stored in the JavaPairRDD.