Iterate through a column in Dataset which have array of key value pairs and find out a pair with max value

Iterate through a column in Dataset which have array of key value pairs and find out a pair with max value - dataframe

I have data in a dataframe , which was obtained from azure eventhub.
Then I convert this data to json object and stored the required data into a dataset as shown below.
Code for obtaining data from eventhub and store it into a dataframe.
val connectionString = ConnectionStringBuilder(<ENDPOINT URL>)
.setEventHubName(<EVENTHUB NAME>).build
val currTime = Instant.now
val ehConf = EventHubsConf(connectionString)
.setConsumerGroup("<CONSUMER GRP>")
.setStartingPosition(EventPosition
.fromEnqueuedTime(currTime.minus(Duration.ofMinutes(30))))
.setEndingPosition(EventPosition.fromEnqueuedTime(currTime))
val reader = spark.read.format("eventhubs").options(ehConf.toMap).load()
var SIGNALS = reader
.select(get_json_object(($"body").cast("string"),"$.NUM").alias("NUM"),
get_json_object(($"body").cast("string"),"$.SIG1").alias("SIG1"),
get_json_object(($"body").cast("string"),"$.SIG2").alias("SIG2"),
get_json_object(($"body").cast("string"),"$.SIG3").alias("SIG3"),
get_json_object(($"body").cast("string"),"$.SIG4").alias("SIG4")
)
val SIGNALSFiltered = SIGNALS.filter(col("SIG1").isNotNull &&
col("SIG2").isNotNull && col("SIG3").isNotNull && col("SIG4").isNotNull)
The data obtained at SIGNALSFiltered is shown below.
+-----------------+--------------------+--------------------+--------------------+--------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+--------------------+--------------------+--------------------+--------------------+
|XXXXX01|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX02|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|[{"TIME":15695604780...|
|XXXXX03|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX04|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX05|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX06|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|[{"TIME":15695605340...|
|XXXXX07|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
|XXXXX08|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|[{"TIME":15695605310...|
If we check entire data for a single row it will be as below.
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825},{"TIME":1569560475000,"VALUE":3.7812},{"TIME":1569560483000,"VALUE":3.7812},{"TIME":1569560491000,"VALUE":34.7875}]|
[{"TIME":1569560537000,"VALUE":3.7825},{"TIME":1569560481000,"VALUE":34.7825},{"TIME":1569560489000,"VALUE":34.7825},{"TIME":1569560497000,"VALUE":34.7825}]|
[{"TIME":1569560505000,"VALUE":34.7825},{"TIME":1569560513000,"VALUE":34.7825},{"TIME":1569560521000,"VALUE":34.7825},{"TIME":1569560527000,"VALUE":34.7825}]|
[{"TIME":1569560535000,"VALUE":34.7825},{"TIME":1569560479000,"VALUE":34.7825},{"TIME":1569560487000,"VALUE":34.7825}]
I want only the highest TIME pair from each column, not the entire TIME VALUE pairs. Output should be as shown below.
+-----------------+-----------------------------+---------------------------------------+---------------------------------------+----------------------------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+-----------------------------+---------------------------------------+---------------------------------------+----------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":4.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":5.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":6.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":7.7825}]|
|XXXXX03|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":9.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":8.7825}]|
How to Iterate through each column in each row and get the highest TIME-VALUE pair?
After getting highest in each columns (SIG1,....SIG4) have to update only the value of TIME in all columns with highest among them.
Is there Any way to convert the base dataset as below?. Each elements in a column should be converted to a new row.
+-----------------+-----------------------------+---------------------------------------+---------------------------------------+----------------------------------------+
| NUM| SIG1| SIG2| SIG3| SIG4|
+-----------------+-----------------------------+---------------------------------------+---------------------------------------+----------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]| null |[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531000,"VALUE":3.7825}]|```
Any leads or help is appreciated! Thanks in Advance.

You have to write one user defined function like below. which will loop your data and get Max Time Value.
Note: UDF is just for reference, you can change it as per requirement
How to Iterate through each column in each row and get the highest TIME-VALUE pair?
scala> import org.apache.spark.sql.expressions.{UserDefinedFunction}
scala> def MaxTime:UserDefinedFunction = udf((json:String) => {
val pars = JSON.parseFull(json)
var output=""
pars.foreach{ x => val y = x.asInstanceOf[List[Any]]
var i = 1
var TimeMap = scala.collection.mutable.Map[String, Long]()
var ValueMap = scala.collection.mutable.Map[String, Double]()
y.foreach{ zz => val z = zz.asInstanceOf[Map[String,Double]]
TimeMap(i.toString) = z("TIME").toLong
ValueMap(i.toString) = z("VALUE")
i = i + 1
}
output = """[{"TIME" : """ + TimeMap.maxBy(_._2)._2.toString + """ ,"VALUE": """ + ValueMap(TimeMap.maxBy(_._2)._1) + """}]"""
}
output})
scala> SIGNALSFiltered.withColumn("SIG1", MaxTime(col("SIG1")).withColumn("SIG2", MaxTime(col("SIG2")))).withColumn("SIG3", MaxTime(col("SIG3"))).withColumn("SIG4", MaxTime(col("SIG4"))).show(false)
After getting highest in each columns (SIG1,....SIG4) have to update only the value of TIME in all columns with highest among them.
Write same UDF like above and pass complete row as a parameter. Then parse each column value into Map and get Maximum among all columns.

Related

How to get row data from web table in Karate framework

The web table has a combination of textbox, span and checkboxes. I need to get first row of all the data in single def and have to verify with DB in order wise.
Ex: First row of table has columns like below.
OrderID(span), EmpName(input), IsHeEligible(checkbox), Address(span)
By using below,
def tebleFirstData = scriptAll("table/tbody/tr/td",'_.textContent')
Able to get only span text data, not able to get input tag test data.
I tried below,
data = attribute("table/tbody/tr/td[5]/input",'value')
But, I'm able to get only single input tag attribute value.
How can I get all the data in single def, i.e span data and input data?

Below is the solution to get all row data from table..!
* def UiFirstRowElements = locateAll("Row xpath")
print UiFirstRowElements
def tableData = []
def RowHtml = UiFirstRowElements[0].html
print RowHtml
eval
"""
for(var i=0; i<UiFirstRowElements.length; i++)
if(UiFirstRowElements[i].html.contains("input") && UiFirstRowElements[i].html.contains("date")){
tableData.push(locate("//table/tbody/tr/td["+(i+1)+"]/div/input").property('value'))
}
else if(UiFirstRowElements[i].html.contains("input") && UiFirstRowElements[i].html.contains("checkbox")){
tableData.push(locate("//table/tbody/tr/td["+(i+1)+"]/div/input").property('checked'))
}
else {
tableData.push(locate("//table/tbody/tr/td["+(i+1)+"]").property('textContent'))
}
"""
* print 'TableName-->', TableName
* print tableData

DART SQL How do I read multiple row values (with multiple columns) from SQL query results in Dart?

Right now i am able to read data from the column in the single returned row ad below
var resultsDoc = await conn
.query('select fNAME, SPECIALITY, CATENAME from doctor where id = ?', [5]);
for (var rowDoc in resultsDoc) {
ffname = rowDoc[0]; //reading value at column 1
sspeciality = rowDoc[1]; //reading value at column 2
ccategory = rowDoc[2]; //reading value at column 3
}
what if i wanted to return the entire table as below
enter image description here
Team, assist on how to read those values.
Thank You

You were close!
for (var rowDoc in resultsDoc) {
print(rowDoc.join('\t'));
}
Or did you want to transpose the list of lists, so that you'd have 3 lists, one for each column?

Outputting values not equal to certain values in yii2

I would like to output variables not equal to certain values but it returns an error of
Failed to prepare SQL: SELECT * FROM `tblsuunit` WHERE `unitid` != :qp0
There are two models the first model where am getting the array of ids
public function actionSunits($id){
$unitslocation = new Unitslocation();
$id2 = Unitslocation::find()->where(['officelocationid'=>$id])->all();
foreach( $id2 as $ids){
print_r($ids['unitid']."<br>");
}
}
This outputs the ids as
8
9
11
12
13
14
16
I would then like to take the id and compare another model(units model) and get the id values not similar to the above and output then
So i have added
$idall = Units::find()->where(['!=', 'unitid', $ids])->all();
So the whole controller action becomes
public function actionSunits($id){
$unitslocation = new Unitslocation();
$id2 = Unitslocation::find()->where(['officelocationid'=>$id])->all();
foreach( $id2 as $ids){
$idall = Units::find()->where(['!=', 'unitid', $ids])->all();
}
var_dump($idall);
}
This is the units model table:
If it were working it should return 7 and 10
What could be wrong..

You should fix your code and simply use a not in condition, e.g. :
// $uls will be an array of Unitslocation objects
$uls = Unitslocation::find()->where(['officelocationid'=>$id])->all();
// $uids will contain the unitids
$uids = \yii\helpers\ArrayHelper::getColumn($uls, 'unitid');
// then simply use a not in condition
$units = Units::find()->where(['not in', 'unitid', $uids])->all();
$idall = \yii\helpers\ArrayHelper::getColumn($units, 'unitid');
Read more about ActiveQuery::where() and ArrayHelper::getColumn().

Try with:
$idall = Units::find()->where(['not in','unitid',$ids])->all();
Info: https://github.com/yiisoft/yii2/blob/master/docs/guide/db-query-builder.md
operand 1 should be a column or DB expression. Operand 2 can be either
an array or a Query object.

Programmatically adding several columns to Spark DataFrame

I'm using spark with scala.
I have a Dataframe with 3 columns: ID,Time,RawHexdata.
I have a user defined function which takes RawHexData and expands it into X more columns. It is important to state that for each row X is the same (the columns do not vary). However, before I receive the first data, I do not know what the columns are. But once I have the head, I can deduce it.
I would like a second Dataframe with said columns: Id,Time,RawHexData,NewCol1,...,NewCol3.
The "Easiest" method I can think of to do this is:
1. deserialize each row into json (every data tyoe is serializable here)
2. add my new columns,
3. deserialize a new dataframe from the altered json,
However, that seems like a waste, as it involves 2 costly and redundant json serialization steps. I am looking for a cleaner pattern.
Using case-classes, seems like a bad idea, because I don't know the number of columns, or the column names in advance.

What you can do to dynamically extend your DataFrame is to operate on the row RDD which you can obtain by calling dataFrame.rdd. Having a Row instance, you can access the RawHexdata column and parse the contained data. By adding the newly parsed columns to the resulting Row, you've almost solved your problem. The only thing necessary to convert a RDD[Row] back into a DataFrame is to generate the schema data for your new columns. You can do this by collecting a single RawHexdata value on your driver and then extracting the column types.
The following code illustrates this approach.
object App {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(Person("a", 1), Person("b", 2)))
val dataFrame = input.df
dataFrame.show()
// create the extended rows RDD
val rowRDD = dataFrame.rdd.map{
row =>
val blob = row(1).asInstanceOf[Int]
val newColumns: Seq[Any] = Seq(blob, blob * 2, blob * 3)
Row.fromSeq(row.toSeq.init ++ newColumns)
}
val schema = dataFrame.schema
// we know that the new columns are all integers
val newColumns = StructType{
Seq(new StructField("1", IntegerType), new StructField("2", IntegerType), new StructField("3", IntegerType))
}
val newSchema = StructType(schema.init ++ newColumns)
val newDataFrame = sqlContext.createDataFrame(rowRDD, newSchema)
newDataFrame.show()
}
}

SELECT is your friend solving it without going back to RDD.
case class Entry(Id: String, Time: Long)
val entries = Seq(
Entry("x1", 100L),
Entry("x2", 200L)
)
val newColumns = Seq("NC1", "NC2", "NC3")
val df = spark.createDataFrame(entries)
.select(col("*") +: (newColumns.map(c => lit(null).as(c))): _*)
df.show(false)
+---+----+----+----+----+
|Id |Time|NC1 |NC2 |NC3 |
+---+----+----+----+----+
|x1 |100 |null|null|null|
|x2 |200 |null|null|null|
+---+----+----+----+----+

Lua get index name of table as table

Is there any way to get every index value of a table?
Example:
local mytbl = {
["Hello"] = 123,
["world"] = 321
}
I want to get this:
{"Hello", "world"}

local t = {}
for k, v in pairs(mytbl) do
table.insert(t, k) -- or t[#t + 1] = k
end
Note that the order of how pairs iterates a table is not specified. If you want to make sure the elements in the result are in a certain order, use:
table.sort(t)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Iterate through a column in Dataset which have array of key value pairs and find out a pair with max value - dataframe

Related

How to get row data from web table in Karate framework

DART SQL How do I read multiple row values (with multiple columns) from SQL query results in Dart?

Outputting values not equal to certain values in yii2

Programmatically adding several columns to Spark DataFrame

Lua get index name of table as table

Categories

Resources