Spark transformation for timeseries/tick data set

Spark transformation for timeseries/tick data set - dataframe

We have table in hive which stores trading orders data for each end of day as order_date. Other important columns are
product,
contract,
price(price of order placed),
ttime (transaction time)
status (insert,update or remove)
price (price of order)
We have to build a charting table in in tick data fashion from the main table with the max and min price orders for each row(order) from the morning when market opened till that time. i.e for a given order we would have 4 columns populated as maxPrice(max price till now), maxpriceOrderId(orderid of the max price), minPrice and minPriceOrderId
This has to be for each product, contract i.e max and min prices for among that product,contract.
While calculating these values we need to exclude all closed orders
from the aggregation. i.e max and min over all order prices till now excluding orders with status as "Remove"
We are using : Spark 2.2 and input data format is parquet.
Input records
Output Records
To give a simple SQL view - the problem work out with a self join and would look like this :
With a ordered data set on ttime we have to get the max and min prices for a particular product,contract for each row(order) from the morning till that order's time. This will run on for each eod (order_date) data set in batch:
select mainSet.order_id, mainSet.product,mainSet.contract,mainSet.order_date,mainSet.price,mainSet.ttime,mainSet.status,
max(aggSet.price) over (partition by mainSet.product,mainSet.contract,mainSet.ttime) as max_price,
first_value(aggSet.order_id) over (partition by mainSet.product,mainSet.contract,mainSet.ttime) order by (aggSet.price desc,aggSet.ttime desc ) as maxOrderId
min(aggSet.price) over (partition by mainSet.product,mainSet.contract,mainSet.ttime) as min_price as min_price
first_value(aggSet.order_id) over (partition by mainSet.product,mainSet.contract,mainSet.ttime) order by (aggSet.price ,aggSet.ttime) as minOrderId
from order_table mainSet
join order_table aggSet
ON (mainSet.produuct=aggSet.product,
mainSet.contract=aggSet.contract,
mainSet.ttime>=aggSet.ttime,
aggSet.status <> 'Remove')
Writing in Spark :
We started with spark sql like below:
val mainDF: DataFrame= sparkSession.sql("select * from order_table where order_date ='eod_date' ")
val ndf=mainDf.alias("mainSet").join(mainDf.alias("aggSet"),
(col("mainSet.product")===col("aggSet.product")
&& col("mainSet.contract")===col("aggSet.contract")
&& col("mainSet.ttime")>= col("aggSet.ttime")
&& col("aggSet.status") <> "Remove")
,"inner")
.select(mainSet.order_id,mainSet.ttime,mainSet.product,mainSet.contract,mainSet.order_date,mainSet.price,mainSet.status,aggSet.order_id as agg_orderid,aggSet.ttime as agg_ttime,price as agg_price) //Renaming of columns
val max_window = Window.partitionBy(col("product"),col("contract"),col("ttime"))
val min_window = Window.partitionBy(col("product"),col("contract"),col("ttime"))
val maxPriceCol = max(col("agg_price")).over(max_window)
val minPriceCol = min(col("agg_price")).over(min_window)
val firstMaxorder = first_value(col("agg_orderid")).over(max_window.orderBy(col("agg_price").desc, col("agg_ttime").desc))
val firstMinorder = first_value(col("agg_orderid")).over(min_window.orderBy(col("agg_price"), col("agg_ttime")))
val priceDF= ndf.withColumn("max_price",maxPriceCol)
.withColumn("maxOrderId",firstMaxorder)
.withColumn("min_price",minPriceCol)
.withColumn("minOrderId",firstMinorder)
priceDF.show(20)
Volume Stats:
avg count 7 Million records
avg count for each group (product,contract)= 600K
The job is running for hours and just not finishing.I have tried increasing memory and other parameters but no luck.
Job is getting stuck and many times I am getting memory issues Container killed by YARN for exceeding memory limits. 4.9 GB of 4.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
ANOTHER APPROACH :
Doing repartitioning on our lowest group columns(product and contract) and then sorting within partitions on time so that we receive each row ordered on time for the mapPartition function.
Executing mappartition while maintaining a collection (key as order_id and price as value) at partition level to calculate the max and min price and their orderids.
We will keep removing the orders with status as "Remove" from collection as and when we recieve them.
Once collection is updated for a given row within mapparition we can calculate max and min values from collection and return updated row.
val mainDF: DataFrame= sparkSession.sql("select order_id,product,contract,order_date,price,status,null as maxPrice,null as maxPriceOrderId,null as minPrice,null as minPriceOrderId from order_table where order_date ='eod_date' ").repartitionByRange(col("product"),col("contract"))
case class summary(order_id:String ,ttime:string,product:String,contract :String,order_date:String,price:BigDecimal,status :String,var maxPrice:BigDecimal,var maxPriceOrderId:String ,var minPrice:BigDecimal,var minPriceOrderId String)
val summaryEncoder = Encoders.product[summary]
val priceDF= mainDF.as[summary](summaryEncoder).sortWithinPartitions(col("ttime")).mapPartitions( iter => {
//collection at partition level
//key as order_id and value as price
var priceCollection = Map[String, BigDecimal]()
iter.map( row => {
val orderId= row.order_id
val rowprice= row.price
priceCollection = row.status match {
case "Remove" => if (priceCollection.contains(orderId)) priceCollection -= orderId
case _ => priceCollection += (orderId -> rowPrice)
}
row.maxPrice = if(priceCollection.size > 0) priceCollection.maxBy(_._2)._2 // Gives key,value tuple from collectin for max value )
row.maxPriceOrderId = if(priceCollection.size > 0) priceCollection.maxBy(_._2)._1
row.minPrice = if(priceCollection.size > 0) priceCollection.minBy(_._2)._2 // Gives key,value tuple from collectin for min value )
row.minPriceOrderId = if(priceCollection.size > 0) priceCollection.minBy(_._2)._1
row
})
}).show(20)
This is running fine and finishing it within 20 mins for smaller data sets but I found for a 23 mill records (having 17 diff product and contracts) the result seems not correct. I can see the data from one partition(input split) of mappartition is going another partition and thus messing up the values.
--> Can we achieve a situation in which I can guaranty that each mappartition task would get all the data for functional key (product and contract) here..
As I know ,mappartition executes function on each spark partition(similar to input splits in map reduce) so how can I force spark to create inputsplits/partitions having all of the values for that product and contract group.
--> Is there any other approach to this problem ?
Would really appreciate the help as we are stuck here.

Edit: Here's an article on why many small files are bad
Why is poorly compacted data bad?
Poorly compacted data is bad for Spark applications in the sense that
it is extremely slow to process. Continuing with our previous example,
anytime we want to process a day’s worth of events we have to open up
86,400 files to get to the data. This slows down processing massively
because our Spark application is effectively spending most of its time
just opening and closing files. What we normally want is for our Spark
application to spend most of its time actually processing the data.
We’ll do some experiments next to show the difference in performance
when using properly compacted data as compared to poorly compacted
data.
I bet if you properly partitioned your source data to however you're joining AND get rid of all those windows, you'd end up in a much better place.
Every time you hit partitionBy, you are forcing a shuffle and every time you hit orderBy, you are forcing an expensive sort.
I would suggest you take a look at the Dataset API and learn some groupBy and flatMapGroups/reduce/sliding for O(n) time computation. You can get your min/max in one pass.
Furthermore, it sounds like your driver is running out of memory due to the many little files problem. Try to compact your source data as much as possible and properly partition your tables. In this particular case I would suggest a partitioning by order_date (maybe daily?) then sub partitions of product and contract.
Here's a snippet that took me about 30 minutes to write and probably runs exponentially better than your windowing function. It should run in O(n) time but it doesn't make up for if you have a many small files problem. Let me know if anything's missing.
import org.apache.spark.sql.{Dataset, Encoder, Encoders, SparkSession}
import scala.collection.mutable
case class Summary(
order_id: String,
ttime: String,
product: String,
contract: String,
order_date: String,
price: BigDecimal,
status: String,
maxPrice: BigDecimal = 0,
maxPriceOrderId: String = null,
minPrice: BigDecimal = 0,
minPriceOrderId: String = null
)
class Workflow()(implicit spark: SparkSession) {
import MinMaxer.summaryEncoder
val mainDs: Dataset[Summary] =
spark.sql(
"""
select order_id, ttime, product, contract, order_date, price, status
from order_table where order_date ='eod_date'
"""
).as[Summary]
MinMaxer.minMaxDataset(mainDs)
}
object MinMaxer {
implicit val summaryEncoder: Encoder[Summary] = Encoders.product[Summary]
implicit val groupEncoder: Encoder[(String, String)] = Encoders.product[(String, String)]
object SummaryOrderer extends Ordering[Summary] {
def compare(x: Summary, y: Summary): Int = x.ttime.compareTo(y.ttime)
}
def minMaxDataset(ds: Dataset[Summary]): Dataset[Summary] = {
ds
.groupByKey(x => (x.product, x.contract))
.flatMapGroups({ case (_, t) =>
val sortedRecords: Seq[Summary] = t.toSeq.sorted(SummaryOrderer)
generateMinMax(sortedRecords)
})
}
def generateMinMax(summaries: Seq[Summary]): Seq[Summary] = {
summaries.foldLeft(mutable.ListBuffer[Summary]())({case (b, summary) =>
if (b.lastOption.nonEmpty) {
val lastSummary: Summary = b.last
var minPrice: BigDecimal = 0
var minPriceOrderId: String = null
var maxPrice: BigDecimal = 0
var maxPriceOrderId: String = null
if (summary.status != "remove") {
if (lastSummary.minPrice >= summary.price) {
minPrice = summary.price
minPriceOrderId = summary.order_id
} else {
minPrice = lastSummary.minPrice
minPriceOrderId = lastSummary.minPriceOrderId
}
if (lastSummary.maxPrice <= summary.price) {
maxPrice = summary.price
maxPriceOrderId = summary.order_id
} else {
maxPrice = lastSummary.maxPrice
maxPriceOrderId = lastSummary.maxPriceOrderId
}
b.append(
summary.copy(
maxPrice = maxPrice,
maxPriceOrderId = maxPriceOrderId,
minPrice = minPrice,
minPriceOrderId = minPriceOrderId
)
)
} else {
b.append(
summary.copy(
maxPrice = lastSummary.maxPrice,
maxPriceOrderId = lastSummary.maxPriceOrderId,
minPrice = lastSummary.minPrice,
minPriceOrderId = lastSummary.minPriceOrderId
)
)
}
} else {
b.append(
summary.copy(
maxPrice = summary.price,
maxPriceOrderId = summary.order_id,
minPrice = summary.price,
minPriceOrderId = summary.order_id
)
)
}
b
})
}
}

The method that you have used to repartition the data repartitionByRange partitions the data on these column expressions but does a range partitioning. What you want is hash partitioning on these columns.
Change the method to repartition and pass these columns to it and it should ensure that the same value groups end up in one partition.

Related

overwriting hive partition by a spark dataframe

I have a hive table partitioned on snapshot_dt
I have to delete records from each partition which have turned 180 days old or 7 years old depending on condition.
I am iterating over each partition one by one inside for loop
selecting all the records from the partition into whole_data_df dataframe
selecting records to be deleted into save_to_data dataframe
using except to get the records I want.
var cons_rec_to_retain_df = whole_data_df.except(save_to_data)
now I want to overwrite that particular hive partition with the records in the dataframe cons_rec_to_retain_df.
Please find my code below
var query5=s"""select distinct snapshot_dt from workspace_x215579.ess_consldt_cust_prospect_ord_orc_test"""
var partition_load_dates = createDataframeFromSql(hc,query5)
var output = partition_load_dates.collect().toList
output.foreach(println)
for (d1<- output){
var partition_date:String=d1(0).toString
println("=====================this is the partition date========================")
println(partition_date)
var whole_data = s"""select * From workspace_x215579.ess_consldt_cust_prospect_ord_orc_test where snapshot_dt='$partition_date'"""
println("=====================this is whole partition data df========================")
var whole_data_df = createDataframeFromSql(hc,whole_data)
whole_data_df.show(5,false)
var select_data = s"""select * From workspace_x215579.ess_consldt_cust_prospect_ord_orc_test where snapshot_dt='$partition_date'
and
bus_bacct_num_src_id = 130
and
(bacct_status_cd = 'T' and datediff(to_date(current_timestamp()), snapshot_dt) > 180)
OR
(bacct_status_cd <> 'T' and cast(datediff(to_date(current_timestamp()), snapshot_dt)/365.25 as int) > 7)
"""
var save_to_data = createDataframeFromSql(hc,select_data)
println("=====================this is df to be removed========================")
save_to_data.show(5,false)
var cons_rec_to_retain_df = whole_data_df.except(save_to_data)
println("=====================this is df REcord to be inserted=======================")
cons_rec_to_retain_df.show(5,false)
println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
cons_rec_to_retain_df.show(5)
cons_rec_to_retain_df.write.mode("append").format("orc").insertInto("workspace_x215579.ess_consldt_cust_prospect_ord_orc_test")
}//end of for loop
In the starting I am setting
hc.sql("""set hive.exec.dynamic.partition=true""")
hc.sql("""set hive.exec.dynamic.partition.mode=nonstrict""")
after running the code the count of the record is same as was before. Can someone please suggest a proper way of doing it.

Selecting in Bigquery doesn't retrieve all data more than 53000 rows

The table named student_master in bigquery has 70000 rows and I would like to retrieve rows using this query. I found no error when doing this, however, it just retrieve 52226 rows (means, not all). I try to use row_number() over partition_by like this code but still didn't get all data. What should I do?
I am using the idea of using two query order by id_student, limit 35000, and make asc (query1), desc(query2) but it will not works if the data increase (let's say 200000 rows).
data= []
sql = ( "SELECT id_student, class,name," +
" ROW_NUMBER() OVER (PARTITION BY class ORDER BY class ASC) row_num," +
"FROM" +
" [project_name.dataset.student_master]" +
"WHERE not class = " + element['class']
)
query = client.run_sync_query(sql)
query.timeout_ms = 20000
query.run()
for row in query.rows:
data.append(row)
return data

I was able to gather 200,000+ rows by querying a public dataset, verified by using a counter variable:
query_job = client.query("""
SELECT ROW_NUMBER() OVER (PARTITION BY token_address ORDER BY token_address ASC) as row_number,token_address
FROM `bigquery-public-data.ethereum_blockchain.token_transfers`
WHERE token_address = '0x001575786dfa7b9d9d1324ec308785738f80a951'
ORDER BY 1
""")
contador = 0
for row in query_job:
contador += 1
print(contador,row)

In general, for big exports you should run an export job which will place your data into files in GCS.
https://cloud.google.com/bigquery/docs/exporting-data
But in this case you might just need to go through more pages of results:
If the rows returned by the query do not fit into the initial response, then we need to fetch the remaining rows via fetch_data():
query = client.run_sync_query(LIMITED)
query.timeout_ms = TIMEOUT_MS
query.max_results = PAGE_SIZE
query.run() # API request
assert query.complete
assert query.page_token is not None
assert len(query.rows) == PAGE_SIZE
assert [field.name for field in query.schema] == ['name']
iterator = query.fetch_data() # API request(s) during iteration
for row in iterator:
do_something_with(row)
https://gcloud-python.readthedocs.io/en/latest/bigquery/usage.html

Adjust R Data Frame with Element Wise Function

I am using RODBC to pull down server data into a data frame using this statement:
df <- data.frame(
sqlQuery(
channel = ODBC_channel_string,
query = SQLquery_string
)
)
The resultant data set has the following grouping attributes of interest:
Scenario
Other Group By 1
Other Group By 2
...
Other Group By K
With key variables:
[Time] = Future Year(i)
[Spot] = Projected Effective Discount Rate For Year(i-1) to Year(i)
Abbreviated Table Snip
What I would like to do is transform the [Spot] column into a discount factor that is dependent on consistent preceding values:
Scenario
Other Group By 1
Other Group By 2
...
Other Group By K
With key variables:
[Time] = Future Year(i)
[Spot] = Projected Effective Discount Rate For Year(i-1) to Year(i)
[Disc_0] = prod([Value]), for [All Grouping] = [This Grouping] and [Time] <= [This Time]
Excel Version of Abbreviated Goal Snip
I could code the solution using a for loop, but I suspect that will be very inefficient in R if there are significant row counts in the original data frame.
What I am hoping is to use some creative implementation of dplyr's mutate:
df %>% mutate(Disc_0 = objective_function{?})
I think that R should be able to do this kind of data wrangle quickly, but I am not sure if that is the case. I am more familiar with SQL and may attempt to produce the necessary variable there.

BQ SQL solution solution for comparing rows based on variance

I'm trying to compare scraped retail item price data in BigQuery (~2-3B rows depending on the time period and retailers included); with the intent to identify meaningful price differences. For example $1.99 vs $2.00 isn't meaningful, but $1.99 vs $2.50 is meaningful. Meaningful is quantified as a 2% difference between prices.
Example dataset for one item looks like this:
ITEM Price($) Meaningful (This is the column I'm trying to flag)
Apple $1.99 Y (lowest price would always be flagged)
Apple $2.00 N ($1.99 v $2.00)
Apple $2.01 N ($1.99 v $2.01) Still using $1.99 for comparison
Apple $2.50 Y ($1.99 v $2.50) Still using $1.99 for comparison
Apple $2.56 Y ($2.50 v $2.56) Now using $2.50 as new comp. price
Apple $2.62 Y ($2.55 v $2.62) Now using $2.56 as new comp. price
I was hoping to solve the problem just using SQL Window functions (lead, lag, partition over, etc..) comparing the current row's price to the next following row. However, that doesn't work correctly when I get to a non-meaningful price because I always want the next value to be compared to the most recent meaningful price (see $2.50 row example above that's compared to $2.00 and NOT $2.01 in the prior row)
My Questions:
Is it possible to solve this with SQL alone in BigQuery? (e.g. What creative SQL logic solution am I overlooking, like bucketing based on the variance amounts?)
What programmatic options do I have since I can't use stored procedures with BQ? Python/Dataframes in GCP Datalab? BQ UDFs?

Below is for BigQuery Standard SQL
#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
var result = [];
var last = 0;
var flag = '';
for (i = 0; i < prices.length; i++){
if (i == 0) {
last = prices[i];
flag = 'Y'
} else {
if ((prices[i] - last)/last > 0.02) {
last = prices[i];
flag = 'Y'
} else {flag = 'N'}
}
var rec = [];
rec.price = prices[i];
rec.flag = flag;
result.push(rec);
}
return result;
""";
SELECT item, rec.*
FROM (
SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
FROM `yourTable`
GROUP BY item
), UNNEST(x(prices) ) AS rec
-- ORDER BY item, price
You can play with / test it with below dummy data from your question
#standardSQL
CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
LANGUAGE js AS """
var result = [];
var last = 0;
var flag = '';
for (i = 0; i < prices.length; i++){
if (i == 0) {
last = prices[i];
flag = 'Y'
} else {
if ((prices[i] - last)/last > 0.02) {
last = prices[i];
flag = 'Y'
} else {flag = 'N'}
}
var rec = [];
rec.price = prices[i];
rec.flag = flag;
result.push(rec);
}
return result;
""";
WITH `yourTable` AS (
SELECT 'Apple' AS item, 1.99 AS price UNION ALL
SELECT 'Apple', 2.00 UNION ALL
SELECT 'Apple', 2.01 UNION ALL
SELECT 'Apple', 2.50 UNION ALL
SELECT 'Apple', 2.56 UNION ALL
SELECT 'Apple', 2.62
)
SELECT item, rec.*
FROM (
SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
FROM `yourTable`
GROUP BY item
), UNNEST(x(prices) ) AS rec
ORDER BY item, price
Result is as below
item price flag
---- ----- ----
Apple 1.99 Y
Apple 2.0 N
Apple 2.01 N
Apple 2.5 Y
Apple 2.56 Y
Apple 2.62 Y

Convert SQL query to Linq (Need help to write sql query to linq)

This is my query returns me accurate result that I want. I want to write this in LINQ.
select i.reportdate,co.naam,i.issueid,i.vrijetekst,i.lockuser,te.teamnaam, count(ie.issueid) as events, sum(ie.bestedetijd) as Tijd
from company co,hoofdcontracten hc,subcontracten sc,sonderhoud so,g2issues i,g2issueevents ie, g2issuestatus iss,teams te,locatie l
Where co.companyid = hc.companyid And
hc.hcontractid = sc.hcontractid and
so.scontractid = sc.scontractid and
sc.scontractid = i.scontractid and
i.issueid = ie.issueid and
so.teamid = te.teamid and
ie.locatieid = l.locatieid and
l.bezoek = 0 and
i.issuestatusid = iss.issuestatusid and
fase < 7 and
co.companyid <> 165
group by i.reportdate,co.naam,i.issueid,i.vrijetekst,i.lockuser,te.teamnaam ,i.reportdate
having sum(ie.bestedetijd)>123
I am trying this but confused at select clause. How can I use aggregate function in select clause and group by clause also.
var myList = (from co in _context.Company
from hc in _context.Hoofdcontracten
from sc in _context.Subcontracten
from so in _context.Sonderhoud
from i in _context.G2issues
from ie in _context.G2issueEvents
from iss in _context.G2issueStatus
from te in _context.Teams
from l in _context.Locatie
where
co.CompanyId == hc.CompanyId
&& hc.HcontractId == sc.HcontractId
&& so.ScontractId == sc.ScontractId
&& sc.ScontractId == i.ScontractId
&& i.IssueId == ie.IssueId
&& so.Teamid == te.Teamid
&& ie.LocatieId == l.LocatieId
&& l.Bezoek == false
&& i.IssuestatusId == iss.IssueStatusId
&& iss.Fase < 7
&& co.CompanyId != 165
select new { }).ToList();

My! Someone was trying to save a few minutes typing time using all kinds of undecipherable variable names like hc, sc, so, i, ie, iss, without bothering that this would tenfold the time needed to understand what happens.
I haven't tried to decipher your query fully, apparently you thought it was not needed to show your entity framework classes and relation between them.
What I gather is that you want to perform a big join and then select several columns from the join. You want to group the resulting collection into groups of items that have the same reportdate, name, issueid, .... You want the bestede tijd on all items in each group, and you want the number of items in the group
LINQ has two types of syntaxes which in fact do the same: Query syntax and Method syntax. Although I can read the query syntax, I'm more familiar with the method syntax, so my answer will be using the method syntax. I guess you'll understand what happens.
I'll do it in smaller steps, you can concatenate all steps into one if you want.
First you do a big join, after which you select some columns. A big join is one of the few statements that are easier when written in query syntax (See SO: LINQ method syntax for multiple left join)
var bigJoin = from hc in _context.Hoofdcontracten
from sc in _context.Subcontracten
from so in _context.Sonderhoud
...
from l in _context.Locatie
where
co.CompanyId == hc.CompanyId
&& hc.HcontractId == sc.HcontractId
...
&& iss.Fase < 7
&& co.CompanyId != 165
select new
{
hc,
sc,
so,
... // etc select only items you'll use to create your end result
};
Now you've got a big table with the join results. You want to divide this table into groups with the same value for
{
i.reportdate,
co.naam,
i.issueid,
i.vrijetekst,
i.lockuser,
te.teamnaam,
}
(by the way: you mentioned reportdate twice in your GroupBy, I guess that's not what you meant)
This grouping is done using Enumerable.GroupBy.
var groups = bigjoin.GroupBy(joinedItem => new
{ // the items to group on: all elements in the group have the same values
ReportDate = i.Reportdate,
CompanyName = co.naam,
IssueId = i.issueid,
FreeText = i.vrijetekst,
LockUser = i.lockuser,
TeamName = te.teamnaam,
});
The result is a collection of groups. Each group contains the original bigJoin items that have the same value for for ReportDate, CompanyName, etc. This common value is in group.Key
Now from every group you want the following items:
Several of the common values in the group (ReportDate, CompanyName, IssueId, ...). We get them from the Key of the group
Tijd = the sum of ie.bestedeTijd of all elements in the group
Events = is the number of ie.IssueId of all elements in the group. As every element in the group has only one ie.IssueId, the result is also the number of elements in the group.
From this result, you only want to keep those with a Tijd > 123
So we'll do a new select on the groups, followed by a Where on Tijd
var result = groups.Select(group => new
{
Tijd = group.Sum(groupElement => groupElement.ie.bestedetijd),
Events = group.Count(),
// the other fields can be taken from the key
ReportDate = group.Key.reportdate,
CompanyName = group.Key.CompanyName,
IssueId = group.Key.Issueid,
FreeText = group.Key.FreeText,
LockUser = group.Key.LockUser,
TeamName = group.Key.TeamName,
})
.Where(resultItem => resultItem.Tijd > 123);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark transformation for timeseries/tick data set - dataframe

Related

overwriting hive partition by a spark dataframe

Selecting in Bigquery doesn't retrieve all data more than 53000 rows

Adjust R Data Frame with Element Wise Function

BQ SQL solution solution for comparing rows based on variance

Convert SQL query to Linq (Need help to write sql query to linq)

Categories

Resources