We have table in hive which stores trading orders data for each end of day as order_date. Other important columns are
product,
contract,
price(price of order placed),
ttime (transaction time)
status (insert,update or remove)
price (price of order)
We have to build a charting table in in tick data fashion from the main table with the max and min price orders for each row(order) from the morning when market opened till that time. i.e for a given order we would have 4 columns populated as maxPrice(max price till now), maxpriceOrderId(orderid of the max price), minPrice and minPriceOrderId
This has to be for each product, contract i.e max and min prices for among that product,contract.
While calculating these values we need to exclude all closed orders
from the aggregation. i.e max and min over all order prices till now excluding orders with status as "Remove"
We are using : Spark 2.2 and input data format is parquet.
Input records
Output Records
To give a simple SQL view - the problem work out with a self join and would look like this :
With a ordered data set on ttime we have to get the max and min prices for a particular product,contract for each row(order) from the morning till that order's time. This will run on for each eod (order_date) data set in batch:
select mainSet.order_id, mainSet.product,mainSet.contract,mainSet.order_date,mainSet.price,mainSet.ttime,mainSet.status,
max(aggSet.price) over (partition by mainSet.product,mainSet.contract,mainSet.ttime) as max_price,
first_value(aggSet.order_id) over (partition by mainSet.product,mainSet.contract,mainSet.ttime) order by (aggSet.price desc,aggSet.ttime desc ) as maxOrderId
min(aggSet.price) over (partition by mainSet.product,mainSet.contract,mainSet.ttime) as min_price as min_price
first_value(aggSet.order_id) over (partition by mainSet.product,mainSet.contract,mainSet.ttime) order by (aggSet.price ,aggSet.ttime) as minOrderId
from order_table mainSet
join order_table aggSet
ON (mainSet.produuct=aggSet.product,
mainSet.contract=aggSet.contract,
mainSet.ttime>=aggSet.ttime,
aggSet.status <> 'Remove')
Writing in Spark :
We started with spark sql like below:
val mainDF: DataFrame= sparkSession.sql("select * from order_table where order_date ='eod_date' ")
val ndf=mainDf.alias("mainSet").join(mainDf.alias("aggSet"),
(col("mainSet.product")===col("aggSet.product")
&& col("mainSet.contract")===col("aggSet.contract")
&& col("mainSet.ttime")>= col("aggSet.ttime")
&& col("aggSet.status") <> "Remove")
,"inner")
.select(mainSet.order_id,mainSet.ttime,mainSet.product,mainSet.contract,mainSet.order_date,mainSet.price,mainSet.status,aggSet.order_id as agg_orderid,aggSet.ttime as agg_ttime,price as agg_price) //Renaming of columns
val max_window = Window.partitionBy(col("product"),col("contract"),col("ttime"))
val min_window = Window.partitionBy(col("product"),col("contract"),col("ttime"))
val maxPriceCol = max(col("agg_price")).over(max_window)
val minPriceCol = min(col("agg_price")).over(min_window)
val firstMaxorder = first_value(col("agg_orderid")).over(max_window.orderBy(col("agg_price").desc, col("agg_ttime").desc))
val firstMinorder = first_value(col("agg_orderid")).over(min_window.orderBy(col("agg_price"), col("agg_ttime")))
val priceDF= ndf.withColumn("max_price",maxPriceCol)
.withColumn("maxOrderId",firstMaxorder)
.withColumn("min_price",minPriceCol)
.withColumn("minOrderId",firstMinorder)
priceDF.show(20)
Volume Stats:
avg count 7 Million records
avg count for each group (product,contract)= 600K
The job is running for hours and just not finishing.I have tried increasing memory and other parameters but no luck.
Job is getting stuck and many times I am getting memory issues Container killed by YARN for exceeding memory limits. 4.9 GB of 4.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
ANOTHER APPROACH :
Doing repartitioning on our lowest group columns(product and contract) and then sorting within partitions on time so that we receive each row ordered on time for the mapPartition function.
Executing mappartition while maintaining a collection (key as order_id and price as value) at partition level to calculate the max and min price and their orderids.
We will keep removing the orders with status as "Remove" from collection as and when we recieve them.
Once collection is updated for a given row within mapparition we can calculate max and min values from collection and return updated row.
val mainDF: DataFrame= sparkSession.sql("select order_id,product,contract,order_date,price,status,null as maxPrice,null as maxPriceOrderId,null as minPrice,null as minPriceOrderId from order_table where order_date ='eod_date' ").repartitionByRange(col("product"),col("contract"))
case class summary(order_id:String ,ttime:string,product:String,contract :String,order_date:String,price:BigDecimal,status :String,var maxPrice:BigDecimal,var maxPriceOrderId:String ,var minPrice:BigDecimal,var minPriceOrderId String)
val summaryEncoder = Encoders.product[summary]
val priceDF= mainDF.as[summary](summaryEncoder).sortWithinPartitions(col("ttime")).mapPartitions( iter => {
//collection at partition level
//key as order_id and value as price
var priceCollection = Map[String, BigDecimal]()
iter.map( row => {
val orderId= row.order_id
val rowprice= row.price
priceCollection = row.status match {
case "Remove" => if (priceCollection.contains(orderId)) priceCollection -= orderId
case _ => priceCollection += (orderId -> rowPrice)
}
row.maxPrice = if(priceCollection.size > 0) priceCollection.maxBy(_._2)._2 // Gives key,value tuple from collectin for max value )
row.maxPriceOrderId = if(priceCollection.size > 0) priceCollection.maxBy(_._2)._1
row.minPrice = if(priceCollection.size > 0) priceCollection.minBy(_._2)._2 // Gives key,value tuple from collectin for min value )
row.minPriceOrderId = if(priceCollection.size > 0) priceCollection.minBy(_._2)._1
row
})
}).show(20)
This is running fine and finishing it within 20 mins for smaller data sets but I found for a 23 mill records (having 17 diff product and contracts) the result seems not correct. I can see the data from one partition(input split) of mappartition is going another partition and thus messing up the values.
--> Can we achieve a situation in which I can guaranty that each mappartition task would get all the data for functional key (product and contract) here..
As I know ,mappartition executes function on each spark partition(similar to input splits in map reduce) so how can I force spark to create inputsplits/partitions having all of the values for that product and contract group.
--> Is there any other approach to this problem ?
Would really appreciate the help as we are stuck here.
Edit: Here's an article on why many small files are bad
Why is poorly compacted data bad?
Poorly compacted data is bad for Spark applications in the sense that
it is extremely slow to process. Continuing with our previous example,
anytime we want to process a day’s worth of events we have to open up
86,400 files to get to the data. This slows down processing massively
because our Spark application is effectively spending most of its time
just opening and closing files. What we normally want is for our Spark
application to spend most of its time actually processing the data.
We’ll do some experiments next to show the difference in performance
when using properly compacted data as compared to poorly compacted
data.
I bet if you properly partitioned your source data to however you're joining AND get rid of all those windows, you'd end up in a much better place.
Every time you hit partitionBy, you are forcing a shuffle and every time you hit orderBy, you are forcing an expensive sort.
I would suggest you take a look at the Dataset API and learn some groupBy and flatMapGroups/reduce/sliding for O(n) time computation. You can get your min/max in one pass.
Furthermore, it sounds like your driver is running out of memory due to the many little files problem. Try to compact your source data as much as possible and properly partition your tables. In this particular case I would suggest a partitioning by order_date (maybe daily?) then sub partitions of product and contract.
Here's a snippet that took me about 30 minutes to write and probably runs exponentially better than your windowing function. It should run in O(n) time but it doesn't make up for if you have a many small files problem. Let me know if anything's missing.
import org.apache.spark.sql.{Dataset, Encoder, Encoders, SparkSession}
import scala.collection.mutable
case class Summary(
order_id: String,
ttime: String,
product: String,
contract: String,
order_date: String,
price: BigDecimal,
status: String,
maxPrice: BigDecimal = 0,
maxPriceOrderId: String = null,
minPrice: BigDecimal = 0,
minPriceOrderId: String = null
)
class Workflow()(implicit spark: SparkSession) {
import MinMaxer.summaryEncoder
val mainDs: Dataset[Summary] =
spark.sql(
"""
select order_id, ttime, product, contract, order_date, price, status
from order_table where order_date ='eod_date'
"""
).as[Summary]
MinMaxer.minMaxDataset(mainDs)
}
object MinMaxer {
implicit val summaryEncoder: Encoder[Summary] = Encoders.product[Summary]
implicit val groupEncoder: Encoder[(String, String)] = Encoders.product[(String, String)]
object SummaryOrderer extends Ordering[Summary] {
def compare(x: Summary, y: Summary): Int = x.ttime.compareTo(y.ttime)
}
def minMaxDataset(ds: Dataset[Summary]): Dataset[Summary] = {
ds
.groupByKey(x => (x.product, x.contract))
.flatMapGroups({ case (_, t) =>
val sortedRecords: Seq[Summary] = t.toSeq.sorted(SummaryOrderer)
generateMinMax(sortedRecords)
})
}
def generateMinMax(summaries: Seq[Summary]): Seq[Summary] = {
summaries.foldLeft(mutable.ListBuffer[Summary]())({case (b, summary) =>
if (b.lastOption.nonEmpty) {
val lastSummary: Summary = b.last
var minPrice: BigDecimal = 0
var minPriceOrderId: String = null
var maxPrice: BigDecimal = 0
var maxPriceOrderId: String = null
if (summary.status != "remove") {
if (lastSummary.minPrice >= summary.price) {
minPrice = summary.price
minPriceOrderId = summary.order_id
} else {
minPrice = lastSummary.minPrice
minPriceOrderId = lastSummary.minPriceOrderId
}
if (lastSummary.maxPrice <= summary.price) {
maxPrice = summary.price
maxPriceOrderId = summary.order_id
} else {
maxPrice = lastSummary.maxPrice
maxPriceOrderId = lastSummary.maxPriceOrderId
}
b.append(
summary.copy(
maxPrice = maxPrice,
maxPriceOrderId = maxPriceOrderId,
minPrice = minPrice,
minPriceOrderId = minPriceOrderId
)
)
} else {
b.append(
summary.copy(
maxPrice = lastSummary.maxPrice,
maxPriceOrderId = lastSummary.maxPriceOrderId,
minPrice = lastSummary.minPrice,
minPriceOrderId = lastSummary.minPriceOrderId
)
)
}
} else {
b.append(
summary.copy(
maxPrice = summary.price,
maxPriceOrderId = summary.order_id,
minPrice = summary.price,
minPriceOrderId = summary.order_id
)
)
}
b
})
}
}
The method that you have used to repartition the data repartitionByRange partitions the data on these column expressions but does a range partitioning. What you want is hash partitioning on these columns.
Change the method to repartition and pass these columns to it and it should ensure that the same value groups end up in one partition.
I'm new to DAX and I have a problem that I don't know how to solve. I simplify it with an artificial example. I'm in the context of a SSAS tabular model.
Let's say I have a factory of "zirkbols" (invented) and a table representing the sales of zirkbols. Each customer bought a different number of zirkbols and gave a rating 1 to 5.
The table looks like this:
with this code to generate it:
= DATATABLE(
"ClientId"; INTEGER;
"CountryCode"; STRING;
"OrderDate"; DATETIME;
"OrderAmount"; DOUBLE;
"Rating"; INTEGER;
{
{123; "US"; "2018-01-01"; 502; 1};
{124; "US"; "2018-01-01"; 400; 4};
{125; "US"; "2018-01-03"; 60; 5};
{126; "US"; "2018-01-02"; 160; 4};
{124; "US"; "2018-01-05"; 210; 3};
{128; "JP"; "2018-01-03"; 22; 5};
{129; "JP"; "2018-01-07"; 540; 2};
{130; "JP"; "2018-01-03"; 350; 4};
{131; "JP"; "2018-01-09"; 405; 4};
{132; "JP"; "2018-01-09"; 85; 5}
}
)
I need to create measures that give me statistics for the sample of clients that bought 30% of my sales, taken among the most satisfied. This means that I need to rank by "Rating" and sum the "OrderAmounts" until I get at least 30% of the total. This sample are my happy zirkbols owners. For these happy zirkbols owners I would like to know for example their average rating.
I think that this could be easier if I could put the running total of the order amounts in a calculated column, but I would like to give the analyst the possibility to filter for example only the "US" sales and I don't know if this is possible in a calculated column.
On the other hand I suppose that the ranking by rating can be stored in a calculated column (Ranking = RANK.EQ([Rating];ClientOrders[Rating])).
I expect the following result:
As I said I'm new to SSAS and DAX, so I don't know if I am taking this problem from the wrong angle...
Regards,
Nicola
P.S. Please see the comments on the accepted answer as well
I've got some DAX mostly working, but I'll need to come back to it.
In the meantime, here's some of the code:
Happy owners amount =
VAR Summary =
SUMMARIZE (
Orders,
Orders[CountryCode],
Orders[ClientId],
Orders[Rating],
"Amount", SUM ( Orders[OrderAmount] )
)
VAR Ranked =
ADDCOLUMNS ( Summary, "Rank", RANKX ( Summary, Orders[Rating] + 1 / [Amount] ) )
VAR Cumulative =
ADDCOLUMNS (
Ranked,
"CumAmt", CALCULATE (
SUM ( Orders[OrderAmount] ),
FILTER ( Ranked, [Rank] <= EARLIER ( [Rank] ) )
)
)
VAR CutOff =
MINX (
FILTER (
Cumulative,
[CumAmt]
> 0.3 * CALCULATE ( SUM ( Orders[OrderAmount] ), ALLSELECTED ( Orders ) )
),
[Rank]
)
RETURN
SUMX ( FILTER ( Cumulative, [Rank] <= CutOff ), [Amount] )
I have a query that returns the probability that a token has a certain classification.
token class probPaired
---------- ---------- ----------
potato A 0.5
potato B 0.5
potato C 1.0
potato D 0.5
time A 0.5
time B 1.0
time C 0.5
I need to aggregate the probabilities of each class by multiplying them together.
-- Imaginary MUL operator
select class, MUL(probPaired) from myTable group by class;
class probability
---------- ----------
A 0.25
B 0.5
C 0.5
D 0.5
How can I do this in SQLite? SQLite doesn't have features like LOG/EXP or variables - solutions mentioned in other questions.
In general, if SQLite can't do it you can write a custom function instead. The details depend on what programming language you're using, here it is in Perl using DBD::SQLite. Note that functions created in this way are not stored procedures, they exist for that connection and must be recreated each time you connect.
For an aggregate function, you have to create a class which handles the aggregation. MUL is pretty simple, just an object to store the product.
{
package My::SQLite::MUL;
sub new {
my $class = shift;
my $mul = 1;
return bless \$mul, $class;
}
sub step {
my $self = shift;
my $num = shift;
$$self *= $num;
return;
}
sub finalize {
my $self = shift;
return $$self;
}
}
Then you'd install that as the aggregate function MUL which takes a single argument and uses that class.
my $dbh = ...doesn't matter how the connection is made...
$dbh->sqlite_create_aggregate("MUL", 1, "My::SQLite::MUL");
And now you can use MUL in queries.
my $rows = $dbh->selectall_arrayref(
"select class, MUL(probPaired) from myTable group by class"
);
Again, the details will differ with your particular language, but the basic idea will be the same.
This is significantly faster than fetching each row and taking the aggregate product.
You can calculate row numbers and then use a recursive cte for multiplication. Then get the max rnum (calculated row_number) value for each class which contains the final result of multiplication.
--Calculating row numbers
with rownums as (select t1.*,
(select count(*) from t t2 where t2.token<=t1.token and t1.class=t2.class) as rnum
from t t1)
--Getting the max rnum for each class
,max_rownums as (select class,max(rnum) as max_rnum from rownums group by class)
--Recursive cte starts here
,cte(class,rnum,probPaired,running_mul) as
(select class,rnum,probPaired,probPaired as running_mul from rownums where rnum=1
union all
select t.class,t.rnum,t.probPaired,c.running_mul*t.probPaired
from cte c
join rownums t on t.class=c.class and t.rnum=c.rnum+1)
--Final value selection
select c.class,c.running_mul
from cte c
join max_rownums m on m.max_rnum=c.rnum and m.class=c.class
SQL Fiddle
I have a pig Latin question. I have a table with the following:
ID:Seller:Price:BID
1:John:20:B1
1:Ben:25:B1
2:John:60:B2
2:Chris:35:B2
3:John:20:B3
I'm able to group the table by ID using the following (assuming A is the LOAD table):
W = GROUP A BY ID;
But what I can't seem to figure out is the command to only return the values for the lowest price for each ID. In this example the final output should be:
1:John:20:B1
2:Chris:35:B2
3:John:20:B3
Cheers,
Shivedog
Generally you'll want to GROUP by the BID, then use MIN. However, since you want the whole tuple associated with the minimum value you'll want to use a UDF to do this.
myudfs.py
#outputSchema('vals: (ID: int, Seller: chararray, Price: chararray, BID: chararray)')
def get_min_tuple(bag):
return min(bag, key=lambda x: x[2])
myscript.pig
register 'myudfs.py' using jython as myudfs ;
-- A: (ID: int, Seller: chararray, Price: chararray, BID: chararray)
B = GROUP A BY BID ;
C = FOREACH B GENERATE group AS BID, FLATTEN(myudfs.get_min_tuple(A)) ;
-- Now you can do the JOIN to get the type of novel on C
Remember change the types (int,chararray,etc.) to the appropriate values.
Note: If multiple items in A have the same minimum price for an ID, then this will only return one of them.
option (1) - get all records with maximum price:
Use the new (Pig 0.11) RANK operator:
A = LOAD ...;
B = RANK A BY Price DESC;
C = FILTER B BY $0=1;
option (2) - get all records with maximum price:
Pig version below 0.11:
a = load ...;
b = group a by all;
c = foreach b generate MAX(a.price) as maxprice;
d = JOIN a BY price, c BY maxprice;
option (3) - use org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField to get one of the tuples with maximum price:
define mMax org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField( '4', 'max' );
a = load ...;
b = group a by all;
c - foreach b generate mMax(a);
How do you convert a SQL query with nested SELECT statements to a LINQ statement?
I have the following SQL statement which outputs the results I need but I'm not sure how to replicate this in LINQ .
SELECT X.ITMGEDSC, (SUM(X.[ENDQTY_I]) - SUM(X.[ORDERLINES])) AS AVAIL
FROM SELECT T1.[MANUFACTUREORDER_I],T2.[ITMGEDSC],T1.[ENDQTY_I],
(SELECT (COUNT(VW.[MANUFACTUREORDER_I]) - 1)
FROM [PLCD].dbo.[vw_WIP_Srl] VW
WHERE VW.[MANUFACTUREORDER_I] = T1.[MANUFACTUREORDER_I]
GROUP BY VW.[MANUFACTUREORDER_I]) AS ORDERLINES
FROM [PLCD].dbo.[vw_WIP_Srl] T1
INNER JOIN [PLCD].dbo.IV00101 T2 ON T2.ITEMNMBR = T1.ITEMNMBR
GROUP BY T1 [MANUFACTUREORDER_I],T2.[ITMGEDSC],T1.[ENDQTY_I]) AS X
GROUP BY X.ITMGEDSC
ITEMNMBR is the ID of an item including a revision number, for example A1008001. The last 3 numbers denote the revision. So A1008002 are the same item, just differing revisions. In my query I need to treat these as the same item and output only the quantity for the parent item number (A1008). This parent item number is the column IV00101.ITMGEDSC.
The above code would take the following data
MANUFACTUREORDER_I ITEMNMBR ENDQTY_I
MAN00003140 A1048008 15
MAN00003507 A1048008 1
MAN00004880 A10048001 15
MAN00004880 A10048001 15
MAN00004880 A10048001 15
and output the following results
ITEMNMBR QTY
A1048008 16
A10048001 13*
The reason that this value is 13 and NOT 45 is because they are all part of the same MANUFACTUREORDER_I. In the system this therefore means that there were 15 in stock but two of these have then been transacted out of stock to be used. Hence the 3 rows, one for the goods coming into stock, the other two for two items going out of stock (ignore the quantity in these rows)
As I mentioned at the start, the SQL above gives me the output I'm after but I'm unsure how to replicate this in Linq.
UPDATE - JEFF'S ORIGINAL SOLUTION
var query = from item in db.vw_WIP_Srls
group new { item.MANUFACTUREORDER_I, item.ENDQTY_I } by item.ITEMNMBR into items
select new
{
ItemNumber = items.Key,
QtyAvailable = (from item in items
//assumes quantities equal per order number
group 1 by item into orders
select orders.Key.ENDQTY_I - (orders.Count() - 1))
.Sum()
};
Here you go. Unfortunately I couldn't see the comment you left me anymore but I believe this should be equivalent. I changed the names to match closely with your query.
var query = from a in db.vw_WIP_Srl
join b in db.IV00101 on a.ITEMNMBR equals b.ITEMNMBR
group new { a.MANUFACTUREORDER_I, a.ENDQTY_I } by b.ITMGEDSC into g
select new
{
ITMGEDSC = g.Key,
AVAIL = (from item in g
group 1 by item into orders
select orders.Key.ENDQTY_I - (orders.Count() - 1))
.Sum()
};