OPTAPLANNER: How to apply GroupBy to MultiConstraintStream - optimization

I have some doubts about how GroupBy works in a MultiConstraintStream. I need to group entries by two fields at the same time, and I am unsure how to do it.
For context, I am trying to build a constraint in Optaplanner for a job scheduling problem. I want to limit the maximum amount of output that can be done per day for each different type of job.
The constraint would go like this...
private Constraint MaximumDailyOuput(ConstraintFactory constraintFactory) {
// Limits maximum output per day.
return constraintFactory.forEach(TimeSlotOpta.class) // iterate for each timeslot (days)
// join time slots with jobs
.join(JobOpta.class)
// filter if jobs are being done that day
.filter((timeslot, job) -> job.isActive(timeslot.getDay()))
// join with job types, and filter, not sure if this is necessary or optimal
.join(JobTypeOpta.class)
.filter((timeSlot, job, jobType) -> job.getJobType() == jobType)
// HERE: now I would like to group the jobs that are active
// during a time slot and that are of the same type (job.getJobType()).
// For each group obtained, I need to sum the outputs of the jobs,
// which can be obtained using job.getDailyOutput().
// Therefore, for each day (timeslot) and for each job type,
// I should obtain a sum that cannot overcome
// the daily maximum for that job type (jobType.getMaximumDailyOuput())
.groupBy((timeSlot, job, jobType) -> ...)
...
.penalize("Maximum daily output exceeded", HardMediumSoftScore.ONE_HARD,
(timeSlot, jobType, dailyOuput) -> dailyOuput - jobType.getMaximumDailyOutput());
}

You can do this by specifying multiple group key functions in your groupBy
private Constraint MaximumDailyOuput(ConstraintFactory constraintFactory) {
// Limits maximum output per day.
return constraintFactory.forEach(TimeSlotOpta.class) // iterate for each timeslot (days)
// join time slots with jobs
.join(JobOpta.class)
// filter if jobs are being done that day
.filter((timeslot, job) -> job.isActive(timeslot.getDay()))
// join with job types, and filter, not sure if this is necessary or optimal
.join(JobTypeOpta.class)
.filter((timeSlot, job, jobType) -> job.getJobType() == jobType)
// calculate total output for a given timeslot of a given jobType
.groupBy((timeSlot, job, jobType) -> timeslot,
(timeSlot, job, jobType) -> jobType,
ConstraintCollectors.sum((timeSlot, job, jobType) -> job.getDailyOutput()))
// include only timeslot/jobType pairs where dailyOutput exceeds maximum allowed
.filter((timeSlot, jobType, dailyOuput) -> dailyOuput > jobType.getMaximumDailyOutput())
.penalize("Maximum daily output exceeded", HardMediumSoftScore.ONE_HARD,
(timeSlot, jobType, dailyOuput) -> dailyOuput - jobType.getMaximumDailyOutput());
}
For groupBy, you can include up to four group key functions and collectors combined (if you have 1 key function, you can have up to 3 collectors; if you have 2 key functions, you can have up to 2 collectors, etc.). The key functions are always before the collectors.

Related

Constraint to require a certain number of matches

I am trying to write a hard constraint that requires that a certain value has been chosen a certain number of times. I have a constraint written below, which (I think) filters to a set of results that match this criteria, and I want it to penalize if there are no such results. I cannot figure out how to work .ifNotExists() into this. I think I am missing some understanding.
fun cpMustUseN(constraintFactory: ConstraintFactory): Constraint {
return constraintFactory.forEach(MealMenu::class.java)
.join(CpMustUse::class.java, equal({ mm -> mm.slottedCp!!.id }, CpMustUse::cpId))
.groupBy({ _, cpMustUse -> cpMustUse.numRequired }, countBi())
.filter { numRequired, count -> count >= numRequired }
.penalize(HardSoftScore.ONE_HARD)
.asConstraint("cpMustUseN")
}
MealMenu is an entity:
#PlanningEntity
class MealMenu {
#PlanningId
var id = 0
#PlanningVariable(valueRangeProviderRefs = ["cpRange"])
var slottedCp: Cp? = null
}
CpMustUse is a #ProblemFactCollectionProperty on my solution class, and the class looks like this:
class CpMustUse {
var cpId = 1
var numRequired = 4
}
I want to, in this case, constrain the result such that cpId 1 is chosen at least 4 times.
There are two conceptual issues here:
groupBy() will only match if the join returns a non-zero number of matches. Therefore you will never get a countBi() of zero - in that case, groupBy() will simply never match. Therefore you can not use grouping to check that something does not exist.
ifNotExists() always applies to a fact from the working memory. You can not use it to check if a result of a previous calculation exists.
Combined together, this makes your approach infeasible. This particular requirement will be a bit trickier to implement.
Start by inverting the logic of the constraint you pasted. Penalize every time count < numRequired; this handles all cases where count >= 1.
Then introduce a second constraint that will handle specifically the case where the count would be zero - in this case, you should be able to use forEach(MealMenu::class.java).ifNotExists(CpMustUse::class, ...).

How to optimize a problem using only linear constraints

I am new to AMPL. Currently I am trying to optimize a networking problem. I can only use CPLEX solver. Others like ILOG CP are forbidden.
Input:
Demands - set of demands that have to be fulfilled (basically QoS rules) which consist of multiple paths between a given pair of servers. Every path must be assigned some guaranteed bandwidth which is >= 0
demand_maxPath - number of paths for a demand d
hd - bandwidth requested for a demand d
Goal
The goal is to assign bandwidths to every path with regards to some cost function.
Here is an example. d denotes demand id, x(d,1) is a first path that belongs to demand d and so on, hd denotes bandwidth requested for a demand d.
Constraints
There are a couple of constraints:
sum of bandwidths of all paths for a demand d must equal hd
x(d,p) must be equal to hd, where p is a variable
Constraint 2 enforces that all other paths from a demand d, except for path (x,dp) must be equal to 0.
Approach
My approach: I declared a variable:
var demandPath_signalCount { d in Demands, 1..demand_maxPath[d]}, >= 0;
which holds values for each of the paths. The following constraint reflects contraint 1:
subject to demand_satisfaction_constraint { d in Demands }:
sum { dp in 1..demand_maxPath[d] } demandPath_signalCount[d,dp] = h[d];
However, I can't think of a way to write constraint 2. For example:
subject to path_value_satisfaction_constraint { d in Demands }:
max { dp in 1..demand_maxPath[d] } demandPath_signalCount[d,dp] = h[d];
doesn't work since max() function is nonlinear.
Other idea was to declare another variable:
var demand_chosenPath { d in demands }, >= 0;
and to use it like so:
subject to path_value_satisfaction_constraint { d in Demands }:
demandPath_signalCount[d,demand_chosenPath[d]] = h[d];
It obviously doesn't work either since variables cannot be used as indices.
Yet another way that I tried was to constraint the values that demandPath_signalCount may be equal to like so:
set possible_values {d in Demands } = 0..demand_volume[d] by demand_volume[d];
and
subject to possible_values_satisfaction_constraint { d in Demands, dp in 1..demand_maxPath[d] }:
demandPath_signalCount[d,dp] in possible_values[d];
But then again, the error is: continuous variable in tuple
How to formulate the second constraint?

Summing measurements

I have this code:
#Name("Creating_hourly_measurement_Position_Stopper for line 2")
insert into CreateMeasurement
select
m.measurement.source as source,
current_timestamp().toDate() as time,
"Line2_Count_Position_Stopper_Measurement" as type,
{
"Line2_DoughDeposit2.Hourly_Count_Position_Stopper.value",
count(cast(getNumber(m, "Status.Sidestopper_positioning.value"), double)),
"Line2_DoughDeposit2.Hourly_Count_Position_Stopper.unit",
getString(m, "Status.Sidestopper_positioning.unit")
} as fragments
from MeasurementCreated.win:time(1 hours) m
where getNumber(m, "Status.Sidestopper_positioning.value") is not null
and cast(getNumber(m, "Status.Sidestopper_positioning.value"), int) = 1
and m.measurement.source.value = "903791"
output last every 1 hours;
but it seems to loop. I believe it's because new measurement will modify this group, meaning it is constantly extending. This mean that recalculation will be performed each time when new data will be available.
Is there a way to count the measurement or get the total of the measurements per hour or per day?
The stream it consumes is "MeasurementCreated" (see from) and that isn't produced by any EPL so one can safely say that this EPL by itself cannot possibly loop.
If you want to improve the EPL there is some information at this link: http://esper.espertech.com/release-8.2.0/reference-esper/html_single/index.html#processingmodel_basicfilter
By moving the where-clause text into a filter you can discard events early.
Doesn't the insert into CreateMeasurement then cause an event in MeasurementCreated?

Strategy for loading data into BigQuery and Google cloud Storage from local disk

I have 2 years of combined data of size around 300GB in my local disk which i have extracted from teradata. I have to load the same data to both google cloud storage and BigQuery table.
The final data in google cloud storage should be day wise segregated in compressed format(each day file should be a single file in gz format).
I also have to load the data in BigQuery in a day wise partitioned table i.e. each day's data should be stored in one partition.
I loaded the combined data of 2 years to google storage first. Then tried using google dataflow to day wise segregate data by using the concept of partitioning in dataflow and load it to google cloud storage (FYI dataflow partitioning is different from bigquery partitioning). But dataflow did not allow to create 730 partitions(for 2 years) as it hit the 413 Request Entity Too Large (The size of serialized JSON representation of the pipeline exceeds the allowable limit").
So I ran the dataflow job twice which filtered data for each year.
It filtered each one year's data and wrote it into separate files in google cloud storage but it could not compress it as dataflow currently cannot write to compressed files.
Seeing the first approach fail, I thought of filtering 1 the one year's data from the combined data using partioning in dataflow as explained above and writing it directly to BigQuery and then exporting it to google storage in compressed format. This process would have been repeated twice.
But in this approach i could not write more than 45 days data at once as I repeatedly hit java.lang.OutOfMemoryError: Java heap space issue. So this startegy also failed
Any help in figuring out a strategy for date wise segregated migration to google storage in compressed format and BigQuery would be of great help?
Currently, partitioning the results is the best way to produce multiple output files/tables. What you're likely running into is the fact that each write allocates a buffer for the uploads, so if you have a partition followed by N writes, there are N buffers.
There are two strategies for making this work.
You can reduce the size of the upload buffers using the uploadBufferSizeBytes option in GcsOptions. Note that this may slow down the uploads since the buffers will need to be flushed more frequently.
You can apply a Reshuffle operation to each PCollection after the partition. This will limit the number of concurrent BigQuery sinks running simultaneously, so fewer buffers will be allocated.
For example, you could do something like:
PCollection<Data> allData = ...;
PCollectionList<Data> partitions = allData.apply(Partition.of(...));
// Assuming the partitioning function has produced numDays partitions,
// and those can be mapped back to the day in some meaningful way:
for (int i = 0; i < numDays; i++) {
String outputName = nameFor(i); // compute the output name
partitions.get(i)
.apply("Write_" + outputName, ReshuffleAndWrite(outputName));
}
That makes use of these two helper PTransforms:
private static class Reshuffle<T>
extends PTransform<PCollection<T>, PCollection<T>> {
#Override
public PCollection<T> apply(PCollection<T> in) {
return in
.apply("Random Key", WithKeys.of(
new SerializableFunction<T, Integer>() {
#Override
public Integer apply(Data value) {
return ThreadLocalRandom.current().nextInt();
}
}))
.apply("Shuffle", GroupByKey.<Integer, T>create())
.apply("Remove Key", Values.create());
}
}
private static class ReshuffleAndWrite
extends PTransform<PCollection<Data>, PDone> {
private final String outputName;
public ReshuffleAndWrite(String outputName) {
this.outputName = outputName;
}
#Override
public PDone apply(PCollection<Data> in) {
return in
.apply("Reshuffle", new Reshuffle<Data>())
.apply("Write", BigQueryIO.Write.to(tableNameFor(outputName)
.withSchema(schema)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE));
}
}
Let's see if this will help?
Steps + pseudo code
1 - Upload combined data (300GB) to BigQuery to CombinedData table
2 - Split Years (Cost 1x2x300GB = 600GB)
SELECT * FROM CombinedData WHERE year = year1 -> write to DataY1 table
SELECT * FROM CombinedData WHERE year = year2 -> write to DataY2 table
3 - Split to 6 months (Cost 2x2x150GB = 600GB)
SELECT * FROM DataY1 WHERE month in (1,2,3,4,5,6) -> write to DataY1H1 table
SELECT * FROM DataY1 WHERE month in (7,8,9,10,11,12) -> write to DataY1H2 table
SELECT * FROM DataY2 WHERE month in (1,2,3,4,5,6) -> write to DataY2H1 table
SELECT * FROM DataY2 WHERE month in (7,8,9,10,11,12) -> write to DataY2H2 table
4 - Split to 3 months (Cost 4x2x75GB = 600GB)
SELECT * FROM DataY1H1 WHERE month in (1,2,3) -> write to DataY1Q1 table
SELECT * FROM DataY1H1 WHERE month in (4,5,6) -> write to DataY1Q2 table
SELECT * FROM DataY1H2 WHERE month in (7,8,9) -> write to DataY1Q3 table
SELECT * FROM DataY1H2 WHERE month in (10,11,12) -> write to DataY1Q4 table
SELECT * FROM DataY2H1 WHERE month in (1,2,3) -> write to DataY2Q1 table
SELECT * FROM DataY2H1 WHERE month in (4,5,6) -> write to DataY2Q2 table
SELECT * FROM DataY2H2 WHERE month in (7,8,9) -> write to DataY2Q3 table
SELECT * FROM DataY2H2 WHERE month in (10,11,12) -> write to DataY2Q4 table
5 - Split each quarter into 1 and 2 months (Cost 8x2x37.5GB = 600GB)
SELECT * FROM DataY1Q1 WHERE month = 1 -> write to DataY1M01 table
SELECT * FROM DataY1Q1 WHERE month in (2,3) -> write to DataY1M02-03 table
SELECT * FROM DataY1Q2 WHERE month = 4 -> write to DataY1M04 table
SELECT * FROM DataY1Q2 WHERE month in (5,6) -> write to DataY1M05-06 table
Same for rest of Y(1/2)Q(1-4) tables
6 - Split all double months tables into separate month table (Cost 8x2x25GB = 400GB)
SELECT * FROM DataY1M002-03 WHERE month = 2 -> write to DataY1M02 table
SELECT * FROM DataY1M002-03 WHERE month = 3 -> write to DataY1M03 table
SELECT * FROM DataY1M005-06 WHERE month = 5 -> write to DataY1M05 table
SELECT * FROM DataY1M005-06 WHERE month = 6 -> write to DataY1M06 table
Same for the rest of Y(1/2)M(XX-YY) tables
7 - Finally you have 24 monthly tables and now I hope limitations you are facing will be gone so you can proceed with your plan – second approach let’s say - to further split on daily tables
I think, cost wise this is most optimal approach and final querying cost is
(assuming billing tier 1)
4x600GB + 400GB = 2800GB = $14
Of course don’t forget delete intermediate tables
Note: I am not happy with this plan - but if splitting your original file to daily chunks outside of BigQuery is not an option - this can help

How to count and then compute the total average in PIG

Each line in my dataset is a sale and my goal is to compute the average time a client buys during his lifetime.
I have already grouped and counted by clientId like this:
byClientId = GROUP sales BY clientId;
countByClientId = FOREACH byClientId GENERATE group, count($1);
This creates a table with 2 columns: clientId, count of transactions.
Now, I am trying to get the total average of the second column (i.e. the overall average of sales to same client). I am using this code:
groupCount = GROUP countByClientId all;
avg = foreach groupCount generate AVG($1);
But I get this error message:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
<line 18, column 31> Could not infer the matching function for org.apache.pig.builtin.AVG
as multiple or none of them fit. Please use an explicit cast.
How to get the overall average of the second column?
It would have been simpler for us with a sample of input data.. I created my own, to be sure that my solution would work. You only have one mistake : once you grouped all your schema become group:chararray,countByClientId:bag{:tuple(group:chararray,:long)}
So, $1 refers to a bag and this is why you can't compute the mean. If you want to access $1 (which is the second element) inside this bag you have two choices, either $1.$1, or countByClientId.$1. So your last line should be :
avg = foreach groupCount generate AVG(countByClientId.$1);
I hope it's clear.