CPLEX: How to run subset constraints in OPL? - optimization

CPLEX; Mixed Integer Linear Programming; Constraint Formulation:
There are 150 towns in the dataset, each town having several markets (or "mandis"). Total number of mandis in the dataset = 1800. I have a binary decision variable y[mandi][days]. I want to add a constraint which states that y[mandi][days] is equal for all mandis within any given town on any given day. y[mandi][days] could be different/same for the mandis in different towns on the same day.
Sample Data
I'm inputting the data from Excel. Please see the attached image. Can you help me out with how to formulate this constraint in OPL?
One way to achieve the above is to specify individual constraints on the set of mandis within each town. However, the number of constraints, in that case, would become 150, each referring to one town. Also, we might receive from the client an additional list of mandis for some towns, which would distort the mandi-town mapping numbering, and I would have to change the mapping in CPLEX again. Is there a better way to do this, which could take the mapping directly from excel in the attached image format?

Instead of the decision variable
dvar boolean y[mandi][days]
why not use
dvar boolean y[town][days]
?
And then when you need the y for a given mandi, you first get the town of that mandi and then get its y.

Related

python - pandas - dataframe - data padding multidimensional statistics

i have a dataframe with columns accounting for different characteristics of stars and rows accounting for measurements of different stars. (something like this)
\property_______A _______A_error_______B_______B_error_______C_______C_error ...
star1
star2
star3
...
in some measurements the error for a specifc property is -1.00 which means the measurement was faulty.
in such case i want to discard the measurement.
one way to do so is by eliminating the entire row (along with other properties who's error was not -1.00)
i think it's possible to fill in the faulty measurement with a value generated by the distribution based on all the other measurements, meaning - given the other properties which are fine, this property should have this value in order to reduce the error of the entire dataset.
is there a proper name to the idea i'm referring to?
how would you apply such an algorithm?
i'm a student on a solo project so would really appreciate answers that also elaborate on theory (:
edit
after further reading, i think what i was referring to is called regression imputation.
so i guess my question is - how can i implement multidimensional linear regression in a dataframe in the most efficient way???
thanks!

Similarity matching algorithm

I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.
For example, if a product has:
Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new
Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.
I'm open to suggestions as the project is just starting.
One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:
Colors Weight Height Condition
00011011000 10110110 10001100 01
Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).
The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.
Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.
Further question: also if I could use different weights for each attribute, it would be awesome.
You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.
Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.
You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.
What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.
You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.
What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements.
When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).
Simple example of BK-tree (from the second link)
BOOK
/ \
/(1) \(4)
/ \
BOOKS CAKE
/ / \
/(2) /(1) \(2)
/ | |
BOO CAPE CART
Your metric should be Hamming distance (count of differences between bit representations of two objects).
BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?
BK-tree with details:
https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html
https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

Determining which polygon contains the majority of a line - Oracle Spatial

I have an oracle database (11g spatial) that includes a series of area polygons and water mains. I'm trying to attribute each of these mains to the area in which it is contained and for the most part this is straightforward enough (using the SDO_CONTAINS function) but I'm not sure how to deal with mains that straddle multiple polygons due to errors in digitisation.
In cases like this what I'd ideally like to do is attribute a main to an area polygon if the majority of it's length (>50%) is contained within onit. I know that I can use the SDO_RELATE function to determine every polygon that any given main interacts with, but I don't know how to then go about determining how much of it's length is contained within each area.
The principle is like this:
Correlate mains and areas. Assuming you have many mains and many areas, the most efficient approach is to use SDO_JOIN
For each couple (main/area) returned, compute their intersection (SDO_GEM.SDO_INTERSECTION) and measure the length of that intersection (SDO_GEOM.SDO_LENGTH).
From those results, retain the area for each main where the length is the maximum
If you want a full SQL example, allow me a bit of time to write that using sample data.

How to distinguish in master data and calculated interpolated data?

I'm getting a bunch of vectors with datapoints for a fixed set of values, in the example below you see an example of a vector with a value per time point
1D:2
2D:
7D:5
1M:6
6M:6.5
But alas not for all the timepoints is a value available. All vectors are stored in a database and with a trigger we calcuate the missing values by interpolation, or possibly a more advanced algorithm. Somehow I want to be able to tell which data points have been calculated and which have been original delivered to us. Of course I can add a flag column to the table with values indicating whether the value was a master value or is calculated, but I'm wondering whether there is a more sophisticated way. We probably don't need to determine on a regular basis, so cpu cycles are not an issue for determining or insertion.
The example above shows some nice looking numbers but in reality it would look more somethin like 3.1415966533.
The database for storage is called oracle 10.
cheers.
Could you deactivate the trigger temporarily?

dimensional and unit analysis in SQL database

Problem:
A relational database (Postgres) storing timeseries data of various measurement values. Each measurement value can have a specific "measurement type" (e.g. temperature, dissolved oxygen, etc) and can have specific "measurement units" (e.g. Fahrenheit/Celsius/Kelvin, percent/milligrams per liter, etc).
Question:
Has anyone built a similar database such that dimensional integrity is conserved? Have any suggestions?
I'm considering building a measurement_type and a measurement_unit table, both of these would have text two columns, ID and text. Then I would create foreign keys to these tables in the measured_value table. Text worries me somewhat because there's the possibility for non-unique duplicates (e.g. 'ug/l' vs 'µg/l' for micrograms per liter).
The purpose of this would be so that I can both convert and verify units on queries, or via programming externally. Ideally, I would have the ability later to include strict dimensional analysis (e.g. linking µg/l to the value 'M/V' (mass divided by volume)).
Is there a more elegant way to accomplish this?
I produced a database sub-schema for handling units an aeon ago (okay, I exaggerate slightly; it was about 20 years ago, though). Fortunately, it only had to deal with simple mass, length, time dimensions - not temperature, or electric current, or luminosity, etc. Rather less simple was the currency side of the game - there were a myriad different ways of converting between one currency and another depending on date, currency, and period over which conversion rate was valid. That was handled separately from the physical units.
Fundamentally, I created a table 'measures' with an 'id' column, a name for the unit, an abbreviation, and a set of dimension exponents - one each for mass, length, time. This gets populated with names such as 'volume' (length = 3, mass = 0, time = 0), 'density' (length = 3, mass = -1, time = 0) - and the like.
There was a second table of units, which identified a measure and then the actual units used by a particular measurement. For example, there were barrels, and cubic metres, and all sorts of other units of relevance.
There was a third table that defined conversion factors between specific units. This consisted of two units and the multiplicative conversion factor that converted unit 1 to unit 2. The biggest problem here was the dynamic range of the conversion factors. If the conversion from U1 to U2 is 1.234E+10, then the inverse is a rather small number (8.103727714749e-11).
The comment from S.Lott about temperatures is interesting - we didn't have to deal with those. A stored procedure would have addressed that - though integrating one stored procedure into the system might have been tricky.
The scheme I described allowed most conversions to be described once (including hypothetical units such as furlongs per fortnight, or less hypothetical but equally obscure ones - outside the USA - like acre-feet), and the conversions could be validated (for example, both units in the conversion factor table had to have the same measure). It could be extended to handle most of the other units - though the dimensionless units such as angles (or solid angles) present some interesting problems. There was supporting code that would handle arbitrary conversions - or generate an error when the conversion could not be supported. One reason for this system was that the various international affiliate companies would report their data in their locally convenient units, but the HQ system had to accept the original data and yet present the resulting aggregated data in units that suited the managers - where different managers each had their own idea (based on their national background and length of duty in the HQ) about the best units for their reports.
"Text worries me somewhat because there's the possibility for non-unique duplicates"
Right. So don't use text as a key. Use the ID as a key.
"Is there a more elegant way to accomplish this?"
Not really. It's hard. Temperature is it's own problem because temperature is itself an average, and doesn't sum like distance does; plus F to C conversion is not a multiply (as it is with every other unit conversion.)
A note about conversions: a lot of units are linearly related, and can be converted using a formula like "y = A + Bx", where A and B are constants which could be stored in the database for each pair of units that you need to convert between. For example, for Celsius to Farenheit the constants are A=32, B=1.8.
However, there are also rare exceptions. Converting between logarithmic and non-logarithmic units, for example. Or converting between mass-per-volume and molar-mass-per-volume (in which case you would need to know the molar mass of the compound being measured).
Of course, if you are sure that all the conversions required by the system are linear, then there's no need for over-engineering, just store the two constants. You can then extract standardized results from the database using straight SQL joins with calculated fields.