PuLP - COIN-CBC error: How to add constraint with double inequality and relaxation? - optimization

I want to add this set of constraints:
-M(1-X_(i,j,k,n) )≤S_(i,j,k,n)-ToD_(i,j,k,n)≤M(1-X_(i,j,k,n) ) ∀i,j,k,n
Where M is a big number, S is a integer variable that takes values between 0 and 1440. ToD is a 4-dimensional matrix that takes values from an Excel sheet. X i dual variable, it takes as values 0-1.
I try to implement in code as following:
for n in range(L):
for k in range(M):
for i in range(N):
for j in range(N):
if (i != START_POINT_S & i != END_POINT_T & j != START_POINT_S & j != END_POINT_T):
prob += (-BIG_NUMBER*(1-X[i][j][k][n])) <= (S[i][j][k][n] - ToD[i][j][k][n]), ""
and another constraint as follows:
for i in range(N):
for j in range(N):
for k in range(M):
for n in range(L):
if (i != START_POINT_S & i != END_POINT_T & j != START_POINT_S & j != END_POINT_T):
prob += S[i][j][k][n] - ToD[i][j][k][n] <= BIG_NUMBER*(1-X[i][j][k][n]), ""
According to my experience, in code, those two constraints are totally equivalent to what we want. The problem is that PuLP and CBC won't accept them. The produce the following errors:
PuLP:
Traceback (most recent call last):
File "basic_JP.py", line 163, in <module>
prob.solve()
File "C:\Users\dimri\Desktop\Filesystem\Projects\deliverable_B4\lib\site-packa
ges\pulp\pulp.py", line 1643, in solve
status = solver.actualSolve(self, **kwargs)
File "C:\Users\dimri\Desktop\Filesystem\Projects\deliverable_B4\lib\site-packa
ges\pulp\solvers.py", line 1303, in actualSolve
return self.solve_CBC(lp, **kwargs)
File "C:\Users\dimri\Desktop\Filesystem\Projects\deliverable_B4\lib\site-packa
ges\pulp\solvers.py", line 1366, in solve_CBC
raise PulpSolverError("Pulp: Error while executing "+self.path)
pulp.solvers.PulpSolverError: Pulp: Error while executing C:\Users\dimri\Desktop
\Filesystem\Projects\deliverable_B4\lib\site-packages\pulp\solverdir\cbc\win\64\
cbc.exe
and CBC:
Welcome to the CBC MILP Solver
Version: 2.9.0
Build Date: Feb 12 2015
command line - C:\Users\dimri\Desktop\Filesystem\Projects\deliverable_B4\lib\sit
e-packages\pulp\solverdir\cbc\win\64\cbc.exe 5284-pulp.mps branch printingOption
s all solution 5284-pulp.sol (default strategy 1)
At line 2 NAME MODEL
At line 3 ROWS
At line 2055 COLUMNS
Duplicate row C0000019 at line 10707 < X0001454 C0000019 -1.000000000000e+
00 >
Duplicate row C0002049 at line 10708 < X0001454 C0002049 -1.000000000000e+
00 >
Duplicate row C0000009 at line 10709 < X0001454 C0000009 1.000000000000e+
00 >
Duplicate row C0001005 at line 10710 < X0001454 C0001005 1.000000000000e+
00 >
At line 14153 RHS
At line 16204 BOUNDS
Bad image at line 17659 < UP BND X0001454 1.440000000000e+03 >
At line 18231 ENDATA
Problem MODEL has 2050 rows, 2025 columns and 5968 elements
Coin0008I MODEL read with 5 errors
There were 5 errors on input
** Current model not valid
Option for printingOptions changed from normal to all
** Current model not valid
No match for 5284-pulp.sol - ? for list of commands
Total time (CPU seconds): 0.02 (Wallclock seconds): 0.02
I don't know what's the problem, any help? I am new to this, if information are not enough let me know what I should add.

Alright, I have searched for hours, but right after I posted this question I found the answer. These kinds of problems are mainly because of the names of the variables or the constraints. That is what caused something to duplicate. I am really not used to that kind of software that is why it took me so long to find and answer. Anyway, the problem for me was when I was defining the variables:
# define X[i,j,k,n]
lower_bound_X = 0 # lower bound for variable X
upper_bound_X = 1 # upper bound for variable X
X = LpVariable.dicts(name="X",
indexs=(range(N), range(N), range(M), range(L)),
lowBound=lower_bound_X,
upBound=upper_bound_X,
cat=LpInteger)
and
# define S[i,j,k,n]
lower_bound_S = 0 # lower bound for variable S
upper_bound_S = 1440 # upper bound for variable S
S = LpVariable.dicts(name="X",
indexs=(range(N),
range(N), range(M), range(L)),
lowBound=lower_bound_S,
upBound=upper_bound_S,
cat=LpInteger)
As you see in the definition of S I obviously forgot to change the name of the variable to S because I copy-pasted it. Anyway, the right way to define S is like this:
# define S[i,j,k,n]
lower_bound_S = 0 # lower bound for variable S
upper_bound_S = 1440 # upper bound for variable S
S = LpVariable.dicts(name="S",
indexs=(range(N), range(N), range(M), range(L)),
lowBound=lower_bound_S,
upBound=upper_bound_S,
cat=LpInteger)
This is how I got my code running.

Related

How can I delete sub-sub-list elements based on condition?

I am having the following two (2) lists:
lines = [[[0, 98]], [[64], [1,65,69]]]
stations = [[0,1], [0,3,1]]
The lines describes the line combinations for getting from 0 to 1 and stations describes the stops visited by choosing the line combinations. For getting from 0 to 1 the following are the possible combinations:
Take line 0 or 98 (direct connection)
Take line 64 and then line 1 (1 transfer at station 3)
Take line 64 and then line 65 (1 transfer at station 3)
Take line 64 and then line 69 (1 transfer at station 3)
The len of stations always equals the len of lines. I have used the following code to explode the lists in the way I described previously and store the options in a dataframe df.
result_direct = []
df = pd.DataFrame(columns=["lines", "stations", 'transfers'])
result_transfer = []
for index,(line,station) in enumerate(zip(lines,stations)):
#print(index,line,station)
if len(line) == 1: #if the line store direct connections
result_direct = [[i] for i in line[0]] #stores the direct connections in a list
for sublist in result_direct:
df = df.append({'lines': sublist,'stations': station, 'transfers': 0},ignore_index=True)
else:
result_transfer = [[[x] for x in tup] for tup in itertools.product(*line)]
result_transfer = [[item[0] for item in sublist] for sublist in result_transfer]
for sublist in result_transfer:
df = df.append({'lines': sublist,'stations': station, 'transfers': len(sublist)-1},ignore_index=True)
For the sake of the example I add the following 2 columns score1, score2:
df['score1'] = [5,5,5,2,2]
df['score2'] = [2,6,4,3,3]
I want to update the values of lines and stations based on a condition. When score2 > score1 this line/station combinations should be removed from the lists.
In this example the direct line 98 as well as the combination of lines 64,65 and 64,69 should be removed from the lines. Therefore, the expected output is the following:
lines = [[[0]], [[64], [1]]]
stations = [[0,1], [0,3,1]]
At this point, I should note that stations is not affected since there is at least one remaining combination of lines in the sublists. If also line 0 should have been removed the expected output should be:
lines = [[[64], [1]]]
stations = [[0,3,1]]
For starting I have tried a manual solution that works for a single line removal (e.g for removing line 98):
lines = [[y for y in x if y != 98] if isinstance(x, list) else x for x in [y for x in lines for y in x]]
I am having the difficulty for line combinations (e.g 64,65 and 64,69). Do you have any suggestions?

How do you detect blank lines in Fortran?

Given an input that looks like the following:
123
456
789
42
23
1337
3117
I want to iterate over this file in whitespace-separated chunks in Fortran (any version is fine). For example, let's say I wanted to take the average of each chunk (e.g. mean(123, 456, 789) then mean(42, 23, 1337) then mean(31337)).
I've tried iterating through the file normally (e.g. READ), reading in each line as a string and then converting to an int and doing whatever math I want to do on each chunk. The trouble here is that Fortran "helpfully" ignores blank lines in my text file - so when I try and compare against the empty string to check for the blank line, I never actually get a .True. on that comparison.
I feel like I'm missing something basic here, since this is a typical functionality in every other modern language, I'd be surprised if Fortran didn't somehow have it.
If you're using so-called "list-directed" input (format = '*'), Fortran does special handling to spaces, commas, and blank lines.
To your point, there's a feature which is using the BLANK keyword with read
read(iunit,'(i10)',blank="ZERO",err=1,end=2) array
You can set:
blank="ZERO" will return a valid zero value if a blank is found;
blank="NULL" is the default behavior that skips blank/returns an error depending on the input format.
If all your input values are positive, you could use blank="ZERO" and then use the location of zero values to process your data.
EDIT as #vladimir-f has correctly pointed out, you not only have blanks in between lines, but also after the end of the numbers in most lines, so this strategy will not work.
You can instead load everything into an array, and process it afterwards:
program array_with_blanks
integer :: ierr,num,iunit
integer, allocatable :: array(:)
open(newunit=iunit,file='stackoverflow',form='formatted',iostat=ierr)
allocate(array(0))
do
read(iunit,'(i10)',iostat=ierr) num
if (is_iostat_end(ierr)) then
exit
else
array = [array,num]
endif
end do
close(iunit)
print *, array
end program
Just read each line as a character (but note Francescalus's comment on the format). Then read the character as an internal file.
program stuff
implicit none
integer io, n, value, sum
character (len=1000) line
n = 0
sum = 0
io = 0
open( 42, file="stuff.txt" )
do while( io == 0 )
read( 42, "( a )", iostat = io ) line
if ( io /= 0 .or. line == "" ) then
if ( n > 0 ) print *, ( sum + 0.0 ) / n
n = 0
sum = 0
else
read( line, * ) value
n = n + 1
sum = sum + value
end if
end do
close( 42 )
end program stuff
456.000000
467.333344
3117.00000

KeyError: "['green_picture', 'yellow_green_picture', 'yellow_picture'] not in index"

That my code
#Select your X variables based on your hypothesis : mainvariables + control variables
# Add your main x variables here - based on your hypothesis
main_x_names = ['Green_color','Yellow_green_color','Yellow_color']
# Add your control variables here. Default value is empty. You can leave it empty.
control_x_names = ['length_description','Valence_overall', 'ocr_Valence_overall', 'ocr_length_description']
# Your control variables should be your selected variables + time variables + brand variables
# e.g., datetime_features.columns.to_list() helps to get the column names of datetime_features and put them in a list
control_x_names = control_x_names + datetime_features.columns.to_list()
# Decide your X variables : main X variables + control variables
x_names = main_x_names + control_x_names
That my error.I don't know if the formatting is wrong or what, only this column is not recognized. When I printed it I got this.
Y=np.log(test_data1['likeCount']+1)
X =test_data1[x_names]
X=sm.add_constant(X)
model=sm.OLS(Y,X)
results=model.fit()
KeyError Traceback (most recent call last)
<ipython-input-93-a5ec2e24ee54> in <module>()
1 Y=np.log(test_data1['likeCount']+1)
----> 2 X =test_data1[x_names]
3 X=sm.add_constant(X)
4 model=sm.OLS(Y,X)
5 results=model.fit()
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377 raise KeyError(f"{not_found} not in index")
1378
1379
This was my mistake, I did not copy the key error into the code box.The final key error should be:KeyError: "['Green_color', 'Yellow_green_color', 'Yellow_color'] not in index"
I don't know what the problem is.

Awk failing extraction

I have a huge file containing the xyz positions of some atoms from different molecules. The whole file contains ~ 10000 configurations. I have created a script that iterates over the total number of configurations and extracts the coordinates associated with a specific atomic species that is systematically repeated at a fixed position, along with each frame associated with each system. My code works perfectly, except in the case in which the atomic position coincides with the last position of the frame I have to process, skipping to grab it and print in the corresponding file.
Each frame contains 384 atoms. In the xyz format, we have to take into account two extra lines at the beginning, where the number of atoms (in this case 384, line #1) and a blank/commented line are (line #2) are located.
The awk file with the list of atoms position lines is of the form:
{n = NR%386}
n == 1 {print "24"; next}
n == 2 ||
n == 91 ||
...
n == 378 ||
n == 380 ||
n == 381 ||
n == 386
where the n=NR%386 is the number of lines that awk has to account at every iteration in order to have the correct number of frames, in
n == 1 {print "24"; next}
the code prints the number of atoms I want to extract for each frame, in this case, 24.
The problem arises with the last value, in the last position of each frame before advancing to the next frame:
n == 386
When using the command
awk -f file.awk filename.xyz >> test.txt
the code will skip reading, extracting, and printing the last coordinate.
The filename.xyz I have to process is something like:
384
i = 3171, time = 3171.000, E = -3298.3005315786
C 6.66359796 19.29831718 16.63773520
C 6.19922671 19.83243350 15.35406226
C 7.73577004 21.24303011 16.94974860
C 7.32315891 21.77975003 15.67093925
N 5.08248005 17.55384984 15.51887635
N 7.75857672 23.00895664 15.43811018
N 8.58649028 22.07495287 17.61330368
N 7.45555304 19.97249138 17.42360101
...
...
...
N 3.62924684 23.22942656 15.38486984
N 4.52670891 22.25077226 17.55981432
N 3.17369677 20.23465407 17.45881199
N 2.28230853 21.30557433 14.86646780
S 1.48394488 18.18032187 17.21253664
S 0.70072709 19.13053602 14.60582837
S 4.67511560 23.53830074 16.57005901
Currently, just trying to extract only position 386
n == 386
produces something like:
1
i = 3171, time = 3171.000, E = -3298.3005315786
1
i = 3172, time = 3172.000, E = -3298.3023115390
1
i = 3173, time = 3173.000, E = -3298.3056102462
1
i = 3174, time = 3174.000, E = -3298.3101590395
that are just the corresponding to the commented lines, apparently skipping or not correctly interpreting which line to grep.
I would like to understand why awk if not able to extract the last line properly and how to solve the problem.
This appears to be a math problem. NR%386 will never be 386 because of the way the modulus operator works (there is no remainder when you divide 386 by 386). So your n==386 will never get executed. Try using (NR-1)%386 instead of NR%386 and shift all your conditionals accordingly:
n == 0 {print "24"; next}
etc. If you need n for calculations, add one to it.

Pandas not consistently skipping input number of rows for skiprows argument?

I am using Pandas to organize CSV files to later plot with matplotlib. First I create a Pandas dataframe to find the line containing 'Pt'. This is what I search for to use as my header line. header
Then I save the index of this line and apply it to the skiprow argument when creating the new dataframe which I will use.
Oddly, depending on the file format, even though the correct index is found, the wrong line shows up as the header. For example, note how in Pandas line 54 has 'Pt" right after the tab:
correct index on first file
The dataframe comes out correctly here.
correct dataframe on first file
For another file, line 44 is correctly recognized with having 'Pt'.
correct index on second file
But the dataframe includes line 43 as the header!
incorrect dataframe on second file
I have tried setting header=0, header=none. Am I missing something?
Here is the code
entire_df = pd.read_csv(file_path, header=None)
print(entire_df.head(60))
header_idx = -1
for index, row in entire_df.iterrows(): # find line with desired header
if any(row.str.contains('Pt')):
print("Yes! I have pt!")
print("Header index is: " + str(index))
print("row contains:")
print(entire_df.loc[[index]])
header_idx = index # correct index obtained!
break
df = pd.read_csv(file_path, delimiter='\t', skiprows=header_idx, header=0) # use line index to exclude extra information above
print(df.head())
Here are sections of the two files that give different results. They are saved as .dta files. I cannot share the entire files.
file1 (properly made dataframe)
FRAMEWORKVERSION QUANT 7.07 Framework Version
INSTRUMENTVERSION LABEL 4.32 Instrument Version
CURVE TABLE 16875
Pt T Vf Im Vu Pwr Sig Ach Temp IERange Over
# s V A V W V V deg C # bits
0 0.1 3.49916E+000 -1.40364E-002 0.00000E+000 -4.91157E-002 -4.22328E-001 0.00000E+000 1.41995E+003 11 ...........
1 0.2 3.49439E+000 -1.40305E-002 0.00000E+000 -4.90282E-002 -4.22322E-001 0.00000E+000 1.41995E+003 11 ...........
2 0.3 3.49147E+000 -1.40258E-002 0.00000E+000 -4.89705E-002 -4.22322E-001
file2 (dataframe with wrong header)
FRAMEWORKVERSION QUANT 7.07 Framework Version
INSTRUMENTVERSION LABEL 4.32 Instrument Version
CURVE TABLE 18
Pt T Vf Vm Ach Over Temp
# s V vs. Ref. V V bits deg C
0 2.00833 3.69429E+000 3.69429E+000 0.00000E+000 ........... 1419.95
1 4.01667 3.69428E+000 3.69352E+000 0.00000E+000 ........... 1419.95
2 6.025 3.69419E+000 3.69284E+000 0.00000E+000 ........... 1419.95
3 8.03333 3.69394E+000 3.69211E+000 0.00000E+000 ........... 1419.95
Help would be much appreciated.
You should pay attention to your indentation levels. Your code block in which you want to set the header_idx depending on your if any(row.str.contains('Pt')) condition has the same intendation level as the if statement, which means it is executed at each iteration of the for loop, and not just when the condition is met.
for index, row in entire_df.iterrows():
if any(row.str.contains('Pt')):
[...]
header_idx = index
Adapt the indentation like that to put the assignment under the control of the if statement:
for index, row in entire_df.iterrows():
if any(row.str.contains('Pt')):
[...]
header_idx = index