Initialize a two-dimensional array from literals - ampl

I have this data file:
param: name car pro fat vit cal :=
1 'Fiddleheads' 3 1 0 3 80
2 'Fireweed Shoots' 3 0 0 4 150
3 'Prickly Pear Fruit' 2 1 1 3 190
;
and this model:
set I;
set J;
param name{I} symbolic;
param car{I} integer >= 0;
param pro{I} integer >= 0;
param fat{I} integer >= 0;
param vit{I} integer >= 0;
param cal{I} integer >= 0;
param nut{i in I, J} = (car[i], pro[i], fat[i], vit[i]);
The last line is invalid:
mod, line 10 (offset 176):
syntax error
context: param nut{i in I, J} = >>> (car[i], <<< pro[i], fat[i], vit[i]);
but I don't know how to get an equivalent working. Essentially, I want to form a {3,4} array based on a literal expression. I've tried a handful of different syntaxes both in the data and model file and haven't been able to get any working.

Model:
set names;
set components;
param nut{names,components} default 0;
Data:
set names :=
Fiddleheads
'Fireweed Shoots'
'Prickly Pear Fruit';
set components := car pro fat vit cal
;
param nut :=
[Fiddleheads,*]
car 3 pro 1 vit 3 cal 80
['Fireweed Shoots',*]
car 3 vit 4 cal 150
['Prickly Pear Fruit',*]
car 2 pro 1 fat 1 vit 3 cal 190
;
See Chapter 9 of the AMPL Book for variants.
The "default 0" option avoids the need to explicitly list zero values, which can be useful for sparse data sets.
It would be useful to have an AMPL input format that allows a 2-D parameter to be specified in a simple table layout with row and column headers, along the lines of your data step, but I'm not aware of one that does this.

Related

Can I use pandas to create a biased sample?

My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone

Changing names of variables using the values of another variable

I am trying to rename around 100 dummy variables with the values from a separate variable.
I have a variable products, which stores information on what products a company sells and have generated a dummy variable for each product using:
tab products, gen(productid)
However, the variables are named productid1, productid2 and so on. I would like these variables to take the values of the variable products instead.
Is there a way to do this in Stata without renaming each variable individually?
Edit:
Here is an example of the data that will be used. There will be duplications in the product column.
And then I have run the tab command to create a dummy variable for each product to produce the following table.
sort product
tab product, gen(productid)
I noticed it updates the labels to show what each variable represents.
What I would like to do is to assign the value to be the name of the variable such as commercial to replace productid1 and so on.
Using your example data:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tabulate product, generate(productid)
list, abbreviate(10)
sort product
levelsof product, local(new) clean
tokenize `new'
ds productid*
local i 0
foreach var of varlist `r(varlist)' {
local ++i
rename `var' ``i''
}
Produces the desired output:
list, abbreviate(10)
+---------------------------------------------------------------------------+
| companyid product Commercial CreditCard EMFunds P2P Retail |
|---------------------------------------------------------------------------|
1. | 3 Commercial 1 0 0 0 0 |
2. | 5 CreditCard 0 1 0 0 0 |
3. | 4 CreditCard 0 1 0 0 0 |
4. | 6 EMFunds 0 0 1 0 0 |
5. | 1 P2P 0 0 0 1 0 |
6. | 2 Retail 0 0 0 0 1 |
+---------------------------------------------------------------------------+
Arbitrary strings might not be legal Stata variable names. This will happen if they (a) are too long; (b) start with any character other than a letter or an underscore; (c) contain characters other than letters, numeric digits and underscores; or (d) are identical to existing variable names. You might be better off making the strings into variable labels, where only an 80 character limit bites.
This code loops over the variables and does its best:
gen long obs = _n
foreach v of var productid? productid?? productid??? {
su obs if `v' == 1, meanonly
local tryit = product[r(min)]
capture rename `v' `=strtoname("`tryit'")'
}
Note: code not tested.
EDIT: Here is a test. I added code for variable labels. The data example and code show that repeated values and values that could not be variable names are accommodated.
clear
input str13 products
"one"
"two"
"one"
"three"
"four"
"five"
"six something"
end
tab products, gen(productsid)
gen long obs = _n
foreach v of var productsid*{
su obs if `v' == 1, meanonly
local value = products[r(min)]
local tryit = strtoname("`value'")
capture rename `v' `tryit'
if _rc == 0 capture label var `tryit' "`value'"
else label var `v' "`value'"
}
drop obs
describe
Contains data
obs: 7
vars: 7
size: 133
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
products str13 %13s
five byte %8.0g five
four byte %8.0g four
one byte %8.0g one
six_something byte %8.0g six something
three byte %8.0g three
two byte %8.0g two
-------------------------------------------------------------------------------
Another solution is to use the extended macro function
local varlabel:variable label
The tested code is:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tab product, gen(product_id)
* get the list of product id variables
ds product_id*
* loop through the product id variables and change the
variable name to its label
foreach var of varlist `r(varlist)' {
local varlabel: variable label `var'
display "`varlabel'"
local pos = strpos("`varlabel'","==")+2
local varlabel = substr("`varlabel'",`pos',.)
display "`varlabel'"
rename `var' `varlabel'
}

Calculate amount of combinations with conditions

I'd like to calculate how many different variations of a certain amount of numbers are possible. The number of elements is variable.
Example:
I have 5 elements and each element can vary between 0 and 8. Only the first element is a bit more defined and can only vary between 1 and 8. So far I'd say I have 8*9^4 possibilities. But I have some more conditions. As soon as one of the elements gets zero the next elements should be automatically zero as well.
E.G:
6 5 4 7 8 is ok
6 3 6 8 0 is ok
3 6 7 0 5 is not possible and would turn to 3 6 7 0 0
Would somebody show me how to calculate the amount of combinations for this case and also in general, because I'd like to be able to calculate it also for 4 or 8 or 9 etc. elements. Later on I'd like to calculate this number in VBA to be able give the user a forecast how long my calculations will take.
Since once a 0 is present in the sequence, all remaining numbers in the sequence will also be 0, these are all of the possibilities: (where # below represents any digit from 1 to 8):
##### (accounts for 8^5 combinations)
####0 (accounts for 8^4 combinations)
...
#0000 (accounts for 8^1 combinations)
Therefore, the answer is (in pseudocode):
int sum = 0;
for (int x = 1; x <= 5; x++)
{
sum = sum + 8^x;
}
Or equivalently,
int prod = 0;
for (int x = 1; x <= 5; x++)
{
prod = 8*(prod+1);
}
great thank you.
Sub test()
Dim sum As Single
Dim x As Integer
For x = 1 To 6
sum = sum + 8 ^ x
Next
Debug.Print sum
End Sub
With this code I get exactly 37488. I tried also with e.g. 6 elements and it worked as well. Now I can try to estimate the calculation time

Getting a positional slice using a Range variable as a subscript

my #numbers = <4 8 15 16 23 42>;
this works:
.say for #numbers[0..2]
# 4
# 8
# 15
but this doesn't:
my $range = 0..2;
.say for #numbers[$range];
# 16
the subscript seems to be interpreting $range as the number of elements in the range (3). what gives?
Working as intended. Flatten the range object into a list with #numbers[|$range] or use binding on Range objects to hand them around. https://docs.perl6.org will be updated shortly.
On Fri Jul 22 15:34:02 2016, gfldex wrote:
> my #numbers = <4 8 15 16 23 42>; my $range = 0..2; .say for
> #numbers[$range];
> # OUTPUT«16␤»
> # expected:
> # OUTPUT«4␤8␤15␤»
>
This is correct, and part of the "Scalar container implies item" rule.
Changing it would break things like the second evaluation here:
> my #x = 1..10; my #y := 1..3; #x[#y]
(2 3 4)
> #x[item #y]
4
Noting that since a range can bind to #y in a signature, then Range being a
special case would make an expression like #x[$(#arr-param)]
unpredictable in its semantics.
> # also binding to $range provides the expected result
> my #numbers = <4 8 15 16 23 42>; my $range := 0..2; .say for
> #numbers[$range];
> # OUTPUT«4␤8␤15␤»
> y
This is also expected, since with binding there is no Scalar container to
enforce treatment as an item.
So, all here is working as designed.
A symbol bound to a Scalar container yields one thing
Options for getting what you want include:
Prefix with # to get a plural view of the single thing: numbers[#$range]; OR
declare the range variable differently so it works directly
For the latter option, consider the following:
# Bind the symbol `numbers` to the value 1..10:
my \numbers = [0,1,2,3,4,5,6,7,8,9,10];
# Bind the symbol `rangeA` to the value 1..10:
my \rangeA := 1..10;
# Bind the symbol `rangeB` to the value 1..10:
my \rangeB = 1..10;
# Bind the symbol `$rangeC` to the value 1..10:
my $rangeC := 1..10;
# Bind the symbol `$rangeD` to a Scalar container
# and then store the value 1..10 in it:`
my $rangeD = 1..10;
# Bind the symbol `#rangeE` to the value 1..10:
my #rangeE := 1..10;
# Bind the symbol `#rangeF` to an Array container and then
# store 1 thru 10 in the Scalar containers 1 thru 10 inside the Array
my #rangeF = 1..10;
say numbers[rangeA]; # (1 2 3 4 5 6 7 8 9 10)
say numbers[rangeB]; # (1 2 3 4 5 6 7 8 9 10)
say numbers[$rangeC]; # (1 2 3 4 5 6 7 8 9 10)
say numbers[$rangeD]; # 10
say numbers[#rangeE]; # (1 2 3 4 5 6 7 8 9 10)
say numbers[#rangeF]; # (1 2 3 4 5 6 7 8 9 10)
A symbol that's bound to a Scalar container ($rangeD) always yields a single value. In a [...] subscript that single value must be a number. And a range, treated as a single number, yields the length of that range.

Hash function to iterate through a matrix

Given a NxN matrix and a (row,column) position, what is a method to select a different position in a random (or pseudo-random) order, trying to avoid collisions as much as possible?
For example: consider a 5x5 matrix and start from (1,2)
0 0 0 0 0
0 0 X 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
I'm looking for a method like
(x,y) hash (x,y);
to jump to a different position in the matrix, avoiding collisions as much as possible
(do not care how to return two different values, it doesn't matter, just think of an array).
Of course, I can simply use
row = rand()%N;
column = rand()%N;
but it's not that good to avoid collisions.
I thought I could apply twice a simple hash method for both row and column and use the results as new coordinates, but I'm not sure this is a good solution.
Any ideas?
Can you determine the order of the walk before you start iterating? If your matrices are large, this approach isn't space-efficient, but it is straightforward and collision-free. I would do something like:
Generate an array of all of the coordinates. Remove the starting position from the list.
Shuffle the list (there's sample code for a Fisher-Yates shuffle here)
Use the shuffled list for your walk order.
Edit 2 & 3: A modular approach: Given s array elements, choose a prime p of form 2+3*n, p>s. For i=1 to p, use cells (iii)%p when that value is in range 1...s-1. (For row-length r, cell #c subscripts are c%r, c/r.)
Effectively, this method uses H(i) = (iii) mod p as a hash function. The reference shows that as i ranges from 1 to p, H(i) takes on each of the values from 0 to p-1, exactly one time each.
For example, with s=25 and p=29 or 47, this uses cells in following order:
p=29: 1 8 6 9 13 24 19 4 14 17 22 18 11 7 12 3 15 10 5 16 20 23 2 21 0
p=47: 1 8 17 14 24 13 15 18 7 4 10 2 6 21 3 22 9 12 11 23 5 19 16 20 0
according to bc code like
s=25;p=29;for(i=1;i<=p;++i){t=(i^3)%p; if(t<s){print " ",t}}
The text above shows the suggestion I made in Edit 2 of my answer. The text below shows my first answer.
Edit 0: (This is the suggestion to which Seamus's comment applied): A simple method to go through a vector in a "random appearing" way is to repeatedly add d (d>1) to an index. This will access all elements if d and s are coprime (where s=vector length). Note, my example below is in terms of a vector; you could do the same thing independently on the other axis of your matrix, with a different delta for it, except a problem mentioned below would occur. Note, "coprime" means that gcd(d,s)=1. If s is variable, you'd need gcd() code.
Example: Say s is 10. gcd(s,x) is 1 for x in {1,3,7,9} and is not 1 for x in {2,4,5,6,8,10}. Suppose we choose d=7, and start with i=0. i will take on values 0, 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, which modulo 10 is 0, 7, 4, 1, 8, 5, 2, 9, 6, 3, 0.
Edit 1 & 3: Unfortunately this will have a problem in the two-axis case; for example, if you use d=7 for x axis, and e=3 for y-axis, while the first 21 hits will be distinct, it will then continue repeating the same 21 hits. To address this, treat the whole matrix as a vector, use d with gcd(d,s)=1, and convert cell numbers to subscripts as above.
If you just want to iterate through the matrix, what is wrong with row++; if (row == N) {row = 0; column++}?
If you iterate through the row and the column independently, and each cycles back to the beginning after N steps, then the (row, column) pair will interate through only N of the N^2 cells of the matrix.
If you want to iterate through all of the cells of the matrix in pseudo-random order, you could look at questions here on random permutations.
This is a companion answer to address a question about my previous answer: How to find an appropriate prime p >= s (where s = the number of matrix elements) to use in the hash function H(i) = (i*i*i) mod p.
We need to find a prime of form 3n+2, where n is any odd integer such that 3*n+2 >= s. Note that n odd gives 3n+2 = 3(2k+1)+2 = 6k+5 where k need not be odd. In the example code below, p = 5+6*(s/6); initializes p to be a number of form 6k+5, and p += 6; maintains p in this form.
The code below shows that half-a-dozen lines of code are enough for the calculation. Timings are shown after the code, which is reasonably fast: 12 us at s=half a million, 200 us at s=half a billion, where us denotes microseconds.
// timing how long to find primes of form 2+3*n by division
// jiw 20 Sep 2011
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
double ttime(double base) {
struct timeval tod;
gettimeofday(&tod, NULL);
return tod.tv_sec + tod.tv_usec/1e6 - base;
}
int main(int argc, char *argv[]) {
int d, s, p, par=0;
double t0=ttime(0);
++par; s=5000; if (argc > par) s = atoi(argv[par]);
p = 5+6*(s/6);
while (1) {
for (d=3; d*d<p; d+=2)
if (p%d==0) break;
if (d*d >= p) break;
p += 6;
}
printf ("p = %d after %.6f seconds\n", p, ttime(t0));
return 0;
}
Timing results on 2.5GHz Athlon 5200+:
qili ~/px > for i in 0 00 000 0000 00000 000000; do ./divide-timing 500$i; done
p = 5003 after 0.000008 seconds
p = 50021 after 0.000010 seconds
p = 500009 after 0.000012 seconds
p = 5000081 after 0.000031 seconds
p = 50000021 after 0.000072 seconds
p = 500000003 after 0.000200 seconds
qili ~/px > factor 5003 50021 500009 5000081 50000021 500000003
5003: 5003
50021: 50021
500009: 500009
5000081: 5000081
50000021: 50000021
500000003: 500000003
Update 1 Of course, timing is not determinate (ie, can vary substantially depending on the value of s, other processes on machine, etc); for example:
qili ~/px > time for i in 000 004 010 058 070 094 100 118 184; do ./divide-timing 500000$i; done
p = 500000003 after 0.000201 seconds
p = 500000009 after 0.000201 seconds
p = 500000057 after 0.000235 seconds
p = 500000069 after 0.000394 seconds
p = 500000093 after 0.000200 seconds
p = 500000099 after 0.000201 seconds
p = 500000117 after 0.000201 seconds
p = 500000183 after 0.000211 seconds
p = 500000201 after 0.000223 seconds
real 0m0.011s
user 0m0.002s
sys 0m0.004s
Consider using a double hash function to get a better distribution inside the matrix,
but given that you cannot avoid colisions, what I suggest is to use an array of sentinels
and mark the positions you visit, this way you are sure you get to visit a cell once.