Simulate data for repeated binary measures - variables

I can generate a binary variable y as follows:
clear
set more off
gen y =.
replace y = rbinomial(1, .5)
How can I generate n variables y_1, y_2, ..., y_n with a correlation of rho?

This is #pjs's solution in Stata for generating pairs of variables:
clear
set obs 100
set seed 12345
generate x = rbinomial(1, 0.7)
generate y = rbinomial(1, 0.7 + 0.2 * (1 - 0.7)) if x == 1
replace y = rbinomial(1, 0.7 * (1 - 0.2)) if x != 1
summarize x y
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
x | 100 .72 .4512609 0 1
y | 100 .67 .4725816 0 1
correlate x y
(obs=100)
| x y
-------------+------------------
x | 1.0000
y | 0.1781 1.0000
And a simulation:
set seed 12345
tempname sim1
tempfile mcresults
postfile `sim1' mu_x mu_y rho using `mcresults', replace
forvalues i = 1 / 100000 {
quietly {
clear
set obs 100
generate x = rbinomial(1, 0.7)
generate y = rbinomial(1, 0.7 + 0.2 * (1 - 0.7)) if x == 1
replace y = rbinomial(1, 0.7 * (1 - 0.2)) if x != 1
summarize x, meanonly
scalar mean_x = r(mean)
summarize y, meanonly
scalar mean_y = r(mean)
corr x y
scalar rho = r(rho)
post `sim1' (mean_x) (mean_y) (rho)
}
}
postclose `sim1'
use `mcresults', clear
summarize *
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
mu_x | 100,000 .7000379 .0459078 .47 .89
mu_y | 100,000 .6999094 .0456385 .49 .88
rho | 100,000 .1993097 .1042207 -.2578483 .6294388
Note that in this example I use p = 0.7 and rho = 0.2 instead.

This is #pjs's solution in Stata for generating a time-series:
clear
set seed 12345
set obs 1
local p = 0.7
local rho = 0.5
generate y = runiform()
if y <= `p' replace y = 1
else replace y = 0
forvalues i = 1 / 99999 {
set obs `= _N + 1'
local rnd = runiform()
if y[`i'] == 1 {
if `rnd' <= `p' + `rho' * (1 - `p') replace y = 1 in `= `i' + 1'
else replace y = 0 in `= `i' + 1'
}
else {
if `rnd' <= `p' * (1 - `rho') replace y = 1 in `= `i' + 1'
else replace y = 0 in `= `i' + 1'
}
}
Results:
summarize y
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
y | 100,000 .70078 .4579186 0 1
generate id = _n
tsset id
corrgram y, lags(5)
-1 0 1 -1 0 1
LAG AC PAC Q Prob>Q [Autocorrelation] [Partial Autocor]
-------------------------------------------------------------------------------
1 0.5036 0.5036 25366 0.0000 |---- |----
2 0.2567 0.0041 31955 0.0000 |-- |
3 0.1273 -0.0047 33576 0.0000 |- |
4 0.0572 -0.0080 33903 0.0000 | |
5 0.0277 0.0032 33980 0.0000 | |

Correlation is a pairwise measure, so I'm assuming that when you talk about binary (Bernoulli) values Y1,...,Yn having a correlation of rho you're viewing them as a time series Yi: i = 1,...,n, of Bernoulli values having a common mean p, variance p*(1-p), and a lag 1 correlation of rho.
I was able to work it out using the definition of correlation and conditional probability. Given it was a bunch of tedious algebra and stackoverflow doesn't do math gracefully, I'm jumping straight to the result, expressed in pseudocode:
if Y[i] == 1:
generate Y[i+1] as Bernoulli(p + rho * (1 - p))
else:
generate Y[i+1] as Bernoulli(p * (1 - rho))
As a sanity check you can see that if rho = 0 it just generates Bernoulli(p)'s, regardless of the prior value. As you already noted in your question, Bernoulli RVs are binomials with n = 1.
This works for all 0 <= rho, p <= 1. For negative correlations, there are constraints on the relative magnitudes of p and rho so that the parameters of the Bernoullis are always between 0 and 1.
You can analytically check the conditional probabilities to confirm correctness. I don't use Stata, but I tested this pretty thoroughly in the JMP statistical software and it works like a charm.
IMPLEMENTATION (Python)
import random
def Bernoulli(p):
return 1 if random.random() <= p else 0 # yields 1 w/ prob p, 0 otherwise
N = 100000
p = 0.7
rho = 0.5
last_y = Bernoulli(p)
for _ in range(N):
if last_y == 1:
last_y = Bernoulli(p + rho * (1 - p))
else:
last_y = Bernoulli(p * (1 - rho))
print(last_y)
I ran this and redirected the results to a file, then imported the file into JMP. Analyzing it as a time series produced:
The sample mean was 0.69834, with a standard deviation of 0.4589785 [upper right of the figure]. The lag-1 estimates for autocorrelation and partial correlation are 0.5011 [bottom left and right, respectively]. These estimated values are all excellent matches to a Bernoulli(0.7) with rho = 0.5, as specified in the demo program.
If the goal is instead to produce (X,Y) pairs with the specified correlation, revise the loop to:
for _ in range(N):
x = Bernoulli(p)
if x == 1:
y = Bernoulli(p + rho * (1 - p))
else:
y = Bernoulli(p * (1 - rho))
print(x, y)

Related

Division by Zero error in calculating series

I am trying to compute a series, and I am running into an issue that I don't know why is occurring.
"RuntimeWarning: divide by zero encountered in double_scalars"
When I checked the code, it didn't seem to have any singularities, so I am confused. Here is the code currently(log stands for natural logarithm)(edit: extending code if that helps):
from numpy import pi, log
#Create functions to calculate the sums
def phi(z: int):
k = 0
phi = 0
#Loop through 1000 times to try to approximate the series value as if it went to infinity
while k <= 100:
phi += ((1/(k+1)) - (1/(k+(2*z))))
k += 1
return phi
def psi(z: int):
psi = 0
k = 1
while k <= 101:
psi += ((log(k))/( k**(2*z)))
k += 1
return psi
def sig(z: int):
sig = 0
k = 1
while k <= 101:
sig += ((log(k))**2)/(k^(2*z))
k += 1
return sig
def beta(z: int):
beta = 0
k = 1
while k <= 101:
beta += (1/(((2*z)+k)^2))
k += 1
return beta
#Create the formula to approximate the value. For higher accuracy, either calculate more derivatives of Bernoulli numbers or increase the boundry of k.
def Bern(z :int):
#Define Euler–Mascheroni constant
c = 0.577215664901532860606512
#Begin computations (only approximation)
B = (pi/6) * (phi(1) - c - 2 * log(2 * pi) - 1) - z * ((pi/6) * ((phi(1)- c - (2 * log(2 * pi)) - 1) * (phi(1) - c) + beta(1) - 2 * psi(1)) - 2 * (psi(1) * (phi(1) - c) + sig(1) + 2 * psi(1) * log(2 * pi)))
#output
return B
A = int(input("Choose any value: "))
print("The answer is", Bern(A + 1))
Any help would be much appreciated.
are you sure you need a ^ bitwise exclusive or operator instead of **? I've tried to run your code with input parameter z = 1. And on a second iteration the result of k^(2*z) was equal to 0, so where is from zero division error come from (2^2*1 = 0).

Visual basic function for intermittent antibiotic dosing

I am a beginner in VB. I wrote a little program to simulate dosing regimens of antibiotics using some exponential equations and pharmacokinetic data.
The problem I have is that I want to display on a graph the following mathematical function:
That simulates the concentration variation at different intervals of time:
Where:
b(t) is the concentration at time t that will be plotted as Y axis, t is time (plotted on the x-axis).
b(0) is the concentration at time 0 and it is a known variable.
u(t-a1) is a function that has the value u(t-a1)=b(o) if t=a1 or 0 if t<>a1
a1 is the time at which a next dose is given.
alpha is the elimination rate constant, a variable that is known.
What I have so far:
Dim y, x As Double
For x = 0 To 24 Step 1
For n As Double = 1 To 24 / tau
y = (1 - test_condition(n * tau, x)) * css * Math.Exp(-ke * x) + test_condition(n * tau, x) * css * Math.Exp(-ke * (x - n * tau))
Chart1.Series("Concentratie1").Points.AddXY(x, y)
Next
Next
The test_condition:
if x=tau then test_condition= 1 else 0
It is close but I don't get an exponential decay after a dose ... don't know how to make that happen.
This works!! for tau (dosing interval) every 4 hours. Can it be rearranged somehow because the tau (dosing interval) will vary (sometimes 4 hours, sometimes every 6 hours)?:
Dim y, x, y2, x2, y3, x3, y4, x4, x5, x6, y5, y6 As Double
For x = 0 To tau Step 1
y = exponential_decay(css, ke, x) + test_condition(tau, x) * (css - Val(mic))
Chart1.Series("Bolus 1").Points.AddXY(x, y)
Next
For x2 = tau To 2 * tau Step 1
y2 = exponential_decay(css, ke, x2 - tau) + test_condition(2 * tau, x2) * (css - Val(mic))
Chart1.Series("Bolus 2").Points.AddXY(x2, y2)
Next
For x3 = 2 * tau To 3 * tau Step 1
y3 = exponential_decay(css, ke, x3 - 2 * tau) + test_condition(3 * tau, x3) * (css - Val(mic))
Chart1.Series("Bolus 3").Points.AddXY(x3, y3)
Next
For x4 = 3 * tau To 4 * tau Step 1
y4 = exponential_decay(css, ke, x4 - 3 * tau) + test_condition(4 * tau, x4) * (css - Val(mic))
Chart1.Series("Bolus 4").Points.AddXY(x4, y4)
Next
For x5 = 4 * tau To 5 * tau Step 1
y5 = exponential_decay(css, ke, x5 - 4 * tau) + test_condition(5 * tau, x5) * (css - Val(mic))
Chart1.Series("Bolus 4").Points.AddXY(x5, y5)
Next
For x6 = 5 * tau To 32 Step 1
y6 = exponential_decay(css, ke, x6 - 5 * tau)
Chart1.Series("Bolus 4").Points.AddXY(x6, y6)
Next
End Sub
I managed to solve the problem:
this function f relates time (t) to dosing interval (tau)
Private Function f(ByVal t As Double, ByVal tau As Double)
Dim n As Integer
For n = 0 To 24 / tau
If t = n * tau Then
f = n * tau
ElseIf t < tau Then
f = 0
ElseIf t > n * tau And t < (n + 1) * tau Then
f = n * tau
ElseIf t >= (n + 1) * tau Then
f = n * tau
End If
Next
End Function
And this is what I draw on the chart:
For x = 0 To 36 Step 0.5
y = exponential_decay(css, ke, x - f(x, tau))
Chart1.Series("Intermitent Dosage").Points.AddXY(x, y)
Next

algorithm to deal with series of values

With a series with a START, INCREMENT, and MAX:
START = 100
INCREMENT = 30
MAX = 315
e.g. 100, 130, 160, 190, 220, 250, 280, 310
Given an arbitrary number X return:
the values remaining in the series where the first value is >= X
the offset Y (catch up amount needed to get from X to first value of the series).
Example
In:
START = 100
INCREMENT = 30
MAX = 315
X = 210
Out:
Y = 10
S = 220, 250, 280, 310
UPDATE -- From MBo answer:
float max = 315.0;
float inc = 30.0;
float start = 100.0;
float x = 210.0;
float k0 = ceil( (x-start) / inc) ;
float k1 = floor( (max - start) / inc) ;
for (int i=k0; i<=k1; i++)
{
NSLog(#" output: %d: %f", i, start + i * inc);
}
output: 4: 220.000000
output: 5: 250.000000
output: 6: 280.000000
output: 7: 310.000000
MBo integer approach will be nicer.
School math:
Start + k0 * Inc >= X
k0 * Inc >= X - Start
k0 >= (X - Start) / Inc
Programming math:
k0 = Ceil(1.0 * (X - Start) / Inc)
k1 = Floor(1.0 * (Max - Start) / Inc)
for i = k0 to k1 (including both ends)
output Start + i * Inc
Integer math:
k0 = (X - Start + Inc - 1) / Inc //such integer division makes ceiling
k1 = (Max - Start) / Inc //integer division makes flooring
for i = k0 to k1 (including both ends)
output Start + i * Inc
Example:
START = 100
INCREMENT = 30
MAX = 315
X = 210
k0 = Ceil((210 - 100) / 30) = Ceil(3.7) = 4
k1 = Floor((315 - 100) / 30) = Floor(7.2) = 7
first 100 + 4 * 30 = 220
last 100 + 7 * 30 = 310
Solve the inequation
X <= S + K.I <= M
This is equivalent to
K0 = Ceil((X - S) / I) <= K <= Floor((M - S) / I) = K1
and
Y = X - (S + K0.I).
Note that it is possible to have K0 > K1, and there is no solution.

Project Euler 3

I have been trying to solve question 3 on project euler with the following vb code but I do not under stand why it is not working. Can someone please point me in the right direction?
Sub Main()
Dim p As Int64 = 600851475143
Dim y As Integer
For i As Int64 = p / 2 To 1 Step -1
If p Mod i = 0 Then
y = 0
For n As Int64 = 1 To Math.Floor(i ^ 0.5) Step 1
If i Mod n = 0 Then
y = y + 1
End If
Next
If y = 0 Then
Console.WriteLine(i)
Console.ReadLine()
End If
End If
Next
End Sub
The question is "The prime factors of 13195 are 5, 7, 13 and 29.
What is the largest prime factor of the number 600851475143 ?"
For n As Int64 = 1 To Math.Floor(i ^ 0.5) Step 1
If i Mod n = 0 Then
You start with n=1. Every number divides evenly by 1.
(so y = y+1 every time, and If y = 0 Then can never happen).

Proc Optmodel Conditional Constraint SAS

I'm fairly new to proc optmodel and having trouble sorting out some syntax.
Here is my dataset.
data opt_test;
input ID GRP $ x1 MIN MAX y z;
cards;
2 F 10 9 11 1.5 100
3 F 10 9 11 1.2 50
4 F 11 9 11 .9 20
8 G 5 4 6 1.2 300
9 G 6 4 6 .9 200
1 H 21 18 22 1.2 300
7 H 20 18 22 .8 1000
;
run;
There are a few things going on here:
The IDs within a GRP must have the same x2, which is constrained by MIN and MAX. I now wish to further constrain the increase/decrease of x2 based on the value of y. If y<1, I do not want x2 to go below .95*x1. If y>1, I do not want x2 to exceed 1.05*x1. I've looked online and tried a few things to make this happen. Here is my latest attempt, cond_1 and cond_2 are the problems of interest, as everything else works:
proc optmodel;
set<num> ID;
string GRP{ID};
set GRPS = setof{i in ID} GRP[i];
set IDperGRP{gi in GRPS} = {i in ID: GRP[i] = gi};
number x1{ID};
number MIN{ID};
number MAX{ID};
var x2{gi in GRPS} >= max{i in IDperGRP[gi]} MIN[i]
<= min{i in IDperGRP[gi]} MAX[i]
;
impvar x2byID{i in ID} = x2[GRP[i]];
number y{ID};
number z{ID};
read data opt_test into
ID=[ID]
GRP
x1
MIN
MAX
y
z
;
max maximize = sum{gi in GRPS} sum{i in IDperGRP[gi]}
(x2[gi]) * (1-(x2[gi]-x1[i])*y[i]/x1[i]) * z[i];
con cond_1 {i in ID}: x2[i] >=
if y[i]<1 then .95*x1[i] else 0;
con cond_2 {i in ID}: x2[i] <=
if y[i]>=1 then 1.05*x1[i] else 99999999;
solve;
create data results from [ID]={ID} x2=x2byID GRP x1 MIN MAX y z;
print x2 maximize;
quit;
I would calculate the global max and min in a data step outside of PROC OPTMODEL and then set the values. Like so:
data opt_test;
set opt_test;
if y < 1 then
min2 = .95*x1;
else
min2 = 0;
if y>=1 then
max2 = 1.05*x1;
else
max2 = 9999999999;
Min_old = min;
max_old = max;
MIN = max(min,min2);
MAX = min(max,max2);
run;
But you have a problem with group G. Use expand to see it.
proc optmodel;
set<num> ID;
string GRP{ID};
set GRPS = setof{i in ID} GRP[i];
set IDperGRP{gi in GRPS} = {i in ID: GRP[i] = gi};
number x1{ID};
number MIN{ID};
number MAX{ID};
var x2{gi in GRPS} >= max{i in IDperGRP[gi]} MIN[i]
<= min{i in IDperGRP[gi]} MAX[i]
;
impvar x2byID{i in ID} = x2[GRP[i]];
number y{ID};
number z{ID};
read data opt_test into
ID=[ID]
GRP
x1
MIN
MAX
y
z
;
max maximize = sum{gi in GRPS} sum{i in IDperGRP[gi]}
(x2[gi]) * (1-(x2[gi]-x1[i])*y[i]/x1[i]) * z[i];
/*con cond_1 {i in ID}: x2[i] >=
if y[i]<1 then .95*x1[i] else 0;
con cond_2 {i in ID}: x2[i] <=
if y[i]>=1 then 1.05*x1[i] else 99999999;*/
expand;
solve;
create data results from [ID]={ID} x2=x2byID GRP x1 MIN MAX y z;
print x2 maximize;
quit;
You will see that X2[G] is infeasible:
Var x2[G] >= 5.7 <= 5.25
X2[G] starts in [4,6];
For ID=8, X=5 and Y=1.2. By your logic, this sets the max to 5.25 (5*1.2).
Now X2[G] in [4,5.25]
For ID=9, X=6 and Y=0.9. By your logic this sets the min to 5.7 (0.95*6).
X2[G] in [5.7,5.25] <-- BAD!
The biggest problem with the model in the question is that the indexing of the var x2 is incorrect. You could fix that by referring to the group of the ID as such:
con cond_1 {i in ID}: x2[GRP[i]] >=
if y[i] < 1 then .95*x1[i] else 0;
con cond_2 {i in ID}: x2[GRP[i]] <=
if y[i]>=1 then 1.05*x1[i] else 99999999;
but a description of the constraint that reads closer to the business problem is to put a filter in the constraint definition itself:
con Cond_1v2 {i in ID: y[i] < 1} : x2[GRP[i]] >= .95 * x1[i];
con Cond_2v2 {i in ID: Y[i] >= 1}: x2[GRP[i]] <= 1.05 * x1[i];
In either case, the problem becomes infeasible because of the constraint Cond2v2, as you can see using expand (as #DomPazz pointed out), and in particular the expand / iis option, which will print the conflicting constraints when it can determine them:
solve with nlp / iis=on;
expand / iis;