Generate a variable for an entire household based on another variable for cross sectional data - variables

I have a dataset on various households. Each household has various individuals. I want to take one entry(individual) for a specific variable and apply it to the entire household. Finally I want to perform regression on the household data and ignore the individuals. Eg:
Household id | Individual id | Variable 1
H1 | A1 |a
H1 | A2 | .
H1 | A3 | a
H1 | A4 | a
H2 | A1 | B
H2 | A2 | .
H3 | A1 | B
H3 | A2 | a
H3 | A3 | .
H3 | A4 | a
H3 | A5 | .
H3 | A6 | .
I want to create a data where :
Household id Variable 1
H1 | a
H2 | B
H3 | B
[EDIT : it is possible to do it through egen. I was trying to approach it through nested loops.
> foreach h in household_id{
> local d = 0
> local e = 0
> foreach i in individual_id{
> replace `d' = 1 if var1 == 1
> replace `e' = 2 if var1 == 2
> }
> foreach i in individual_id{
> replace a = 1 if `d' == 1
> replace a = 2 if `e' == 2
> }
> }
Here a is another variable which I will directly use to replace var1. The code gives an error :
0 invalid name
What are the corrections required in my code? Is there a fundamental flaw in my usage of local macros? Using tempvar didn't help either.

The code shown is very confused.
A loop over a single entity that is never referred to inside the loop is legal but just is equivalent to the statements inside, executed once.
Notice that you never refer to the local macros h or i within your loops.
Thus your code is equivalent to
local d = 0
local e = 0
replace `d' = 1 if var1 == 1
replace `e' = 2 if var1 == 2
replace a = 1 if `d' == 1
replace a = 2 if `e' == 2
That fails when the local macro d is first replaced by its contents, which results in
replace 0 = 1 if var1 == 1
and the error message is correct because 0 is not a valid variable name. The next statement would raise the same problem.
Stata never gets as far as testing
... if 0 == 1
... if 0 == 2
and those statements, although legal, would do nothing because the conditions will never be satisfied.
So far, so negative. I can't suggest code positively because I don't understand what you want to do. What you have posted doesn't look like real or realistic data. I suggest that you look at http://www.statalist.org/forums/help#stata which suggests how to post Stata data examples (and so is relevant here) and also at https://stackoverflow.com/help/mcve (which does lay down the standard here).

Related

Confusing matching behaviour of pandas extract(all)

I have a strange problem. But first, I want to match a hierarchy-based string onto the value of a column in a pandas data frame and count the occurrence of the current node and all of its children.
| index | hierarchystr |
| ----- | --------------------- |
| 0 | level0level00level000|
| 1 | level0level01 |
| 2 | level0level02level021|
| 3 | level0level02level021|
| 4 | level0level02level020|
| 5 | level0level02level021|
| 6 | level1level02level021|
| 7 | level1level02level021|
| 8 | level1level02level021|
| 9 | level2level02level021|
Assume that there are 300k lines. Each node can have multiple children with again multiple children so on and so forth (here represented by level0-2 strings). Now I have a separate hierarchy where I extract the hierarchy strings from. Now to the problem:
#hstrs = ["level0", "level1", "level0level01", "level0level02", "level0level02level021"]
pat = "|".join(hstrs)
s = df.hierarchystr.str.extract('(' + pat + ')', expand=True)[0]
df1 = df.groupby(s).size().reset_index(name='Count')
df1 = df1[df1 > 200]
size = len(df1)
The size of the found matched substrings with occurrence greater than 200 differ every RUN! "level0" should match every row where the hierarchy str level0 is included and should build a group with all its subchildren and that size needs to be greater than 200.
Edit:// levelX is just an example, i have thousands of nodes, with different names and again thousands of different subchilds. The hstrs strings do not include each other, besides the parent nodes. (E.g. "parent1" is included in "parent1subchild1" and "parent1subchild2")
I traced it back to a different order of the hierarchy strings in the array hstrs. So I changed the code and compare each substring individually:
for hstr in hstrs:
s = df.hierarchystr.str.extract('(' + hstr + ')', expand=True)
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
list.append(hstr)
This is slow as hell, but the result sticks the same, no matter which order hstrs has. But for efficiency is it possible to do the same with only one regex matching group, all at once for all hstrs?
Edit://
expected output would be:
|index| 0 | Count |
|-----|---------------------|-------|
|0 |level0 | 5 |
|1 |level1 | 3 |
|2 |level0level01 | 1 |
|3 |level0level02 | 4 |
|4 |level0level02level021| 3 |
Edit2://
it has something to do with the ordering of hstrs. I think with the match and stop after the first match the behavior of the extract method. If the ordering is different the hierarchy strings in the pat will be matched differently which results in different sizes of each group. A high hierarchy (short str) will be matched first, the lower hierarchy levels in the same pat won't be matched again. But IDK what to do against this behavior.
Edit3://
an alternative would be, but is also slow as hell:
for hstr in hstrs:
s = df[df.hierarchystr.str.contains(fqn)]
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
beforeset.append(fqn)
Edit4://
I think what I am searching for is the opportunity to do a "group_by" with "contains" or "is in" for the hstrs. I am glad for every Idea. :)
Edit5://
Found a simple, but not satisfying alternative (but faster than the previous tries):
containing =[item for hierarchystr in df.hierarchystr for item in hstrs if item in hierarchystr]
containing = Counter(containing)
df1 = pd.DataFrame([containing]).T
nodeNamesWithOver200 = df1[df1 > 200].dropna().index.values

What is the meaning of the "Load" column in Apache balancer-manager?

I've set up the Apache (2.4) load-balancer which is working okay. To monitor its performance, I enabled the balancer-manager handler, which shows the status of the balancers.
I noticed a "Load" column, which was not present in version 2.2, with a value that may be negative, but I don't understand its meaning nor I was able to find documentation relative to this.
Can anyone explain the meaning of that value or point me to the right documentation?
I now understood, how the calculation of "Load" works. Here is a I think more simpler example than on the apache documents page.
Let's say we have 3 worker and a configured load factor of 1.
1) Start
a | b | c
--+---+---
0 | 0 | 0
add the load factor of 1 to all workers
a | b | c
--+---+---
1 | 1 | 1
now select the one with highest value --> a and decrease by the sum of the factor of all (=3) - this is the selected worker
a | b | c
---+---+---
-2 | 1 | 1
2) next round, add again 1 to all
a | b | c
---+---+---
-1 | 2 | 2
now select the one with highest value --> b and decrease by the sum of the factor of all (=3) - this is the selected worker
a | b | c
---+----+----
-1 | -1 | 2
3) next round, add again 1
a | b | c
---+----+----
0 | 0 | 3
now select the one with highest value --> c and decrease by the sum of the factor of all (=3) - this is the selected worker
a | b | c
---+----+----
0 | 0 | 0
startover again :)
I hope this helps others.
The Load value is populated by lbstatus based on this line of code:
ap_rprintf(r, "<td>%d</td><td>", worker->s->lbstatus);
in https://svn.apache.org/viewvc/httpd/httpd/trunk/modules/proxy/mod_proxy_balancer.c?view=markup#l1767 (line might changed when the code modified)
Since your method is by request, lbstatus is specified by mod_lbmethod_byrequests which define:
lbstatus is how urgent this worker has to work to fulfill its quota of
work.
Details on the algorithm can be found here: https://httpd.apache.org/docs/2.4/mod/mod_lbmethod_byrequests.html
i too want to know to description for others column like BUSY, ELECTED etc.. my LB has BUSY over 100 already.. i though BUSY should not exceed 100 ( as in 100% server busyness or something )

How to set manually split long rows size on Octave's Terminal Output?

How to set manually slipt long rows size on Octave's Terminal Output?
I am using Octave through Sublime Text output build panel, and octave cannot recognize correctly how many rows it should use to split/to fill up the screen.
Example, It is currently filling the screen like this:
octave:13> rand (2,10)
ans =
Columns 1 through 6:
0.75883 0.93290 0.40064 0.43818 0.94958 0.16467
0.75697 0.51942 0.40031 0.61784 0.92309 0.40201
Columns 7 through 10:
0.90174 0.11854 0.72313 0.73326
0.44672 0.94303 0.56564 0.82150
But I want to set 10 columns (Columns 1 through 10) instead of Columns 1 through 6.
If I disable the split_long_rows, never splits.
Query or set the internal variable that controls whether rows of a
matrix may be split when displayed to a terminal window.
If the rows are split, Octave will display the matrix in a series of
smaller pieces, each of which can fit within the limits of your
terminal width and each set of rows is labeled so that you can easily
see which columns are currently being displayed.
https://www.gnu.org/software/octave/doc/v4.0.1/Matrices.html#XREFsplit_005flong_005frows
You cannot to split them like that. The Octave output is just a simple and fast way to debug your program. To print things beautifully as you want to, just to create a function for it and it to print them.
This is a similar example, where a table is printed:
...
for i = 2 : 7
...
% https://www.gnu.org/software/octave/doc/v4.0.1/Basic-Usage-of-Cell-Arrays.html
results(end+1).vector = { m, gaussLegendreIntegral________, gaussLegendreIntegralErroExato___ };
end
printf( "%20s | %30s | %30s\n", "m", "Gm", "Erro Exato Gm = |Gm - Ie |" )
printf( "%20s | %30s | %30s\n", "--------------------", "------------------------------", "------------------------------" )
numberToStringPrecision = 15;
for i = 1 : numel( results )
# https://www.gnu.org/software/octave/doc/v4.0.0/Processing-Data-in-Cell-Arrays.html
# https://www.gnu.org/software/octave/doc/v4.0.1/Converting-Numerical-Data-to-Strings.html#XREFnum2str
printf( "%20s | ", num2str( cell2mat( results(i).vector(1) ), numberToStringPrecision ) )
printf( "%30s | ", num2str( cell2mat( results(i).vector(2) ), numberToStringPrecision ) )
printf( "%30s\n" , num2str( cell2mat( results(i).vector(3) ), numberToStringPrecision ) )
end
It would generate a output like this:
m | Gm | Erro Exato Gm = |Gm - Ie |
-------------------- | ------------------------------ | ------------------------------
2 | -0.895879734614027 | 0.104120265385973
3 | -0.947672383858322 | 0.0523276161416784
4 | -0.968535977854582 | 0.0314640221454183
5 | -0.979000992287376 | 0.0209990077126242
6 | -0.984991210262343 | 0.0150087897376568
7 | -0.988738923004894 | 0.0112610769951058

Basic vs. compound condition coverage

I'm trying to get my head around the differences between these 2 coverage criteria and I can't work out how they differ. I think I'm failing to understand exactly what decision coverage is. My software testing textbook states that compound decision coverage can be costly (2n combinations for n basic conditions).
I would have thought basic condition coverage would be costlier.
Consider a && b && c && d && e. My understanding is that in basic condition coverage, each of these atomic variables have to have the value TRUE and FALSE in a test case for the test case to be have basic condition adequacy - that's 32 different test cases.
So what is the actual difference, and what is referred to as a "basic condition". In the example above, is a a basic condition?
Thanks.
Regarding terminology, I don't have a single source handy that uses the exact terms "basic condition coverage" and "multiple condition coverage". Binder's "Testing Object-Oriented Systems" says "condition coverage" and "multiple-condition coverage". Everett & McLeod's "Software Testing" says "simple condition coverage" and "compound condition coverage". But I'm certain that the first term in each case is your "basic condition coverage" and the second is your "compound condition coverage". I'll use those terms below.
Basic condition coverage means that every basic condition in the program is true in some test and false in some test, regardless of other conditions. In the following
if a && b && c
# do stuff
else
# do other stuff
end
there is a compound condition, a && b && c, with three basic conditions, a, b and c. It takes only two test cases, one where all basic conditions are true and one where all are false, to get full basic condition coverage. It doesn't matter that the basic conditions happen to be part of a compound condition.
Note that basic condition coverage is not branch coverage. If the compound condition were a && b && !c, those two test cases above would still achieve basic condition coverage but would not achieve branch coverage.
A less aggressively optimized set of test cases for basic condition coverage would have one test case where all three basic conditions are false and three test cases with a different basic condition true in each. That would still only be four of the eight possible combinations of basic conditions in the compound condition. The uncomfortable feeling that we're ignoring the other four is why there's compound condition coverage. That requires a test for each possible combination of basic conditions in a compound condition. In the example above, you'd need eight tests, one for each possible combination of possible values of a, b and c, to get full compound condition coverage.
First, the difference between Decision and Condition.
A Condition is an atomar boolean expression that can not be broken down into simpler boolean expression. For example: a (if a is boolean).
A Decision is a compound of Conditions with zero or more Boolean operators. A Decision without an operator is also a condition. For example: (a or b) and c but also a and b or just a.
Lets take a simple example
if(decision) {
//branch 1
} else {
//branch 2
}
You need two tests to cover both branches. Thats the decision coverage or branch coverage. In case the decision is a condition (i.e. just a), that is also called basic condition coverage, which is the coverage of the two branches of a single condition.
The decision can be broken down into conditions.
Lets take for example
decision = (a or b) and c
The decision coverage would be achieved with
a,b,c = 0
a,b,c = 1
But the permutation of all the combinations of its boolean sub expressions is the full condition coverage or multiple condition coverage), which is the compound of the basic condition coverage :
| a | b | c |
| 0 | 0 | 1 |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
| 1 | 1 | 0 |
That would be quite a lot of tests, but some of those are redundant as some conditions are covered by others. This is reflected in the Modified Condition/Decision Coverage (MC/DC) which is a combination of condition coverage and function coverage.
For MC/DC it is required, that each condition has to affect the outcome independently. With the above test (all are 0 or all are 1), we ignore the fact, that c-value doesn't matter if a and b are 0, or, that b-value doesnt matter if a and c are 1.
So you should sit down, use your brain, and think about for which combinations the overall result R is 1 or 0.
| a | b | c | a or b | c | R | eq
1 | 0 | 0 | 0 | 0 | 0 | 0 | A
2 | 0 | 0 | 1 | 0 | 1 | 0 | B
3 | 0 | 1 | 0 | 1 | 0 | 0 | A
4 | 0 | 1 | 1 | 1 | 1 | 1 | C
5 | 1 | 0 | 0 | 1 | 0 | 0 | A
6 | 1 | 0 | 1 | 1 | 1 | 1 | D
7 | 1 | 1 | 0 | 1 | 0 | 0 | A
8 | 1 | 1 | 1 | 1 | 1 | 1 | D
The last column shows the equivalence class:
A: c = 0, result is 0, neither a nor b have an influence
B: a,b = 0, result is 0, c has no influence
C: b,c = 1, result is 1, a has no influence
D: a,c = 1, result is 1, b has no influence
For B and C it's quite obvious which to pick, not so for A and D. For each you have to check, what would happen, if I replace the operators, i.e. or -> and, and -> or, how will this affect the result of the (sub)decision. If the result will be affected, you got a candidate - If not, you don't.
A : (0 and/or 0) and/or 0 -> doesn't matter
A : (0 and 1) vs (0 or 1) -> does matter! -> Candidate
A : (1 and 0) vs (1 or 0) -> does matter! -> Candidate
A : (1 and/or 1) -> doesn't matter
D : (1 and 0) vs (1 or 0) -> does matter -> Candidate
D : (1 and 1) -> doesn't matter
So you get the final test set as mentioned above:
a = 0, b = 1, c = 0 -> false branch (A) OR a = 1, b = 0, c = 0
a = 0, b = 0, c = 1 -> false branch (B)
a = 0, b = 1, c = 1 -> true branch (C)
a = 1, b = 0, c = 1 -> true branch (D)
Especially the latter test - changing the operators - can be done with tools like mutation testing, that do not just replaing operators, but can do quite some more, i.e. flipping operands, removing statements, change order of execution, replace return values etc. And for each alteration of your code, it verifies if the test actually fails. This is good indicator of the quality of your test suite and ensures that code is not just covered but your tests for the code are actually valid.
Regarding the terminology, I couldn't find the term "Compound Decision Coverage" somewhere. In my view a "compound decision" would be a compound of compounds of conditions, in other words: a compound of conditions.

FormulaArray not averaging out all the specified entries

Table 1:
G H I J K
| Lane | Bowler | Score | Score | Score | 1
|:-----------|------------:|:------------:|:------------:|:------------:|
| Lane 1 | Thomas| 100 | 100 | 100 | 2
| Lane 2 | column | 200 | 200 | 100 | 3
| Lane 3 | Mary | 300 | 300 | 100 | 4
| Lane 1 | Cool | 150 | 400 | 100 | 5
| Lane 2 | right | 160 | 500 | 100 | 6
| Lane 9 | Susan | 170 | 600 | 100 | 7
say I want to find the average for each Lane that appeared in table 2 and put them in column O:
Table 2:
N O
| Lane | Average | 1
|:-----------|------------:|
| Lane 1 | | 2
| Lane 2 | | 3
| Lane 3 | | 4
I would put
=AVERAGE(IF(N2=$G$2:$G$7, $I$2:$K$7 )) for lane 1 (put this formula on cell "O2")
=AVERAGE(IF(N3=$G$2:$G$7, $I$2:$K$7 )) for Lane 2 ("O3")
=AVERAGE(IF(N4=$G$2:$G$7, $I$2:$K$7 )) for Lane 2 ("O4")
My first question is
What if I want to find the Average of ALL the lane together that appear in table 2. So average of Lane 1, Lane 2 and Lane 3 together (but not other lane, such as lane 9).
My attempt:
= Average(IF(G2:G7 = N2:N4, I2:K:7)) why doesn't this work?
My second question is
I have done the "average of each individual Lane" using vba:
.
Dim i As Integer
For i = 2 To 4
Cells(i, 15).FormulaArray = "=AVERAGE(IF(RC[-1]=R2C7:R7C7,R2C9:R7C12))"
Next i
.
What if I have done it using vba without the .formula method
For Lane 1 only:
pseudo code:
Loop from G2 to G7
If cell (N1) = Gx then //x: 2 to 7
Sum = Sum + Ix + Jx + Kx
}
Average = Sum/totalEntries
Would this be slower than if I were to use the build in .formula? is there a advanage to doing it this way instead?
The answer to the first question about why this FormulaArray
= Average(IF(G2:G7 = N2:N4, I2:K7)) doesn't work?
Is implicit on how this other FormulaArray works:
= AVERAGE( IF( $G$7:$G$12 = $N7, $I$7:$K$12 ) )
Let’s see how each part of this “single-cell formula array” works:
1st part: $G$7:$G$12 = $N7
The first part of the formula generates an array with the records from range $G$7:$G$12 complying with the condition = $N7. Fig. 1 shows the first part of the FormulaArray in as a “multi-cell formula array”.
2nd Part: $I$7:$K$12
The result of the first part is applied to the second part to obtain the range of scores complying with the condition = $N7 (see Fig. 2)
3rd part: AVERAGE
Finally the last part of the formula calculates the average of the scores complying with the condition = $N7
Now let’s try to apply the same analysis to the formula:
= AVERAGE( IF( G2:G7 = N2:N4, I2:K7 ) )
Unfortunately, we cannot go beyond the first part G2:G7 = N2:N4 as it fails trying to compare two arrays of different dimensions thus resulting in #N/A (see Fig. 3)
However, even if the arrays have same dimension the result would not have shown the duplicated values, as the members are compared one to one (see Fig. 4)
To obtain the average for Lanes 1 to 3 use this FormulaArray
=AVERAGE( IF(
( $G$7:$G$12 = $N7 ) + ( $G$7:$G$12 = $N8 ) + ( $G$7:$G$12 = $N9 ),
$I$7:$K$12 ) )
It generates an array with the records complying with the conditions = $N7 + = $N8 + = $N9 (+ equivalent to operator OR)
As regards the second question:
Performance is intrinsically associated to maintenance and efficiency.
The sample procedure just enters a formula which is hard coded and only works for this particular case, for example:
If needed to change the formulas to expand the ranges, the macro has to be updated, it may still have to change the formula but no need to open the VBA editor.
If any of the columns before column G get deleted as it becomes obsolete, the macro needs to be updated, while the formulas will not require any maintenance as they are automatically updated.
In reference to the macro without the .Formula method
I found this redundant, as it’s like writing an algorithm to do something that can be done efficiently and accurately with an existing function, as such a macro will not bring anything that's it's not there actually.
I'll consider the advantage of writing such a procedure in a situation in which the workbook is very large and it heavily uses resource significantly slowing down the performance of the workbook, however the advantages to be delivered by the procedure will not reside and just writing the formulas but it must calculate the results and enter the values resulting from the formulas instead of the formulas thus making the workbook light, fast and smooth to the end user.
To get the average of them all, just use
=AVERAGE(I2:K7)
As to the VBA, as it is all done on the same lines, could you just use
For i = 2 To 7
Cells(i,"O").Value = Application.Sum(Range(Cells(i,"I"),Cells(i,"K")))
Next i