I am using Stata 13.
I want to create a variable that equals 0 when none of a bunch of other variables equals 0; this variable is 1 when one variable of a bunch of other variables equals 1; it is 2 when two variables of a bunch of other variables are 1; it is 3 when three variables of a bunch of other variables are 1; and so on.
Any suggestions?
Your conditions are not mutually exclusive. Two criteria need to be separated.
A variable that is 0 when none of a bunch of other variables equals 0.
A variable is 1 when one variable of a bunch of other variables equals 1; 2 when two variables are 1; 3 when three variables are 1; etc.
Condition #2 is just counting 1s, as here:
clear
input x1 x2 x3
0 0 1
0 1 1
1 1 1
end
egen count1 = anycount(x1 x2 x3), value(1)
list
+-----------------------+
| x1 x2 x3 count1 |
|-----------------------|
1. | 0 0 1 1 |
2. | 0 1 1 2 |
3. | 1 1 1 3 |
+-----------------------+
Condition #1 could be done this way for a modest number of variables:
gen none0 = inlist(0, x1, x2, x3)
list
+-------------------------------+
| x1 x2 x3 count1 none0 |
|-------------------------------|
1. | 0 0 1 1 1 |
2. | 0 1 1 2 1 |
3. | 1 1 1 3 0 |
+-------------------------------+
The rowtotal() method of counting 1s in your comment only works for values that are only ever 1, 0 or missing, which may be true of your data but it is not a stated condition.
Related
I want to take matrix 1 like the one below and pad it with 1 padding so that it looks like matrix 2 or pad it with 2 padding to make it look like matrix 3. I want to do this without using using the np.pad() or any other Numpy function.
Matrix 1
| 4 4 |
| 7 2 |
Matrix 2 - with padding of 1
| 0 0 0 0 |
| 0 4 4 0 |
| 0 7 2 0 |
| 0 0 0 0 |
Matrix 3 - with padding of 2
| 0 0 0 0 0 0 |
| 0 0 0 0 0 0 |
| 0 0 5 1 0 0 |
| 0 0 7 1 0 0 |
| 0 0 0 0 0 0 |
| 0 0 0 0 0 0 |
You could create a custom pad function like so:
Very late edit: Do not use this function, use the one below it called pad2().
def pad(mat, padding):
dim1 = len(mat)
dim2 = len(mat[0])
# new empty matrix of the required size
new_mat = [
[0 for i in range(dim1 + padding*2)]
for j in range(dim2 + padding*2)
]
# "insert" original matix in the empty matrix
for i in range(dim1):
for j in range(dim2):
new_mat[i+padding][j+padding] = mat[i][j]
return new_mat
It might not be the optimal/fastest solution, but this should work fine for regular sized matrices.
Very late edit:
I tried to use this function on a non square matrix and noticed it threw an IndexError. So for future reference here is the corrected version that works for N x M matrices (where N != M):
def pad2(mat, padding, pad_with=0):
n_rows = len(mat)
n_cols = len(mat[0])
# new empty matrix of the required size
new_mat = [
[pad_with for col in range(n_cols + padding * 2)]
for row in range(n_rows + padding * 2)
]
# "insert" original matix in the empty matrix
for row in range(n_rows):
for col in range(n_cols):
new_mat[row + padding][col + padding] = mat[row][col]
return new_mat
I have a pandas dataframe in the following structure:
|index | a | b | c | d | e |
| ---- | -- | -- | -- | -- | -- |
|0 | -1 | -2| 5 | 3 | 1 |
How can I get the minimum value for each row using only the positive values in columns a-e?
For the example row above, the minimum of (5,3,1) should be 1 and not (-2).
You can use the loop on all rows and apply your condition on the rows.
for example:
df = pd.DataFrame([{"a":-2,"b":2,"c":5},{"a":3,"b":0,"c":-1}])
# a b c
#0 -2 2 5
#1 3 0 -1
def my_condition(li):
li = [i for i in li if i>=0]
return min(li)
min_cel = []
for k,r in df.iterrows():
li = r.to_dict().values()
min_cel.append( my_condition(li) )
df["min"] = min_cel
# a b c min
#0 -2 2 5 2
#1 3 0 -1 0
You can also write the same code on one line:
df['min'] = ddd.apply(lambda row: min([i for i in row.to_dict().values() if i>=0]) , axis=1)
How can I create a dummy variable in Stata that takes the value of 1 when the variable pax is above 100 and 0 otherwise?
Missing values should be labelled as 0.
My code is the following:
generate type = 0
replace type = 1 if pax > 100
The problem is that Stata labels all missing values as 1 instead of keeping them as 0.
This occurs because Stata views missing values as large positive values. As such, your variable type is set equal to 1 when you request this for all values of pax > 100 (which includes missings).
You can avoid this by explicitly indicating that you do not want missing values replaced as 1:
generate type = 0
replace type = 1 if pax > 100 & pax != .
Consider the toy example below:
clear
input pax
20
30
40
100
110
130
150
.
.
.
end
The following syntax is in fact sufficient:
generate type1 = pax > 100 & pax < .
Alternatively, one can use the missing() function:
generate type2 = pax > 100 & !missing(pax)
Note the use of ! before the function, which tells Stata to focus on the non-missing values.
In both cases, the results are the same:
list
+---------------------+
| pax type1 type2 |
|---------------------|
1. | 20 0 0 |
2. | 30 0 0 |
3. | 40 0 0 |
4. | 100 0 0 |
5. | 110 1 1 |
|---------------------|
6. | 130 1 1 |
7. | 150 1 1 |
8. | . 0 0 |
9. | . 0 0 |
10. | . 0 0 |
+---------------------+
Let say I have the following dataframe :
elements = [1,1,1,1,1,2,3,4,5]
df = pd.DataFrame({'elements': elements})
df.set_index(['elements'])
print df
elements
0 1
1 1
2 1
3 1
4 1
5 2
6 3
I have a list [1, 1, 2, 3] and I want a subset of the dataframe including those 4 elements, for example:
elements
0 1
1 1
5 2
6 3
I have been able to deal with it by building a dict counting the items occurrences in the array and building a new dataframe by appending subparts of the initial one.
Would you know some dataframe methods to help me find a more elegant solution?
After #jezrael comment : I must add that i need to keep track of the initial index (in df).
We can see df (first dataframe) as a repository of resources and i need to track which rows/indices are attributed :
Use case is : among the elements in df give me two 1, one 2 and one 3. i would persist the fact that i have the rows 0 and 1 as 1, row 4 as 2 and row 5 as 3.
If and only if your Series and list are sorted (otherwise, see below), then you can do:
L = [1, 1, 2, 3]
df[df.elements.apply(lambda x: x == L.pop(0) if x in L else False)]
elements
0 1
1 1
5 2
6 3
list.pop(i) returns and removes the value in list at index i. Because both, the elements and L, are sorted, popping the first element (i==0) of the subset list L will always occur at the corresponding first element in elements.
So at each iteration of lambda on elements, L will become:
| element | L | Output |
|=========|==============|===========|
| 1 | [1, 1, 2, 3] | True |
| 1 | [1, 2, 3] | True |
| 1 | [2, 3] | False |
| 1 | [2, 3] | False |
| 1 | [2, 3] | False |
| 2 | [2, 3] | True |
| 3 | [3] | True |
| 4 | [] | False |
| 5 | [] | False |
As you can see, your list is empty at the end, so if it's a problem, you can copy it beforehand. Or, you actually have that information in the new dataframe you just created!
If df.elements is not sorted, create a sorted copy on which you apply the same lambda function as above, but the output of it will be used as index for the original dataframe (indexes whose values are True are used):
df
elements
0 5
1 4
2 3
3 1
4 2
5 1
6 1
7 1
8 1
cp = df.elements.copy()
cp.sort_values(inplace=True)
tmp = df.loc[cp.apply(lambda x: x == L.pop(0) if x in L else False)]
print tmp
elements
2 3
3 1
4 2
5 1
HTH
Extracting can be possible by merge with new columns by GroupBy.cumcount:
L = [1,1,2,3]
df1 = pd.DataFrame({'elements':L})
df['g'] = df.groupby('elements')['elements'].cumcount()
df1['g'] = df1.groupby('elements')['elements'].cumcount()
print (df)
elements g
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 2 0
6 3 0
7 4 0
8 5 0
print (df1)
elements g
0 1 0
1 1 1
2 2 0
3 3 0
print (pd.merge(df,df1, on=['elements', 'g']))
elements g
0 1 0
1 1 1
2 2 0
3 3 0
print (pd.merge(df.reset_index(),df1, on=['elements', 'g'])
.drop('g', axis=1)
.set_index('index')
.rename_axis(None))
elements
0 1
1 1
5 2
6 3
Does there exist a Postgres Aggregator such that, when used on the following table:
id | value
----+-----------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 3
6 | 3
7 | 3
8 | 4
9 | 4
10 | 5
in a query such as:
select agg_function(4,value) from mytable where id>5
will return
agg_function
--------------
t
(a boolean true result) because a row or rows with value=4 were selected?
In other words, one argument specifies the value you are looking for, the other argument takes the column specifier, and it returns true if the column value was equal to the specified value for one or more rows?
I have successfully created an aggregate to do just that, but I'm wondering if I have just re-created the wheel...
select sum(case when value = 4 then 1 else 0 end) > 0
from mytable
where id > 5