How to compute the sum of some variables in Stata? - variables

I have 48 variables in my dataset: first 12 concern year 2000, second 12 year 2001, third 12 year 2002 and fourth 12 year 2003.
Each single variable contains the values in such a way:
ID
var1
var2
var3
...
var12
...
var48
xx
0
0
1
...
1
...
0
yy
1
0
0
...
9
...
0
zz
3
2
1
...
0
...
0
Now, I want to collect the sum of the values of the first 12 variables in another one called, say, "tot_2000" which should contain just one number (in this example it is 18).
Then, I must repeat this passage for the 3 remaining years, thus having 4 variables ("tot_2000", "tot_2001", "tot2002", "tot2003") to be plotted in an histogram.
What I'm looking for is such a variable:
tot_2000
18

ORIGINAL QUESTION, addressed by #TheIceBear and myself.
I have a dataset that contains, say, 12 variables with values 0,1,2.... like this, for example:
ID
var1
var2
var3
...
var12
xx
0
0
1
...
1
yy
1
0
0
...
9
zz
3
2
1
...
0
and I want to create a variable that is just the sum of all the values (18 in this case), like:
tot_var
18
What is the command?
FIRST ANSWER FROM ME
Here is another way to do it, as indicated in a comment on the first answer by #TheIceBear.
* Example generated by -dataex-. For more info, type help dataex
clear
input str2 ID byte(var1 var2 var3 var4)
"xx" 0 0 1 1
"yy" 1 0 0 9
"zz" 3 2 1 0
end
mata : total = sum(st_data(., "var1 var2 var3 var4"))
mata : st_numscalar("total", total)
di scalar(total)
18
The two Mata commands could be telescoped.
SECOND ANSWER
A quite different question is emerging slowly from comments and edits. The question is still unfocused, but here is an attempt to sharpen it up.
You have monthly data for various identifiers. You want to see bar charts (not histograms) with annual totals.
The data structure or layout you have is a poor fit for handling such data in Stata. You have a so-called wide layout but a long layout is greatly preferable. Then your totals can be put in a variable for graphing.
* fake dataset
clear
set obs 3
gen id = word("xx yy zz", _n)
forval j = 1/48 {
gen var`j' = _n * `j'
}
* you start here
reshape long var, i(id) j(time)
gen mdate = ym(1999, 12) + time
format mdate %tm
gen year = year(dofm(mdate))
* not clear that you want this, but it could be useful
egen total = total(var), by(id year)
twoway bar total year, by(id) xla(2000/2003) name(G1, replace)
* this seems to be what you are asking for
egen TOTAL = total(var), by(year)
twoway bar TOTAL year, base(0) xla(2000/2003) name(G2, replace)

Here is a solution for how to do it in two steps:
* Example generated by -dataex-. For more info, type help dataex
clear
input str2 ID byte(var1 var2 var3 var4)
"xx" 0 0 1 1
"yy" 1 0 0 9
"zz" 3 2 1 0
end
egen row_sum = rowtotal(var*) //Sum each row into a var
egen tot_var = sum(row_sum ) //Sum the row_sum var
* Get the value of the first observation and store in a local macro
local total = tot_var[1]
display `total'

Related

Can I use pandas to create a biased sample?

My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone

Changing names of variables using the values of another variable

I am trying to rename around 100 dummy variables with the values from a separate variable.
I have a variable products, which stores information on what products a company sells and have generated a dummy variable for each product using:
tab products, gen(productid)
However, the variables are named productid1, productid2 and so on. I would like these variables to take the values of the variable products instead.
Is there a way to do this in Stata without renaming each variable individually?
Edit:
Here is an example of the data that will be used. There will be duplications in the product column.
And then I have run the tab command to create a dummy variable for each product to produce the following table.
sort product
tab product, gen(productid)
I noticed it updates the labels to show what each variable represents.
What I would like to do is to assign the value to be the name of the variable such as commercial to replace productid1 and so on.
Using your example data:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tabulate product, generate(productid)
list, abbreviate(10)
sort product
levelsof product, local(new) clean
tokenize `new'
ds productid*
local i 0
foreach var of varlist `r(varlist)' {
local ++i
rename `var' ``i''
}
Produces the desired output:
list, abbreviate(10)
+---------------------------------------------------------------------------+
| companyid product Commercial CreditCard EMFunds P2P Retail |
|---------------------------------------------------------------------------|
1. | 3 Commercial 1 0 0 0 0 |
2. | 5 CreditCard 0 1 0 0 0 |
3. | 4 CreditCard 0 1 0 0 0 |
4. | 6 EMFunds 0 0 1 0 0 |
5. | 1 P2P 0 0 0 1 0 |
6. | 2 Retail 0 0 0 0 1 |
+---------------------------------------------------------------------------+
Arbitrary strings might not be legal Stata variable names. This will happen if they (a) are too long; (b) start with any character other than a letter or an underscore; (c) contain characters other than letters, numeric digits and underscores; or (d) are identical to existing variable names. You might be better off making the strings into variable labels, where only an 80 character limit bites.
This code loops over the variables and does its best:
gen long obs = _n
foreach v of var productid? productid?? productid??? {
su obs if `v' == 1, meanonly
local tryit = product[r(min)]
capture rename `v' `=strtoname("`tryit'")'
}
Note: code not tested.
EDIT: Here is a test. I added code for variable labels. The data example and code show that repeated values and values that could not be variable names are accommodated.
clear
input str13 products
"one"
"two"
"one"
"three"
"four"
"five"
"six something"
end
tab products, gen(productsid)
gen long obs = _n
foreach v of var productsid*{
su obs if `v' == 1, meanonly
local value = products[r(min)]
local tryit = strtoname("`value'")
capture rename `v' `tryit'
if _rc == 0 capture label var `tryit' "`value'"
else label var `v' "`value'"
}
drop obs
describe
Contains data
obs: 7
vars: 7
size: 133
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
products str13 %13s
five byte %8.0g five
four byte %8.0g four
one byte %8.0g one
six_something byte %8.0g six something
three byte %8.0g three
two byte %8.0g two
-------------------------------------------------------------------------------
Another solution is to use the extended macro function
local varlabel:variable label
The tested code is:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tab product, gen(product_id)
* get the list of product id variables
ds product_id*
* loop through the product id variables and change the
variable name to its label
foreach var of varlist `r(varlist)' {
local varlabel: variable label `var'
display "`varlabel'"
local pos = strpos("`varlabel'","==")+2
local varlabel = substr("`varlabel'",`pos',.)
display "`varlabel'"
rename `var' `varlabel'
}

find closest match within a vector to fill missing values using dplyr

A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)
Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by if the intention was to do this by group.
You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.
first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)

Calculate amount of combinations with conditions

I'd like to calculate how many different variations of a certain amount of numbers are possible. The number of elements is variable.
Example:
I have 5 elements and each element can vary between 0 and 8. Only the first element is a bit more defined and can only vary between 1 and 8. So far I'd say I have 8*9^4 possibilities. But I have some more conditions. As soon as one of the elements gets zero the next elements should be automatically zero as well.
E.G:
6 5 4 7 8 is ok
6 3 6 8 0 is ok
3 6 7 0 5 is not possible and would turn to 3 6 7 0 0
Would somebody show me how to calculate the amount of combinations for this case and also in general, because I'd like to be able to calculate it also for 4 or 8 or 9 etc. elements. Later on I'd like to calculate this number in VBA to be able give the user a forecast how long my calculations will take.
Since once a 0 is present in the sequence, all remaining numbers in the sequence will also be 0, these are all of the possibilities: (where # below represents any digit from 1 to 8):
##### (accounts for 8^5 combinations)
####0 (accounts for 8^4 combinations)
...
#0000 (accounts for 8^1 combinations)
Therefore, the answer is (in pseudocode):
int sum = 0;
for (int x = 1; x <= 5; x++)
{
sum = sum + 8^x;
}
Or equivalently,
int prod = 0;
for (int x = 1; x <= 5; x++)
{
prod = 8*(prod+1);
}
great thank you.
Sub test()
Dim sum As Single
Dim x As Integer
For x = 1 To 6
sum = sum + 8 ^ x
Next
Debug.Print sum
End Sub
With this code I get exactly 37488. I tried also with e.g. 6 elements and it worked as well. Now I can try to estimate the calculation time

Calculate diff() between selected rows

I have a dataframe with ordered times (in seconds) and a column that is either 0 or 1:
time bit
index
0 0.24 0
1 0.245 0
2 0.47 1
3 0.471 1
4 0.479 0
5 0.58 1
... ... ...
I want to select those rows where the time difference is, let's say <0.01 s. But only those differences between rows with bit 1 and bit 0. So in the above example I would only select row 3 and 4 (or any one of them). I thought that I would calculate the diff() of the time column. But I need to somehow select on the 0/1 bit.
Coming from the future to answer this one. You can apply a function to the dataframe that finds the indices of the rows that adhere to the condition and returns the row pairs accordingly:
def filter_(x, threshold = 0.01):
indices = df.index[(df.time.diff() < threshold) & (df.bit.diff().abs() == 1)]
mask = indices | indices - 1
return x[mask]
print(df.apply(filter_, args = (0.01,)))
Output:
time bit
3 0.471 1
4 0.479 0