I am trying to rename around 100 dummy variables with the values from a separate variable.
I have a variable products, which stores information on what products a company sells and have generated a dummy variable for each product using:
tab products, gen(productid)
However, the variables are named productid1, productid2 and so on. I would like these variables to take the values of the variable products instead.
Is there a way to do this in Stata without renaming each variable individually?
Edit:
Here is an example of the data that will be used. There will be duplications in the product column.
And then I have run the tab command to create a dummy variable for each product to produce the following table.
sort product
tab product, gen(productid)
I noticed it updates the labels to show what each variable represents.
What I would like to do is to assign the value to be the name of the variable such as commercial to replace productid1 and so on.
Using your example data:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tabulate product, generate(productid)
list, abbreviate(10)
sort product
levelsof product, local(new) clean
tokenize `new'
ds productid*
local i 0
foreach var of varlist `r(varlist)' {
local ++i
rename `var' ``i''
}
Produces the desired output:
list, abbreviate(10)
+---------------------------------------------------------------------------+
| companyid product Commercial CreditCard EMFunds P2P Retail |
|---------------------------------------------------------------------------|
1. | 3 Commercial 1 0 0 0 0 |
2. | 5 CreditCard 0 1 0 0 0 |
3. | 4 CreditCard 0 1 0 0 0 |
4. | 6 EMFunds 0 0 1 0 0 |
5. | 1 P2P 0 0 0 1 0 |
6. | 2 Retail 0 0 0 0 1 |
+---------------------------------------------------------------------------+
Arbitrary strings might not be legal Stata variable names. This will happen if they (a) are too long; (b) start with any character other than a letter or an underscore; (c) contain characters other than letters, numeric digits and underscores; or (d) are identical to existing variable names. You might be better off making the strings into variable labels, where only an 80 character limit bites.
This code loops over the variables and does its best:
gen long obs = _n
foreach v of var productid? productid?? productid??? {
su obs if `v' == 1, meanonly
local tryit = product[r(min)]
capture rename `v' `=strtoname("`tryit'")'
}
Note: code not tested.
EDIT: Here is a test. I added code for variable labels. The data example and code show that repeated values and values that could not be variable names are accommodated.
clear
input str13 products
"one"
"two"
"one"
"three"
"four"
"five"
"six something"
end
tab products, gen(productsid)
gen long obs = _n
foreach v of var productsid*{
su obs if `v' == 1, meanonly
local value = products[r(min)]
local tryit = strtoname("`value'")
capture rename `v' `tryit'
if _rc == 0 capture label var `tryit' "`value'"
else label var `v' "`value'"
}
drop obs
describe
Contains data
obs: 7
vars: 7
size: 133
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
products str13 %13s
five byte %8.0g five
four byte %8.0g four
one byte %8.0g one
six_something byte %8.0g six something
three byte %8.0g three
two byte %8.0g two
-------------------------------------------------------------------------------
Another solution is to use the extended macro function
local varlabel:variable label
The tested code is:
clear
input companyid str10 product
1 "P2P"
2 "Retail"
3 "Commercial"
4 "CreditCard"
5 "CreditCard"
6 "EMFunds"
end
tab product, gen(product_id)
* get the list of product id variables
ds product_id*
* loop through the product id variables and change the
variable name to its label
foreach var of varlist `r(varlist)' {
local varlabel: variable label `var'
display "`varlabel'"
local pos = strpos("`varlabel'","==")+2
local varlabel = substr("`varlabel'",`pos',.)
display "`varlabel'"
rename `var' `varlabel'
}
Related
I have 48 variables in my dataset: first 12 concern year 2000, second 12 year 2001, third 12 year 2002 and fourth 12 year 2003.
Each single variable contains the values in such a way:
ID
var1
var2
var3
...
var12
...
var48
xx
0
0
1
...
1
...
0
yy
1
0
0
...
9
...
0
zz
3
2
1
...
0
...
0
Now, I want to collect the sum of the values of the first 12 variables in another one called, say, "tot_2000" which should contain just one number (in this example it is 18).
Then, I must repeat this passage for the 3 remaining years, thus having 4 variables ("tot_2000", "tot_2001", "tot2002", "tot2003") to be plotted in an histogram.
What I'm looking for is such a variable:
tot_2000
18
ORIGINAL QUESTION, addressed by #TheIceBear and myself.
I have a dataset that contains, say, 12 variables with values 0,1,2.... like this, for example:
ID
var1
var2
var3
...
var12
xx
0
0
1
...
1
yy
1
0
0
...
9
zz
3
2
1
...
0
and I want to create a variable that is just the sum of all the values (18 in this case), like:
tot_var
18
What is the command?
FIRST ANSWER FROM ME
Here is another way to do it, as indicated in a comment on the first answer by #TheIceBear.
* Example generated by -dataex-. For more info, type help dataex
clear
input str2 ID byte(var1 var2 var3 var4)
"xx" 0 0 1 1
"yy" 1 0 0 9
"zz" 3 2 1 0
end
mata : total = sum(st_data(., "var1 var2 var3 var4"))
mata : st_numscalar("total", total)
di scalar(total)
18
The two Mata commands could be telescoped.
SECOND ANSWER
A quite different question is emerging slowly from comments and edits. The question is still unfocused, but here is an attempt to sharpen it up.
You have monthly data for various identifiers. You want to see bar charts (not histograms) with annual totals.
The data structure or layout you have is a poor fit for handling such data in Stata. You have a so-called wide layout but a long layout is greatly preferable. Then your totals can be put in a variable for graphing.
* fake dataset
clear
set obs 3
gen id = word("xx yy zz", _n)
forval j = 1/48 {
gen var`j' = _n * `j'
}
* you start here
reshape long var, i(id) j(time)
gen mdate = ym(1999, 12) + time
format mdate %tm
gen year = year(dofm(mdate))
* not clear that you want this, but it could be useful
egen total = total(var), by(id year)
twoway bar total year, by(id) xla(2000/2003) name(G1, replace)
* this seems to be what you are asking for
egen TOTAL = total(var), by(year)
twoway bar TOTAL year, base(0) xla(2000/2003) name(G2, replace)
Here is a solution for how to do it in two steps:
* Example generated by -dataex-. For more info, type help dataex
clear
input str2 ID byte(var1 var2 var3 var4)
"xx" 0 0 1 1
"yy" 1 0 0 9
"zz" 3 2 1 0
end
egen row_sum = rowtotal(var*) //Sum each row into a var
egen tot_var = sum(row_sum ) //Sum the row_sum var
* Get the value of the first observation and store in a local macro
local total = tot_var[1]
display `total'
I have a DataFrame, in which I want to merge certain rows to a single one. It has the following structure (values repeat)
Index Value
1 date:xxxx
2 user:xxxx
3 time:xxxx
4 description:xxx1
5 xxx2
6 xxx3
7 billed:xxxx
...
Now the problem is, that the columns 5 & 6 still belong to the description and were separated just wrong (whole string separated by ","). I want to merge the "description" row (4) with the values afterwards (5,6). In my DF, there can be 1-5 additional entries which have to be merged with the description row, but the structure allows me to work with startswith, because no matter how many rows have to be merged, the end point is always the row which starts with "billed". Due to me being very new to python, I haven´t got any code written for this problem yet.
My thought is the following (if it is even possible):
Look for a row which starts with "description" → Merge all the rows afterwards till reaching the row which starts with "billed", then stop (obviosly we keep the "billed" row) → Do the same to each row starting with "description"
New DF should look like:
Index Value
1 date:xxxx
2 user:xxxx
3 time:xxxx
4 description:xxx1, xxx2, xxx3
5 billed:xxxx
...
df = pd.DataFrame.from_dict({'Value': ('date:xxxx', 'user:xxxx', 'time:xxxx', 'description:xxx', 'xxx2', 'xxx3', 'billed:xxxx')})
records = []
description = description_val = None
for rec in df.to_dict('records'): # type: dict
# if previous description and record startswith previous description value
if description and rec['Value'].startswith(description_val):
description['Value'] += ', ' + rec['Value'] # add record Value into previous description
continue
# record with new description...
if rec['Value'].startswith('description:'):
description = rec
_, description_val = rec['Value'].split(':')
elif rec['Value'].startswith('billed:'):
# billed record - remove description value
description = description_val = None
records.append(rec)
print(pd.DataFrame(records))
# Value
# 0 date:xxxx
# 1 user:xxxx
# 2 time:xxxx
# 3 description:xxx, xxx2, xxx3
# 4 billed:xxxx
I have a string variable called country with a value which can be for example Afghanistan2008, but it can also be Brasil2012. I would like to create two new variables, one being the country part and one the year part .
Because there are always numbers at the end of the string, I do know the position the string should be split at from the right side but not from the left side.
Could I use something like:
gen(substr("country",-4,.))
If not, could anyone tell me how to split an entire column of such variables into a country and a year variable? I would also like to keep the original variable.
You can use a regular expression:
clear
set obs 2
generate string = ""
replace string = "Afghanistan2008" in 1
replace string = "Brasil2012" in 2
generate country = regexs(0) if regex(string, "[a-zA-Z]+")
generate year = regexs(1) + regexs(2) if regex(string, "(19|20)([0-9][0-9])")
list
+--------------------------------------+
| string country year |
|--------------------------------------|
1. | Afghanistan2008 Afghanistan 2008 |
2. | Brasil2012 Brasil 2012 |
+--------------------------------------+
Type help regex in Stata's command prompt for more information.
Alternatively you could do the following:
generate len = length(string) - 3
generate country2 = substr(string, 1, len - 1)
generate year2 = substr(string, len, .)
list country2 year2
+---------------------+
| country2 year2 |
|---------------------|
1. | Afghanistan 2008 |
2. | Brasil 2012 |
+---------------------+
For my specific situation the following makes a new year variable:
gen spyear = real(substr(country,-4,.))
I took the other part from #PearlySpencer:
generate len = length(country) - 3
generate spcountry = substr(country, 1, len - 1)
which creates an excess column to be removed.
EDIT (Nick Cox) This can be simplified to
gen spyear = real(substr(country, -4, 4))
gen spcountry = substr(country, 1, length(country) - 4)
showing that
There is no need to create a variable containing the string length.
The puzzling split 4 = 3 + 1 is not needed either.
I am trying to get the value of 'id' in the vmstat result.
However, I found out that the position of 'id' column is different between platforms such as linux/AIX/HP...
## Linux
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 35268 117568 158244 1849104 0 0 3 11321 5 2 9 15 73 3 0
So, I think I should find the string 'id' and get the position(the) then, get the value of the position in the next row.
How can I do that with awk script?
this oneliner does what you want:
awk '{for(i=NF;i>0;i--)if($i=="id"){x=i;break}}END{print $x}'
first find out the id index, then print the corresponding column in the last line.
I have crosstab which has row columns indicating different classes, and then peoples names across the top.
| | Required | Person 1 | Person 2 | Person 3 |
| Class 1 | 8 6 | 1 6 | 3 6 | 4 6 |
| Class 2 | 6 2 | 3 2 | 2 2 | 1 2 |
Each field contains 2 values The first value is the number of hours spent in the class, the second field is the number of hours required for certification.
The Required field id my grand total summary.
In the cross tab expert the fields are defined as follows.
Rows:
Command.descr -> a field containing the class names
Columns:
Command.fullname -> a field containing students full names
Summarized Fields:
Sum of Command.evlength -> summation of all time spent in a given course
Max of #required -> this formula returns the number of required hours based on the course name
I am trying to highlight the field Sum of Command.evlength if it is greater than or equal to the value of Max of #required.
My solution was to perform background formatting. Right-Click on the Sum of Command.evlength field, select Format Field. Click the borders tab, check Background, and enter a formula.
The formula I was using is:
if CurrentFieldValue >= {#required} then color(152, 251, 152) else crNoColor
This is not the correct formula. My crosstab has been placed in the footer, which causes {#required} to contain the last value in the grid which in the above example is 2.
From my research I thought I would have to use GridRowColumnValue(row or column name) to access the value of {#required} in the crosstab, but I could not come up with the correct string to represent it.
Does anyone have a way for me to correctly perform this comparison?
Frustratingly I don't think you can use the highlighting expert to compare to a dynamic value. You could swap the columns round then add the following formulas:
To the max_of_required background colour:
whileprintingrecords;
global numbervar required_hrs := currentfieldvalue;
crNoColor;
To the sum_of_command.evlength background colour:
whileprintingrecords;
global numbervar required_hrs;
if currentfieldvalue >= required_hrs then
crRed
else
crNoColor;
I think there are a few other ways but i'm not as confident with those so start here.