How to get variable labels of regression output in matrix - variables

In Stata, say that I have these data:
sysuse auto2, clear
gen name = substr(make, 1,3)
encode name, gen(name2)
I run this regression, which importantly uses i.:
reg price i.name2 trunk weight turn
The output takes the form of:
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
name2 |
Aud | 4853.048 1254.083 3.87 0.000 2331.545 7374.551
BMW | 5742.124 1560.161 3.68 0.001 2605.211 8879.037
Bui | 1351.065 946.733 1.43 0.160 -552.4696 3254.599
Cad | 7740.865 1168.332 6.63 0.000 5391.776 10089.95
Che | 62.35577 946.1153 0.07 0.948 -1839.937 1964.648
....
I then go to the estimation results:
matrix list e(b)
which produces:
e(b)[1,27]
1b. 2. 3. 4. 5. 6. 7.
name2 name2 name2 name2 name2 name2 name2
y1 0 4853.0482 5742.1237 1351.0647 7740.8653 62.355771 2676.3971
8. 9. 10. 11. 12. 13. 14.
name2 name2 name2 name2 name2 name2 name2
y1 943.4266 1964.8242 1776.4058 2711.4324 6386.7936
....
My question is how can I retrieve the variable labels from the name2 variable after the regression is run? What I want is what is displayed in the initial output: Aud, BMW, Bui, etc. I do not want what is stored in the e(b) matrix: 1b. name2, 2. name2, 3. name2, etc. Is there a way to get what I want stored in e(b) or is is stored elsewhere in other estimation results? Can estout/esttab? I would like to get the results stored in a matrix.

You can store e(b) in a matrix, get the names of your variable with the levelsof command and rename the column names.
sysuse auto2, clear
gen name = substr(make, 1,3)
encode name, gen(name2)
reg price i.name2 trunk weight turn
mat A = e(b)
levelsof name, local(names)
local colnames "`names' trunk weight turn _cons"
matrix colnames A = `colnames'
matrix list A

Related

SQL Counting empy fields in a row

I'm new to SQL and i'm not sure what i want to do is possible.
Basically i want to count the number of empty fields present in a row and add it as a column at the end, like so:
Original data
ID
Code
Name
Age
1111
aaa
name1
23
1111
bbb
name1
23
2222
cccc
3333
fdfd
34
3333
rrrr
Result:
ID
Code
Name
Age
Empty Fields
1111
aaa
name1
23
0
1111
bbb
name1
23
0
2222
cccc
2
3333
fdfd
34
1
3333
rrrr
2
After that i want to concatenate the code field for each duplicate ID and delete the line with the higher Empty fields, like so:
Result:
First:
ID
Code
Name
Age
Empty Fields
1111
aaa,bbb
name1
23
0
1111
aaa,bbb
name1
23
0
2222
cccc
2
3333
fdfd,rrrr
34
1
3333
rrrr,fdfd
2
End Result:
ID
Code
Name
Age
Empty Fields
1111
aaa,bbb
name1
23
0
2222
cccc
2
3333
fdfd,rrrr
34
1
This is an inefficient method/logic but since you've laid out your steps I assume there's some reason for this logic.
Count the Number of Missing using CMISS which counts across character and numeric variables. Set the variable to 0 initially so it's not included in the missing count. Use _all_ to refer to all variables in the data set.
*calculate number missing;
data step1;
set have;
*set N_Missing to 0 so it is not counted as missing;
N_Missing = 0;
*count missing;
N_Missing = cmiss(Of _all_);
run;
Combine CODE into one field, using a data step.
*combine CODE from into one field;
proc sort data=step1;
by id code;
run;
data step2;
set have;
by ID code;
retain Code_Agg;
length Code_Agg $80.;
if first.ID then Code_Agg = Code;
else Code_Agg = catx(", ", Code_Agg, Code);
if last.ID then output;
keep ID Code_AGG;
run;
Merge output from #1 and #2 to get the middle table. You may need to sort the data ahead of time.
*merge results with prior table;
data step3;
merge step1 step2;
by ID;
run;
Keep record only of interest using NODUPKEY in PROC SORT which keeps only unique record. Note the double sort, first sorting to get the order correct to take the first record.
proc sort data=step3;
by ID N_Missing;
run;
proc sort data=step3 out=final nodupkey;
by ID;
run;

Stata : Change name of variables with values of another Variables

I have a dataset of variables looking like this:
Screenshot of the Dataset.
I would like, if it is possible, to label the other variables with the name of the country they are related to. For example, ggdy1 is the gross debt/GDP ratio for country 1, here Austria, while ggdy2 is the Gross Debt/GDP ratio for country 2, Belgium.
To avoid the back and forth from the dataset to the results or command windows, is there a way to label the different variables (ggdy, pby,...) automatically with the name of the suitable country?
I have 28 countries in my dataset and work on Stata 15.
I have to say I think this is the wrong question. Your data structure is analogous to this
* Example generated by -dataex-. For more info, type help dataex
clear
input float year str7 country1 float(y1 x1) str7 country2 float(y2 x2)
1990 "Austria" 12 16 "Belgium" 20 24
1991 "Austria" 14 18 "Belgium" 22 26
end
which is both logical and perverse for most Stata purposes. A simple reshape gets you to a structure that is much more useful for most analyses.
. reshape long country y x , i(year) j(which)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 4
Number of variables 7 -> 5
j variable (2 values) -> which
xij variables:
country1 country2 -> country
y1 y2 -> y
x1 x2 -> x
-----------------------------------------------------------------------------
. l
+----------------------------------+
| year which country y x |
|----------------------------------|
1. | 1990 1 Austria 12 16 |
2. | 1990 2 Belgium 20 24 |
3. | 1991 1 Austria 14 18 |
4. | 1991 2 Belgium 22 26 |
+----------------------------------+
which does no harm, but is not essential.
P.S. What you ask for is programmable too, something like
foreach v of var ggdy* {
local suffix = substr("`v'", 5, .)
local where = country`suffix'[1]
label var `v' "ggdy `where'"
label var pby`suffix' "pby `where'"
label var cby`suffix' "cby `where'"
label var fby`suffix' "fby `where'"
}

Reading space delimited text file into SAS

I have a following .txt file:
Mark1[Country1]
type1=1 type2=5
type1=1.50 EUR type2=21.00 EUR
Mark2[Country2]
type1=2 type2=1 type3=1
type1=197.50 EUR type2=201.00 EUR type3= 312.50 EUR
....
I am trying to input it in my SAS program, so that it would look something like that:
Mark Country Type Count Price
1 Mark1 Country1 type1 1 1.50
2 Mark1 Country1 type2 5 21.00
3 Mark1 Country1 type3 NA NA
4 Mark2 Country2 type1 2 197.50
5 Mark2 Country2 type2 2 201.00
6 Mark2 Country2 type3 1 312.50
Or maybe something else, but i need it to be possible to print two way report
Country1 Country2
Type1 ... ...
Type2 ... ...
Type3 ... ...
But the question is how to read that kind of txt file:
read and separate Mark1[Country1] to two columns Mark and Country;
retain Mark and Country and read info for each Type (+somehow ignoring type1=, maybe using formats) and input it in a table.
Maybe there is a way to use some kind of input templates to achive that or nasted queries.
You have 3 name/value pairs, but the pairs are split between two rows. An unusual text file requiring creative input. The INPUT statement has a line control feature # to read relative future rows within the implicit DATA Step loop.
Example (Proc REPORT)
Read the mark and country from the current row (relative row #1), the counts from relative row #2 using #2 and the prices from relative row #3. After the name/value inputs are made for a given mark country perform an array based pivot, transposing two variables (count and price) at a time into a categorical (type) data form.
Proc REPORT produces a 'two-way' listing. The listing is actually a summary report (cells under count and price are a default SUM aggregate), but each cell has only one contributing value so the SUM is the original individual value.
data have(keep=Mark Country Type Count Price);
attrib mark country length=$10;
infile cards delimiter='[ ]' missover;
input mark country;
input #2 #'type1=' count_1 #'type2=' count_2 #'type3=' count_3;
input #3 #'type1=' price_1 #'type2=' price_2 #'type3=' price_3;
array counts count_:;
array prices price_:;
do _i_ = 1 to dim(counts);
Type = cats('type',_i_);
Count = counts(_i_);
Price = prices(_i_);
output;
end;
datalines;
Mark1[Country1]
type1=1 type2=5
type1=1.50 EUR type2=21.00 EUR
Mark2[Country2]
type1=2 type2=1 type3=1
type1=197.50 EUR type2=201.00 EUR type3= 312.50 EUR
;
ods html file='twoway.html';
proc report data=have;
column type country,(count price);
define type / group;
define country / ' ' across;
run;
ods html close;
Output image
Combined aggregation
proc means nway data=have noprint;
class type country;
var count price;
output out=stats max(price)=price_max sum(count)=count_sum;
run;
data cells;
set stats;
if not missing(price_max) then
cell = cats(price_max,'(',count_sum,')');
run;
proc transpose data=cells out=twoway(drop=_name_);
by type;
id country;
var cell;
run;
proc print noobs data=twoway;
run;
You can specify the name of variable with the DLM= option on the INFILE statement. That way you can change the delimiter depending on the type of line being read.
It looks like you have three lines per group. The first one have the MARK and COUNTRY values. The second one has a list of COUNT values and the third one has a list of PRICE values. So something like this should work.
data want ;
length dlm $2 ;
length Mark $8 Country $20 rectype $8 recno 8 type $10 value1 8 value2 $8 ;
infile cards dlm=dlm truncover ;
dlm='[]';
input mark country ;
dlm='= ';
do rectype='Count','Price';
do recno=1 by 1 until(type=' ');
input type value1 #;
if rectype='Price' then input value2 #;
if type ne ' ' then output;
end;
input;
end;
cards;
Mark1[Country1]
type1=1 type2=5
type1=1.50 EUR type2=21.00 EUR
Mark2[Country2]
type1=2 type2=1 type3=1
type1=197.50 EUR type2=201.00 EUR type3= 312.50 EUR
;
Results:
Obs Mark Country rectype recno type value1 value2
1 Mark1 Country1 Count 1 type1 1.0
2 Mark1 Country1 Count 2 type2 5.0
3 Mark1 Country1 Price 1 type1 1.5 EUR
4 Mark1 Country1 Price 2 type2 21.0 EUR
5 Mark2 Country2 Count 1 type1 2.0
6 Mark2 Country2 Count 2 type2 1.0
7 Mark2 Country2 Count 3 type3 1.0
8 Mark2 Country2 Price 1 type1 197.5 EUR
9 Mark2 Country2 Price 2 type2 201.0 EUR
10 Mark2 Country2 Price 3 type3 312.5 EUR

Data.frame after transpose and data.frame still gives variables as non-numeric

I find myself struggling with data import for further nMDS and Bioenv analysis with "vegan" and "ggplot2". I have a data frame "Taxa" that looks like this (the values are there to mean it is "numeric". —
head(Taxa)
X1 Station1 Stations1_2 Stations1_3 ...
Species1 123 456 789
Species2 123 456 789
Species3 123 456 789
...
After I transpose my data to have the stations (observations) as rows
Taxa <- t(Taxa)
X_1 Species1 Species2 Species3 ...
Station1 123 456 789
Species1_2 123 456 789
Species1_3 123 456 789
...
Now if I check how the data has been transposed I see that it has been converted into a "matrix"
class(Taxa)
[1] "matrix"
Now I can change again the matrix into a data frame
Taxa.df <- data.frame(Taxa)
And what I get then is the following:
head(Taxa.df)
X1 X2 X3
X_1 Species1 Species2 Species3 ...
Station1 123 456 789
Species1_2 123 456 789
Species1_3 123 456 789
...
Now what I would need is to get the first row to become the columns header so that I can restore the initial structure
colnames(Taxa.df)=Taxa.df[1,]
When I do this this happens to the data frame
23 10 16 ....
X_1 Species1 Species2 Species3 ...
Station1 123 456 789
Species1_2 123 456 789
Species1_3 123 456 789
...
I don't manage to get to have the first row as header.
If I can't do this I can't run the transformation I need and all the stats analysis I still need to run. I spent the whole day simply trying to import the data from xlsx on Rstudio for Mac and solve this issue. I hope you guys can help. I did already look around a lot and mostly thought to have found these two links as useful answers, but nothing solved my exact problem.
http://r.789695.n4.nabble.com/Transposing-Data-Frame-does-not-return-numeric-entries-td852889.html
Why does the transpose function change numeric to character in R?
The first variable in your data frame was X1 with values Species1 etc. You should have read your data so that the first variable is numeric (Species1) which you can achieve with argument row.names=1 in the read.* command. Alternatively, you can only transpose the numeric data and then label the rows and columns with the original data. The following may work
mat <- t(Taxa[,-1]) # remove col 1
colnames(mat) <- rownames(Taxa)
mat <- as.data.frame(mat)
However, I think you have not posted the actual output of your R commands, but written by hand the things you think are essential to the structure. So it may be that your data are different than you display, and you may also have non-numeric rows. Just check sum(Taxa) which is number if your data are numeric, and sum(Taxa[,-1]) which is a number if removing the first column is sufficient, and summary(Taxa) which gives Mean and Median for columns which are all numeric (including first row).

SAS table with percentage attached

I am trying to create a matrix with both numeric and percentage result. I was given two tables
id cc
1 2
1 5
1 40
2 55
2 2
2 130
2 177
3 20
3 55
3 40
4 30
4 100
id Description
1 Dell
1 Lenovo
1 HP
2 Sony
2 Dell
2 Acer
2 Other
3 Fujitsu
3 Sony
3 HP
4 Apple
4 Asus
I have already created a table that looks like..I used the code
CC CC1 CC2… …CC177
1 264 5 0
2 0 132 6
…
…
177 2 1 692
data RESULT;
set id_CC;
by id;
retain CC1-CC177; /*CC range from 1 to 177*/
array CC_List(177) CC1-CC177;
if first.id then do i=1 to 177;
id_LIST(i)=0;
end;
CC_List(CC)=1;
if last.id then output;
run;
ods output sscp=coocs;
ods select sscp;
proc corr data=RESULT sscp;
var CC1-CC177;
run;
/*proc print data=coocs;*/
/*run;*/
/**/
In other words, how many id have cc1 also have cc2..cc177..etc. Now, I am wondering if it's doable to add percentage next to each number. For instance if CC1*CC1=264 (100%) then CC1*CC2= 5/264=1.9%
Another table I am trying to create is to have description of each CC on the matrix. Each CC number stands for one brand. 2=Dell 177=Other, etc. I want to create a table looks like
If I want to change the CC1 CC2 to characters, how do I modify the arrays? Eventually, I would like my table looks like
Description Dell Lenovo HP Sony Acer Other Fujitsu Sony
Dell 264 (100%)
Lenovo
HP 50 (10%)
Sony
Acer
Other
Fujitsu
Sony
In other words, how many people have dell also have acer, sony, other, etc?
The rename is a question that's been asked on here so I'll leave that one for now.
For the percentages you'll need to create a character variable. TO calculate the percent use the automatic variable _n_ which is the row, but will also be the denominator for your calculation. Then use a concatenate function such as cats to create the variable in the format N(PP%).
data want;
set have;
array cc(177) cc1-cc177;
array dd(177) $ dd1-dd177;
do i=1 to 177;
percent=cc(i)/cc(_n_);
dd(i)=cats(cc(i), "(", put(percent, percent8.1), ")");
end;
run;
In answering Reeza, I did:
data RESULT_PRE;
set ID_CC;
by ID;
retain CC1-CC177;
array CC_LIST(177) CC1-CC177;
array DD_LIST(177) $ DD1-DD177;
if first.id then do i=1 to 177;
CC_LIST(i)=0;
end;
CC_LIST(CC)=1;
if last.id then output;
run;
data RESULT;
set RESULT_PRE;
array CC_LIST(177) CC1-CC177;
array DD_LIST(177) $ DD1-DD177;
do i=1 to 177;
percent=CC_LIST(i)/CC_LIST(_n_);
DD_LIST(i)=cats(CC_LIST(i), "(", put(percent, percent8.1), ")");
end;
run;
The error shows that Array subscript out of range at line xx column xx and ERROR 68-185: The function CC is unknown, or cannot be accessed.