Stata: combine multiple variables into one

Stata: combine multiple variables into one - variables

I have a problem in Stata. What I want to do is to combine multiple variables into one. My data looks like the following (simplified):
ID a b c
1 x . .
2 y . .
3 . z .
4 . w .
5 . . u
Now I want to generate a new variable d consisting of all values of variables a, b and c, such that d has no missing values:
ID a b c d
1 x . . x
2 y . . y
3 . z . z
4 . w . w
5 . . u u
I tried to use the command stack a b c, into(d) but then Stata gives me a warning that data will be lost and what is left of my data is only the stacked variable and nothing else. Is there another way to do it without renaming the variables a, b and c?
My dataset contains around 90 of these variables which I want to combine to a single variable, so maybe there is an efficient way to do so.

From your example, which implies numeric variables and at most one variable non-missing in each observation, egen's rowmax() function is all you need.
egen d = rowmax(a b c)

You can loop over the variables, replacing the new variables to the nonmissing values of the other variables. This is assuming your variables are strings. Nick's solution works better for numeric variables.
clear
input ID str5(a b c)
1 x "" ""
2 y "" ""
3 "" z ""
4 "" w ""
5 "" "" u
end
gen d=""
foreach v of varlist a-c {
replace d=`v' if mi(d)
}
li

You could similarly use stack as you were, while specifying the wide option:
clear
input ID str5(a b c)
1 x "" ""
2 y "" ""
3 "" z ""
4 "" w ""
5 "" "" u
end
stack a b c, into(d) wide clear
keep if !mi(d)

Related

Seperation of path in csv file

I am facing a problem regarding separation of path of files in a csv file.
I have such a structure in a csv file:
w:b/c/v/n/1/y/i
w:b/c/v/n/2/h/l
w:b/c/v/n/3/n/r
w:b/c/v/n/4/f/e
this is one column of my csv file as the path name. Now what I need to do is to preserve my first column and create 3 more columns for:
1/ y /i
2 h l
3 n r
4 f e
I know that .str.split('/', expand=True) works in such a cases but my problem is how to show that it should leave out "b/c/v/n" part. Could you please help me with it?

If you have DataFrame like this:
path
0 w:b/c/v/n/1/y/i
1 w:b/c/v/n/2/h/l
2 w:b/c/v/n/3/n/r
3 w:b/c/v/n/4/f/e
Then:
df[["col1", "col2", "col3"]] = (
df["path"].str.split("/", expand=True).iloc[:, -3:]
)
print(df)
creates 3 new columns:
path col1 col2 col3
0 w:b/c/v/n/1/y/i 1 y i
1 w:b/c/v/n/2/h/l 2 h l
2 w:b/c/v/n/3/n/r 3 n r
3 w:b/c/v/n/4/f/e 4 f e

You could first get a slice of the string and then use split like the following:
.str[10:].str.split('/', expand=True)

Another try is:
s = 'w:b/c/v/n/1/y/i'
temp = s.split("/")
s = "/".join(temp[0:4]) + "," + ",".join(temp[4:7])
Result is:
w:b/c/v/n,1,y,i

concise solution
df[['col', 'col1', 'col2', 'col3']] = df['path'].str.rsplit('/',3, expand=True)
df
path col col1 col2 col3
0 w:b/c/v/n/1/y/i w:b/c/v/n 1 y i
1 w:b/c/v/n/2/h/l w:b/c/v/n 2 h l
2 w:b/c/v/n/3/n/r w:b/c/v/n 3 n r
3 w:b/c/v/n/4/f/e w:b/c/v/n 4 f e

Extracting a word from string from n rows and append that word as a new col in SQL Server

I have got a data set that contains 3 columns and has 15565 observations. one of the columns has got several words in the same row.
What I am looking to do is to extract a particular word from each row and append it to a new column (i will have 4 cols in total)
The problem is that the word that i am looking for are not the same and they are not always on the same position.
Here is an extract of my DS:
x y z
-----------------------------------------------------------------------
1 T 3C00652722 (T558799A)
2 T NA >> MSP: T0578836A & 3C03024632
3 T T0579010A, 3C03051500, EAET03051496
4 U T0023231A > MSP: T0577506A & 3C02808556
8 U (T561041A C72/59460)>POPMigr.T576447A,C72/221816*3C00721502
I am looking to extract all the words that start with 3Cand are 10 characters long and then append the to a new col so it looks like this:
x y z Ref
----------------------------------------------------------------
1 T 3C00652722 (T558799A) 3C00652722
2 T NA >> MSP: T0578836A & 3C03024632 3C03024632
3 T T0579010A, 3C03051500, EAET03051496 3C03051500
4 U T0023231A > MSP: T0577506A & 3C02808556 3C02808556
8 U >POPMigr.T576447A,C72/221816*3C00721502 3C00721502
I have tried using the Contains, Like and substring methods but it does not give me the results i am looking for as it basically finds the rows that have the 3C number but does not extract it, it just copies the whole cell and pastes is on the Ref column.

SQL Server doesn't have good string functions, but this should suffice if you only want to extract one value per row:
select t.*,
left(stuff(col,
1,
patindex('%3C[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', col),
''
), 10)
from t ;

How to load array of strings with tab delimiter in pig

I have a text file with tab delimiter and I am trying to print first column as id and remaining array of strings as second column names.
consider below is the file to load:
cat file.txt;
1 A B
2 C D E F
3 G
4 H I J K L M
In the above file, first column is an id and the remaining are names.
I should get the output like:
id names
1 A,B
2 C,D,E,F
3 G
4 H,I,J,K,L,M
If names are split with delimiter ,, then I am getting the output by using below commands:
test = load '/tmp/arr' using PigStorage('\t') as (id:int,names:chararray)
btest = FOREACH test GENERATE id, FLATTEN(TOBAG(STRSPLIT(name,','))) as value:tuple(name:CHARARRAY);
But for the array with delimiter ('\t'), I am not getting them because it's considering only the first value in the column 2 (i.e, names).
Any solution for this?

I have a solution for this:
When using PigStorage('\t') in the load, the file should have tab delimiter. So the number of tab used in a line that many coloumns(+1) is created. This is how it works.
But you have a trick
You can change the default delimiter and use some other delimiter to load the file like comma and then you can have the names in commaseperated.
It will work for sure
Input file sample
1,A B
2,C D E F
3,G
4,H I J K L M
Hope this helps

SAS Checking Whether A Third Variable is Between the Values of two other variables

I have been dealing with this issue that I thought was trivial, but for some reason nothing I have tried has worked so far.
I have a dataset
obs A B C
1 2 6 7
2 3 1 5
3 8 5 9
. . . .
For each observation, I want to compare the values in column A to the values in column B and assign a value 1 to a variable called within. My goal to only select observations where their A value is within their B and C values. I have tried everything, but nothing seem to be working.
Thank you.

Here's how to do it in a data step. Let me know if that works for you.
data new;
set old;
if B < A < C then D = 1;
else delete;
run;

Find all paths of at most length 2 from a set of relationships

I have a connection data set with each row marks A connects B in the form A B. The direct connection between A and B appears only once, either in the form A B or B A. I want to find all the connections at most one hop away, i.e. A and C are at most one hop away, if A and C are directly connected, or A connects C through some B.
For example, I have the following direct connection data
1 2
2 4
3 7
4 5
Then the resulting data I want is
1 {2,4}
2 {1,4,5}
3 {7}
4 {1,2,5}
5 {2,4}
7 {3}
Could anybody help me to find a way as efficient as possible? Thank you.

You could do this:
myudf.py
#outputSchema('bagofnums: {(num:int)}')
def merge_distinct(b1, b2):
out = []
for ignore, n in b1:
out.append(n)
for ignore, n in b2:
out.append(n)
return out
script.pig
register 'myudf.py' using jython as myudf ;
A = LOAD 'foo.in' USING PigStorage(' ') AS (num: int, link: int) ;
-- Essentially flips A
B = FOREACH A GENERATE link AS num, num AS link ;
-- We need to union the flipped A with A so that we will know:
-- 3 links to 7
-- 7 links to 3
-- Instead of just:
-- 3 links to 7
C = UNION A, B ;
-- C is in the form (num, link)
-- You can't do JOIN C BY link, C BY num ;
-- So, T just is C replicated
T = FOREACH D GENERATE * ;
D = JOIN C BY link, T BY num ;
E = FOREACH (FILTER E BY $0 != $3) GENERATE $0 AS num, $3 AS link_hopped ;
-- The output from E are (num, link) pairs where the link is one hop away. EG
-- 1 links to 2
-- 2 links to 4
-- 3 links to 7
-- The output will be:
-- 1 links to 4
F = COGROUP C BY num, E BY num ;
-- I use a UDF here to merge the bags together. Otherwise you will end
-- up with a bag for C (direct links) and E (links one hop away).
G = FOREACH F GENERATE group AS num, myudf.merge_distinct(C, E) ;
Schema and output for G using your sample input:
G: {num: int,bagofnums: {(num: int)}}
(1,{(2),(4)})
(2,{(4),(1),(5)})
(3,{(7)})
(4,{(5),(2),(1)})
(5,{(4),(2)})
(7,{(3)})

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Stata: combine multiple variables into one - variables

From your example, which implies numeric variables and at most one variable non-missing in each observation, egen's rowmax() function is all you need. egen d = rowmax(a b c)

You could similarly use stack as you were, while specifying the wide option: clear input ID str5(a b c) 1 x "" "" 2 y "" "" 3 "" z "" 4 "" w "" 5 "" "" u end stack a b c, into(d) wide clear keep if !mi(d)

Related

Seperation of path in csv file

Extracting a word from string from n rows and append that word as a new col in SQL Server

How to load array of strings with tab delimiter in pig

SAS Checking Whether A Third Variable is Between the Values of two other variables

Find all paths of at most length 2 from a set of relationships

Categories

Resources