Stata: How to use column value as file name in loop - automation

I am working with 350 datasets. I want to automate naming the final datasets with values from the dataset.
For example, if ID is abc and year is 2010. There are two columns in the dataset with those values. I want to pull that information out and use in the file name. and the name would look like abc_2010.dta in this case.
So basically I want to do
foreach file in `files' {
**calculation codes**
** construct the file name as three digit ID_year.dta **
}
I have already done the calculation part. I need some help with the naming of the files.

If I understand what you are trying to do, I believe you should be able to do this:
foreach file in `files' {
**calculation codes**
** construct the file name as three digit ID_year.dta **
local fname:di "`=id[1]'_`=year[1]'"
save `fname', replace
}
Note that this assume that after the calculation of the current iteration of the loop through files, the value of id in the first row holds the three digit code and the value of year in the first row holds the year.

Related

Powershell how to use SQL syntax on system.data.datatable foreach variable lookup?

I am using a data table in Powershell ISE on 2 large csv files. The second file is used to lookup various facilities by ID number found in File1. So, File2 has either 1 or many facilities depending on where they go. I'm trying to do a standard foreach loop, but not sure how to populate where the number variable would go, since it's SQL syntax (is my understanding). $Script:MappingTable2 has at least one matching ID, but sometimes many. I have loaded these into data tables and can't figure out how to add the PowerShell equivalent of $_ for lookup. So, where 2017 will pull all the facilities that have 2017 as the ID, I would like to look these up per $Code in File1. $Script:MappingTable2.select("PRACT_ID = '$Code.ID'") won't work, but if just the number is there (like below) it grabs all the rows related to the ID. Or just one if only a one to one match. Hoping to add the array as a single combined string to a new 3rd csv cell in a new column, which is supposed to be a merging of the 2 files. Thx
foreach ($Code in $File1) {
$Script:MappingTable2.select("PRACT_ID = '2017'")}
I figured it out! The $Code.gettype() was BaseType - System.Object Name - DataRow. When I set it to a [string] it works. So this works and is very fast.
foreach ($Code in $File1) {
[string]$ID = $Code.PRACT_ID
$Script:MappingTable2.select("PRACT_ID = '$ID'")}

Splitting a variable into separate rows

I would like to split my Variables "Wellbeing_Pre" and "Wellbeing_Post" to one Wellbeing variable and one Pre/Post-Variable, so that I have one row for each pre and post and one column for the variable "wellbeing" and one for the variable "Pre / Post" (0=pre, 1=post).
I have already a long-format in my data where there are several rows for each person for every measurment point. Now I would like to split the Pre Post Measurments to separate rows as well.
I'm grateful for ideas! :-)
To elaborate #horace_vr's solution:
varstocases make Wellbeing from Wellbeing_Pre Wellbeing_Post /index=PPtxt(Wellbeing).
* variable PPtxt now has the values "Wellbeing_Pre" and "Wellbeing_Post".
* If you want a numeric 1/0 variable instead, do the following.
recode PPtxt ("Wellbeing_Pre"=0) ("Wellbeing_Post"=1) into Pre_Post.
exe.

Pulling data from a dataframe column only if it contains a certain value

Fairly new to programming in R,
I have a dataframe from which I am trying to create a more concise table by pulling the entire row only if it contains a certain name in the "name" column. The names are all in a separate text document. Any suggestions?
I tried:
refGenestable <- dbGetQuery(con, "select row_names, name, chrom, strand, txStart, txEnd from refGene where name in c_Gene")
where c_Gene is the list of names I need to test that I have turned into a dataframe. I also tried turning into a list of strings and iterating through that but also had problems with that
Edit:
sorry for confusion I'm still learning! I created dataframe ("refGenestable") in R (but yes it is from SQL database) but I want to narrow it down more now to only include rows that contain same name as names I have in a text file, c_Genes, where each name is separated by \n. I created a list out of this file
You may have a few issues here. It's hard to know exactly what you need because it's unclear what the structure of your data is.
The general question is easy to answer.
Provided you have a data frame, and you want a new one with only names that are in a vector, you can use DF[DF$name %in% <some vector>) or with dplyr filter(DF, name %in% <some vector>). You can't use %in% to test whether something is in a data though. You have to actually extract the variable in the other data frame.
If the names you want to keep are lines in a text file, then you're also asking a question about how to get the text file into R, in which case it's my_vector <- readLines("path to file"). The actual code will depend on the structure of the file, but if each element is on a new line, that will do what you want.
If the names you want to keep are in another data frame, then you need to extract them as a vector in order to use %in%, i.e., filter(DF, name, name %in% OTHERDF$name)
EDIT:
From your edit to the question, my answer should likely work for you. Though, again, we don't know for sure what the structure of your data is without seeing it (you can provide it by pasting the output of dput(<your object>). Here's the answer above, using the names for objects that you've described.
gene_names <- readLines("c_Genes")
# is that really the name? No extension? Is it in your working directory?
# if not, you need to use a relative or absolute path for c_Genes
genes_you_want <- refGenestable[refGenestable$name %in% gene_names,]
# is the column with the gene name called name?
# don't forget the comma at the end
# or with dplyr
install.packages("dplyr")
library(dplyr)
genes_you_want <- filter(refGenestable, name %in% gene_names)

How to rename a list of variables

I generated a number of dummy variables from a variable indicating the the relevant quarter, labelled quarter, with the following command:
tabulate quarter, generate(timeq)
This generates a set of dummy variables that range from timeq1 to timeq68.
I am trying to think about a way to rename these variables to change the names in the following way
timeq1 into 1995q1
timeq2 into 1995q2
timeq3 into 1995q3
timeq4 into 1995q4
...
timeq68 into 2011q4
As has been pointed out, this question sorely lacks a good MCVE. but an answer is possible.
Note first that the ambition to create variable with names beginning with 1 and 2 is futile as it is a basic rule that such names are not legal. However, a beginning underscore is allowed.
The problem of renaming requires, I think, at least one loop. In fact, it seems easier to back up and create the variables from scratch.
The first part of the code just creates a sandbox for play.
clear
set obs 68
gen quarter = yq(1994, 4) + _n
format quarter %tq
The second part is ad hoc code for the problem. Note that variable labels can take the form desired.
forval y = 1995/2011 {
forval q = 1/4 {
gen _`y'q`q' = quarter == yq(`y', `q')
label var _`y'q`q' "`y'q`q'"
}
}

Table aware parsing of a string field

I have a table of videos with a field, filename, and some of these videos are split in multiple parts with the starting frame number of the video part appended to the end of the filename separated by a '_'.
I want to get the integer which represents the starting frame for each filename, so for e.g.:
movie.avi : frame=0
movie_500.avi: frame=500
For the two files above, I can get it with a regular expression on my table:
SELECT coalesce(substring(filename FROM '_(\d{2,7}).avi$')::int, 0) FROM table;
However, how to deal with the case when the filename of the video might include numbers at the end. Say I have the two files:
anothermovie_100.avi: frame = 100 (WRONG!)
anothermovie_100_500.avi: frame = 500
My select statement above will give me the wrong frame starting number. I want to know from looking at my table that anothermovie_100 has frame=0 because there exists another filename in the same table which contains anothermovie_100 and finishes in three digits at the end.
So basically for a table with the four above-mentioned rows, I would like my select statement to give me this:
movie.avi: frame=0
movie_500.avi: frame=500
anothermovie_100.avi: frame=0
anothermovie_100_500.avi: frame=500
So the query has to somehow know if the filename string is not contained entirely in another filename string of the same table, in which case it must return frame 0 and not the last digits on the filename converted to integer.
I think the issue here is modeling the data - you should keep a reference to which movie each file belongs to.
Otherwise, your data may be ambiguous. Assume you have the files movie.avi and movie_500_500.avi. How would you tell (regardless on SQL syntax, just in plain English) whether movie_500.avi is in fact the 500 frame of movie.avi or the 0 frame of movie_500_500.avi ?