Pulling data from a dataframe column only if it contains a certain value - sql

Fairly new to programming in R,
I have a dataframe from which I am trying to create a more concise table by pulling the entire row only if it contains a certain name in the "name" column. The names are all in a separate text document. Any suggestions?
I tried:
refGenestable <- dbGetQuery(con, "select row_names, name, chrom, strand, txStart, txEnd from refGene where name in c_Gene")
where c_Gene is the list of names I need to test that I have turned into a dataframe. I also tried turning into a list of strings and iterating through that but also had problems with that
Edit:
sorry for confusion I'm still learning! I created dataframe ("refGenestable") in R (but yes it is from SQL database) but I want to narrow it down more now to only include rows that contain same name as names I have in a text file, c_Genes, where each name is separated by \n. I created a list out of this file

You may have a few issues here. It's hard to know exactly what you need because it's unclear what the structure of your data is.
The general question is easy to answer.
Provided you have a data frame, and you want a new one with only names that are in a vector, you can use DF[DF$name %in% <some vector>) or with dplyr filter(DF, name %in% <some vector>). You can't use %in% to test whether something is in a data though. You have to actually extract the variable in the other data frame.
If the names you want to keep are lines in a text file, then you're also asking a question about how to get the text file into R, in which case it's my_vector <- readLines("path to file"). The actual code will depend on the structure of the file, but if each element is on a new line, that will do what you want.
If the names you want to keep are in another data frame, then you need to extract them as a vector in order to use %in%, i.e., filter(DF, name, name %in% OTHERDF$name)
EDIT:
From your edit to the question, my answer should likely work for you. Though, again, we don't know for sure what the structure of your data is without seeing it (you can provide it by pasting the output of dput(<your object>). Here's the answer above, using the names for objects that you've described.
gene_names <- readLines("c_Genes")
# is that really the name? No extension? Is it in your working directory?
# if not, you need to use a relative or absolute path for c_Genes
genes_you_want <- refGenestable[refGenestable$name %in% gene_names,]
# is the column with the gene name called name?
# don't forget the comma at the end
# or with dplyr
install.packages("dplyr")
library(dplyr)
genes_you_want <- filter(refGenestable, name %in% gene_names)

Related

Stata: How to use column value as file name in loop

I am working with 350 datasets. I want to automate naming the final datasets with values from the dataset.
For example, if ID is abc and year is 2010. There are two columns in the dataset with those values. I want to pull that information out and use in the file name. and the name would look like abc_2010.dta in this case.
So basically I want to do
foreach file in `files' {
**calculation codes**
** construct the file name as three digit ID_year.dta **
}
I have already done the calculation part. I need some help with the naming of the files.
If I understand what you are trying to do, I believe you should be able to do this:
foreach file in `files' {
**calculation codes**
** construct the file name as three digit ID_year.dta **
local fname:di "`=id[1]'_`=year[1]'"
save `fname', replace
}
Note that this assume that after the calculation of the current iteration of the loop through files, the value of id in the first row holds the three digit code and the value of year in the first row holds the year.

Loop over list of DataFrame names and save to csv in PySpark

I need to save a bunch of PySpark DataFrames as csv tables. The tables should also have the same names as the DataFrames.
The code should be something like that:
for table in ['ag01','a5bg','h68chl', 'df42', 'gh63', 'ur55', 'ui99']:
ppath='hdfs://hadoopcentralprodlab01/..../data/'+table+'.csv'
table.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").save(ppath)
The problem is here that in the command "table.repartition(1)..." I need the actual names of data frames without ''. So in this form the code doesn't work. But If write "for table in [ag01,a5bg,...]", so without quotes in the list, I then cannot define the path because I cannot concantenate the name of data frame and a string. How can I resolve this dilemma?
Thanks in advance!
Having a bunch of variable names not considered good coding practice. You should have used a list or a dictionary in the first place. But if you're stuck in this already, you can use eval to get the dataframe stored in that variable.
for table in ['ag01', 'a5bg', 'h68chl', 'df42', 'gh63', 'ur55', 'ui99']:
ppath = 'hdfs://hadoopcentralprodlab01/..../data/'+table+'.csv'
df = eval(table)
df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").save(ppath)

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

Keeping table formatting in Sage with multiple tables

As the title suggests, I am trying to keep proper table formatting in Sage while displaying multiple tables (this is strictly a formatting question, so no knowledge of the math involved is necessary). Currently, I am using the following code:
my_table2 = table([column1, column2], frame = True)
my_table1 = table([in_the_cone, lengths_in_cone], frame = True)
result_table1 = my_table1.transpose()
result_table2 = my_table2.transpose()
result_table1
result_table2
With this, I receive no output for table1 and the following output for table2:
I want both tables to look this way, but having no output for the first table is no good. So I tried changing the bottom two lines to:
result_table1, result_table2
While this does display both tables, the formatting now looks like:
Is there a way I can display both tables at the same time with the first formatting?
It would have been nice for you to include a full minimal working example, but in any case it does depend a little on the output.
Basically, in a notebook or other "cell", only the last return value prints to the screen in some fashion (sometimes via a "hook" as in your case). But if you use the comma, that implicitly creates a "tuple" which is then printed as a tuple, so you lose that "hook" to display things with math modes (since a tuple doesn't have that).
In this case, the (newish) canonical way to achieve what you want is
pretty_print(result_table1)
pretty_print(result_table2)
though you may want to put print "\n" in between so they don't end up right on top of each other.
Edit: Here is a picture in Jupyter inside of Sage.

R - find name in string that matches a lookup field using regex

I have a data frame of ad listings for pets:
ID Ad_title
1 1 year old ball python
2 Young red Blood python. - For Sale
3 1 Year Old Male Bearded Dragon - For Sale
I would like take the common name in the Ad_listing (i.e. ball pyton) and create a new field with the Latin name for the species. To assist, I have another data frame that has the latin names and common names:
ID Latin_name Common_name
1 Python regius E: Ball Python, Royal Python G: Königspython
2 Python brongersmai E: Red Blood Python, Malaysian Blood Python
3 Pogona barbata E: Eastern Bearded Dragon, Bearded Dragon
How can I go about doing this? The tricky part is that the common names are hidden in between text both in the ad listing and in the Common_name. If that were not the case I could just use %in%. If there was a way/function to use regex I think that would be helpful.
The other answer does a good job outlining the general logic, so here's a few thoughts on a simple (though not optimized!!) way to do this:
First, you'll want to make a big table, two columns of all 'common names' (each name gets its own row) alongside it's Latin name. You could also make a dictionary here, but I like tables.
reference_table <- data.frame(common = c("cat", "kitty", "dog"), technical = c("feline", "feline", "canine"))
common technical
1 cat feline
2 kitty feline
3 dog canine
From here, just loop through every element of "ad_title" (use apply() or a for loop, depending on your preference). Now use something like this:
apply(reference_table,1, function(X) {
if (length(grep(X$common, ad_title)) > 0){ #If the common name was found in the ad_title
[code to replace the string]})
For inserting the new string, play with your regular regex tools. Alternatively, play with strsplit(ad_title, X$common). You'll be able to rebuild the ad_title using paste(), and the parts that make up the strsplit.
Again, this is NOT the best way to do this, but hopefully the logic is simple.
Well, I tried to create a workable solution for your requirement. There could be better ways to execute it, though, probably using packages such as data.table and/or stringr. Anyway, this snippet could be a working starting point. Oh, and I modified the Ad_title data a bit so that the species names are in titlecase.
# Re-create data
Ad_title <- c("1 year old Ball Python", "Young Red Blood Python. - For Sale",
"1 Year Old Male Bearded Dragon - For Sale")
df2 <- data.frame(Latin_name = c("Python regius", "Python brongersmai", "Pogona barbata"),
Common_name = c("E: Ball Python, Royal Python G: Königspython",
"E: Red Blood Python, Malaysian Blood Python",
"E: Eastern Bearded Dragon, Bearded Dragon"),
stringsAsFactors = F)
# Aggregate common names
Common_name <- paste(df2$Common_name, collapse = ", ")
Common_name <- unlist(strsplit(Common_name, "(E: )|( G: )|(, )"))
Common_name <- Common_name[Common_name != ""]
# Data frame latin names vs common names
df3 <- data.frame(Common_name, Latin_name = sapply(Common_name, grep, df2$Common_name),
row.names = NULL, stringsAsFactors = F)
df3$Latin_name <- df2$Latin_name[df3$Latin_name]
# Data frame Ad vs common names
Ad_Common_name <- unlist(sapply(Common_name, grep, Ad_title))
df4 <- data.frame(Ad_title, Common_name = sapply(1:3, function(i) names(Ad_Common_name[Ad_Common_name==i])),
stringsAsFactors = F)
obviously you need a loop structure for all your common name lookup table and another loop that splits this compound field on comma, before doing simple regex. there's no sane regex that will do it all.
in future avoid using packed/compound structures that require packing and unpacking. it looks fine for human consumption but semantically and for computer program consumption, you have multiple data values packed in single field, i.e. it's not "common name" it's "common names" delimited by comma, that you have there.
sorry if i haven't provided R or whatever specific answer. I'm a technology veteran and use many languages/technologies depending on problem and available resources. you will need to iterate over every record of your latin names lookup table, within which you will need to iterate over the comma delimited packed field of "common names", so you're working with one common name at a time. with that single common name you search/replace using regex or whatever means available to you, over the whole input file. it's plain and simple that you need to start at it from that end, i.e. the lookup table. you need to iterlate/loop through that. iteration/looping should be familiar to you, as it's a basic building block of any program/script. this kind of procedural logic is not part of the capability (or desired functionality) of regex itself. I assume you know how to create a iterative construct in R or whatever you're using for this.