find partial match between two data frame - sql

I have two data frames.
head(NEexp)
Gene Transcript Ratio_log2 FDR
HLHmgamma HLHmgamma-RA 3.759200 1.09e-10
Brd Brd-RA 3.527000 2.66e-08
CG4080 CG4080-RE 3.378500 2.95e-50
RpII215 RpII215-RA 3.343967 1.82e-10
head(excel$gene)
Enhancer of split mgamma, helix-loop-helix
distal antenna
CG4080 gene product from transcript CG4080-RB
As you can see, the two gene column match partially(HLHmgamma matches Enhancer of split mgamma, helix-loop-helix; CG4080 matches CG4080 gene product from transcript CG4080-RB), is there anyway that I can link these two?
codes I have tried so far:
genename=as.character(NEexp$Gene)
query=paste("select * from excel where excel.gene LIKE \"", genename,"\ ",sep"")
Newtable<-dbGetQuery(con,query)
dbGetQuery(con,"select * from excel, NEexp where excel.gene LIKE % "NEexp$Gene" %")

You need merge , which basically is the same as join in SQL. But first you might want to split excel$gene to get the part you want to match.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html

Related

A different merge

So I have two tables and thoses are the samples:
df1:
Element
Range
Family
Ae_aag2/0013F
5-2500
Chuviridae
Ae_aag2/0014F
300-2100
Flaviviridae
df2:
Element
Range
Family
0012F
30-720
Chuviridae
0013F
23-1200
Chuviridae
0013F
1300-2610
Xinmoviridae
And I need to join the tables in the following logic:
Element_df1
Element_df2
Family_df1
Family_df2
Ae_aag2/0013F
"0013F:23-1200,0013F:1300-2610"
Chuviridae
"Chuviridae,Xinmoviridae"
I need the common rows in the two dataframes of the column (Element) in one line, saving the element of the first and second and also the family of the first and second. If the 3 elements are common, in the two df, it should join the 3 in one single line.
I tried using the merge in pandas, but it gets me two lines, not one as I needed:
I searched and didn't find how make exceptions on how to merge the two dataframe. I tried using groupby afterwards but kind make worst :(
Unfortunately I don't have much knowledge on working with pandas. Please be kind I'm new at the subject.
Use:
df1.drop(columns='Range').merge(
df2.assign(group=lambda d: d['Element'],
Element=lambda d: d['Element']+':'+d['Range'])
.groupby('group')[['Element', 'Family']].agg(','.join),
left_on=df1['Element'].str.extract('/(.*)$', expand=False),
right_index=True, suffixes=('_df1', '_df2')
)#.drop(columns='key_0') # uncomment to remove the key
Output:
key_0 Element_df1 Family_df1 Element_df2 Family_df2
0 0013F Ae_aag2/0013F Chuviridae 0013F:23-1200,0013F:1300-2610 Chuviridae,Xinmoviridae

Compare two comma separated columns

I want to compare two columns actual_data and pipeline_data based on source column bcz every source has different format.
I am trying to achieve the result column based on comparision between actual_data and pipeline_data .
I am new to pandas and looking for a way to implement this.
df['result'] = np.where(df['pipeline_data'].str.len() == df['actual_data'].str.len(), 'Match', np.where(df['pipeline_data'].str.len() > df['actual_data'].str.len(), 'Length greater than actual_data', 'Length shorter than actual_data'))
The code above should to what you want to do.

How to split a column containing two records separately

I have millions of observation in different columns and one of the column contains records of two factors together. for instance, 136789 and i want to split the first character (1) and the rest (36789) as separate columns for all observations.
The field looks like this
#136789
I want to see like this
#1 36789
You can make use of sub() function.
For example:
kent$ awk 'BEGIN{x="123456";sub(/^./,"& ",x);print x}'
1 23456
In your code, you need apply sub() on some column ($x)

recoding multiple variables in the same way

I am looking for the shortest way to recode many variables in the same way.
For example I have data frame where columns a,b,c are names of items of survey and rows are observations.
d <- data.frame(a=c(1,2,3), b=c(1,3,2), c=c(1,2,1))
I want to change values of all observations for selected columns. For instance value 1 of column "a" and "c" should be replaced to string "low" and values 2,3 of these columns should be replaced to "high".
I do it often with many columns so I am looking for function which can do it in very simple way, like this:
recode2(data=d, columns=a,d, "1=low, 2,3=high").
Almost ok is function recode from package cars, but if I have 10 columns to recode I have to rewrite it 10 times and it is not as effective as I want.

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.
I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.