how to combine fasta files? - sequence

My question is regarding concatenating two gene sequences into a combined file. Here are four of example file names
IMB211_trasncripts.renamed-20175.fa
IMB211_trasncripts.renamed-20176.fa
R500_trasncripts.renamed-20175.fa
R500_trasncripts.renamed-20176.fa
Basically i want to concatenate IMB211_trasncripts.renamed-20175.fa with R500_trasncripts.renamed-20175.fa and IMB211_trasncripts.renamed-20176.fa with R500_trasncripts.renamed-20176.fa. I know this is simple using cat command.
cat MB211_trasncripts.renamed-20175.fa R500_trasncripts.renamed-20175.fa > test_comb-20175.fa
However i have around 40 thousand genes like these (each for IMB211 and R500) and so i am wondering is there a easy way around to concatenate these files into one file specific for each gene id?
Thanks

Related

Recursively search directory for occurrences of each string from one column of a .csv file

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

Huge file with 55000 rows * 1800 Columns - need to delete only specific column with a partial pattren

I have a huge file (cancer Gene expression data- ~ 2 GBs .csv file) with 55000 rows and ~ 1800 Columns. so my table looks like this:
TCGA-4N-A93T-01A-11R-A37K-07, **TCGA-5M-AAT4-11A-11R-A41B-07**, TCGA-5M-AATE-01A-11R-A41B-07, TCGA-A6-2677-01B-02R-A277-07, **TCGA-A6-2677-11A-01R-0821-07**
for example in Column TCGA-5M-AAT4-11A-11R-A41B-07 at the fourth position I have -11A, Now my problem is I have to delete the entire columns which have -11A at 4th position (xx-xx-xx-11A-xx-xx-xx).This has to search all 1800 columns and keep only those columns which do not have -11A at a fourth position.
Can you please help me what command should i use to get the required data.
I am a biologist and have limited Experience in coding
EDITED:
I have a data file collected from 1800 Breast cancer patient, the table has got 55000 gene names as rows and 1800 samples as the columns. (55000 * 1800 matrix file)Few samples designed by our lab were faulty and we have to remove those from our analysis. Now, I have identified those samples and I wanted to remove them from my file1.csv. xx-xx-xx-11A-xx-xx-xx are the faulty samples, i need to identify only those samples and remove them from the file.csv. The samples which show 11A in the fourth place of the column name. I can do this in R but it takes too long for me to process. Thanks in advance, sorry for annoying.
Try this
#! /usr/local/bin/gawk -f
# blacklist_columns.awk
# https://stackoverflow.com/questions/49578756
# i.e. TCGA-5M-AAT4-11A-11R-A41B-07
BEGIN{
PATTERN="TCGA-..-....-11A-...-....-.."
}
$0 ~ ".*" PATTERN ".*"{ # matches rows with the pattern
for(col=1;col<=NF; col++)
# find column(s) in the row with the patten
if($col ~ PATTERN){
blacklist[col]++ # note which column
}
}
END{ # output the list collected
n = asorti(blacklist)
for(i=1;i<=n;i++)
bl=bl "," blacklist[i]
print substr(bl, 2)
}
# Usage try ... :
# BLACKLIST=blacklist_columns.awk table.tab
#
# cut --complement -f $BLACKLIST table.tab > table_purged.tab
You can't do it in one pass so you might as well let an existing tool
do the second pass especially since you are more on the wet side.
The script should spit out a list of columns it thinks you should skip
you can feed that list as an argument to the program cut
and get it to only keep the columns not mentioned.
Edit(orial):
Thank you for your sentiment Wojciech Kaczmarek I could not agree more.
There is also a flip side where some biologist discount "coders" which I find annoying. The paper being working on here may include some water cooler collaborator but fail to mention technical help on a show stopper (hey, they fixed it must not have been any big deal).
Not sure what you really asking for, this script will delete row by row the fields which has the "11A" in the 4th position (based on - delim).
$ awk -F', *' -v OFS=', ' '{for(i=1;i<=NF;i++)
{split($i,a,"-");
if(a[4]=="11A") $i=""}}1' input > output
If you're asking to remove the entire column for all rows not just the found row, this is not it. Also not tested but perhaps will give you ideas...

awk: How can I use awk to determine if lines in one file of my choosing (lines 8-12, for example) are also present anywhere in another file

I have two files, baseline.txt and result.txt. I need to be able to find if lines in baseline.txt are also in results.txt. For example, if lines 8-12, is in results.txt. I need to use awk. Thanks.
Assuming the files are sorted, it looks like comm is more of what you're looking for if you want lines that are present in both files:
comm -12 baseline.txt results.txt
The -12 argument suppresses lines that are unique to baseline.txt and results.txt, respectively, leaving you with only lines that are common to both files ("suppress lines unique to file 1, suppress lines unique to file 2").
If you are dead set on using awk, then perhaps this question can help you.

How to find the same words in two different text files and print those lines using bash?

I have two text files. One contain just one column of words. Hundreds of words. Just one word in every line. The second one contain a lot of columns a row.
I need to find the words from first text file which are in the second text file and print the entire line from second text file where this word is, using awk, grep or other command line program. For example:
Text file #1:
car
house
notebook
Text file #2:
32233: FTD laptop
24342: TGD car
2424: jdj notebook
Output:
24342: TGD car
2424: jdj notebook
try this:
grep -Fwf file1 file2

Reading blocks of text from a CSV file - vb.net

I need to parse a CSV file with blocks of text being processed in different ways according to certain rules, e.g.
userone,columnone,columntwo
userthirteen,columnone,columntwo
usertwenty,columnone,columntwo
customerone,columnone<br>
customertwo,columntwo<br>
singlevalueone
singlevaluetwo
singlevalueone_otherruleapplies
singlevaluethree_otherruleapplies
Each block of text will be grouped so the first three rows will be parsed using certain rules and so on. Notice that the last two groups have only one single column but each group must be handled in a different way.
I have the chance to propose the customer the format of the file so I'm thinking to propose the following.
[group 1]
userone,columnone,columntwo
userthirteen,columnone,columntwo
usertwenty,columnone,columntwo
[group N]
rowN
A kind of sections like the INI files from some years ago. However I'd like to hear your comments because I think there must be a better way to handle this.
I proposed to use XML but the customer prefers the text files.
Any suggestions are welcome.
m0dest0.
Ps. using VB.net and VS 2008
You can use regular expression groups set to either an enum line mode if each line has the same format, or to an enum multi-line if the format is not constrained to a single line. For each line in multiline you can include \n in your pattern to cross multiple lines to find you pattern. If its on a single line you don't need to include \n also know as Carriage return line feed in your regex matching pattern.
vb.net as well as many other modern programming language has extensive support for grouping operations. You can use index groups, or named groups.
Each name such as header1 or whatever you want to name it would be in this format: <myname>
See this link for more info: How do I access named capturing groups in a .NET Regex?.
Good luck.