Print the first column of multiple files using awk - awk

I have 20 files, I want to print the first column of each file into a different file.I need 20 output files.
i have tried the following command, but this one puts all the output into a single file.
awk '{print $1}' /home/gee/SNP_data/20* > out_file
write the output to different files, i have 20 input files

1st solution: Could you please try following.
awk '
FNR==1{
if(file){
close(file)
}
file="out_file_"FILENAME".txt"
}
{
print $1 > (file)
}
' /home/gee/SNP_data/20*
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
FNR==1{ ##checking condition if FNR==1 then do following.
if(file){ ##Checking condition if variable file is NOT NULL then do following.
close(file) ##Using close to close the opened output file in backend, to avoid too many opened files error.
} ##Closing BLOCK for if condition.
file="out_file_"FILENAME".txt" ##Setting variable file value to string out_file_ then FILENAME(which is Input_file) and append .txt to it.
} ##Closing BLOCK for condition for FNR==1 here.
{
print $1 > (file) ##Printing first field to variable file here.
}
' /home/gee/SNP_data/20* ##Mentioning Input_file path here to pass files here.
2nd solution: In case you need to get output files like output_file_1.txt ans so on then try following. I have created an awk variable named out_file where you could change your output file's name's initial too(as per your need).
awk -v out_file="Output_file_" '
FNR==1{
if(file){
close(file)
}
++count
file=out_file count".txt"
}
{
print $1 > (file)
}
' /home/gee/SNP_data/20*

Awk has a builtin redirection operator, you can use it like:
awk '{ print $1 > ("out_" FILENAME) }' /home/gee/SNP_data/20*
or, even better:
awk 'FNR==1 { close(f); f=("out_" FILENAME) } { print $1 > f }' /home/gee/SNP_data/20*
Former is just an example usage of redirection operator, latter is how to use it robustly.

Related

Search one file's lines for a partial match in another file

I have 2 files, the first one:
values.txt
test#
test1#
test3#
test4#
test6#
test7#
test8#
test9#
test10#
data.csv
"username","email"
"user","test#gmail.com"
"user1","test1#gmail.com"
"user2","test3#gmail.com"
"user4","test4#gmail.com"
"user456","loka#gmail.com"
"user789","lopa#gmail.com"
"user5","test7#gmail.com"
"user","xpos#gmail.com"
"user5","test9#gmail.com"
"user","xpx#gmail.com"
I want the output to be like this:
"user","test#gmail.com"
"user1","test1#gmail.com"
"user2","test3#gmail.com"
"user4","test4#gmail.com"
"user5","test7#gmail.com"
"user5","test9#gmail.com"
What I was able to do :
$ awk -F, -v q='"' 'NR==FNR{a[q $0 q]; next}
$2 in a' values.txt data.csv > test1.csv
This will work only when i have the full "email" exp: test9#gmail.com and not only test9# a new file test1.csv containing:
"user5","test9#gmail.com"
....
....
Couldn't figure out how to do it with a partial substring with awk
You may use this awk:
awk -F, 'NR==FNR {a[$1]; next} {ea = $2; gsub(/^"|#.*$/, "", ea)} ea "#" in a' values.txt data.csv
"user","test#gmail.com"
"user1","test1#gmail.com"
"user2","test3#gmail.com"
"user4","test4#gmail.com"
"user5","test7#gmail.com"
"user5","test9#gmail.com"
A more readable version:
awk -F, 'NR == FNR {
a[$1] # from values.txt store each value in array a
next
}
{
ea = $2 # copy $2 into ea (email address)
gsub(/^"|#.*$/, "", ea) # strip starting " and text after #
}
ea "#" in a # check if ea + "#" exists in array a
' values.txt data.csv
Could you please try following, written and tested with shown samples in GNU awk. Looks like few of your lines have empty spaces at last of the lines in case you want to remove them and then match both the file's contents I have added gsub(/ +$/,"") in my solution.
awk '
{ gsub(/ +$/,"") }
FNR==NR{
arr[$0]
next
}
{
for(key in arr){
if(index($2,key)){
print
next
}
}
}' values.txt FS="," delta.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{ gsub(/ +$/,"") } ##Using gsub to remove spaces at last of lines.
FNR==NR{ ##Checking condition which will be TRUE when values.txt is being read.
arr[$0] ##Creating arr here with index of current line value.
next ##next will skip all further statements from here.
}
{
for(key in arr){ ##Going through arr elements from here.
if(index($2,key)){ ##Checking condition if key is present by index in 2nd field.
print ##Printing the current line.
next ##next will skip all further statements from here.
}
}
}' values.txt FS="," delta.csv ##Mentioning Input_file names here.

Removing lines which match with specific pattern from another file

I've got two files (I only show the beginning of these files) :
patterns.txt
m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147
myfile.txt
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I should get an output like that :
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I want to create a new file if the lines in patterns.txt match with the lines in myfile.txt . I need to keep the letters ACTG associated with the pattern in question. I use :
for i in $(cat patterns.txt); do
grep -A 1 $i myfile.txt; done > my_newfile.txt
It works, but it's very slow to create the new file... The files I work on are pretty large but not too much (14M for patterns.txt and 700M for myfile.txt).
I also tried to use grep -v because I have the another file which contains the others patterns of myfile.txt not present in patterns.txt. But it is the same "speed filling file" problem.
If you see a solution..
With your shown samples please try following. Written and tested in GNU awk.
awk '
FNR==NR{
arr[$0]
next
}
/^>/{
found=0
match($0,/.*\//)
if((substr($0,RSTART+1,RLENGTH-2)) in arr){
print
found=1
}
next
}
found
' patterns.txt myfile.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when patterns.txt is being read.
arr[$0] ##Creating array with index of current line.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found=0 ##Unsetting found here.
match($0,/.*\//) ##using match to match a regex to till / in current line.
if((substr($0,RSTART+1,RLENGTH-2)) in arr){ ##Checking condition if sub string of matched regex is present in arr then do following.
print ##Printing current line here.
found=1 ##Setting found to 1 here.
}
next ##next will skip all further statements from here.
}
found ##Printing the line if found is set.
' patterns.txt myfile.txt ##Mentioning Input_file names here.
Another awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
Output:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
Edit: To output the ones not found:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
Output:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG

Awk adding a pipe at the end of the first line

I have a little problem with my awk command.
The objective is to add a new column to my CSV :
The header must be "customer_id"
The next rows must be a customer_id from an array
Here is my csv :
email|event_date|id|type|cha|external_id|name|date
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13
I would like to have this output :
email|event_date|id|type|cha|external_id|name|date|customer_id
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
But when I'm doing the awk I have this result :
awk -v a="$(echo "${customerIdList[#]}")" 'BEGIN{FS=OFS="|"} FNR==1{$(NF+1)="customer_id"} FNR>1{split(a,b," ")} {print $0,b[NR-1]}' test.csv
email|event_date|id|type|cha|external_id|name|date|customer_id|
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
Where customerIdList = (20200 20201)
There is a pipe just after the "customer_id" header and I don't know why :(
Can someone help me ?
Could you please try following, written and tested with shown samples.
awk -v var="${customerIdList[*]}" '
BEGIN{
num=split(var,arr," ")
}
FNR==1{
print $0"|customer_id"
next
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"")
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v var="${customerIdList[*]}" ' ##Starting awk program from here, creating var variable and passing array values to it.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var,arr," ") ##Splitting var into arr with space delimiter.
}
FNR==1{ ##Checking condition if this is first line.
print $0"|customer_id" ##Then printing current line with string here.
next ##next will skip all further statements from here.
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"") ##Checking condition if value of arr with current line number -1 is NOT NULL then add its value to current line with pipe else do nothing.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
awk -v IdList="${customerIdList[*]}" 'BEGIN { split(IdList,ListId," ") } NR > 1 { $0=$0"|"ListId[NR-1]}1' file
An array will need to be created within awk and so pass the array as a space separated string and then use awk's split function to create the array IdList. The ignoring the headers (NR>1), set the line equal to the line plus the index of ListId array NR-1.

Filter out FASTA files by specified sequence length in bash

There's a FASTA file assembly.fasta containing contig names and corresponding sequences:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
>contig_3
CGTGATCTCGCCATTCGTGCCG
I want to get only contigs longer than 30 letters and get a new FASTA file assembly.filtered.fasta containing only those long sequences with contig names, in this format:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
Using gnu-awk, you may use this simpler version:
awk -v RS='>[^\n]+\n' 'length() >= 30 {printf "%s", prt $0} {prt = RT}' file
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
A very quick way to achieve what you are after is:
awk -v n=30 '/^>/{ if(l>n) print b; b=$0;l=0;next }
{l+=length;b=b ORS $0}END{if(l>n) print b }' file
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx -v '(length($seq)>30){print ">" $name ORS $seq}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
Could you please try following, tested and written with shown samples.
awk '
/^>/{
if(sign_val && strLen>=30){
print sign_val ORS line
}
strLen=line=""
sign_val=$0
next
}
{
strLen+=length($0)
line=(line?line ORS:"")$0
}
END{
if(sign_val && strLen>=30){
print sign_val ORS line
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^>/{ ##Checking condition if line starts from > then do following.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
strLen=line="" ##Nullify variables steLen and line here.
sign_val=$0 ##Setting sign_val to current line here.
next ##next will skip all further statements from here.
}
{
strLen+=length($0) ##Checking length of line and keep adding it here.
line=(line?line ORS:"")$0 ##Creating line variable and keep appending it to it with new line.
}
END{ ##Starting END block of this program from here.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
}
' Input_file ##mentioning Input_file name here.

Conditional transposition in awk based on column values

I'm trying to make the below transformation using awk.
Input:
status,parent,child,date
first,foo,bar,2019-01-01
NULL,foo,bar,2019-01-02
NULL,foo,bar,2019-01-03
last,foo,bar,2019-01-04
NULL,foo,bar,2019-01-05
blah,foo,bar,2019-01-06
NULL,foo,bar,2019-01-07
first,bif,baz,2019-01-02
NULL,bif,baz,2019-01-03
last,bif,baz,2019-01-04
Expected output:
parent,child,first,last
foo,bar,2019-01-01,2019-01-04
bif,baz,2019-01-02,2019-01-04
I'm pretty stumped by this problem, and haven't got anything to show yet - any pointers would be very helpful.
Could you please try following.
awk '
BEGIN{
FS=OFS=SUBSEP=","
print "parent,child,first,last"
}
$1=="first" || $1=="last"{
a[$1,$2,$3]=$NF
b[$2,$3]
}
END{
for(i in b){
print i,a["first",i],a["last",i]
}
}
' Input_file
Output will be as follows.
parent,child,first,last
bif,baz,2019-01-02,2019-01-04
foo,bar,2019-01-01,2019-01-04
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=SUBSEP="," ##Setting Fs, OFS and SUBSEP as comma here.
print "parent,child,first,last" ##Printing header values as per OP request here.
} ##Closing BEGIN BLOCK for this progam here.
$1=="first" || $1=="last"{ ##Checking condition if $1 is either string first or last then do following.
a[$1,$2,$3]=$NF ##Creating an array named a whose index is $1,$2,$3 and its value is $NF(last column of current line).
b[$2,$3] ##Creating an array named b whose index is $2,$3 from current line.
} ##Closing main BLOCK for main program here.
END{ ##Starting END BLOCK for this awk program.
for(i in b){ ##Starting a for loop to traverse through array here.
print i,a["first",i],a["last",i] ##Printing variable it, array a with index of "first",i and value of array b with index of "last",i.
} ##Closing BLOCK for, for loop here.
} ##Closing BLOCK for END block for this awk program here.
' Input_file ##Mentioning Input_file name here.
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $2 OFS $3 }
FNR==1 { print key, "first", "last" }
$1=="first" { first[key] = $4 }
$1=="last" { print key, first[key], $4 }
$ awk -f tst.awk file
parent,child,first,last
foo,bar,2019-01-01,2019-01-04
bif,baz,2019-01-02,2019-01-04
If you can have a first without a last or vice-versa or they can occur out of order then include those cases in the example in your question.
Not awk, you already have that, but here's an option in bash alone, just for kicks.
#!/usr/bin/env bash
declare -A first=()
printf 'parent,child,first,last\n'
while IFS=, read pos a b date; do
case "$pos" in
first) first[$a,$b]=$date ;;
last) printf "%s,%s,%s,%s\n" "$a" "$b" "${first[$a,$b]}" "$date" ;;
esac
done < input.csv
Requires bash 4+ for the associative array.