Returning Nth line from multiple files

Returning Nth line from multiple files - awk

Given a folder with multiple .csv files, I want to return the Nth line from each file and write to a new file.
For a single file, I use
awk 'NR==5' file.csv
For multiple files, I figured
ls *.csv | xargs awk 'NR==5'
...however that only returns the 5th line from the first file in the list.
Thanks!

Could you please try following and let me know if this helps you(GNU awk should help I believe):
awk 'FNR==5{print;nextfile}' *.csv
In case you need to take output into a single out file then append > output_file at last of above command too.
Explanation:
FNR==5: Checking here if line number is 5th for current Input_file then do actions mentioned after it.
{print;}: print is awk out of the box keyword which prints the current line so it will print the 5th line only.
nextfile: as by name itself it is clear that nextfile will skip all the lines in current Input_file(since I have given *.csv in end so it means it will pass all csv files to awk 1 by 1) it will save time for us since we DO NOT want to read the entries Input_file we needed only 5th line and we got it.

Related

Print filenames & line number with number of fields greater than 'x'

I am running Ubuntu Linux. I am in need to print filenames & line numbers containing more than 7 columns. There are several hundred thousand files.
I am able to print the number of columns per file using awk. However the output I am after is something like
file1.csv-463 which is to suggest file1.csv has more than 7 records on line 463. I am using awk command awk -F"," '{print NF}' * to print the number of fields across all files.
Please could I request help?

If you have GNU awk with you, try following code then. This will simply check condition if NF is greater than 7 then it will print that particular file's file name along with line number and nextfile will take program to next Input_file which will save our time because we need not to read whole Input_file then.
awk -F',' 'NF>7{print FILENAME,FNR;nextfile}' *.csv
Above will print only very first match of condition to get/print all matched lines try following then:
awk -F',' 'NF>7{print FILENAME,FNR}' *.csv

This might work for you (GNU sed):
sed -Ens 's/\S+/&/8;T;F;=;p' *.csv | paste - - -
If there is no eighth column, break.
Output the file name F, the line number = and print the current line p.
Feed the output into a paste command which prints three lines as one.
N.B. The -s option resets the line numbers for each file, without it, it will number each line for the entire input.

How do I match a pattern and then copy multiple lines?

I have two files that I am working with. The first file is a master database file that I am having to search through. The second file is a file that I can make that allows me to name the items from the master database that I would like to pull out. I have managed to make an AWK solution that will search the master database and extract the exact line that matches the second file. However, I cannot figure out how to copy the lines after the match to my new file.
The master database looks something like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40006X/50006/60006/3/10/3/
10047A/20047/30047/0/5/23./XXXX/
10048A/20048/30048/0/5/23./XXXX/
10049A/20049/30049/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
40009X/50009/60009/3/10/3/
10054A/20054/30054/0/5/23./XXXX/
10055A/20055/30055/0/5/23./XXXX/
10056A/20056/30056/0/5/23./XXXX/
40010X/50010/60010/3/10/3/
10057A/20057/30057/0/5/23./XXXX/
10058A/20058/30058/0/5/23./XXXX/
10059A/20059/30059/0/5/23./XXXX/
In my example, the lines that start with 4000 is the first line that I am matching up to. The last number in that row is what tells me how many lines there are to copy. So in the first line, 40005X/50005/60005/3/10/9/, I would be matching off of the 40005X, and the 9 in that line tells me that there are 9 lines underneath that I need to copy with it.
The second file is very simple and looks something like this:
40005X
40007X
40008X
As the script finds each match, I would like to move the information from the first file to a new file for analysis. The end result would look like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
The code that I currently have that will match the first line is this:
#! /bin/ksh
file1=input_file
file2=input_masterdb
file3=output_test
awk -F'/' 'NR==FNR {id[$1]; next} $1 in id' $file1 $file2 > $file3
I have had the most success with AWK, however I am open to any suggestion. However, I am working on this on a UNIX system. I would like to keep it as a KSH script, since most of the other scripts that I use with this are written in that format, and I am most familiar with it.
Thank you for your help!!

Your existing awk matches correctly the rows from the ids' file, you now need to add a condition to print N lines ahead after reading the last field of the matching row. So we will set a variable p to the number of lines to print plus one (the current one), and decrease per row printing.
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p-->0{print}' file1 file2
or the same with last condition more "awkish" (by Ed Morton) and covering any possible extreme case of a huge file
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p&&p--' file1 file2
here the print condition is omitted, as it is the default action, and the condition is true again as long as decreasing p is positive.

another one
$ awk -F/ 'NR==FNR {a[$1]; next}
!n && $1 in a {n=$(NF-1)+1}
n&&n--' file2 file1
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
this takes care if any of the content lines match the ids given. This will only look for another id after the specified number of lines printed.

Could you please try following, written and tested with shown samples in GNU awk. Considering that you want to print lines from line which stars from digits X here. Where Input_file2 is file having only ids and Input_file1 is master file as per OP's question.
awk '
{
sub(/ +$/,"")
}
FNR==NR{
a[$0]
next
}
/^[0-9]+X/{
match($0,/[0-9]+\/$/)
no_of_lines_to_print=substr($0,RSTART,RLENGTH-1)
found=count=""
}
{
if(count==no_of_lines_to_print){ count=found="" }
for(i in a){
if(match($0,i)){
found=1
print
next
}
}
}
found{
++count
}
count<=no_of_lines_to_print && count!=""
' Input_file2 Input_file1

check for value in csv file then print line with awk / sed

Is it possible to parse a .csv file and look for the 13th entry containing a particular value.
So data for example would be
10,1,a,bhd,5,7,10,,,8,9,3,19,0
I only want to extract lines which have a value of 3 in the 13th field if that makes sense.
Tried it wish a bash while loop using cut etc but was messy.
Not sure if there a awk / sed method.
Thanks in advance.

This is beginner level awk.
awk -F, '$13==3' file
-F, is for setting field separator to comma, $13 is the 13th field's value. For each line, if $13==3 evaluates true the line is printed.

awk/gawk - remove line if line 2 doesn't exist

I have a .txt file with 2 rows, and a seperator, some lines only contain 1 row though, so I want to remove those that only contain 1 row.
example of lines are
Line to keep,
Iamnotyours:email#email.com
Line to remove,
Iamnotyours:

Given your posted sample input all you need is:
grep -v ':$' file
or if you insist on awk for some reason:
awk '!/:$/' file
If that's not all you need then edit your question to clarify your requirements.

awk to the rescue!
$ awk -F: 'NF==2' file
prints only the lines with two fields
$ awk -F: 'NF>1' file
prints lines more than one field. Your case, you have the separator in place, the field count will be two. You need to check whether second field is empty
$ awk -F: '$2!=""' file

splitting a multiple FASTA file into separate files keeping their original names

I am trying to work with an AWK script that was posted earlier on this forum. I am trying to split a large FASTA file containing multiple DNA sequences, into separate FASTA files. I need to separate each sequence into its own FASTA file, and the name of each of the new FASTA files needs to be the name of the DNA sequence from the original, large multifasta file (all the characters after the >).
I tried this script that I found here at stackoverflow:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input
It works well, but the DNA sequence begins directly after the name of the file- with no space. The DNA sequence needs to begin on a new line (regular FASTA format).
I would appreciate any help to solve this.
Thank you!!

Do you mean something like this?
awk '/^>chr/ {OUT=substr($0,2) ".fa";print " ">OUT}; OUT{print >OUT}' your_input
where the new file that is created for each "chromosome/sequence/thing" gets a blank line at the start?

I think this should work.
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

Hope this perl script could help.
#!/usr/bin/perl
open (INFILE, "< your_input.fa")
or die "Can't open file";
while (<INFILE>) {
$line = $_;
chomp $line;
if ($line =~ /\>/) { #if has fasta >
close OUTFILE;
$new_file = substr($line,1);
$new_file .= ".fa";
open (OUTFILE, ">$new_file")
or die "Can't open: $new_file $!";
}
print OUTFILE "$line\n";
}
close OUTFILE;

The .fa (or .fasta) format looks like:
>ID1
SEQUENCE
>ID2
SEQUENCE
When splitting a fasta file it is actually not desired inserting a new line character at its top. Therefore the answer of Pramod is more appropriate. Additionally, the ID can be defined more generally to match only the > character. Consequently, the complete line would be:
awk '/^>/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
If you don't want to mess up your current directory with all the split files you can also output into a subdirectory (subdir):
awk '/^>/ {OUT="subdir/" substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

awk to split multi-sequence fasta file into separate sequence files
This problem is best approached by considering each sequence (complete with header) a single record and changing awk's default record separator RS (usually a line break) to be the unique (one per record) > symbol used to define the start of a header. As we will want to use the header text as a file name, and as fasta headers cannot contain line-breaks, it is also convenient to reset awk's default field separaor FS (usually white space) to be line breaks.
Both of these are done in an awk BEGIN block:
BEGIN{RS=">";FS="\n"}
Since the file begins with >, the first record will be empty and therefore must be ignored to prevent an error caused by trying to write to a file name extracted from an empty record. Thus, the main awk action block is filtered to only process records beginning with record number (NR) 2. This is achieved by placing a condition before the action block as follows:
NR>1{ ... }
with the record separator set to > each record is a whole sequence including its header, and each is split into fields at line breaks (because we set the field separator to "\n"). Thus, field 1 ($1) of each record contains the text we wish to use as filenames. Note the record separator (>) is no longer part of any field and so the entire first field can be used to build the filename. In this example, ".fasta" has been appended as a file extension:
fnme=$1 ".fasta";
Next, the fasta header marker ">" is printed, followed by the entire record ($0) to the filename fnme just formed, using awk's > redirect:
print ">" $0 > fnme;
lastly, the file is closed to prevent awk exceeding the system limit for the number of open files allowed, if many files are to be written (see footnote):
close(fnme);
whole procedure
awk command
awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fasta"; print ">" $0 > fnme; close(fnme);}' example.fasta
Tested on the following mock file named example.fasta:
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
results (terminal commands and output)
$ ls
'DNA sequence 1.fasta' 'DNA sequence 3.fasta'
'DNA sequence 2.fasta' example.fasta
$ cat DNA\ sequence\ 1.fasta
>DNA sequence 1
GCAAAAGAACCGCCGCCACTGGTCGTGAAAGTGGTCGATCCAGTGACATCCCAGGTGTTGTTAAATTGAT
CATGGGCAGTGGCGGTGTAGGCTTGAGTACTGGCTACAACAACACTCGCACTACCCGGAGTGATAGTAAT
GCCGGTGGCGGTACCATGTACGGTGGTGAAGT
$ cat DNA\ sequence\ 2.fasta
>DNA sequence 2
TCCCAGCCAGCAGGTAGGGTCAAAACATGCAAGCCGGTGGCGATTCCGCCGACAGCATTCTCTGTAATTA
ATTGCTACCAGCGCGATTGGCGCCGCGACCAGGATCCTTTTTAACCATTTCAGAAAACCATTTGAGTCCA
TTTGAACCTCCATCTTTGTTC
$ cat DNA\ sequence\ 3.fasta
>DNA sequence 3
AACAAAAGAATTAGAGATATTTAACTCCACATTATTAAACTTGTCAATAACTATTTTTAACTTACCAGAA
AATTTCAGAATCGTTGCGAAAAATCTTGGGTATATTCAACACTGCCTGTATAACGAAACACAATAGTACT
TTAGGCTAACTAAGAAAAAACTTT
footnote
"To write numerous files, successively, in the same awk program. If the files aren’t closed, eventually awk may exceed a system limit on the number of open files in one process. It is best to close each one when the program has finished writing it."
quoted from https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Returning Nth line from multiple files - awk

Given a folder with multiple .csv files, I want to return the Nth line from each file and write to a new file. For a single file, I use awk 'NR==5' file.csv For multiple files, I figured ls *.csv | xargs awk 'NR==5' ...however that only returns the 5th line from the first file in the list. Thanks!

Related

Print filenames & line number with number of fields greater than 'x'

How do I match a pattern and then copy multiple lines?

check for value in csv file then print line with awk / sed

awk/gawk - remove line if line 2 doesn't exist

splitting a multiple FASTA file into separate files keeping their original names

Categories

Resources