cat lines from X to Y of multiple files into one file - awk

I have many huge files in say 3 different folders from which i would like to copy say lines from X to Y of files of the same name and append them into a new file of the same name.
I tried doing
ls seed1/* | while read FILE; do
head -n $Y | tail -n $X seed1/$FILE seed2/$FILE seed3/$FILE > combined/$FILE
done
This does the job for the first value of $FILE, but this does not return the prompt, and hence I am unable to execute this loop.
For example i have the following files in three different folders, seed1, seed2 and seed3:
seed1/foo.dat
seed1/bar.dat
seed1/qax.dat
seed2/foo.dat
seed2/bar.dat
seed2/qax.dat
seed3/foo.dat
seed3/bar.dat
seed3/qax.dat
I would like to combine lines 10 to 20 of all files in to a combined folder:
combined/foo.dat
combined/bar.dat
combined/qax.dat
Each of the files in combined have 30 lines, with 10 each from seed1,seed2 and seed3.

No loop required:
awk -v x=10 -v y=20 '
FNR==1 { out = gensub(/.*\//,"combined/",1,FILENAME) }
FNR>=x { print > out }
FNR==y { nextfile }
' seed*/*.dat
The above assumes the "combined" directory already exists (empty or not) before awk is called and uses GNU awk for gensub() and nextfile and internal file management. Solutions with other awks are less efficient, require a bit more coding, and require you to manage closing files when too many are going to be open.

Related

Split a file into multiple gzip files in one line

Is it possible to split a file into multiple gzip files in one line?
Lets say I have a very large file data.txt containing
A somedata 1
B somedata 1
A somedata 2
C somedata 1
B somedata 2
I would like to split each into separate directory of gz files.
For example, if I didnt care about separating, I would do
cat data.txt | gzip -5 -c | split -d -a 3 -b 100000000 - one_dir/one_dir.gz.
And this will generate gz files of 100MB chunks under one_dir directory.
But what I want is separating each based on the first column. So I would like to have say 3 different directory, containing gz files of 100MB chunks for A, B and C respectively.
So the final directory will look like
A/
A.gz.000
A.gz.001
...
B/
B.gz.000
B.gz.001
...
C/
C.gz.000
C.gz.001
...
Can I do this in a 1 liner using cat/awk/gzip/split? Can I also have it create the directory (if it doesnt exist yet)
With awk:
awk '
!d[$1]++ {
system("mkdir -p "$1)
c[$1] = "gzip -5 -c|split -d -a 3 -b 100000000 - "$1"/"$1".gz."
}
{ print | c[$1] }
' data.txt
Assumes:
sufficiently few distinct $1 (there is an implementation-specific limit on how many pipes can be active simultaneously - eg. popen() on my machine seems to allow 1020 pipes per process)
no problematic characters in $1
Incorporating improvements suggested by #EdMorton:
If you have a sort that supports -s (so-called "stable sort"), you can remove the first limit above as only a single pipe will need to be active.
You can remove the second limit by suitable testing and quoting before you use $1. In particular, unescaped single-quotes will interfere with quoting in the constructed command; and forward-slash is not valid in a filename. (NUL (\0) is not allowed in a filename either but should never appear in a text file.)
sort -s -k1,1 data.txt | awk '
$1 ~ "/" {
print "Warning: unsafe character(s). Ignoring line",FNR >"/dev/stderr"
next
}
$1 != prev {
close(cmd)
prev = $1
# escape single-quote (\047) for use below
s = $1
gsub(/\047/,"\047\\\047\047",s)
system("mkdir -p -- \047"s"\047")
cmd = "gzip -5 -c|split -d -a 3 -b 100000000 -- - \047"s"/"s".gz.\047"
}
{ print | cmd }
'
Note that the code above still has gotchas:
for a path d1/d2/f:
the total length can't exceed getconf PATH_MAX d1/d2; and
the name part (f) can't exceed getconf NAME_MAX d1/d2
Hitting the NAME_MAX limit can be surprisingly easy: for example copying files onto an eCryptfs filesystem could reduce the limit from 255 to 143 characters.

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?
Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR - this test only matches the first file to be processed (allGenes.txt)
gene[$1] - store each gene as an index in an associative array
next stop processing and go to next line in the file
$1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line
I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....
GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

Print File Paths and Filenames within a Directory into a CSV

I have a directory that contains several files. For instance:
File1.bam
File2.bam
File3.bam
I want to create a .csv file that contains 2 columns and includes a header:
Name,Path
File1, /Path/To/File1.bam
File2, /Path/To/File2.bam
File3, /Path/To/File3.bam
I've managed to piece together a way of doing this in separate steps, but it involves creating a csv with the path, appending with the filename, and then appending again with the header. I would like to add both the filename and path in 1 step, so that there is no possibility of linking an incorrect filename and path.
In case it matters, I'm trying to do this within a script that is running in a batch job (SLURM), and the output CSV will be used in subsequent workflow steps.
find ~/Desktop/test -iname '*.csv' -type f >bamlist1.csv
awk '{print FILENAME (NF?",":"") $0}' *.csv > test.csv
{ echo 'Name, Path'; cat bamlist.csv; } > bamdata.csv
Untested but should be close:
find ~/Desktop/test -name '*.bam' -type f |
awk '
BEGIN { OFS=","; print "Name", "Path" }
{ fname=$0; sub(".*/","",fname); print fname, $0 }
' > bamdata.csv
Like your original script the above assumes that none of your file/directory names contain newlines or commas.
If you have GNU find you can just do:
{ echo "Name,Path"; find ~/Desktop/test -name '*.bam' -type f -printf '%f,%p\n' } > bamdata.csv

Renaming files based on internal text match - keep all content of file

Still having trouble figuring out how to preserve the contents of a given file using the following code that is attempting to rename the file based on a specific regex match within said file (i.e. within a given file there will always be one SMILE followed by 12 digits, e.g., SMILE000123456789).
for f in FILENAMEX_*; do awk '/SMILE[0-9]/ {OUT=$f ".txt"}; OUT {print >OUT}' ${f%.*}; done
This code is naming the file correctly but is simply printing out everything after the match instead of the entire contents of the file.
The list of files to be processed don't currently have an extension (and they need one for the next step) because I was using csplit to parse out the content from a larger file.
There are two problems: the first is using a shell variable in your awk program, and the second is the logic of the awk program itself.
To use a shell variable in awk, you can use
awk -v var="$var" '<program>'
and then use just var inside of awk.
For the second problem: if a line doesn't match your pattern and OUT is not set, you don't print the line. After the first line matching the pattern, OUT is set and you print. Since the match might be anywhere in the file, you have to store the lines at least up to the first match.
Here is a version that should work and is pretty close to your approach:
for f in FILENAMEX_*; do
awk -v f="${f%.*}" '
/SMILE[0-9]/ {
out=f".txt"
for (i=1;i<NR;++i) # Print file so far
print lines[i] > out
}
out { print > out } # Match has been seen: print
! out { lines[NR] = $0 } # No match yet: store
' "$f"
done
You could do some trickery and work with FILENAME or similar to do everything in a single invocation of awk, but since the main purpose is to find the presence of a pattern in the file, you're much better off using grep -q, which returns an exit status of 0 if the pattern has been found:
for f in FILENAMEX_*; do grep -q 'SMILE[0-9]' "$f" && cp "$f" "${f%.*}".txt; done
perhaps a different approach and just do each step separately..
ie pseudocode
for all files with some given text
extract text
rename file

Trying to modify awk code

awk 'BEGIN{OFS=","} FNR == 1
{if (NR > 1) {print fn,fnr,nl}
fn=FILENAME; fnr = 1; nl = 0}
{fnr = FNR}
/ERROR/ && FILENAME ~ /\.gz$/ {nl++}
{
cmd="gunzip -cd " FILENAME
cmd; close(cmd)
}
END {print fn,fnr,nl}
' /tmp/appscraps/* > /tmp/test.txt
the above scans all files in a given directory. prints the file name, number of lines in each file and number of lines found containing 'ERROR'.
im now trying to make it so that the script executes a command if any of the file it reads in isn't a regular file. i.e., if the file is a gzip file, then run a particular command.
above is my attempt to include the gunzip command in there and to do it on my own. unfortunately, it isn't working. also, i cannot "gunzip" all the files in the directory beforehand. this is because not all files in the directory will be "gzip" type. some will be regular files.
so i need the script to treat any .gz file it finds a different way so it can read it, count and print the number of lines that's in it, and the number of lines it found matching the pattern supplied (just as it would if the file had been a regular file).
any help?
This part of your script makes no sense:
{if (NR > 1) {print fn,fnr,nl}
fn=FILENAME; fnr = 1; nl = 0}
{fnr = FNR}
/ERROR/ && FILENAME ~ /\.gz$/ {nl++}
Let me restructure it a bit and comment it so it's clearer what it does:
{ # for every line of every input file, do the following:
# If this is the 2nd or subsequent line, print the values of these variables:
if (NR > 1) {
print fn,fnr,nl
}
fn = FILENAME # set fn to FILENAME. Since this will occur for the first line of
# every file, this is that value fn will have when printed above,
# so why not just get rid of fn and print FILENAME?
fnr = 1 # set fnr to 1. This is immediately over-written below by
# setting it to FNR so this is pointless.
nl = 0
}
{ # for every line of every input file, also do the following
# (note the unnecessary "}" then "{" above):
fnr = FNR # set fnr to FNR. Since this will occur for the first line of
# every file, this is that value fnr will have when printed above,
# so why not just get rid of fnr and print FNR-1?
}
/ERROR/ && FILENAME ~ /\.gz$/ {
nl++ # increment the value of nl. Since nl is always set to zero above,
# this will only ever set it to 1, so why not just set it to 1?
# I suspect the real intent is to NOT set it to zero above.
}
You also have the code above testing for a file name that ends in ".gz" but then you're running gunzip on every file in the very next block.
Beyond that, just call gunzip from shell as everyone else also suggested. awk is a tool for parsing text, it's not an environment from which to call other tools - that's what a shell is for.
For example, assuming your comment (prints the file name, number of lines in each file and number of lines found containing 'ERROR) accurately describes what you want your awk script to do and assuming it makes sense to test for the word "ERROR" directly in a ".gz" file using awk:
for file in /tmp/appscraps/*.gz
do
awk -v OFS=',' '/ERROR/{nl++} END{print FILENAME, NR+0, nl+0}' "$file"
gunzip -cd "$file"
done > /tmp/test.txt
Much clearer and simpler, isn't it?
If it doesn't make sense to test for the word ERROR directly in a ".gz" file, then you can do this instead:
for file in /tmp/appscraps/*.gz
do
zcat "$file" | awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
gunzip -cd "$file"
done > /tmp/test.txt
To handle gz and non-gz files as you've now described in your comment below:
for file in /tmp/appscraps/*
do
case $file in
*.gz ) cmd="zcat" ;;
* ) cmd="cat" ;;
esac
"$cmd" "$file" |
awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
done > /tmp/test.txt
I left out the gunzip since you don't need it as far as I can tell from your stated requirements. If I'm wrong, explain what you need it for.
I think it could be simpler than that.
With shell expansion you already have the file name (hence you can print it).
So you can do a loop over all the files, and for each do the following:
print the file name
zgrep -c ERROR $file (this outputs the number of lines containing 'ERROR')
zcat $file|wc -l (this will output the line numbers)
zgrep and zcat work on both plain text files and gzipped ones.
Assuming you don't have any spaces in the paths/filenames:
for f in /tmp/appscraps/*
do
n_lines=$(zcat "$f"|wc -l)
n_errors=$(zgrep -c ERROR "$f")
echo "$f $n_lines $n_errors"
done
This is untested but it should work.
You can use execute the following command for each file :
gunzip -t FILENAME; echo $?
It will pass print the exit code 0(for gzip files) or 1(corrupt/other file). Now you can compare the output using IF to execute the required processing.