Split a file into multiple gzip files in one line - awk

Is it possible to split a file into multiple gzip files in one line?
Lets say I have a very large file data.txt containing
A somedata 1
B somedata 1
A somedata 2
C somedata 1
B somedata 2
I would like to split each into separate directory of gz files.
For example, if I didnt care about separating, I would do
cat data.txt | gzip -5 -c | split -d -a 3 -b 100000000 - one_dir/one_dir.gz.
And this will generate gz files of 100MB chunks under one_dir directory.
But what I want is separating each based on the first column. So I would like to have say 3 different directory, containing gz files of 100MB chunks for A, B and C respectively.
So the final directory will look like
A/
A.gz.000
A.gz.001
...
B/
B.gz.000
B.gz.001
...
C/
C.gz.000
C.gz.001
...
Can I do this in a 1 liner using cat/awk/gzip/split? Can I also have it create the directory (if it doesnt exist yet)

With awk:
awk '
!d[$1]++ {
system("mkdir -p "$1)
c[$1] = "gzip -5 -c|split -d -a 3 -b 100000000 - "$1"/"$1".gz."
}
{ print | c[$1] }
' data.txt
Assumes:
sufficiently few distinct $1 (there is an implementation-specific limit on how many pipes can be active simultaneously - eg. popen() on my machine seems to allow 1020 pipes per process)
no problematic characters in $1
Incorporating improvements suggested by #EdMorton:
If you have a sort that supports -s (so-called "stable sort"), you can remove the first limit above as only a single pipe will need to be active.
You can remove the second limit by suitable testing and quoting before you use $1. In particular, unescaped single-quotes will interfere with quoting in the constructed command; and forward-slash is not valid in a filename. (NUL (\0) is not allowed in a filename either but should never appear in a text file.)
sort -s -k1,1 data.txt | awk '
$1 ~ "/" {
print "Warning: unsafe character(s). Ignoring line",FNR >"/dev/stderr"
next
}
$1 != prev {
close(cmd)
prev = $1
# escape single-quote (\047) for use below
s = $1
gsub(/\047/,"\047\\\047\047",s)
system("mkdir -p -- \047"s"\047")
cmd = "gzip -5 -c|split -d -a 3 -b 100000000 -- - \047"s"/"s".gz.\047"
}
{ print | cmd }
'
Note that the code above still has gotchas:
for a path d1/d2/f:
the total length can't exceed getconf PATH_MAX d1/d2; and
the name part (f) can't exceed getconf NAME_MAX d1/d2
Hitting the NAME_MAX limit can be surprisingly easy: for example copying files onto an eCryptfs filesystem could reduce the limit from 255 to 143 characters.

Related

Use awk to split hundreds of files, have the filename include the original file name plus the column used to split

I have hundreds of files that look like this:
2 300 500
2 1000 1050
3 500 600
with hundreds of lines in each file. I want to split by the third column, but then have the output include the original file. That way I can do this to hundreds of files without overwriting and ending up with just one set.
I am splitting using:
awk '{print>$3}'
This splits the files, but each one gets named 500.txt, 1050.txt etc. If the file name is SRR3.counts, I would like the files to be SRR3.500.txt rather than 500.txt
Help?
like this:
awk '{p=FILENAME; sub(/\..*/, "", p); print>p"."$3".txt"}' SRR*
So this is what I did, and its actually not putting the file name into the output files, I'm not sure why: I have two scripts named split_header.sh and split.sh that look like this:
split_header.sh
#!/bin/bash
#SBATCH --ntasks=3
#SBATCH --time=10000:00
#SBATCH --mem=32000mb
and split.sh
OUT_PFX="$(cut -d '.' -f 1 <<< "$OUT" )"
awk 'FILENAME ~ /^"$OUT"/{p=""$OUT_PFX"."}{print>p$3".txt"}' "$OUT"
then I made one for each file:
for i in SRR*
do
cat split_header.sh >> $i.sh
printf "\nOUT="$i"\n" >>$i.sh
cat split.sh >>$i.sh
done
... or if your files are named SRRwhatever.somethingElse:
awk '
FNR==1 {root=substr(FILENAME,1, index(FILENAME,".")-1)}
{out=(root "." $3 ".txt"); print >>out;close(out)}
' SRR*.*

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?
Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR - this test only matches the first file to be processed (allGenes.txt)
gene[$1] - store each gene as an index in an associative array
next stop processing and go to next line in the file
$1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line
I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....
GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

Batch renaming files with text from file as a variable

I am attempting to convert the files with the titles {out1.hmm, out2.hmm, ... , outn.hmm} to unique identifiers based on the third line of the file {PF12574.hmm, PF09847.hmm, PF0024.hmm} The script works on a single file however the variable does not get overwritten and only one file remains after running the command below:
for f in *.hmm;
do output="$(sed -n '3p' < $f |
awk -F ' ' '{print $2}' |
cut -f1 -d '.' | cat)" |
mv $f "${output}".hmm; done;
The first line calls all the outn.hmms as an input. The second line sets a variable to return the desired unique identifier. SED, AWK, and CUT are used to get the unique identifier. The variable supposed to rename the current file by the unique identifier, however the variable remains locked and overwrites the previous file.
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm
How can I overwrite the variable to get the following file structure:
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm PF09847.hmm PF0024.hmm
You're piping the empty output of the assignment statement (to the variable named "output") into the mv command. That variable is not set yet, so what I think will happen is that you will - one after the other - rename all the files that match *.hmm to the file named ".hmm".
Try ls -a to see if that's what actually happened.
The sed, awk, cut, and (unneeded) cat are a bit much. awk can do all you need. Then do the mv as a separate command:
for f in *.hmm
do
output=$(awk 'NR == 3 {print $2}' "$f")
mv "$f" "${output%.*}.hmm"
done
Note that the above does not do any checking to verify that output is assigned to a reasonable value: one that is non-empty, that is a proper "identifier", etc.

Finding common value across multiple files containing single column values

I have 100 text files containing single columns each. The files are like:
file1.txt
10032
19873
18326
file2.txt
10032
19873
11254
file3.txt
15478
10032
11254
and so on.
The size of each file is different.
Kindly tell me how to find the numbers which are common in all these 100 files.
The same number appear only once in 1 file.
This will work whether or not the same number can appear multiple times in 1 file:
$ awk '{a[$0][ARGIND]} END{for (i in a) if (length(a[i])==ARGIND) print i}' file[123]
10032
The above uses GNU awk for true multi-dimensional arrays and ARGIND. There's easy tweaks for other awks if necessary, e.g.:
$ awk '!seen[$0,FILENAME]++{a[$0]++} END{for (i in a) if (a[i]==ARGC-1) print i}' file[123]
10032
If the numbers are unique in each file then all you need is:
$ awk '(++c[$0])==(ARGC-1)' file*
10032
awk to the rescue!
to find the common element in all files (assuming uniqueness within the same file)
awk '{a[$1]++} END{for(k in a) if(a[k]==ARGC-1) print k}' files
count all occurrences and print the values where count equals number of files.
Files with a single column?
You can sort and compare this files, using shell:
for f in file*.txt; do sort $f|uniq; done|sort|uniq -c -d
Last -c is not necessary, it's need only if you want to count number of occurences.
One using Bash and comm because I needed to know if it would work. My test files were 1, 2 and 3, hence the for f in ?:
f=$(shuf -n1 -e ?) # pick one file randomly for initial comms file
sort "$f" > comms
for f in ? # this time for all files
do
comm -1 -2 <(sort "$f") comms > tmp # comms should be in sorted order always
# grep -Fxf "$f" comms > tmp # another solution, thanks #Sundeep
mv tmp comms
done

cat lines from X to Y of multiple files into one file

I have many huge files in say 3 different folders from which i would like to copy say lines from X to Y of files of the same name and append them into a new file of the same name.
I tried doing
ls seed1/* | while read FILE; do
head -n $Y | tail -n $X seed1/$FILE seed2/$FILE seed3/$FILE > combined/$FILE
done
This does the job for the first value of $FILE, but this does not return the prompt, and hence I am unable to execute this loop.
For example i have the following files in three different folders, seed1, seed2 and seed3:
seed1/foo.dat
seed1/bar.dat
seed1/qax.dat
seed2/foo.dat
seed2/bar.dat
seed2/qax.dat
seed3/foo.dat
seed3/bar.dat
seed3/qax.dat
I would like to combine lines 10 to 20 of all files in to a combined folder:
combined/foo.dat
combined/bar.dat
combined/qax.dat
Each of the files in combined have 30 lines, with 10 each from seed1,seed2 and seed3.
No loop required:
awk -v x=10 -v y=20 '
FNR==1 { out = gensub(/.*\//,"combined/",1,FILENAME) }
FNR>=x { print > out }
FNR==y { nextfile }
' seed*/*.dat
The above assumes the "combined" directory already exists (empty or not) before awk is called and uses GNU awk for gensub() and nextfile and internal file management. Solutions with other awks are less efficient, require a bit more coding, and require you to manage closing files when too many are going to be open.