First of all, thank you for your help. I have a problem trying to use bash conditionals using two files. I have the file letters.txt
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
And I want to use conditionals so if one letter in file letter.txt is also in number.txt it generates the file a.txt b.txt c.txt so the will look as this:
a.txt
A 12
A 15
b.txt
B 10
B 14
c.txt
C 18
I know I can do it using this code:
cat number.txt | awk '{if($1=="A")print $0}' > a.txt
But I want to do it using two files.
The efficient way to approach this type of problem is to sort the input on the key field(s) first so you don't need to have multiple output files open simultaneously (which has limits and/or can slow processing down managing them) or be opening/closing output files with every line read (which is always slow).
Using GNU sort for -s (stable sort) to retain input order of the non-key fields and only having 1 output file open at a time and keeping it open for the whole time it's being populated:
$ sort -k1,1 -s number.txt |
awk '
NR==FNR { lets[$1]; next }
!($1 in lets) { next }
$1 != prev { close(out); out=tolower($1) ".txt"; prev=$1 }
{ print > out }
' letters.txt -
$ head ?.txt
==> a.txt <==
A 15
A 12
==> b.txt <==
B 10
B 14
==> c.txt <==
C 18
If you don't have GNU sort for -s to retain input order of the lines for each key field, you can replace it with awk | sort | cut, e.g.:
$ sort -k1,1 -s number.txt
A 15
A 12
B 10
B 14
C 18
D 20
E 23
$ awk '{print NR, $0}' number.txt | sort -k2,2 -k1,1n | cut -d' ' -f2-
A 15
A 12
B 10
B 14
C 18
D 20
E 23
Note the change in the order of the 2nd fields for A compared to the input order without doing the above because by default sort doesn't guarantee to retain the relative line order for each key it sorts on:
$ sort -k1,1 number.txt
A 12
A 15
B 10
B 14
C 18
D 20
E 23
With your shown samples, please try following.
awk '
FNR==NR{
arr[$0]
next
}
($1 in arr){
outputFile=(tolower($1)".txt")
print >> (outputFile)
close(outputFile)
}
' letters.txt numbers.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when letters.txt is being read.
arr[$0] ##Creating array arr with index of current line.
next ##next will skip all further statements from here.
}
($1 in arr){ ##checking condition if 1st field is present in arr.
outputFile=(tolower($1)".txt") ##Creating outputFile to print output.
print >> (outputFile) ##Printing current line into output file.
close(outputFile) ##Closing output file in backend.
}
' letters.txt numbers.txt ##Mentioning Input_file names here.
Related
I have multiple large comma separated CSV files in a directory. But, as a toy example:
one.csv has 3 rows, 2 columns
two.csv has 4 rows 5 columns
This is what the files look like -
# one.csv
a b
1 1 3
2 2 2
3 3 1
# two.csv
c d e f g
1 4 1 1 4 1
2 3 2 2 3 2
3 2 3 3 2 3
4 1 4 4 1 4
The goal is to make a new .txt or .csv that gives the rows and columns for each:
one 3 2
two 4 5
To get the rows and columns (and dump it into a file) for a single file
$ awk -F "," '{print NF}' *.csv | sort | uniq -c > dims.txt
But I'm not understanding the syntax to get counts for multiple files.
What I've tried
$ awk '{for (i=1; i<=2; i++) -F "," '{print NF}' *.csv$i | sort | uniq -c}'
With any awk, you could try following awk program.
awk '
FNR==1{
if(cols && rows){
print file,rows,cols
}
rows=cols=file=""
file=FILENAME
sub(/\..*/,"",file)
cols=NF
next
}
{
rows=(FNR-1)
}
END{
if(cols && rows){
print file,rows,cols
}
}
' one.csv two.csv
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of each line then do following.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
rows=cols=file="" ##Nullifying rows, cols and file here.
file=FILENAME ##Setting FILENAME value to file here.
sub(/\..*/,"",file) ##Removing everything from dot to till end of value in file.
cols=NF ##Setting NF values to cols here.
next ##next will skip all further statements from here.
}
{
rows=(FNR-1) ##Setting FNR-1 value to rows here.
}
END{ ##Starting END block of this program from here.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
}
' one.csv two.csv ##Mentioning Input_file names here.
Using gnu awk you can do this in a single awk:
awk -F, 'ENDFILE {
print gensub(/\.[^.]+$/, "", "1", FILENAME), FNR-1, NF-1
}' one.csv two.csv > dims.txt
cat dims.txt
one 3 2
two 4 5
You will need to iterate over all CSVs print the name for each file and the dimensions
for i in *.csv; do awk -F "," 'END{print FILENAME, NR, NF}' $i; done > dims.txt
If you want to avoid awk you can also do it wc -l for lines and grep -o "CSV-separator" | wc -l for fields
I would harness GNU AWK's ENDFILE for this task as follows, let content of one.csv be
1,3
2,2
3,1
and two.csv be
4,1,1,4,1
3,2,2,3,2
2,3,3,2,3
1,4,4,1,4
then
awk 'BEGIN{FS=","}ENDFILE{print FILENAME, FNR, NF}' one.csv two.csv
output
one.csv 3 2
two.csv 4 5
Explanation: ENDFILE is executed after processing every file, I set FS to , assuming that fields are ,-separated and there is not , inside filed, FILENAME, FNR, NF are built-in GNU AWK variables: FNR is number of current row in file, i.e. in ENDFILE number of last row, NF is number of fileds (again of last row). If you have files with headers use FNR-1, if you have rows prepended with row number use NF-1.
edit: changed NR to FNR
Without GNU awk you can use the shell plus POSIX awk this way:
for fn in *.csv; do
cols=$(awk '{print NF; exit}' "$fn")
rows=$(awk 'END{print NR-1}' "$fn")
printf "%s %s %s\n" "${fn%.csv}" "$rows" "$cols"
done
Prints:
one 3 2
two 4 5
file1
1 123 ab456 A G PASS AC=0.15;FB=1.5;BV=45; 0|0 0|0 0|1 0|0
4 789 ab123 C T PASS FB=90;AC=2.15;BV=12; 0|1 0|1 0|0 0|0
desired output
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
I used
awk '{print $1,$2,$3,$4,$5,$6,$7}' file1 > out1.txt
sed -i 's/;/\t/g' out1.txt
awk '{print $1,$2,$3,$4,$5,$6,$7,$8}' out1.txt
output generated
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS FB=90
I want to print first 6 columns along with value of AC=(*) from 7th column.
With your shown samples, please try following awk code.
awk '
{
val=""
while(match($7,/AC=[^;]*/)){
val=(val?val:"")substr($7,RSTART,RLENGTH)
$7=substr($7,RSTART+RLENGTH)
}
print $1,$2,$3,$4,$5,$6,val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
val="" ##Nullifying val here.
while(match($7,/AC=[^;]*/)){ ##Running while loop to use match function to match AC= till semi colon all occurrences here.
val=(val?val:"")substr($7,RSTART,RLENGTH) ##Creating val and keep adding matched regex value to it, from 7th column.
$7=substr($7,RSTART+RLENGTH) ##Assigning rest pending values to 7th column itself.
}
print $1,$2,$3,$4,$5,$6,val ##Printing appropriate columns required by OP along with val here.
}
' Input_file ##Mentioning Input_file name here.
$ awk '{
n=split($7,a,/;/) # split $7 on ;s
for(i=1;i<=n&&a[i]!~/^AC=/;i++); # just loop looking for AC
print $1,$2,$3,$4,$5,$6,a[i] # output
}' file
Output:
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
If AC= was not found, and empty field is outputed instead.
Any time you have tag=value pairs in your data I find it best to first populate an array (f[] below) to hold those tag-value mappings so you can print/test/rearrange those values by their tags (names).
Using any awk in any shell on every Unix box:
$ cat tst.awk
{
split($7,tmp,/[=;]/)
for (i=1; i<NF; i+=2) {
f[tmp[i]] = tmp[i] "=" tmp[i+1]
}
sub(/[[:space:]]*[^[:space:]]+;.*/,"")
print $0, f["AC"]
}
$ awk -f tst.awk file
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
This might work for you (GNU sed):
sed -nE 's/^((\S+\s){6})\S*;?(AC=[^;]*);.*/\1\3/p' file
Turn off implicit printing -n and add easier regexp -E.
Match the first six fields and their delimiters and append the AC tag and its value from the next.
With only GNU sed:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/' file1
1 123 ab456 A G PASS AC=0.15
4 789 ab123 C T PASS AC=2.15
Lines without a AC=... part in the 7th field will be printed without modification. If you prefer removing the 7th field and the end of the line, use:
$ sed -r 's/(\S+;)?(AC=[^;]*).*/\2/;t;s/\S+;.*//' file1
First of all, thank you for your help. I have the file letter.txt:
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
I want to count how many times does each letter in letter.txt appears in number.txt so the output will be:
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
I know I can do it using this code, but I want to do it generally with any file.
cat number.txt | awk 'BEGIN {A=0;B=0;C=0;count=0}; {count++};{if ($1 == "A")A++};{if ($1 == "B")B++};{if ($1 == "C")C++}END{print "We have found" A "A\n" "We have found" B "B\n" "We have found" C "C"}
You basically want to do an inner join (easy enough to google) and group by the join key and return the count for each group.
awk 'NR==FNR { count[$1] = 0; next }
$1 in count { ++count[$1]; ++total}
END { for(k in count)
print "We have found", count[k], k
print "Total", total, "letters"}' letters.txt numbers.txt
All of this should be easy to find in a basic Awk tutorial, but in brief, the line number within the file FNR is equal to the overall line number NR when you are reading the first input file. We initialize count to contain the keys we want to look for. If we fall through, we are reading the second file; if we see a key we want, we increase its count. When we are done, report what we found.
Consider starting with:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c
2 A
2 B
1 C
Then:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c |
awk '
{ print "We have found", $1, $2; tot+=$1 }
END { print "Total letter found:", tot+0 }
'
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
although in reality I'd probably just do it all in awk, just wanted to show an alternative.
Don't know if you need awk
to me easier (but slower execution as you read in comments) to use grep -c
cat file1 | while read line; do
c=`grep -c $line file2 | sed 's/ //g'`;
echo We have found $c $line;
done
it's a cycle, where
$c is the count taken with grep -c, and sed remove spaces in grep -c output
grep and coreutils can also do this:
grep -f letter.txt number.txt | cut -d' ' -f1 | sort | uniq -c
Output:
2 A
2 B
1 C
I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.
With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file
$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C
Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.
You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)
When sorting a file, I am not preserving the header in its position:
file_1.tsv
Gene Number
a 3
u 7
b 9
sort -k1,1 file_1.tsv
Result:
a 3
b 9
Gene Number
u 7
So I am tryig this code:
sed '1d' file_1.tsv | sort -k1,1 > file_1_sorted.tsv
first='head -1 file_1.tsv'
sed '1 "$first"' file_1_sorted.tsv
What I did is to remove the header and sort the rest of the file, and then trying to add again the header. But I am not able to perform this last part, so I would like to know how can I copy the header of the original file and insert it as the first row of the new file without substituting its actuall first row.
You can do this as well :
{ head -1; sort; } < file_1.tsv
** Update **
For macos :
{ IFS= read -r header; printf '%s\n' "$header" ; sort; } < file_1.tsv
a simpler awk
$ awk 'NR==1{print; next} {print | "sort"}' file
$ head -1 file; tail -n +2 file | sort
Output:
Gene Number
a 3
b 9
u 7
Could you please try following.
awk '
FNR==1{
first=$0
next
}
{
val=(val?val ORS:"")$0
}
END{
print first
print val | "sort"
}
' Input_file
Logical explanation:
Check condition FNR==1 to see if its first line; then save its values to variable and move on to next line by next.
Then keep appending all lines values to another variable with new line till last line.
Now come to END block of this code which executes when Input_file is done being read, there print first line value and put sort command on rest of the lines value there.
This will work using any awk, sort, and cut in any shell on every UNIX box and will work whether the input is coming from a pipe (when you can't read it twice) or from a file (when you can) and doesn't involve awk spawning a subshell:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
The above uses awk to stick a 0 at the front of the header line and a 1 in front of the rest so you can sort by that number then whatever other field(s) you want to sort on and then remove the added field again with a cut. Here it is in stages:
$ awk -v OFS='\t' '{print (NR>1), $0}' file
0 Gene Number
1 a 3
1 u 7
1 b 9
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2
0 Gene Number
1 a 3
1 b 9
1 u 7
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
Gene Number
a 3
b 9
u 7