Count rows and columns for multiple CSV files and make new file - awk

I have multiple large comma separated CSV files in a directory. But, as a toy example:
one.csv has 3 rows, 2 columns
two.csv has 4 rows 5 columns
This is what the files look like -
# one.csv
a b
1 1 3
2 2 2
3 3 1
# two.csv
c d e f g
1 4 1 1 4 1
2 3 2 2 3 2
3 2 3 3 2 3
4 1 4 4 1 4
The goal is to make a new .txt or .csv that gives the rows and columns for each:
one 3 2
two 4 5
To get the rows and columns (and dump it into a file) for a single file
$ awk -F "," '{print NF}' *.csv | sort | uniq -c > dims.txt
But I'm not understanding the syntax to get counts for multiple files.
What I've tried
$ awk '{for (i=1; i<=2; i++) -F "," '{print NF}' *.csv$i | sort | uniq -c}'

With any awk, you could try following awk program.
awk '
FNR==1{
if(cols && rows){
print file,rows,cols
}
rows=cols=file=""
file=FILENAME
sub(/\..*/,"",file)
cols=NF
next
}
{
rows=(FNR-1)
}
END{
if(cols && rows){
print file,rows,cols
}
}
' one.csv two.csv
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of each line then do following.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
rows=cols=file="" ##Nullifying rows, cols and file here.
file=FILENAME ##Setting FILENAME value to file here.
sub(/\..*/,"",file) ##Removing everything from dot to till end of value in file.
cols=NF ##Setting NF values to cols here.
next ##next will skip all further statements from here.
}
{
rows=(FNR-1) ##Setting FNR-1 value to rows here.
}
END{ ##Starting END block of this program from here.
if(cols && rows){ ##Checking if cols AND rows are NOT NULL then do following.
print file,rows,cols ##Printing file, rows and cols variables here.
}
}
' one.csv two.csv ##Mentioning Input_file names here.

Using gnu awk you can do this in a single awk:
awk -F, 'ENDFILE {
print gensub(/\.[^.]+$/, "", "1", FILENAME), FNR-1, NF-1
}' one.csv two.csv > dims.txt
cat dims.txt
one 3 2
two 4 5

You will need to iterate over all CSVs print the name for each file and the dimensions
for i in *.csv; do awk -F "," 'END{print FILENAME, NR, NF}' $i; done > dims.txt
If you want to avoid awk you can also do it wc -l for lines and grep -o "CSV-separator" | wc -l for fields

I would harness GNU AWK's ENDFILE for this task as follows, let content of one.csv be
1,3
2,2
3,1
and two.csv be
4,1,1,4,1
3,2,2,3,2
2,3,3,2,3
1,4,4,1,4
then
awk 'BEGIN{FS=","}ENDFILE{print FILENAME, FNR, NF}' one.csv two.csv
output
one.csv 3 2
two.csv 4 5
Explanation: ENDFILE is executed after processing every file, I set FS to , assuming that fields are ,-separated and there is not , inside filed, FILENAME, FNR, NF are built-in GNU AWK variables: FNR is number of current row in file, i.e. in ENDFILE number of last row, NF is number of fileds (again of last row). If you have files with headers use FNR-1, if you have rows prepended with row number use NF-1.
edit: changed NR to FNR

Without GNU awk you can use the shell plus POSIX awk this way:
for fn in *.csv; do
cols=$(awk '{print NF; exit}' "$fn")
rows=$(awk 'END{print NR-1}' "$fn")
printf "%s %s %s\n" "${fn%.csv}" "$rows" "$cols"
done
Prints:
one 3 2
two 4 5

Related

awk using filter less than 3 unique characters in column

i have a dataset in file
1 AAA
1 AAAB
1 AAABC
2 ABCD
2 ABCDE
3 AABB
3 AABCD
3 ABCDE
3 ABBCCDE
Desired output
1 AAABC
2 ABCD
2 ABCDE
3 AABCD
3 ABCDE
3 ABBCCDE
i have tried :
awk 'BEGIN{FS="\t"; OFS="\t"} {
unq=0; delete seen; for (i=1; i<=NF; i++) if (!seen[$i]++) unq++} unq > 3' file
But i cant get any results.
Your code does something quite different than what you seem to be stating in your requirements. You loop over the tokens on the line (on the first line, there are two tokens, 1 and AAA) and check how many times you have seen those. Your script then prints if you have more than three unique tokens on the line, which you never do.
The concept you seem to be looking for is the substr() function, which allows you to loop over the individual characters in a string (for example).
awk 'BEGIN{FS="\t"; OFS="\t"}
{ unq=0; delete seen; for (i=1; i<=length($2); i++) if(!seen[substr($2, i, 1)]++) unq++ }
unq >= 3' file
To start debugging your own code, try adding a print statement at various strategic places to verify that you see what you expect; when you don't, you have found a bug. For example, it should be easy to see that for (i=1; i<=$NF; ++i) print $i loops over the fields of the input line.
With an awk that splits into chars given a null field separator and supports length(array) (e.g. GNU awk but also several others):
$ cat tst.awk
{
delete chars
split($2,tmp,"")
for (i in tmp) {
chars[tmp[i]]
}
}
length(chars) >= 3
$ awk -f tst.awk file
1 AAABC
2 ABCD
2 ABCDE
3 AABCD
3 ABCDE
3 ABBCCDE
With perl:
$ perl -MList::Util=uniq -lane 'print if scalar(uniq split(//,$F[1])) > 2' ip.txt
1 AAABC
2 ABCD
2 ABCDE
3 AABCD
3 ABCDE
3 ABBCCDE
$F[1] has the second column, which is then split into characters using an empty regexp. From this array, duplicates are removed using uniq and then size of the array is obtained using scalar
Similar solution with ruby:
ruby -lane 'print if $F[1].chars.uniq.size > 2'
With your shown samples, please try following. Written and tested in GNU awk.
awk '
{
count=0
delete value
num=split($2,arr,"")
for(i=1;i<=num;i++){
if(!value[arr[i]]++){ count++ }
}
}
count>=3
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
count=0 ##Setting count to 0 here.
delete value ##Deleting value array here.
num=split($2,arr,"") ##Splitting 2nd field into arr array with delimiter of NULL.
for(i=1;i<=num;i++){ ##Traversing through all values of arr array here.
if(!value[arr[i]]++){ count++ } ##Checking if any element is NOT repeated in value then increase count by 1 here.
}
}
count>=3 ##Checking if count>=3 then print that line.
' Input_file ##Mentioning Input_file name here.
another approach
$ awk '{c=0; for(f=$2; f&&c<3; c++) gsub(substr(f,1,1),"",f)} c>2' file
1 AAABC
2 ABCD
2 ABCDE
3 AABCD
3 ABCDE
3 ABBCCDE

Compare and print last column in a file

I have a file
(n34)); 1
Z(n2)); 1
(n52)); 2
(n35)); 3
(n67)); 3
(n19)); 4
(n68)); 4
(n20)); 5
(n36)); 5
(n53)); 5
(n69)); 5
N(n3)); 5
(n54)); 6
(n70)); 7
N(n4)); 7
I want output such that whenever we have same number after semicolon print that lines in single line with field separator as;.
Output should be
(n34)); 1;Z(n2)); 1
(n52)); 2
(n35)); 3;(n67)); 3
(n19)); 4;(n68)); 4
(n20)); 5;(n36)); 5;(n53)); 5;(n69)); 5;N(n3)); 5
(n54)); 6
(n70)); 7;N(n4)); 7
I tried the code below
awk -F';' 'NR == FNR { count[$2]++;next}
In this I am not getting how to print it on same line if same numbers are present.
1st solution: Could you please try following, written and tested with shown samples in GNU awk and considering that your Input_file is sorted by 2nd column.
awk '
BEGIN{ OFS=";" }
prev!=$2{
if(val){ print val }
val=""
}
{
val=(val?val OFS:"")$0
prev=$2
}
END{
if(val){ print val }
}
' Input_file
2nd solution: OR in case your 2nd field is not sorted then try following.
sort -nk2 Input_file |
awk '
BEGIN{ OFS=";" }
prev!=$2{
if(val){ print val }
val=""
}
{
val=(val?val OFS:"")$0
prev=$2
}
END{
if(val){ print val }
}
'
Explanation of awk code:
awk ' ##Starting awk program from here.
BEGIN{ OFS=";" } ##Setting output field separator as semi colon here.
prev!=$2{ ##Checking condition if 2nd field is NOT equal to current 2nd field then do following.
if(val){ print val } ##If val is set then print value of val here.
val="" ##Nullifying val here.
}
{
val=(val?val OFS:"")$0 ##Creating val variable and keep adding values to it with OFS in between their values.
prev=$2 ##Setting current 2nd field to prev to be checked in next line.
}
END{ ##Starting END block for this program from here.
if(val){ print val } ##If val is set then print value of val here.
}
' Input_file ##Mentioning Input_file name here.
Another awk:
$ awk -F\; '{a[$2]=a[$2] (a[$2]==""?"":";") $0}END{for(i in a)print a[i]}' file
Output:
(n34)); 1;Z(n2)); 1
(n52)); 2
(n35)); 3;(n67)); 3
(n19)); 4;(n68)); 4
(n20)); 5;(n36)); 5;(n53)); 5;(n69)); 5;N(n3)); 5
(n54)); 6
(n70)); 7;N(n4)); 7
Explained:
$ awk -F\; '{ # set delimiter (probably useless)
a[$2]=a[$2] (a[$2]==""?"":";") $0 # keep appending where $2s match
}
END { # in the end
for(i in a) # output
print a[i]
}' file
Edit: for(i in a) will produce order that appears random. If you need to order it, you can pipe the output to:
$ awk '...' | sort -t\; -k2n
Perl to the rescue!
perl -ne '($x, $y) = split;
$h{$y} .= "$x $y;";
END { print $h{$_} =~ s/;$/\n/r for sort keys %h }
' -- file
It splits each line on whitespace, stores the value in a hash table %h keyed by the second column, and when the file has been read, it prints the remembered values, sorting them by the second column. We always store the semicolon at the end, so we need to replace the final one with a new line in the output.
I would harness GNU AWK array for that task following way. Let file.txt content be:
(n34)); 1
Z(n2)); 1
(n52)); 2
(n35)); 3
(n67)); 3
(n19)); 4
(n68)); 4
(n20)); 5
(n36)); 5
(n53)); 5
(n69)); 5
N(n3)); 5
(n54)); 6
(n70)); 7
N(n4)); 7
then:
awk '{data[$2]=data[$2] ";" $0}END{for(i in data){print substr(data[i],2)}}' file.txt
output is:
(n34)); 1;Z(n2)); 1
(n52)); 2
(n35)); 3;(n67)); 3
(n19)); 4;(n68)); 4
(n20)); 5;(n36)); 5;(n53)); 5;(n69)); 5;N(n3)); 5
(n54)); 6
(n70)); 7;N(n4)); 7
Explanation: I exploit facts that GNU AWK arrays are lazy and remember order of insertion (latter is not guaranteed in all AWKs). For every line I concatenate whole line to what is under $2 key in array data using ;. If there is not already value stored it is same as empty string. This lead to ; appearing at begins of every record in data so I print it starting at 2nd character. Keep in mind this solution stores everything in data so it might not work well for huge files.
(tested in gawk 4.2.1)
datamash has a similar function built in:
<infile datamash -W groupby 2 collapse 1
Output:
1 (n34));,Z(n2));
2 (n52));
3 (n35));,(n67));
4 (n19));,(n68));
5 (n20));,(n36));,(n53));,(n69));,N(n3));
6 (n54));
7 (n70));,N(n4));
This might work for you (GNU sed):
sed -E ':a;N;s/( \S+)\n(.*\1)$/\1;\2/;ta;P;D' file
Append the following line to the current line.
If both lines end in the same number (string), delete the intervening newline and repeat.
Print/Delete the first line in the pattern space and repeat.

Sort a file preserving the header as first position with bash

When sorting a file, I am not preserving the header in its position:
file_1.tsv
Gene Number
a 3
u 7
b 9
sort -k1,1 file_1.tsv
Result:
a 3
b 9
Gene Number
u 7
So I am tryig this code:
sed '1d' file_1.tsv | sort -k1,1 > file_1_sorted.tsv
first='head -1 file_1.tsv'
sed '1 "$first"' file_1_sorted.tsv
What I did is to remove the header and sort the rest of the file, and then trying to add again the header. But I am not able to perform this last part, so I would like to know how can I copy the header of the original file and insert it as the first row of the new file without substituting its actuall first row.
You can do this as well :
{ head -1; sort; } < file_1.tsv
** Update **
For macos :
{ IFS= read -r header; printf '%s\n' "$header" ; sort; } < file_1.tsv
a simpler awk
$ awk 'NR==1{print; next} {print | "sort"}' file
$ head -1 file; tail -n +2 file | sort
Output:
Gene Number
a 3
b 9
u 7
Could you please try following.
awk '
FNR==1{
first=$0
next
}
{
val=(val?val ORS:"")$0
}
END{
print first
print val | "sort"
}
' Input_file
Logical explanation:
Check condition FNR==1 to see if its first line; then save its values to variable and move on to next line by next.
Then keep appending all lines values to another variable with new line till last line.
Now come to END block of this code which executes when Input_file is done being read, there print first line value and put sort command on rest of the lines value there.
This will work using any awk, sort, and cut in any shell on every UNIX box and will work whether the input is coming from a pipe (when you can't read it twice) or from a file (when you can) and doesn't involve awk spawning a subshell:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
The above uses awk to stick a 0 at the front of the header line and a 1 in front of the rest so you can sort by that number then whatever other field(s) you want to sort on and then remove the added field again with a cut. Here it is in stages:
$ awk -v OFS='\t' '{print (NR>1), $0}' file
0 Gene Number
1 a 3
1 u 7
1 b 9
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2
0 Gene Number
1 a 3
1 b 9
1 u 7
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
Gene Number
a 3
b 9
u 7

Explanation for a specific command in a awk

Can you explain me what doing this command ' dateA="$dateA" '?
awk 'FNR>1 && dateA<=$5' FS='|' dateA="$dateA" "$infile"
awk 'FNR > 1 && dateA <= $5 ' FS='|' dateA="$dateA" "$infile"
FNR is variable gives you total number of records, related to current file, don't get confused with variable NR, FNR and NR value will be same as long as awk reads first file, for second file, variable FNR will reset, whereas NR doesn't.
This is how FNR and NR works in awk
$ seq 1 5 >file1
$ seq 1 3 >file2
$ cat file1
1
2
3
4
5
$ cat file2
1
2
3
$ awk '{print "Current line : "$0,"File: "FILENAME,"FNR : ",FNR,"NR : ",NR}' file1 file2
Current line : 1 File: file1 FNR : 1 NR : 1
Current line : 2 File: file1 FNR : 2 NR : 2
Current line : 3 File: file1 FNR : 3 NR : 3
Current line : 4 File: file1 FNR : 4 NR : 4
Current line : 5 File: file1 FNR : 5 NR : 5
Current line : 1 File: file2 FNR : 1 NR : 6
Current line : 2 File: file2 FNR : 2 NR : 7
Current line : 3 File: file2 FNR : 3 NR : 8
FNR > 1 && dateA <= $5 If no of records read is greater than 1 and variable dateA is less than or equal to 5th field/column, we get boolean true state, so such line will be printed
FS='|' FS is input field separator, you can also set it like
awk -F'|' '{ .... }' OR
awk -v FS='|' '{ .... }' OR
awk 'BEGIN{FS="|"}{ .... }'
dateA="$dateA" dateA is awk variable whose value is taken from your shell variable $dateA, similarly you can set it like
awk -v dateA="$dateA" '{ .... }'
Your above command can be rewritten like below also
awk -F'|' -v dateA="$dateA" 'FNR>1 && dateA <= $5' "$infile"
and some people prefer awk 'condition{action}' for better reading, so you can also write it as
awk -F'|' -v dateA="$dateA" 'FNR>1 && dateA <= $5{ print }' "$infile"
^ ^
| |
If this condition is true |
|
Action is to print line,
print or print $0 is same
Please go through the following explanation and let me know if this helps you.
Explanation: Kindly don't run following awk, it is expanded for explanation purposes only.
awk '
FNR>1 && dateA<=$5 ##FNR denotes the number of current line in awk so here 2 conditions with AND conditions are being checked.
##1st is if current line number is greater than 1 and second is variable named dateA value should be lesser
##and equal to 5.
##So let me explain here awk works on method of condition and then action, so if any condition is TRUE then action
##will happen, here condition is there but NO action defined, so by default print action will happen. print of
##current line.
'
FS='|' ##FS denotes the field separator, in awk we could define the field separator by ourselves too, so making it here as |
dateA="$dateA" ##creating variable named dateA whose value is equal to shell variable named dateA. In awk if we have to assign
##shell variable values to awk variables we have to create an awk variable and then assign shell variable value to
##it.
"$infile" ##Mentioning the Input_file name here which awk has to go through. Point to be noted here the "$infile" means
##it is a shell variable (as we all know to print shell variable value we have to use "$infile")
AWK allows assigning internal variable in the arguments with the form var=value. Since AWK does not have access to shell variables dateA="$dateA" is used to "export" dateA to the AWK script.
Note that assignment arguments occur during file processing, after BEGIN, and can be used in-between files:
$ echo >file1; echo >file2
$ awk -vx=0 '
BEGIN {
print "BEGIN", x
}
{
print FILENAME, x
}
END {
print "END", x
}' x=1 file1 x=2 file2 x=3
BEGIN 0
file1 1
file2 2
END 3

awk to copy and move of file last line to previous line above

In the awk below I am trying to move the last line only, to the one above it. The problem with the below is that since my input file varies (not always 4 lines like in the below), I can not use i=3 everytime and can not seem to fix it. Thank you :).
file
this is line 1
this is line 2
this is line 3
this is line 4
desired output
this is line 1
this is line 2
this is line 4
this is line 3
awk (seems like the last line is being moved, but to i=2)
awk '
{lines[NR]=$0}
END{
print lines[1], lines[NR];
for (i=3; i<NR; i++) {print lines[i]}
}
' OFS=$'\n' file
this is line 1
this is line 2
this is line 4
this is line 3
$ seq 4 | awk 'NR>2{print p2} {p2=p1; p1=$0} END{print p1 ORS p2}'
1
2
4
3
$ seq 7 | awk 'NR>2{print p2} {p2=p1; p1=$0} END{print p1 ORS p2}'
1
2
3
4
5
7
6
try following awk once:
awk '{a[FNR]=$0} END{for(i=1;i<=FNR-2;i++){print a[i]};print a[FNR] ORS a[FNR-1]}' Input_file
Explanation: Creating an array named a with index FNR(current line's number) and keeping it's value to current line's value. Now in END section of awk, starting a for loop from i=1 to i<=FNR-2 why till FNR-2 because you need to swap only last 2 lines here. Once it prints all the lines then simply printing a[FNR](which is last line) and then printing a[FNR-1] with ORS(to print new line).
Solution 2nd: By counting the number of lines in a Input_file and putting them into a awk variable.
awk -v lines=$(wc -l < Input_file) 'FNR==(lines-1){val=$0;next} FNR==lines{print $0 ORS val;next} 1' Input_file
You nearly had it. You just have to change the order.
awk '
{lines[NR]=$0}
END{
for (i=1; i<NR-1; i++) {print lines[i]}
print lines[NR];
print lines[NR-1];
}
' OFS=$'\n' file
I'd reverse the file, swap the first two lines, then re-reverse the file
tac file | awk 'NR==1 {getline line2; print line2} 1' | tac