Related
I've a CSV file containing records like below.
id,h1,h2,h3,h4,h5,h6,h7
101,zebra,1,papa,4,dog,3,apple
102,2,yahoo,5,kangaroo,7,ape
I want to sort rows into this file without header and first column. My output should like this.
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
I tried below AWK but don't know how to exclude header and first column.
awk -F"," ' {
s=""
for(i=1; i<=NF; i++) { a[i]=$i; }
for(i=1; i<=NF; i++)
{
for(j = i+1; j<=NF; j++)
{
if (a[i] >= a[j])
{
temp = a[j];
a[j] = a[i];
a[i] = temp;
}
}
}
for(i=1; i<=NF; i++){ s = s","a[i]; }
print s
}
' file
Thanks
If perl is okay:
$ perl -F, -lane 'print join ",", $.==1 ? #F : ($F[0], sort #F[1..$#F])' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
-F, to indicate , as input field separator, results saved in #F array
See https://perldoc.perl.org/perlrun#Command-Switches for details on other options
join "," to use , as output field separator
$.==1 ? #F for first line, print as is
($F[0], sort #F[1..$#F]) for other lines, get first field and sorted output of other fields
.. is range operator, $#F will give index of last field
you can also use (shift #F, sort #F) instead of ($F[0], sort #F[1..$#F])
For given header, sorting first line would work too, so this can simplify logic required
$ # can also use: perl -F, -lane 'print join ",", shift #F, sort #F'
$ perl -F, -lane 'print join ",", $F[0], sort #F[1..$#F]' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
$ # can also use: ruby -F, -lane 'print [$F.shift, $F.sort] * ","'
$ ruby -F, -lane 'print [$F[0], $F.drop(1).sort] * ","' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
if you have gawk use asort:
awk -v OFS="," 'NR>1{split($0, a, ",");
$1=a[1];
delete a[1];
n = asort(a, b);
for (i = 1; i <= n; i++){ $(i+1)=b[i]}};
1' file.csv
This splits the columns to array a with seperator as , for all raws except the first one.
Then assign the first value in the column in a raw with the first value in a and delete this value from a.
Now the a is sorted to b and assign value starting from 2 column. then print it.
You can just use the asort() function in awk for your requirement and start sorting them from second line on-wards. The solution is GNU awk specific because of length(array) function
awk 'NR==1{ print; next }
NR>1 { finalStr=""
arrayLength=""
delete b
split( $0, a, "," )
for( i = 2; i <= length(a); i++ )
b[arrayLength++] = a[i]
asort( b )
for( i = 1; i <= arrayLength ; i++ )
finalStr = (finalStr)?(finalStr","b[i]):(b[i])
printf( "%s", a[1]","finalStr )
printf( "\n" );
}' file
The idea is first we split the entire line with a , delimiter into the array a from which we get the elements from the 2nd field onwards to a new array b. We sort those elements in this new array and append the first column element when we print it finally.
Would like to merge the lines based on first column $1 into row and formatted the output. While printing the header need to generate Max Unique count of first field.
For example, Angola appears count=3 , Brazil appears count=5 , Zambia appears count=1 .
Max unique count of field $1 is 5 , so need to print the header 5 times to have a proper header for all the fields.
While print the output, want to keep the original input file line Orders.
My actual input files used to be vary like 10 fields, 12 fields etc.
Input.csv
Country,Network,Details,Amount
Angola,voda,xxx,10
Angola,at&t,xxx,20
Angola,mtn,xxx,30
Brazil,voda,yyy,40
Brazil,voda,yyy,50
Brazil,at&t,yyy,60
Brazil,mtn,yyy,70
Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Desired Output.csv
Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount
Angola,voda,xxx,10,Angola,at&t,xxx,20,Angola,mtn,xxx,30
Brazil,voda,yyy,40,Brazil,voda,yyy,50,Brazil,at&t,yyy,60,Brazil,mtn,yyy,70,Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Presently , I am using the below two commands to get the desired output and changing the count manually each time based on number of fields in the actual input file.
Step:#1
awk 'BEGIN { while (count++<5) header=header "Country,Network,Details,Amount,"; print header }' > output.csv
Step:#2
awk -F, '
/.+/{
if (!($1 in Val)) { Key[++i] = $1; }
Val[$1] = Val[$1] $0 ",";
}
END{
for (j = 1; j <= i; j++) {
print(Val[Key[j]] );
}
}' input.csv >> output.csv
Looking for your suggestions ...
awk One-liner :
awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS:"")$0; if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",n,i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Input:
$ cat infile
Country,Network,Details,Amount
Angola,voda,xxx,10
Angola,at&t,xxx,20
Angola,mtn,xxx,30
Brazil,voda,yyy,40
Brazil,voda,yyy,50
Brazil,at&t,yyy,60
Brazil,mtn,yyy,70
Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Output:
$ awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS:"")$0; if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",n,i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount
Angola,voda,xxx,10,Angola,at&t,xxx,20,Angola,mtn,xxx,30
Brazil,voda,yyy,40,Brazil,voda,yyy,50,Brazil,at&t,yyy,60,Brazil,mtn,yyy,70,Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Better Readable:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
n=$0;
next
}
{
a[$1]=($1 in a ? a[$1] OFS:"")$0;
if(!($1 in b)){ o[++i]=$1 };
b[$1]++;
mx=mx>b[$1]?mx:b[$1]
}
END{
for(i=1; i<=mx; i++)
printf("%s%s",n,i==mx?RS:OFS);
for(i=1; i in o; i++)
print a[o[i]]
}' infile
For comment :
Would like to know, where to change the code, to print "Country" in
the Output only first time , if I don't need to print the same country
name 2nd time , third time
$ awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS substr($0,index($0,",")+1) : $0); if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",i==1?n:substr(n,index(n,",")+1),i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Country,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount
Angola,voda,xxx,10,at&t,xxx,20,mtn,xxx,30
Brazil,voda,yyy,40,voda,yyy,50,at&t,yyy,60,mtn,yyy,70,voda,yyy,80
Zambia,tcl,zzz,90
Modified-code:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
n=$0;
next
}
{
# this line modified
# look for char pos of comma,
a[$1]=($1 in a ? a[$1] OFS substr($0,index($0,",")+1) : $0);
if(!($1 in b)){ o[++i]=$1 };
b[$1]++;
mx=mx>b[$1]?mx:b[$1]
}
END{
for(i=1; i<=mx; i++)
# this line modified
printf("%s%s",i==1?n:substr(n,index(n,",")+1),i==mx?RS:OFS);
for(i=1; i in o; i++)
print a[o[i]]
}' infile
Explanation related to modification:
index(in, find)
Search the string in for the first occurrence of the string find, and
return the position in characters where that occurrence begins in the
string in.
substr(string, start [, length ])
Return a length-character-long substring of string, starting at
character number start.
I have a file with data as follows
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,2,3,4,5,6,7,8,9,10
1,2,1,2,0,1,0,1,0,1
1,1,1,1,0,2,3,0,0,0
5,1,1,0,0,0,0,0,1,0
I would like to change the delimiters from col6 through column 10 to pipe '|' and the column value would be followed by column name.
Desired Output:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
I tried using the command
awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i; next}
{for (i=1; i<=NF; i++) if ($i>0) printf "%s%s", ((i>5)?hdr[i] ":":"") $i,
((i<NF)? ((i>5)?"|":",") : ORS)}' data.csv
but not getting the result as expected
Output:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,col6:1|col8:1|col10:1
1,1,1,1,col6:2|col7:3|5,1,1,col9:1|
The columns that does not contain a zero is ending with '|' and the next line is starting data starts there !
In this example, row 2 data ends with a pipe '|' and row 3 data starts in the row 2. row 4 data ends with a pipe '|'
Can some one help me fix this please
P.S: For people looking for the reason behind all this work, I'm trying to load the data from a csv file in to a framework. Source data has 10 columns and The destination dataset would have 6 columns - first 5 from source as is and the rest as a map. Also, I'll have to make sure that there is no map key with the value as zero and then start the data analysis on the set.
This post is to get help for making the data set ready for analysis.
$ awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i":"; next} {for (i=1; i<=5; i++) printf $i","; b=""; for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";} printf ORS}' data.csv
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
Or, written over multiple lines:
awk -F ', *' '
NR==1{
for (i=1; i<=NF; i++) hdr[i]=$i":"
next
}
{
for (i=1; i<=5; i++) printf $i","
b=""
for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";}
printf ORS
}
' data.csv
How it works
NR==1{for (i=1; i<=NF; i++) hdr[i]=$i":"; next}
For the first line, NR==1, we save each field and a trailing colon into array hdr. Then, the rest of the commands are skipped and we just to the next line.
for (i=1; i<=5; i++) printf $i","
If we get here, we are working or the second are later lines. In this case, we print the first five fields, each followed by a comma.
b=""
We initialize the variable b to the empty string.
for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";}
For fields 6 to the end, if the field is nonzero, we print b followed by the hdr followed by the field value. After we have encountered the first such nonzero field, b is set to |.
printf ORS
After printing the last field, we print an output record separator (default is a newline).
The above solution is excellent. Helps me with a similar issue. However, I need to cater for an all-zero case in columns 6 to 10. See the last line of your data below.
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,2,3,4,5,6,7,8,9,10
1,2,1,2,0,1,0,1,0,1
1,1,1,1,0,2,3,0,0,0
5,1,1,0,0,0,0,0,1,0
5,1,1,0,0,0,0,0,0,0
This might never happen in your data, however if it does you are left with an inconvenient comma at the end of the line:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
5,1,1,0,0,
To get around it I made change. Here it is, somewhat spread out for clarity:
awk -F ', *' '
NR==1{
for (i=1; i<=NF; i++) hdr[i]=$i":"
next
}
{
for (i=1; i<5; i++) printf("%s,", $i);
if(i==5) printf("%s", $i);
b="";
for (i=6; i<=NF; i++) {
if ($i>0) {
if(b=="") b=","; else b="|";
printf("%s%s",b, hdr[i] $i);
}
}
printf(ORS);
}
I need to use an awk script to extract some information from a file.
I have a title line which has 11 field and I split it to an array called titleList.
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
After finding a proper line I need to print the fields which proceeds by the titles for example if the result is :
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
I must print it in this way:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18
Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
I use a for loop to manage it:
for (i=0 ;i<=NF ;i++)
{
printf "%s %s %s %s",titleList[i],":",$i," "
}
everything look good except the result which has 2 problems:
first there is an extra space between each result and second the last field of the searched line is missing
Student Number : 92839342 Name : Robert Bloomingdale Lab1 : 9 Lab2 : 26
Lab3:18 Lab4 : 22 Lab5 : 9 Lab6 : 12 Exam1 : 25 Exam2 : 39 Final
what should I do?
is there any problem with \n at the end of the search result?
You can correct the amount of extra whitespace between fields by correcting the printf statement:
awk -F ":" 'NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
Contents of file.txt:
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
Results:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18 Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
EDIT:
Also, your missing the last value because the file you're working with probably has windows newline endings. To fix this, run: dos2unix file.txt before running your awk code. Alternatively, you can set awk's record separater so that it understands newline endings:
awk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
EDIT:
The above requires GNU awk, split() splits on the FS by default so no need to use that as an arg, it's common to use "next" rather than specifying opposite conditions, and it's common to use print "" instead of printf "\n" so you use the ORS setting rather than hard-coding it's value in output statements. So, the above should be tweaked to:
gawk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array); next } { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; print "" }' file.txt
I am in need of reorganizing a large CSV file. The first column, which is currently a 6 digit number needs to be split up, using commas as the field separator.
For example, I need this:
022250,10:50 AM,274,22,50
022255,11:55 AM,275,22,55
turned into this:
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Let me know what you think!
Thanks!
It's a lot shorter in perl:
perl -F, -ane '$,=","; print split("",$F[0]), #F[1..$#F]' <file>
Since you don't know perl, a quick explanation. -F, indicates the input field separator is the comma (like awk). -a activates auto-split (into the array #F), -n implicitly wraps the code in a while (<>) { ... } loop, which reads input line-by-line. -e indicates the next argument is the script to run. $, is the output field separator (it gets set iteration of the loop this way, but oh well). split has obvious purpose, and you can see how the array is indexed/sliced. print, when lists as arguments like this, uses the output field separator and prints all their fields.
In awk:
awk -F, '{n=split($1,a,""); for (i=1;i<=n;i++) {printf("%s,",a[i])}; for (i=2;i<NF;i++) {printf("%s,",$i)}; print $NF}' <file>
I think this might work. The split function (at least in the version I am running) splits the value into individual characters if the third parameter is an empty string.
BEGIN{ FS="," }
{
n = split( $1, a, "" );
for ( i = 1; i <= n; i++ )
printf("%s,", a[i] );
sep = "";
for ( i = 2; i <= NF; i++ )
{
printf( "%s%s", sep, $i );
sep = ",";
}
printf("\n");
}
here's another way in awk
$ awk -F"," '{gsub(".",",&",$1);sub("^,","",$1)}1' OFS="," file
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Here's a variation on a theme. One thing to note is it prints the remaining fields without using a loop. Another is that since you're looping over the characters in the first field anyway, why not just do it without using the null-delimiter feature of split() (which may not be present in some versions of AWK):
awk -F, 'BEGIN{OFS=","} {len=length($1); for (i=1;i<len; i++) {printf "%s,", substr($1,i,1)}; printf "%s", substr($1,len,1);$1=""; print $0}' filename
As a script:
BEGIN {FS = OFS = ","}
{
len = length($1);
for (i=1; i<len; i++)
{printf "%s,", substr($1, i, 1)};
printf "%s", substr($1, len, 1)
$1 = "";
print $0
}