Split large CSV file based on column value cardinality - awk

I have a large CSV file with following line format:
c1,c2
I would like to split the original file in two files as follows:
One file will contain the lines where the value of c1 appears exactly once in the file.
Another file will contain the lines where the value of c1 appears twice or more in the file.
Any idea how it can be done?
For example, if the original file is:
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar
I would like to produce the following files:
3,foo
4,bar
and
1,foo
2,bar
2,foo
1,bar

this one-liner generates two files o1.csv and o2.csv
awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' file file
test:
kent$ cat f
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar
kent$ awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' f f
kent$ head o*
==> o1.csv <==
3,foo
4,bar
==> o2.csv <==
1,foo
2,bar
2,foo
1,bar
Note
awk reads your file twice, instead of saving whole file in memory
the order of the file is retained

Depending what you mean by big, this might work for you.
It has to hold the lines, in associative array done, until it sees
2nd use, or until the end of file. When a 2nd use is seen the remembered
data is changed to "!" to avoid printing it again on a 3rd and later match.
>file2
awk -F, '
{ if(done[$1]!=""){
if(done[$1]!="!"){
print done[$1]
done[$1] = "!"
}
print
}else{
done[$1] = $0
order[++n] = $1
}
}
END{
for(i=1;i<=n;i++){
out = done[order[i]]
if(out!="!")print out >>"file2"
}
}
' <csvfile >file1

I'd break out Perl for this job
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
my #lines;
open ( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file
while ( <$input> ) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
push ( #lines, [ $c1 , $c2 ] );
}
close ( $input );
print "File 1:\n";
#filter any single elements
foreach my $pair ( grep { $count_of{$_ -> [0]} < 2 } #lines ) {
print join (",", #$pair );
}
print "File 2:\n";
#filter any repeats.
foreach my $pair ( grep { $count_of{$_ -> [0]} > 1 } #lines ) {
print join (",", #$pair );
}
This will hold the whole file in memory, but given your data - you don't save much space by double processing it and maintaining a count.
However you could do:
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
open( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file counting "c1"
while (<$input>) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
}
open( my $output_single, '>', "output_uniques.csv" ) or die $!;
open( my $output_dupe, '>', "output_dupes.csv" ) or die $!;
seek( $input, 0, 0 );
while ( my $line = <$input> ) {
my ($c1) = split( ",", $line );
if ( $count_of{$c1} > 1 ) {
print {$output_dupe} $line;
}
else {
print {$output_single} $line;
}
}
close($input);
close($output_single);
close($output_dupe);
This will minimise memory occupancy, by only retaining the count - it reads the file first to count the c1 values, and then processes it a second time and prints lines to different outputs.

Related

Swapping / rearranging of columns and its values based on inputs using Unix scripts

Team,
I have an requirement of changing /ordering the column of csv files based on inputs .
example :
Datafile (source File) will be always with standard column and its values example :
PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION
MK3,Biberach,15200100_3,Biologics Downstream
MK3,Biberach,15200100_4,Sciona Upstream
MK3,Biberach,15200100_5,Drag envois
MK3,Biberach,15200100_8,flatsylio
MK3,Biberach,15200100_1,bioCovis
these columns (PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION) will be standard for source files and what i am looking for solution to format this and generate new file with the columns which we preferred .
Note : Source / Data file will be always comma delimited
Example : if I pass PRODUCTCODE,BATCHID as input then i would like to have only those column and its data extracted from source file and generate new file .
Something like script_name <output_column> <Source_File_name> <target_file_name>
target file example :
PRODUCTCODE,BATCHID
MK3,15200100_3
MK3,15200100_4
MK3,15200100_5
MK3,15200100_8
MK3,15200100_1
if i pass output_column as "LV1P_DESCRIPTION,PRODUCTCODE" then out file should be like below
LV1P_DESCRIPTION,PRODUCTCODE
Biologics Downstream,MK3
Sciona Upstream,MK3
Drag envios,MK3
flatsylio,MK3
bioCovis,MK3
It would be great if any one can help on this.
I have tried using some awk scripts (got it from some site) but it was not working as expected , since i don't have unix knowledge finding difficulties to modify this .
awk code:
BEGIN {
FS = ","
}
NR==1 {
split(c, ca, ",")
for (i = 1 ; i <= length(ca) ; i++) {
gsub(/ /, "", ca[i])
cm[ca[i]] = 1
}
for (i = 1 ; i <= NF ; i++) {
if (cm[$i] == 1) {
cc[i] = 1
}
}
if (length(cc) == 0) {
exit 1
}
}
{
ci = ""
for (i = 1 ; i <= NF ; i++) {
if (cc[i] == 1) {
if (ci == "") {
ci = $i
} else {
ci = ci "," $i
}
}
}
print ci
}
the above code is saves as Remove.awk and this will be called by another scripts as below
var1="BATCHID,LV2P_DESCRIPTION"
## this is input fields values used for testing
awk -f Remove.awk -v c="${var1}" RESULT.csv > test.csv
The following GNU awk solution should meet your objectives:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" 'BEGIN { split(flds,map,",") } NR==1 { for (i=1;i<=NF;i++) { map1[$i]=i } } { printf "%s",$map1[map[1]];for(i=2;i<=length(map);i++) { printf ",%s",$map1[map[i]] } printf "\n" }' file
Explanation:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" ' # Pass the fields to print as a variable field
BEGIN {
split(flds,map,",") # Split fld into an array map using , as the delimiter
}
NR==1 { for (i=1;i<=NF;i++) {
map1[$i]=i # Loop through the header and create and array map1 with the column header as the index and the column number the value
}
}
{ printf "%s",$map1[map[1]]; # Print the first field specified (index of map)
for(i=2;i<=length(map);i++) {
printf ",%s",$map1[map[i]] # Loop through the other field numbers specified, printing the contents
}
printf "\n"
}' file

awk | Rearrange fields of CSV file on the basis of column value

I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
a:5,b:1,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
Output File
session:4,a=5,b=1,c=2
session:3,a=11,b=3,c=5|3
Notes:
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
a:5,b:1,c:2,session:4,e:8
a:5,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
{
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
}
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
}
print ""
}
$ awk -f tst.awk file
session:4,a=5,b=1,c=2
session:4,a=5,b=,c=2
session:3,a=11,b=3,c=5|3
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
}
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
}
printf "\n"
delete hash
delete hash_orig
}
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:
session:4,a:5,b:1,c:2,e:8
session:3,a:11,b:3,c:5|3,e:9

using awk to replace a string by another string (mapping table contained in a file)

I would like to know if it's possible to do the following thing only using
awk:
I am searching some regex in a file Fand I want to replace the string (S1) that matches
the regex by another string (S2). Of course, it's easy to do that with awk. But ... my
problem is that the value of S2 has to be obtained from another file that maps
S1 to S2.
Example :
file F:
abcd 168.0.0.1 qsqsjsdfjsjdf
sdfsdffsd
168.0.0.2 sqqsfjqsfsdf
my associative table in another file
168.0.0.1 foo
168.0.0.2 bar
I want to get:
this result:
abcd foo qsqsjsdfjsjdf
sdfsdffsd
bar sqqsfjqsfsdf
Thanks for help !
edit: if my input file is a bit different, like this (no space before IP address):
file F:
abcd168.0.0.1 qsqsjsdfjsjdf
sdfsdffsd
168.0.0.2 sqqsfjqsfsdf
i can't use $1, $2 variables and search in the associative array.
I tried something like that (based on birei proposition) but it did not work :
FNR < NR {
sub(/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/, assoc [ & ] );
print
}
Is there a way to search the matched string in the associative array (assoc[ & ] seems to
be not valid) ?
One way. It's self-explanatory. Save data from the associative table in an array and in second file check for each field if it matches any key of the array:
awk '
FNR == NR {
assoc[ $1 ] = $2;
next;
}
FNR < NR {
for ( i = 1; i <= NF; i++ ) {
if ( $i in assoc ) {
$i = assoc[ $i ]
}
}
print
}
' associative_file F
Output:
bcd foo qsqsjsdfjsjdf
sdfsdffsd
bar sqqsfjqsfsdf
EDIT: Try following awk script for IPs without spaces with their surrounding words. It's similar to previous one, but now it searches in the array and try to find an IP in any place of the line (default $0 for gsub) and substitutes it.
awk '
FNR == NR {
assoc[ $1 ] = $2;
next;
}
FNR < NR {
for ( key in assoc ) {
gsub( key, assoc[ key ] )
}
print
}
' associative_file F
Assuming infile with the conten of your second example of file F, output would be:
abcdfoo qsqsjsdfjsjdf
sdfsdffsd
bar sqqsfjqsfsdf

can awk replace fields based on separate specification file?

I have an input file like this:
SomeSection.Foo
OtherSection.Foo
OtherSection.Goo
...and there is another file describing which object(s) belong to each section:
[SomeSection]
Blah
Foo
[OtherSection]
Foo
Goo
The desired output would be:
SomeSection.2 // that's because Foo appears 2nd in SomeSection
OtherSection.1 // that's because Foo appears 1st in OtherSection
OtherSection.2 // that's because Goo appears 2nd in OtherSection
(The numbers and names of sections and objects are variable)
How would you do such a thing in awk?
Thanks in advance,
Adrian.
One possibility:
Content of script.awk (with comments):
## When 'FNR == NR', the first input file is in process.
## If line begins with '[', get the section string and reset the position
## of its objects.
FNR == NR && $0 ~ /^\[/ {
object = substr( $0, 2, length($0) - 2 )
pos = 0
next
}
## This section process the objects of each section. It saves them in
## an array. Variable 'pos' increments with each object processed.
FNR == NR {
arr_obj[object, $0] = ++pos
next
}
## This section process second file. It splits line in '.' to find second
## part in the array and prints all.
FNR < NR {
ret = split( $0, obj, /\./ )
if ( ret != 2 ) {
next
}
printf "%s.%d\n", obj[1], arr_obj[ obj[1] SUBSEP obj[2] ]
}
Run the script (important the order of input files, object.txt has sections with objects and input.txt the calls):
awk -f script.awk object.txt input.txt
Result:
SomeSection.2
OtherSection.1
OtherSection.2
EDIT to a question in comments:
I'm not an expert but I will try to explain how I understand it:
SUBSEP is a character to separate indexes in an array when you want to use different values as key. By default is \034, although you can modify it like RS or FS.
In instruction arr_obj[object, $0] = ++pos the comma joins all values with the value of SUBSEP, so in this case would result in:
arr_obj[SomeSection\034Blah] = 1
At the end of the script I access to the index using explicity that variable arr_obj[ obj[1] SUBSEP obj[2], but with same meaning that arr_obj[object, $0] in previous section.
You can also access to each part of this index splitting it with SUBSEP variable, like this:
for (key in arr_obj) { ## Assign 'string\034string' to 'key' variable
split( key, key_parts, SUBSEP ) ## Split 'key' with the content of SUBSEP variable.
...
}
with a result of:
key_parts[1] -> SomeSection
key_parts[2] -> Blah
this awk line should do the job:
awk 'BEGIN{FS="[\\.\\]\\[]"}
NR==FNR{ if(NF>1){ i=1; idx=$2; }else{ s[idx"."$1]=i; i++; } next; }
{ if($0 in s) print $1"."s[$0] } ' f2 input
see test below:
kent$ head input f2
==> input <==
SomeSection.Foo
OtherSection.Foo
OtherSection.Goo
==> f2 <==
[SomeSection]
Blah
Foo
[OtherSection]
Foo
Goo
kent$ awk 'BEGIN{FS="[\\.\\]\\[]"}
NR==FNR{ if(NF>1){ i=1; idx=$2; }else{ s[idx"."$1]=i; i++; } next; }
{ if($0 in s) print $1"."s[$0] } ' f2 input
SomeSection.2
OtherSection.1
OtherSection.2

Concatenating multiple lines with a discriminator

I have the input like this
Input:
a,b,c
d,e,f
g,h,i
k,l,m
n,o,p
q,r,s
I wan to be able to concatenate the lines with a discriminator like "|"
Output:
a,b,c|d,e,f|g,h,i
k,l,m|n,o.p|q,r,s
The file has 1million lines and I want to be able to concatenate lines like the example before.
Any ideas about how to approach this?
#OP, if you want to group them for every 3 records,
$ awk 'ORS=(NR%3==0)?"\n":"|"' file
a,b,c|d,e,f|g,h,i
k,l,m|n,o,p|q,r,s
with Perl,
$ perl -lne 'print $_ if $\ = ($. % 3 == 0) ? "\n" : "|"' file
a,b,c|d,e,f|g,h,i
k,l,m|n,o,p|q,r,s
Since your tags include sed here is a way to use it:
sed 'N;N;s/\n/|/g' datafile
gawk:
BEGIN {
state=0
}
state==0 {
line=$0
state=1
next
}
state==1 {
line=line "|" $0
state=2
next
}
state==2 {
print line "|" $0
state=0
next
}
If Perl is fine, you can try:
$i = 1;
while(<>) {
chomp;
unless($i % 3)
{ print "$line\n"; $i = 1; $line = "";}
$line .= "$_|";
$i++;
}
to run:
perl perlfile.pl 1millionlinesfile.txt
$ paste -sd'|' input | sed -re 's/([^|]+\|[^|]+\|[^|]+)\|/\1\n/g'
With paste, we join the lines together, and then sed dices them up. The pattern grabs runs of 3 pipe-terminated fields and replaces their respective final pipes with newlines.
With Perl:
#! /usr/bin/perl -ln
push #a => $_;
if (#a == 3) {
print join "|" => #a;
#a = ();
}
END { print join "|" => #a if #a }