For some statistics research, I want to separate my data which have duplicated value in first column. I work with vim.
suppose that a part of my data is like this:
Item_ID Customer_ID
123 200
104 134
734 500
123 345
764 347
1000 235
734 546
as you can see, some lines have equal values in first column,
i want to generate two separated files, which one of them contains just non repeated values and the other contains lines with equal first column value.
for above example i want to have these two files:
first one contains:
Item_ID Customer_ID
123 200
734 500
123 345
734 546
and second one contains:
Item_ID Customer_ID
104 134
764 347
1000 235
can anybody help me?
I think awk would be a better option here.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] > 1' input.txt input.txt > dup.txt
Prettier version of awk code:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
Overview
We loop over the text twice. By supplying the same file to our awk script twice we are effectively looping over the text twice. First time though the loop count the number of times we see our field's value. The second time though the loop output only the records which have a field value count of 1. For the duplicate line case we only output lines which have field value counts greater than 1.
Awk primer
awk loops over lines (or records) in a text file/input and splits each line into fields. $1 for the first field, $2 for the second field, etc. By default fields are separated by whitespaces (this can be configured).
awk runs each line through a series of rules in the form of condition { action }. Any time a condition matches then action is taken.
Example of printing the first field which line matches foo:
awk '/foo/ { print $1 }` input.txt
Glory of Details
Let's take a look at finding only the unique lines which the first field only appears once.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
Prettier version for readability:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
awk 'code' input > output - run code over the input file, input, and then redirect the output to file, output
awk can take more than one input. e.g. awk 'code' input1.txt input2.txt.
Use the same input file, input.txt, twice to loop over the input twice
awk 'FNR == NR { code1; next } code2' file1 file2 is a common awk idiom which will run code1 for file1 and run code2 for file2
NR is the current record (line) number. This increments after each record
FNR is the current file's record number. e.g. FNR will be 1 for the first line in each file
next will stop executing any more actions and go to the next record/line
FNR == NR will only be true for the first file
$1 is the first field's data
seen[$1]++ - seen is an array/dictionary where we use the first field, $1, as our key and increment the value so we can get a count
$0 is the entire line
print ... prints out the given fields
print $0 will print out the entire line
just print is short for print $0
condition { print $0 } can be shorted to condition { print } which can be shorted further to just condition
seen[$1] == 1 which check to see if the first field's value count is equal to 1 and print the line
Here is an awk solution:
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > (a[b[i]]==1?"single":"multiple")}' file
cat single
104 134
764 347
1000 235
cat multiple
123 200
734 500
123 345
734 546
PS I skipped the first line, but it could be implemented.
This way you get one file for single hits, one for double, one for triple etc.
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > "file"a[b[i]]}'
That would require some filtering of the list of lines in the buffer. If you're really into statistics research, I'd go search for a tool that is better suited than a general-purpose text editor, though.
That said, my PatternsOnText plugin has some commands that can do the job:
:2,$DeleteUniqueLinesIgnoring /\s\+\d\+$/
:w first
:undo
:2,$DeleteAllDuplicateLinesIgnoring /\s\+\d\+$/
:w second
As you want to filter on the first column, the commands' /{pattern}/ has to filter out the second column; \s\+\d\+$ matches the final number and its preceding whitespace.
:DeleteUniqueLinesIgnoring (from the plugin) gives you just the duplicates, :DeleteAllDuplicateLinesIgnoring just the unique lines. I simply :write them to separate files and :undo in between.
Disclaimers:
1) English is my second language, so please forgive any grammatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have an CSV file that looks like this:
c1,c2,c3,c4,c5,134.6,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0.18,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,416.09,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,12.1,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,480.64,,c8,c9,SERVER4,c11
c1,c2,c3,c4,c5,,83.65,c8,c9,SERVER5,c11
c1,c2,c3,c4,c5,,253.15,c8,c9,SERVER6,c11
c1,c2,c3,c4,c5,,18.84,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,8.12,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,22.45,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,117.81,c8,c9,SERVER8,c11
c1,c2,c3,c4,c5,,96.34,c8,c9,SERVER9,c11
Complementary facts:
1) File has 11 columns.
2) The data in columns 1, 2, 3, 4, 5, 8, 9 and 11 is irrelevant in this case. In other words, I will only work with columns 6, 7 and 10.
3) Column 10 will be typically alphanumeric strings (server names), though it may contain also "-" and/or "_".
4) Columns 6 and 7 will have exclusively numbers, with up to two decimal places (A possible value is 0). Only one of the two will have data per line, never both.
What I need as an output:
- A single occurrence of every string in column 10 (as column 1), then the sum (subtotal) of it's values in column 6 (as column 2) and last, the sum (subtotal) of it's values in column 7 (as column 3).
- If the total for a field is "0" the field must be left empty, but still must exist (it's respective comma has to be printed).
- **Note** that the strings in column 10 will be already alphabetically sorted, so there is no need to do that part of the processing with AWK.
Output sample, using the sample above as an input:
SERVER1,134.6,,
SERVER2,0.18,,
SERVER3,428.19,,
SERVER4,480.64,,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,26.96
I've already found within these pages not one, but two AWK oneliners that PARTIALLY accomplish what it need:
awk -F "," 'NR==1{last=$10; sum=0;}{if (last != $10) {print last "," sum; last=$10; sum=0;} sum += $6;}END{print last "," sum;}' inputfile
awk -F, '{a[$10]+=$6;}END{for(i in a)print i","a[i];}' inputfile
My "problems" in both cases are the same:
- Subtotals of 0 are printed.
- I can only handle the sum of one column at a time. Whenever I try to add the second one, I get either a syntax error or it does simply not print the third column at all.
Thanks in advance for your support people!
Regards,
MartÃn
something like this?
$ awk 'BEGIN{FS=OFS=","}
{s6[$10]+=$6; s7[$10]+=$7}
END{for(k in s6) print k,(s6[k]?s6[k]:""),(s7[k]?s7[k]:"")}' file | sort
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
note that your treatment of commas is not consistent, you're adding an extra one when the last field is zero (count the commas)
Your posted expected output doesn't seem to match your posted sample input so we're guessing but this might be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="," }
$10 != prev {
if (NR > 1) {
print prev, sum6, sum7
}
sum6 = sum7 = ""
prev = $10
}
$6 { sum6 += $6 }
$7 { sum7 += $7 }
END { print prev, sum6, sum7 }
$ awk -f tst.awk file
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
Here is what I am doing.
The text file is comma separated and has three field,
and I want to extract all the line containing the same second field
more than three times.
Text file (filename is "text"):
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
6,keyword2,content5
6,keyword2,content5
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
My command is like below. cat the whole text file inside awk and grep with the second field of each line and count the number of the line.
If the number of the line is greater than 2, print the whole line.
The command:
awk -F "," '{ "cat text | grep "$2 " | wc -l" | getline var; if ( 2 < var ) print $0}' text
However, the command output contains only first three consecutive lines,
instead of printing also last three lines containing "keyword1" which occurs in the text for six times.
Result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
My expected result:
11,keyword1,content1
4,keyword1,content3
5,keyword1,content2
7,keyword1,content4
8,keyword1,content2
1,keyword1,content2
Can somebody tell me what I am doing wrong?
It is relatively straight-forward to make just two passes over the file. In the first pass, you count the number of occurrences of each value in column 2. In the second pass, you print out the rows where the value in column 2 occurs more than your threshold value of 3 times.
awk -F, 'FNR == NR { count[$2]++ }
FNR != NR { if (count[$2] > 3) print }' text text
The first line of code handles the first pass; it counts the occurrences of each different value of the second column.
The second line of code handles the second pass; if the value in column 2 was counted more than 3 times, print the whole line.
This doesn't work if the input is only available on a pipe rather than as a file (so you can't make two passes over the data). Then you have to work much harder.
I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu
This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt
You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141
I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.
a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.
For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]
awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.
awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].