I have collected the following file:
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2
20130308;343380;7591;NA
This is a ; seperated file with 4 columns. The combination of column 2 and 3 however must be unique. Since this dataset has millions of rows I'm looking for an efficient way to get the first occurence of every duplicate. I therefore need to partial match the combination of column 2 and 3 and then select the first one.
The expected outcome should be:
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9 #REMOVED
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2 #REMOVED
20130308;343380;7591;NA #$REMOVED
I have made a few attempts myself. The first one is:
grep -oE "\;(.*);" orders_20130304to20140219_v3.txt | uniq
However this selects only column 2 and 3 and removes all other data. Furthermore it does not take into account a match that occurs later. I can fix that by adding sort, but I prefer not to sort.
Another attempt is:
awk '!x[$0]++' test.txt
This does not require any sorting, but matches the complete line.
I think the second attempt is close, but that needs to be changed in order to only look at the second and third column instead of the whole line. Does anyone know how to incorporate this?
here you go:
awk -F';' '!a[$2 FS $3]++' file
test with your data:
kent$ awk -F';' '!a[$2 FS $3]++' f
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
Related
So, I have csv files to use with hledger, and last field of every row is the amount for that line transaction.
Lines are in the following format:
date1, date2, description, amount
With the amount format any length between 4 and 6 digits; now for some reason all amounts are missing the period before the last two digits.
Now: 1000
Should be: 10.00
Now: 25452
Should be: 254.52
How to add a '.' before the last two digits of all lines, preferably with sed/awk?
So the input file is:
16.12.2005,18.12.2005,ATM,2000
17.12.2005,18.12.2005,utility,12523
18.12.2005,20.12.2005,salary,459023
desired output
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
Thanks
You could try:
awk -F , '{printf "%s,%s,%s,%-6.2f\n", $1, $2, $3, $4/100.0}'
You should always add a sample of your input file and of the output you want in your question.
In this input you provide, you will have to define what has to happen when the description field contains a ,, or if it is possible to have amount of less than 100 as input.
In function of your answer, I will need to adapt the code or not.
sed 's/..$/.&/'
......................
You can also use cut utility to get the desired output. In your case, you always want to add '.' before the last two digits. So essentially it can be thought as something like this:
Step 1: Get all the characters from the beginning till the last 2 characters.
Step 2: Get the last 2 characters from the end.
Step 3: Concatenate them with the character that you want ('.' in this case).
The corresponding command for each of the step is the following:
$ a='17.12.2005,18.12.2005,utility,12523'
$ b=`echo $a | rev | cut -c3- | rev`
$ c=`echo $a | rev | cut -c1-2 | rev`
$ echo $b"."$c
This would produce the output
17.12.2005,18.12.2005,utility,125.23
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
awk -F, '{sub(/..$/,".& ")}1' file
This question already has answers here:
How can I use ":" as an AWK field separator?
(9 answers)
Closed 3 years ago.
I have a file where I'd like to check the content of 4 columns, the order can be reversed between couples of these columns, that means if the columns are a,b,c,d then they can appear also as c,d,a,b. So columns a,b and c,d are locked but can be swapped between each other.
I found a similar post here
remove redundancy in a file based on two fields, using awk
however the solution does not work at all
Even with just two columns
a;b
d;a
b;a
r;f
r;y
a;b
a;d
If I apply the solutions provided and given as correct I end up with duplicates
$ awk '!seen[$1,$2]++ && !seen[$2,$1]++' file
a;b
d;a
b;a
r;f
r;y
a;d
As you can see there's still a;b and b;a
Any suggestion to make this work, considering also there would be four columns, for example
Dallas;Texas;Berlin;Germany
Paris;France;Tokyo;Japan
Berlin;Germany;Dallas;Texas
Florence;Italy;Dublin;Ireland
Berlin;Germany;Texas;Dallas
Should give
Dallas;Texas;Berlin;Germany
Paris;France;Tokyo;Japan
Florence;Italy;Dublin;Ireland
Berlin;Germany;Texas;Dallas
Note that the last line should not be deleted because that's a different record, so a,b and c,d should be considered as locked couples, so a,b,c,d or c,d,a,b should be considered as a duplicate but not other cases.
Well for the original example with two fields, you have missed defining the ; as the input field separator. The same would have worked had you run it as
awk -F';' '!seen[$1,$2]++ && !seen[$2,$1]++' file
For multiple records in a row on a de-limiter, it is better to sort those records by alphabetical order and use the logic. The below logic works irrespective of the order of the elements in a line.
Needs GNU awk because of asort() function.
The input and output delimiters are not needed for the below case, because on every line we use the records split by ; to construct the unique key and print the whole line when it is unique.
awk '{
split($0, arr, ";"); key="";
asort(arr);
for (i=1; i<=length(arr); i++) {
key = ( key FS arr[i] )
}
}!unique[key]++' file
In so called one-liner (aka unreadable) way
awk '{ split($0, arr, ";"); asort(arr); key=""; for (i=1; i<=length(arr); i++) { key = ( key FS arr[i]) }; }!unique[key]++' file
As noted in the comments, if the possible alternates for a,b,c,d is just c,d,a,b then doing below would just suffice
awk -F';' '!seen[$1,$2,$3,$4]++ && !seen[$3,$4,$1,$2]++' file
This question already has answers here:
Finding max value of a specific date awk
(3 answers)
Closed 6 years ago.
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
I have the line position of two lines in a file say line1 and line2. These lines may be anywhere in the file but I can access the line position using a search keyword based on name(the first word) in each line
20160801 means yyyymmdd and has an associated value separated by |
I need to compare the values associated with each of the date for the given two lines.
I am a newbie in awk. I am not understanding how to compare these two lines at the same time.
Your question is not at all clear. Perhaps the first step is to clearly articulate 1) What is the problem I am trying to solve; 2) what tools or data do I have to solve it?
The only hints specific to your question I can offer (since your problem statement is not clearly articulated) are these:
In awk, you can compare two different files by using the test FNR==NR which is only true on the first file.
You can find the key words by using a regular expression of the form /^name1/ which means lines that start with that pattern
You can split on a delimiter in awk by setting the field separator to that delimiter -- in this case (I think) it sounds like that is | but you are also comparing white space delimited fields inside of those fields?
You can compare by saving the data from the first line and comparing with the data from the second line in the other file once you can articulate what 'compare' means to you.
Wrapping that up, given:
$ cat /tmp/f1.txt
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
$ cat /tmp/f2.txt
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
You can find the lines in question like so:
$ awk -F"|" '/^name/ && FNR==NR {print $1}' f1.txt f2.txt
name1 20160801
$ awk -F"|" '/^name/ && FNR<NR {print $1}' f1.txt f2.txt
name2 20160801
(I have only printed the first field for clarity)
Then use that to compare. Save the first in an associative array and then compare the second when found.
I don't know how to word it well. I have an input file with the first column of each row being the index. I need to convert this input file into multi-columned output file so that starting indexes of each such columns match.
I have an input file in the following format:
1 11.32 12.55
1 13.32 17.55
1 56.77 33.22
2 34.22 1.112
3 12.13 13.14
3 12.55 34.55
3 22.44 12.33
3 44.32 77.44
The expected output should be:
1 11.32 12.55 2 34.22 1.112 3 12.13 13.14
1 13.32 17.55 3 12.55 34.55
1 56.77 33.22 3 22.44 12.33
3 44.32 77.44
Is there an easy way I can do this in awk?
Something like this, in bash:
paste <(grep '^1 ' input.txt) <(grep '^2 ' input.txt) <(grep '^3 ' input.txt)
paste has an option to set the delimiter if you don't want the default tab characters used, or you could post-process the tabs with expand...
EDIT: For an input file with many more tags, you could take this sort of approach:
awk '{print > "/tmp/output" $1 ".txt"}' input.txt
paste /tmp/output*.txt > final-output.txt
The awk line outputs each line to a file named after the first field of the line, then paste recombines them.
EDIT: as pointed out in a comment below, you might have issues if you end up with more than 9 intermediate files. One way around that would be something like this:
paste /tmp/output[0-9].txt /tmp/output[0-9][0-9].txt > final-output.txt
Add additional arguments as needed if you have more than 99 files... or more than 999... If that's the case, though, a python or perl solution might be a better route...
If all you need is independently running columns (without trying to line up matching items between the columns or anything like that) then the simplest solution might be something like:
awk '{print > $1".OUT"}' FILE; paste 1.OUT 2.OUT 3.OUT
The only issue with that is it won't fill in missing columns so you will need to fill those in yourself to line up your columns.
If the column width is known in advance (and the same for every column) then using:
paste 1.OUT 2.OUT 3.OUT | sed -e 's/^\t/ \t/;s/\t\t/\t \t/'
where those spaces are the width of the column should get you what you want. I feel like there should be a way to do this in a more automated fashion but can't think of one offhand.
I have a (very) basic understanding of AWK, I have tried a few ways of doing this but all print out far more lines than I want:
I have 10 lines in file.1:
chr10 234567
chr20 123456
...
chrX 62312
I want to move to uppercase and match the first 2 columns of file.2, so first line below matches second line above, but I don't want to get second line below which matches third line above for position but not chr, and I don't want the first line below to match the first line above.
CHR20 123456 ... 234567
CHR28 234567 ... 62312
I have:
$ cat file.1 | tr '[:lower:]' '[:upper:]' | <grep? awk?>
and would love to know how to proceed. I had used a simple grep - previously but the second column of file.1 matches more in the searched file so I get hundreds of lines returned. I want to just match on the first 2 columns (they correspond to the first 2 columns in the file.2).
Hope thats clear enough for you, look forward to your answers=)
If the files are sorted by the first column you can do:
join -i file.1 file.2 ¦ awk '$3==$2{ $3=""; print}'
If they're not sorted, sort them first.
The -i flag says to ignore case.
That won't work if there are multiple lines with the same field in the first column. To make that work you would need something more complicated