Manipulating the awk output depending on the number of occurrences - awk

I don't know how to word it well. I have an input file with the first column of each row being the index. I need to convert this input file into multi-columned output file so that starting indexes of each such columns match.
I have an input file in the following format:
1 11.32 12.55
1 13.32 17.55
1 56.77 33.22
2 34.22 1.112
3 12.13 13.14
3 12.55 34.55
3 22.44 12.33
3 44.32 77.44
The expected output should be:
1 11.32 12.55 2 34.22 1.112 3 12.13 13.14
1 13.32 17.55 3 12.55 34.55
1 56.77 33.22 3 22.44 12.33
3 44.32 77.44
Is there an easy way I can do this in awk?

Something like this, in bash:
paste <(grep '^1 ' input.txt) <(grep '^2 ' input.txt) <(grep '^3 ' input.txt)
paste has an option to set the delimiter if you don't want the default tab characters used, or you could post-process the tabs with expand...
EDIT: For an input file with many more tags, you could take this sort of approach:
awk '{print > "/tmp/output" $1 ".txt"}' input.txt
paste /tmp/output*.txt > final-output.txt
The awk line outputs each line to a file named after the first field of the line, then paste recombines them.
EDIT: as pointed out in a comment below, you might have issues if you end up with more than 9 intermediate files. One way around that would be something like this:
paste /tmp/output[0-9].txt /tmp/output[0-9][0-9].txt > final-output.txt
Add additional arguments as needed if you have more than 99 files... or more than 999... If that's the case, though, a python or perl solution might be a better route...

If all you need is independently running columns (without trying to line up matching items between the columns or anything like that) then the simplest solution might be something like:
awk '{print > $1".OUT"}' FILE; paste 1.OUT 2.OUT 3.OUT
The only issue with that is it won't fill in missing columns so you will need to fill those in yourself to line up your columns.
If the column width is known in advance (and the same for every column) then using:
paste 1.OUT 2.OUT 3.OUT | sed -e 's/^\t/ \t/;s/\t\t/\t \t/'
where those spaces are the width of the column should get you what you want. I feel like there should be a way to do this in a more automated fashion but can't think of one offhand.

Related

Adding a decimal point to an integer with awk or sed

So, I have csv files to use with hledger, and last field of every row is the amount for that line transaction.
Lines are in the following format:
date1, date2, description, amount
With the amount format any length between 4 and 6 digits; now for some reason all amounts are missing the period before the last two digits.
Now: 1000
Should be: 10.00
Now: 25452
Should be: 254.52
How to add a '.' before the last two digits of all lines, preferably with sed/awk?
So the input file is:
16.12.2005,18.12.2005,ATM,2000
17.12.2005,18.12.2005,utility,12523
18.12.2005,20.12.2005,salary,459023
desired output
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
Thanks
You could try:
awk -F , '{printf "%s,%s,%s,%-6.2f\n", $1, $2, $3, $4/100.0}'
You should always add a sample of your input file and of the output you want in your question.
In this input you provide, you will have to define what has to happen when the description field contains a ,, or if it is possible to have amount of less than 100 as input.
In function of your answer, I will need to adapt the code or not.
sed 's/..$/.&/'
......................
You can also use cut utility to get the desired output. In your case, you always want to add '.' before the last two digits. So essentially it can be thought as something like this:
Step 1: Get all the characters from the beginning till the last 2 characters.
Step 2: Get the last 2 characters from the end.
Step 3: Concatenate them with the character that you want ('.' in this case).
The corresponding command for each of the step is the following:
$ a='17.12.2005,18.12.2005,utility,12523'
$ b=`echo $a | rev | cut -c3- | rev`
$ c=`echo $a | rev | cut -c1-2 | rev`
$ echo $b"."$c
This would produce the output
17.12.2005,18.12.2005,utility,125.23
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
awk -F, '{sub(/..$/,".& ")}1' file

awk merging 2 columns and adding an extra column to txt file [duplicate]

This question already has answers here:
Why does my tool output overwrite itself and how do I fix it?
(3 answers)
Closed 3 years ago.
I did this in the past without problems, but I can't this time and I don't understand why.....
My original files is
1002 10214
1002 10220
1002 10222
1002 10248
1002 10256
I need to make a new file where the 2 columns above are merged and add a second column with value 1
Desired output should look like this
100210214 1
100210220 1
100210222 1
100210248 1
100210256 1
I tried the below awk commands to first print the 2 columns into 1 into a tmp file, then adding the extra column with "1"
cat input.txt | awk '{ print ($1$2)}' > tmp1.txt
cat tmp1.txt | awk ' {print $0, (1) }' > output.txt
While the first command seems working ok, the second does not
tmp1.txt (OK)
100210214
100210220
100210222
100210248
100210256
output.txt (not OK)
10210214
10210220
10210222
10210248
10210256
The "1"comes in the front of the first column, not sure why, even replacing the first 2 characters. Is it because the original input file is different (may be "space" was used instead of tab)?
Could you please try following.
awk 'BEGIN{OFS="\t"} {sub(/\r$/,"");print $1 $2,"1"}' Input_file
This happens when input file has Windows line endings (i.e. \r\n). You can fix it using this command:
dos2unix file
and then get the desired output with this one:
awk '{$1=$1$2;$2=1}1' file

Comparing corresponding values of two lines in a file using awk [duplicate]

This question already has answers here:
Finding max value of a specific date awk
(3 answers)
Closed 6 years ago.
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
I have the line position of two lines in a file say line1 and line2. These lines may be anywhere in the file but I can access the line position using a search keyword based on name(the first word) in each line
20160801 means yyyymmdd and has an associated value separated by |
I need to compare the values associated with each of the date for the given two lines.
I am a newbie in awk. I am not understanding how to compare these two lines at the same time.
Your question is not at all clear. Perhaps the first step is to clearly articulate 1) What is the problem I am trying to solve; 2) what tools or data do I have to solve it?
The only hints specific to your question I can offer (since your problem statement is not clearly articulated) are these:
In awk, you can compare two different files by using the test FNR==NR which is only true on the first file.
You can find the key words by using a regular expression of the form /^name1/ which means lines that start with that pattern
You can split on a delimiter in awk by setting the field separator to that delimiter -- in this case (I think) it sounds like that is | but you are also comparing white space delimited fields inside of those fields?
You can compare by saving the data from the first line and comparing with the data from the second line in the other file once you can articulate what 'compare' means to you.
Wrapping that up, given:
$ cat /tmp/f1.txt
name1 20160801|76 20160802|67 20160803|49 20160804|35 20160805|55 20160806|76 20160807|77 20160808|70 2016089|50 20160810|75 20160811|97 20160812|90 20160813|87 20160814|99 20160815|113 20160816|83 20160817|57 20160818|158 20160819|61 20160820|46 20160821|1769608 20160822|2580938 20160823|436093 20160824|75 20160825|57 20160826|70 20160827|97 20160828|101 20160829|96 20160830|95 20160831|89
$ cat /tmp/f2.txt
name2 20160801|32413 20160802|37707 20160803|32230 20160804|31711 20160805|32366 20160806|35532 20160807|36961 20160808|45423 2016089|65230 20160810|111078 20160811|74357 20160812|71196 20160813|71748 20160814|77001 20160815|91687 20160816|92076 20160817|89706 20160818|126690 20160819|168587 20160820|207128 20160821|221440 20160822|234594 20160823|200963 20160824|165231 20160825|139600 20160826|145483 20160827|209013 20160828|228550 20160829|223712 20160830|217959 20160831|169106
You can find the lines in question like so:
$ awk -F"|" '/^name/ && FNR==NR {print $1}' f1.txt f2.txt
name1 20160801
$ awk -F"|" '/^name/ && FNR<NR {print $1}' f1.txt f2.txt
name2 20160801
(I have only printed the first field for clarity)
Then use that to compare. Save the first in an associative array and then compare the second when found.

Grep only one of partial duplicates

I have collected the following file:
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2
20130308;343380;7591;NA
This is a ; seperated file with 4 columns. The combination of column 2 and 3 however must be unique. Since this dataset has millions of rows I'm looking for an efficient way to get the first occurence of every duplicate. I therefore need to partial match the combination of column 2 and 3 and then select the first one.
The expected outcome should be:
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9 #REMOVED
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2 #REMOVED
20130308;343380;7591;NA #$REMOVED
I have made a few attempts myself. The first one is:
grep -oE "\;(.*);" orders_20130304to20140219_v3.txt | uniq
However this selects only column 2 and 3 and removes all other data. Furthermore it does not take into account a match that occurs later. I can fix that by adding sort, but I prefer not to sort.
Another attempt is:
awk '!x[$0]++' test.txt
This does not require any sorting, but matches the complete line.
I think the second attempt is close, but that needs to be changed in order to only look at the second and third column instead of the whole line. Does anyone know how to incorporate this?
here you go:
awk -F';' '!a[$2 FS $3]++' file
test with your data:
kent$ awk -F';' '!a[$2 FS $3]++' f
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA

Awk using input from pipe

I have a (very) basic understanding of AWK, I have tried a few ways of doing this but all print out far more lines than I want:
I have 10 lines in file.1:
chr10 234567
chr20 123456
...
chrX 62312
I want to move to uppercase and match the first 2 columns of file.2, so first line below matches second line above, but I don't want to get second line below which matches third line above for position but not chr, and I don't want the first line below to match the first line above.
CHR20 123456 ... 234567
CHR28 234567 ... 62312
I have:
$ cat file.1 | tr '[:lower:]' '[:upper:]' | <grep? awk?>
and would love to know how to proceed. I had used a simple grep - previously but the second column of file.1 matches more in the searched file so I get hundreds of lines returned. I want to just match on the first 2 columns (they correspond to the first 2 columns in the file.2).
Hope thats clear enough for you, look forward to your answers=)
If the files are sorted by the first column you can do:
join -i file.1 file.2 ¦ awk '$3==$2{ $3=""; print}'
If they're not sorted, sort them first.
The -i flag says to ignore case.
That won't work if there are multiple lines with the same field in the first column. To make that work you would need something more complicated