Awk using input from pipe - awk

I have a (very) basic understanding of AWK, I have tried a few ways of doing this but all print out far more lines than I want:
I have 10 lines in file.1:
chr10 234567
chr20 123456
...
chrX 62312
I want to move to uppercase and match the first 2 columns of file.2, so first line below matches second line above, but I don't want to get second line below which matches third line above for position but not chr, and I don't want the first line below to match the first line above.
CHR20 123456 ... 234567
CHR28 234567 ... 62312
I have:
$ cat file.1 | tr '[:lower:]' '[:upper:]' | <grep? awk?>
and would love to know how to proceed. I had used a simple grep - previously but the second column of file.1 matches more in the searched file so I get hundreds of lines returned. I want to just match on the first 2 columns (they correspond to the first 2 columns in the file.2).
Hope thats clear enough for you, look forward to your answers=)

If the files are sorted by the first column you can do:
join -i file.1 file.2 ¦ awk '$3==$2{ $3=""; print}'
If they're not sorted, sort them first.
The -i flag says to ignore case.
That won't work if there are multiple lines with the same field in the first column. To make that work you would need something more complicated

Related

How can I remove a string after a specific character ONLY in a column/field in awk or bash?

I have a file with tab-delimited fields (or columns) like this one below:
cat abc_table.txt
a b c
1 11;qqw 213
2 22 222
3 333;rs2 83838
I would like to remove everything after the ";" on only the second field.
I have tried with
awk 'BEGIN{FS=OFS="\t"} NR>=1 && sub (/;[*]/,"",$2){print $0}' abc_table.txt
but it does not seem to work.
I also tried with sed:
's/;.*//g' abc_table.txt
but it erases also the strings in the third field:
a b c
1 11
2 22 222
3 333
The desired output is:
a b c
1 11 213
2 22 222
3 333 83838
If someone could help me, I would be very grateful!
You need to simply correct your regex.
awk '{sub(/;.*/,"",$2)} 1' Input_file
In case you have Input_file TAB delimited then try:
awk 'BEGIN{FS=OFS="\t"} {sub(/;.*/,"",$2)} 1' Input_file
Problem in OP's regex: OP's regex ;[*] is looking for ; and *(literal character) in 2nd field that's why its NOT able to substitute everything after ; in 2nd field. We need to simply give ;.* which means grab everything from very first occurrence of ; till last of 2nd field and then substitute with NULL in 2nd field.
An alternative solution using gnu sed:
sed -E 's/(^[^\t]*\t+[^;]*);[^\t]*/\1/' file
a b c
1 11 213
2 22 222
3 333 83838
This might work for you (GNU sed):
sed 's/[^\t]*/&\n/2;s/;[^\t]*\n//;s/\n//' file
Append a unique marker e.g. newline, to the end of field 2.
Remove everything from the first ; which is not a tab to a newline.
Remove the newline if any.
N.B. This method can be extended for selective or all fields e.g. same removal but for the first and third fields:
sed 's/[^\t]*/&\n/1;s//&\n/3;s/;[^\t]*\n//g;s/\n//g' file

Adding a decimal point to an integer with awk or sed

So, I have csv files to use with hledger, and last field of every row is the amount for that line transaction.
Lines are in the following format:
date1, date2, description, amount
With the amount format any length between 4 and 6 digits; now for some reason all amounts are missing the period before the last two digits.
Now: 1000
Should be: 10.00
Now: 25452
Should be: 254.52
How to add a '.' before the last two digits of all lines, preferably with sed/awk?
So the input file is:
16.12.2005,18.12.2005,ATM,2000
17.12.2005,18.12.2005,utility,12523
18.12.2005,20.12.2005,salary,459023
desired output
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
Thanks
You could try:
awk -F , '{printf "%s,%s,%s,%-6.2f\n", $1, $2, $3, $4/100.0}'
You should always add a sample of your input file and of the output you want in your question.
In this input you provide, you will have to define what has to happen when the description field contains a ,, or if it is possible to have amount of less than 100 as input.
In function of your answer, I will need to adapt the code or not.
sed 's/..$/.&/'
......................
You can also use cut utility to get the desired output. In your case, you always want to add '.' before the last two digits. So essentially it can be thought as something like this:
Step 1: Get all the characters from the beginning till the last 2 characters.
Step 2: Get the last 2 characters from the end.
Step 3: Concatenate them with the character that you want ('.' in this case).
The corresponding command for each of the step is the following:
$ a='17.12.2005,18.12.2005,utility,12523'
$ b=`echo $a | rev | cut -c3- | rev`
$ c=`echo $a | rev | cut -c1-2 | rev`
$ echo $b"."$c
This would produce the output
17.12.2005,18.12.2005,utility,125.23
16.12.2005,18.12.2005,ATM,20.00
17.12.2005,18.12.2005,utility,125.23
18.12.2005,20.12.2005,salary,4590.23
awk -F, '{sub(/..$/,".& ")}1' file

awk: Compare two sets of numbers (generated by random and strict rules)

I have many files containing some fixed words and numbers:
The FIRST SET of numbers has a fixed length of 7 digits: the first 4 of them being like a random prefix (in example are 100,200,300 but can be others..) we do not need it, we are interested for the remaining 4 digits.
The SECOND SET of number/s is generated number based on the last 4 digits from the FIRST SET (xxx7777 = 7777; xxx0066 = 66). You can see that the SECOND SET can NOT have leading zeros, they are cut out already and this is a rule.
Input
first second third 1007777 fourth 7777
...
first second third 2008341 fourth 8341
...
first second third 3000005 fourth 5
...
...
first second third 2008341 fourth 8
...
first second third 2008341 fourth 341
I found in other examples here - how to find interested lines using grep, but I didn't found AWK example doing what I want, because of the rule with the leading zeros maybe i'm having problems..
My attempt to find the wrong generations:
grep -Pr 'first second third' docs/test/*.txt | awk '{ if($4=$6) print $4 " " $6}'
7777 7777
8341 8341
5 5
8 8
341 341
The correct Output should look like this:
2008341 8
2008341 341
..only the problems (not right generated) lines and the filename.
Thanks ! :)
$ awk '/first second third/ && (substr($4,4)+0 != $NF) {print FILENAME, $4, $NF}' file
file 2008341 8
file 2008341 341
Call it as:
awk '...' docs/test/*.txt
or:
find docs -name '*.txt' -exec awk '...' {} \;
or similar as you see fit.
Use this gnu way, intented to be human readable and maintenable :
$ grep -r foobarbase . | awk '
{match($4, /[0-9]{4}$/, a); #1
a[0]=gensub(/^0+/, "", "g", a[0])} #2
$NF != a[0] #3
' file
Output :
first second third 2008341 fourth 8
first second third 2008341 fourth 341
Explanations :
#1 get the last 4 digits of column 4 and assign a array with match
#2 remove all leading 0
#3 if cutted part is different than last column, print (default awk behavior on true condition)

Manipulating the awk output depending on the number of occurrences

I don't know how to word it well. I have an input file with the first column of each row being the index. I need to convert this input file into multi-columned output file so that starting indexes of each such columns match.
I have an input file in the following format:
1 11.32 12.55
1 13.32 17.55
1 56.77 33.22
2 34.22 1.112
3 12.13 13.14
3 12.55 34.55
3 22.44 12.33
3 44.32 77.44
The expected output should be:
1 11.32 12.55 2 34.22 1.112 3 12.13 13.14
1 13.32 17.55 3 12.55 34.55
1 56.77 33.22 3 22.44 12.33
3 44.32 77.44
Is there an easy way I can do this in awk?
Something like this, in bash:
paste <(grep '^1 ' input.txt) <(grep '^2 ' input.txt) <(grep '^3 ' input.txt)
paste has an option to set the delimiter if you don't want the default tab characters used, or you could post-process the tabs with expand...
EDIT: For an input file with many more tags, you could take this sort of approach:
awk '{print > "/tmp/output" $1 ".txt"}' input.txt
paste /tmp/output*.txt > final-output.txt
The awk line outputs each line to a file named after the first field of the line, then paste recombines them.
EDIT: as pointed out in a comment below, you might have issues if you end up with more than 9 intermediate files. One way around that would be something like this:
paste /tmp/output[0-9].txt /tmp/output[0-9][0-9].txt > final-output.txt
Add additional arguments as needed if you have more than 99 files... or more than 999... If that's the case, though, a python or perl solution might be a better route...
If all you need is independently running columns (without trying to line up matching items between the columns or anything like that) then the simplest solution might be something like:
awk '{print > $1".OUT"}' FILE; paste 1.OUT 2.OUT 3.OUT
The only issue with that is it won't fill in missing columns so you will need to fill those in yourself to line up your columns.
If the column width is known in advance (and the same for every column) then using:
paste 1.OUT 2.OUT 3.OUT | sed -e 's/^\t/ \t/;s/\t\t/\t \t/'
where those spaces are the width of the column should get you what you want. I feel like there should be a way to do this in a more automated fashion but can't think of one offhand.

Grep only one of partial duplicates

I have collected the following file:
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2
20130308;343380;7591;NA
This is a ; seperated file with 4 columns. The combination of column 2 and 3 however must be unique. Since this dataset has millions of rows I'm looking for an efficient way to get the first occurence of every duplicate. I therefore need to partial match the combination of column 2 and 3 and then select the first one.
The expected outcome should be:
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9 #REMOVED
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2 #REMOVED
20130308;343380;7591;NA #$REMOVED
I have made a few attempts myself. The first one is:
grep -oE "\;(.*);" orders_20130304to20140219_v3.txt | uniq
However this selects only column 2 and 3 and removes all other data. Furthermore it does not take into account a match that occurs later. I can fix that by adding sort, but I prefer not to sort.
Another attempt is:
awk '!x[$0]++' test.txt
This does not require any sorting, but matches the complete line.
I think the second attempt is close, but that needs to be changed in order to only look at the second and third column instead of the whole line. Does anyone know how to incorporate this?
here you go:
awk -F';' '!a[$2 FS $3]++' file
test with your data:
kent$ awk -F';' '!a[$2 FS $3]++' f
20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA