Awk Sum skipping Special Character Row - awk

I am trying to take the sum of a particular column in a file i.e. column 18.
Using awk command along with Printf to display it in proper decimal format.
SUM=`cat ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt|awk -F"" '{s+=$18}END{printf("%24.2f\n", s)}'
Above command is skipping those rows in file which has the special character in one of the column 5 - RÉPARATIONS. Hence Awk skips these rows and doesnt consider sum for that row. Please help how to resolve this issue to take sum of all rows.

There is missing a back tics in your example, should be:
SUM=`cat ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt|awk -F"" '{s+=$18}END{printf("%24.2f\n", s)}'`
But you should not use back tics, you should use parentheses $(code)
Using cat to enter data to awk is also wrong way to do it, add pat after awk
SUM=$(awk -F"" '{s+=$18} END {printf "%24.2f\n",s}' ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt)
This may resolve your problem, but gives a more correct code.
If you give us your input file, it would help us to understand the problem.

Related

Return lines with at least n consecutive occurrences of the pattern in bash [duplicate]

This question already has answers here:
Specify the number of pattern repeats in JavaScript Regex
(2 answers)
Closed 1 year ago.
Might be naive question, but I can't find an answer.
Given a text file, I'd like to find lines with at least (defined number) of occurrences of a certain pattern, say, AT[GA]CT.
For example, in n=2, from the file:
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
Only the second line should be returned.
I know how to use grep/awk to search for at least one instance of this degenerate pattern, and for some defined number of pattern instances occurring non-consecutively. But the issue is the pattern occurrences MUST be consecutive, and I can't figure out how to achieve that.
Any help appreciated, thank you very much in advance!
I would use GNU AWK for this task following way, let file.txt content be
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
then
awk 'BEGIN{p="AT[GA]CT";n=2;for(i=1;i<=2;i+=1){pat=pat p}}$0~pat' file.txt
output
TAGATGCTATACTTGA
Explanation: I use for loop to repeat p n times, then filter line by checking if line ($0) does match with what I created earlier.
Alternatively you might use string formatting function sprintf as follows:
awk 'BEGIN{n=2}$0~sprintf("(AT[GA]CT){%s}",n)' file.txt
Explanation: I used sprintf function, %s in first argument marks where to put n. If you want to know more about what might be used in printf and sprintf first argument read Format Modifiers
(both solutions tested in GNU Awk 5.0.1)

Huge file with 55000 rows * 1800 Columns - need to delete only specific column with a partial pattren

I have a huge file (cancer Gene expression data- ~ 2 GBs .csv file) with 55000 rows and ~ 1800 Columns. so my table looks like this:
TCGA-4N-A93T-01A-11R-A37K-07, **TCGA-5M-AAT4-11A-11R-A41B-07**, TCGA-5M-AATE-01A-11R-A41B-07, TCGA-A6-2677-01B-02R-A277-07, **TCGA-A6-2677-11A-01R-0821-07**
for example in Column TCGA-5M-AAT4-11A-11R-A41B-07 at the fourth position I have -11A, Now my problem is I have to delete the entire columns which have -11A at 4th position (xx-xx-xx-11A-xx-xx-xx).This has to search all 1800 columns and keep only those columns which do not have -11A at a fourth position.
Can you please help me what command should i use to get the required data.
I am a biologist and have limited Experience in coding
EDITED:
I have a data file collected from 1800 Breast cancer patient, the table has got 55000 gene names as rows and 1800 samples as the columns. (55000 * 1800 matrix file)Few samples designed by our lab were faulty and we have to remove those from our analysis. Now, I have identified those samples and I wanted to remove them from my file1.csv. xx-xx-xx-11A-xx-xx-xx are the faulty samples, i need to identify only those samples and remove them from the file.csv. The samples which show 11A in the fourth place of the column name. I can do this in R but it takes too long for me to process. Thanks in advance, sorry for annoying.
Try this
#! /usr/local/bin/gawk -f
# blacklist_columns.awk
# https://stackoverflow.com/questions/49578756
# i.e. TCGA-5M-AAT4-11A-11R-A41B-07
BEGIN{
PATTERN="TCGA-..-....-11A-...-....-.."
}
$0 ~ ".*" PATTERN ".*"{ # matches rows with the pattern
for(col=1;col<=NF; col++)
# find column(s) in the row with the patten
if($col ~ PATTERN){
blacklist[col]++ # note which column
}
}
END{ # output the list collected
n = asorti(blacklist)
for(i=1;i<=n;i++)
bl=bl "," blacklist[i]
print substr(bl, 2)
}
# Usage try ... :
# BLACKLIST=blacklist_columns.awk table.tab
#
# cut --complement -f $BLACKLIST table.tab > table_purged.tab
You can't do it in one pass so you might as well let an existing tool
do the second pass especially since you are more on the wet side.
The script should spit out a list of columns it thinks you should skip
you can feed that list as an argument to the program cut
and get it to only keep the columns not mentioned.
Edit(orial):
Thank you for your sentiment Wojciech Kaczmarek I could not agree more.
There is also a flip side where some biologist discount "coders" which I find annoying. The paper being working on here may include some water cooler collaborator but fail to mention technical help on a show stopper (hey, they fixed it must not have been any big deal).
Not sure what you really asking for, this script will delete row by row the fields which has the "11A" in the 4th position (based on - delim).
$ awk -F', *' -v OFS=', ' '{for(i=1;i<=NF;i++)
{split($i,a,"-");
if(a[4]=="11A") $i=""}}1' input > output
If you're asking to remove the entire column for all rows not just the found row, this is not it. Also not tested but perhaps will give you ideas...

Extracting a specific value from a text file

I am running a script that outputs data.
I am specifically trying to extract one number. However, each time I run the script and get the output file, the position of the number I am interested in will be in a different position (due to the log nature of the output file).
I have tried several awk, sed, grep commands but I can't get any to work as many of them rely on the position of the word or number remaining constant.
This is what I am dealing, with. The value I require is the bold one:
Energy initial, next-to-last, final =
-5.96306582435 -5.96306582435 -5.96349956298
You can try
awk '{print $(i++%3+6)}' infile

sed/awk + regex delete duplicate lines where first field matches (ip address)

I need a solution to delete duplicate lines where first field is an IPv4 address.For example I have the following lines in a file:
192.168.0.1/text1/text2
192.168.0.18/text03/text7
192.168.0.15/sometext/sometext
192.168.0.1/text100/ntext
192.168.0.23/othertext/sometext
So all it matches in the previous scenario is the IP address. All I know is that the regex for IP address is:
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
It would be nice if the solution is one line and as fast as possible.
If, the file contains lines only in the format you show, i.e. first field is always IP address, you can get away with 1 line of awk:
awk '!x[$1]++' FS="/" $PATH_TO_FILE
EDIT: This removes duplicates based only on IP address. I'm not sure this is what the OP wanted when I wrote this answer.
If you don't need to preserve the original ordering, one way to do this is using sort:
sort -u <file>
The awk that ArjunShankar posted worked wonders for me.
I had a huge list of items, which had multiple copies in field 1, and a special sequential number in field 2. I needed the "newest" or highest sequential number from each unique field 1.
I had to use sort -rn to push them up to the "first entry" position, as the first step is write, then compare the next entry, as opposed to getting the last/most recent in the list.
Thank ArjunShankar!

How to skip records that turn on/off the range pattern?

gawk '/<Lexer>/,/<\/Lexer>/' file
this works but it prints the first and last records, which I'd like to omit. How to do so?
It says: "The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them from the records you are interested in." but no example.
I tried something like
gawk '/<Lexer>/,/<\/Lexer>/' {1,FNR-1} file
but it doesn't work.
If you have a better way to do this, without using awk, say so.
You can do it with 2 separate match statements and a variable
gawk '/<Lexer>/{p=1; next} /<\/Lexer>/ {p=0} p==1 {print}' file
This matches <Lexer> and sets p to 1 and then skips to the next line. While p is 1 it prints the current line. When it matches </Lexer> it sets p to 0 and skips. As p is 0 printing is suppressed.