How to extract text with fixed length in delimited file - awk

I want to extract a field from a delimited file.
Below is the content of my file -
A,B,C,"01/02/2015,01/03/2016,02/26/2017",01,56
A,B,G,"01/02/2012,01/03/2011,02/26/2010",01,56
I want to retrieve only the first date in each line and replace the entire column with that value.
output
A,B,C,01/02/2015,01,56
A,B,G,01/02/2012,01,56
I know that I can split the value in "s to comma separated values, but not sure how limit only the first value and omit the others.
Please guide me for this.

sed 's/"\([^,]*\)[^"]*"/\1/'
I.e. find a double quote, remember what follows it up to a comma, and replace that up to the following double quote with the remembered part.
For more serious work with CSV, see Perl and Text::CSV_XS.

Considering that your Input_file is same as shown sample if yes then following awk may help you in same.
awk -F',|"' '{print $1,$2,$5,$(NF-1),$NF}' OFS=, Input_file
Output will be as follows.
A,B,01/02/2015,01,56
A,B,01/02/2012,01,56
Explanation:
-F',|"': Setting field separator as either , or " for each line of Input_file here.
print: print is out of the box awk keyword that prints lines/variables etc.
$1,$2,$5,$(NF-1),$NF: Printing $1(first field of current line), $2(second field of current line), $5(fifth field of current line),$(NF-1)(seconf last field of current line) and $NF(last field of current line).
OFS=,: Setting output field separator as comma here.
Input_file: Mentioning the Input_file name here.

Related

Select first and last column using regex or linux command

I have [a text file][1] that looks something like this...
("oo" (set CANDRA-E-O 0) "ऊ")
("o" (set CANDRA-E-O ?ऑ) "ओ")
("oa" "ऑ")
("au" "औ")
I need to extract the first and last columns like:
"oo", "ऊ"
"o", "ओ"
"oa", "ऑ"
"au", "औ"
I have managed to extract the first column. But not sure how to select the second column.
\ {2}\(\".+\"\
With your shown samples/attempts, please try following awk command. Written and tested in GNU awk.
awk -v FPAT='"[^"]*"' '{for(i=1;i<=NF;i++){printf("%s%s",$i,i==NF?ORS:OFS)}}' Input_file
Explanation: Simple explanation would be, setting FPAT to '"[^"]*"' which means setting field separator as regex form, from " to till next occurrence of " comes. Then in main program going through all fields of each line and printing them, when its last field of line then printing new line else printing spaces(to get all one line values into a single line).
With this awk solution:
awk -v OFS="," '{sub(/^\(/,"",$1);sub(/\)$/,"",$NF);print $1, $NF}' file
"oo","ऊ"
"o","ओ"
"oa","ऑ"
"au","औ"
with first sub() we remove the parenthesis ( of the first field.
Idem second sub() for last parenthesis ) of the last field.
we print the two fields separated by comma: OFS=","

Find an exact match from a patterns file for another file using awk (patterns contain regex symbols to be ignored)

I have a file which has the following patterns.
NO_MATCH
NO_MATCH||NO_MATCH
NO_MATCH||NO_MATCH||NO_MATCH
NO_MATCH||NO_MATCH||NO_MATCH||NO_MATCH
These should be matched exactly with the 5th column of the target csv. I have tried:
awk 'NR==FNR{a[$0]=$0; next;} NR>FNR{if($5==a[$0])print $0}' pattern.csv input.csv > final_out.csv
But the || in the patterns file result in bad matches. The 5th column in the target csv looks something like this:
"AAAA||AAAA"
"BBBB||BBBB"
"NO_MATCH"
"NO_MATCH||NO_MATCH||NO_MATCH"
"NO_MATCH||BBBB"
I need to extract the 3rd and 4th lines.
Edit: I need exact match such as line 3 & 4. Hope this clears up the issue. The columns in the csv are double quoted as shown, and the quotes around fifth column should be removed.
awk 'BEGIN{FS=OFS=","} NR==FNR{a["\""$0"\""];next} ($5 in a){gsub(/^"|"$/,"",$5);print}' pattern.csv input.csv > final_out.csv
Keep pattern.csv's contents in an array with enclosing each line in quotes. For each line in input.csv, if fifth column exists in the array, remove quotes around it and print the line.

check for value in csv file then print line with awk / sed

Is it possible to parse a .csv file and look for the 13th entry containing a particular value.
So data for example would be
10,1,a,bhd,5,7,10,,,8,9,3,19,0
I only want to extract lines which have a value of 3 in the 13th field if that makes sense.
Tried it wish a bash while loop using cut etc but was messy.
Not sure if there a awk / sed method.
Thanks in advance.
This is beginner level awk.
awk -F, '$13==3' file
-F, is for setting field separator to comma, $13 is the 13th field's value. For each line, if $13==3 evaluates true the line is printed.

Filter fields with multiple delimiters

I've done extensive searching for a solution but can't quite find what I need. Have a file like this:
aaa|bbb|ccc|ddd~eee^fff^ggg|hhh|iii
111|222|333|444~555^666^777|888|999
AAA|BBB|CCC||EEE|FFF
What I want to do is use awk or something else to return lines from this file with a change to field 4(pipe delimited). Field 4 has a tilde and caret as delimiters which is where I'm struggling. We want the lines returned as this:
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
If field 4 is empty, it's returned as is. But when field 4 has multiple values, we want the first value right after the tilde returned only.
awk -F "[|^~]" 'BEGIN{OFS="|"}NF==6{print} NF==9{print $1,$2,$3,$5,$8,$9}' tmp.txt
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
use a regular expression as your delimiter
count the fields to decide what to do
set the output delimiter to pipe
$ awk -F'|' '{sub(/^[^~]*~/, "", $4); sub(/\^.*/, "", $4)} 1' OFS='|' file
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
This approach makes no assumption about the contents of fields other than field 4. The other fields may, for example, contain ~ or ^ characters and that will not affect the results.
How it works
-F'|'
This sets the field delimiter on input to |.
sub(/^[^~]*~/, "", $4)
If field 4 contains a ~, this removes the first ~ and everything before the first ~.
sub(/\^.*/, "", $4)
If field 4 contains ^, this removes the first ^ and everything after it.
1
This is awk's cryptic shorthand for print-the-line.
OFS='|'
This sets the field separator on output to |.

Delete lines which contain a number smaller/larger than a user specified value

I need to delete lines in a large file which contain a value larger than a user specified number(see picture). For example I'd like to get rid of lines with values larger than 5e-48 (x>5e-48), i. e. lines with 7e-46, 7e-40, 1e-36,.... should be deleted.
Can sed, grep, awk or any other command do that?
Thank you
Markus
With awk:
awk '$3 <= 5e-48' filename
This selects only those lines whose third field is smaller than 5e-48.
If fields can contain spaces (since the data appears to be tab-separated) use
awk -F '\t' '$3 <= 5e-48' filename
This sets the field separator to \t, so lines are split at tabs rather than any whitespace. It does not appear to be necessary with the shown input data, but it is good practice to be defensive about these things (thanks to #tripleee for pointing this out).
In Perl, for example, the solution can be
perl -ane'print unless$F[2]>5e-48'