I have data in below format in a file
"123","XYZ","M","N","P,Q"
"345",
"987","MNO","A,B,C"
I always want to have 5 entries in the row , so if the count of fields in 2 then 3 extra ("") needs to be added.
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
I looked upto the solution on the page
Add Extra Strings Based on count of fields- Sed/Awk
which has very similar requirement but when I try it fails as I have comma (,) within the field also.
Thanks.
In GNU awk with your shown samples, please try following code.
awk -v s1="\"" -v FPAT='[^,]*|"[^"]+"' '
BEGIN{ OFS="," }
FNR==NR{
nof=(NF>nof?NF:nof)
next
}
NF<nof{
val=""
i=($0~/,$/?NF:NF+1)
for(;i<=nof;i++){
val=(val?val OFS:"")s1 s1
}
sub(/,$/,"")
$0=$0 OFS val
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v s1="\"" -v FPAT='[^,]*|"[^"]+"' ' ##Starting awk program from here setting FPAT to csv file parsing here.
BEGIN{ OFS="," } ##Starting BEGIN section of this program setting OFS to comma here.
FNR==NR{ ##Checking condition FNR==NR here, which will be true for first time file reading.
nof=(NF>nof?NF:nof) ##Create nof to get highest NF value here.
next ##next will skip all further statements from here.
}
NF<nof{ ##checking if NF is lesser than nof then do following.
val="" ##Nullify val here.
i=($0~/,$/?NF:NF+1) ##Setting value of i as per condition here.
for(;i<=nof;i++){ ##Running loop till value of nof matches i here.
val=(val?val OFS:"")s1 s1 ##Creating val which has value of "" in it.
}
sub(/,$/,"") ##Removing ending , here.
$0=$0 OFS val ##Concatinate val here.
}
1 ##Printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.
EDIT: Adding this code here, where keeping a variable named nof where we can give our number of fields value which should be added minimum in all missing lines, in case any line is having more than minimum field values then it will take that value to add those many number of fields in missing field line.
awk -v s1="\"" -v nof="5" -v FPAT='[^,]*|"[^"]+"' '
BEGIN{ OFS="," }
FNR==NR{
nof=(NF>nof?NF:nof)
next
}
NF<nof{
val=""
i=($0~/,$/?NF:NF+1)
for(;i<=nof;i++){
val=(val?val OFS:"")s1 s1
}
sub(/,$/,"")
$0=$0 OFS val
}
1
' Input_file Input_file
Here is one for GNU awk using FPAT when [you] always want to have 5 entries in the row :
$ awk '
BEGIN {
FPAT="([^,]*)|(\"[^\"]+\")"
OFS=","
}
{
NF=5 # set NF to limit too long records
for(i=1;i<=NF;i++) # iterate to NF and set empties to ""
if($i=="")
$i="\"\""
}1' file
Output:
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
Here is a an awk command that would work with any version of awk:
awk -v n=5 -v ef=',""' -F '","' '
{
sub(/,+$/, "")
for (i=NF; i<n; ++i)
$0 = $0 ef
} 1' file
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
With perl, assuming every field is double quoted:
$ perl -pe 's/,$//; s/$/q(,"") x (4 - s|","|$&|g)/e' ip.txt
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
# if the , at the end of line isn't present
$ perl -pe 's/$/q(,"") x (4 - s|","|$&|g)/e' ip.txt
"123","XYZ","M","N","P,Q"
"345","","","",""
"987","MNO","A,B,C","",""
s|","|$&|g will search for "," and replace it back. The return value is number of replacements, which is then used to determine how many fields have to be appended.
The e flag allows you to use Perl code in the replacement section.
q operator helps to use different delimiter for single quoted string.
Here's an alternate solution that creates an array and then adds empty fields if necessary.
perl -lne '#f = /"[^"]+"|[^,]+/g; print join ",", #f, qw("") x (4 - $#f)'
/"[^"]+"|[^,]+/g defines fields as double quoted strings (with no double quote inside, so escaped quotes won't work with this solution) or non , characters (at least one, so , at end of line will be ignored).
qw("") x (4 - $#f) determines the extra fields to be appended. qw("") creates an array with single element of value "" which is then multiplied using the x operator.
Another perl way using -a for autosplit and -F to set the separator:
perl -lanF'/"*,*"/' -e 'print join ",", map "\"$_\"", #F[1..5]'
-F'/"*,*"/' - this uses an autosplit separator of double quote optionally preceeded by commas and quotes
-a uses that separator to autosplit into #F
-l adds linebreaks to print and -n will process input in stream mode w/o printing unless explicitly told to
map "\"$_\"", #F[1..5] takes exactly 5 fields, even undefined ones, and adds double quotes
print join ",", map ... takes the results of the map above, joins into a string with commas, and prints
(Note: because each line starts with a field delimiter, I'm ignoring the empty $F[0] element)
This might work for you (GNU sed):
sed ':a;s/"[^"]*"/&/5;t;s/$/,""/;ta' file
If there are 5 fields, bail out.
Otherwise, append an empty field and repeat.
I would like to write every line of a txt-file to a separate file and use the first column as the name of the new file. The new file should then contain everything else in the line, but column 1.
So when I have:
example_1 a b c d
example_2 e f g h
example_3 j k l m
I want 3 separated files that are named example_1.mop, example_2.mop and example_3.mop and contain everything after the first column. So example_1.mop should contain a b c d and so on.
I almost found a way with
awk '{printf "%s\n", $2>$1".mop"}' file
but this only puts the second column in the new file. How can I tell awk to use everything else but the first column?
Thanks a lot for your help!
With your shown samples, please try following. Written and tested in GNU awk, should work in any awk.
awk '
{
first=$1
$1=""
sub(/^ +/,"")
outputFile=first".mop"
print >> (outputFile)
close(outputFile)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
first=$1 ##Creating first which has 1st field in it.
$1="" ##Nullify 1st field here.
sub(/^ +/,"") ##Substituting initial space with NULL here.
outputFile=first".mop" ##Creating outputFile which has output file name in it.
print >> (outputFile) ##Printing current line into output file.
close(outputFile) ##Closing output file in backend.
}
' Input_file ##Mentioning Input_file name here.
awk '{out=$1".mop"; sub(/[^[:space:]]+[[:space:]]*/,""); print > out; close(out)}' file
table1.csv:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
33629|EEE
33630|FFF
Aims:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Using command:
awk 'BEGIN{FS="|";OFS="|"} {if($2=="AAA" && $2=="BBB" && $2=="CCC" && $2=="DDD"){print $1,$2}}' table1.csv
However, trying to be more automatic, since the categories may increase.
list1.csv:
AAA BBB CCC DDD
list=`cat list1.csv`
awk -v list=$list 'BEGIN{FS="|";OFS="|"} {if($2==list){print $1,$2}}' table1.csv
Which means, can I stored $2=="AAA" && $2=="BBB" ....... into a variable by using list1.csv?
Expected output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
So, any suggestion on storing the multiple condition in one variable?
Thanks all!
$ awk 'NR==FNR{for(i=1;i<=NF;i++)a[$i];next}FNR==1{FS="|";$0=$0}($2 in a)' list table
Output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Explained:
$ awk '
NR==FNR { # process list
for(i=1;i<=NF;i++) # hash all items in file
a[$i]
next # possibility for multiple lines
}
FNR==1 { # changing FS in the beginning of table file
FS="|"
$0=$0
}
($2 in a)' list table
Almost same logic Like James Brown's nice answer, just adding here a small variant which is setting field separator in Input_file places itself.
awk 'FNR==NR{for(i=1;i<=NF;i++){arr[$i]};next} ($2 in arr)' list FS="|" table
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when list is being read.
for(i=1;i<=NF;i++){ ##Going through all fields here.
arr[$i] ##Creating arr with index of current column value here.
}
next ##next will skip all further statements from here.
}
($2 in arr) ##Checking condition if 2nd field is present in arr then print that line from table file.
' list FS="|" table ##mentioning Input_file(s) here and setting FS as | before table file.
I have a column data as follows:
abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hokg|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|hofg|
abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|hoqg|
I want to search and extract specific keywords within the frame and extract only that data with in the separators
Specific keywords are
enst.35
enst.18
enst.98
enst.63
The expected output is
abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA
If match is not found fill with NA in the output columns. There can be multiple occurance of id in the same column, but I want to consider only the first occurance.
I tried this herebut was not working effectively. Can we do this with bash script
Could you please try following, written and tested in shown samples. Mention all values in variable values_to_be_searched which you want to search in Input_file with , delimiter.
awk -v values_to_be_searched="enst.35,enst.18,enst.98,enst.63" '
BEGIN{
FS=","
num=split(values_to_be_searched,array,",")
for(i=1;i<=num;i++){
values[array[i]]
}
}
{
found=""
for(i=1;i<=NF;i++){
for(k in values){
if(match($i,k)){
print $i
found=1
break
}
}
}
if(found==""){
print "NA"
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -v values_to_be_searched="enst.35,enst.18,enst.98,enst.63" ' ##Creating variable values_to_be_searched which has all the values to be searched in it.
BEGIN{ ##Starting BEGIN section of this code from here.
FS="," ##Setting field separator as comma here.
num=split(values_to_be_searched,array,",") ##Splitting variable values_to_be_searched into an array here with delimiter comma.
for(i=1;i<=num;i++){ ##Running a for loop till value of nu here.
values[array[i]] ##Creating array values which has index as value of array which are the keywords to be searched in Input_file.
}
}
{
found="" ##Nullifying found here.
for(i=1;i<=NF;i++){ ##Running a for loop till NF here.
for(k in values){ ##Traversing through values array here.
if(match($i,k)){ ##If match of value k found in current field then do following.
print $i ##Printing current field here, looks like a match of keyword is found in current field.
found=1 ##Setting found as 1 here.
break ##Using break to come out of loop and save some cycles of for loop here.
}
}
}
if(found==""){ ##Checking condition if found is NOT SET then do following.
print "NA" ##Printing NA here.
}
}
' Input_file ##Mentioning Input_file name here.
since pandas is tagged, You can try with str.split followed by explode and then str.contains + reindex for NaN in missing rows
keywords = ['enst.35','enst.18','enst.98','enst.63']
s = df['Column'].str.split(',').explode()
s[s.str.contains('|'.join(keywords))].reindex(df.index)
0 abc|framex|gtk4|enst.35|pxc|h5g|
1 abc|frbx|hgk4|enst.18|pif|homg|
2 abc|frame|gtk|enst.98|pc|hg|
3 NaN
Name: Column, dtype: object
Note: Replace Column in the code with original column name.
Another way:
for STRING in enst.35 enst.18 enst.98 enst.63; do
tr \, \\n < file.txt | grep "$STRING" || echo NA
done
Output results in:
abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA