Display range of rows from first column with awk - awk

I have a table as shown below, here I just want to print all the Emp_Name from the first column starting from the second row.
Emp_Name Position Experience
Cara Senior 12
Doc Junior 6
Quinn Lead 14
Cedric Manager 18
Collen Junior 8
I know that awk '{print $1}' will print the first column from the table but how to skip first row or field i.e. Emp_Name and print all the names from the second row to the last field? Here last field or row number could be any number (not known).
Any help would be appreciated.

Not fully clear though, in case you want to skip only first row then try following.
awk 'FNR>1' Input_file
OR to print 1st column use:
awk 'FNR>1{print $1}' Input_file
In case you do not know on which field Emp_No will come and you want to look for its column number from 1st row AND DO NOT want to print the same column from rest of the row then try following.
awk '
BEGIN{
OFS="\t"
}
FNR==1{
for(i=1;i<=NF;i++){
if($i=="Emp_Name"){
val=i
next
}
}
}
{
for(i=1;i<=NF;i++){
if(i==val){
continue
}
else{
value=(value?value OFS:"")$i
}
}
print value
value=""
}
' Input_file

Related

print the count of the lines from one file and total from another file

I am having a directory called Stem, in that stem directory I am having two files called result.txt and title.txt as below:
result.txt:
Column1 Column2 Column3 Column4
----------------------------------------------------------
Setup First Second Third
Setdown Fifth Sixth Seven
setover Eight Nine Ten
Setxover Eleven Twelve Thirteen
Setdrop Fourteen Fifteen sixteen
title.txt:
Column1 Column2 Column3 Column4
----------------------------------------------------------
result 20 40 60
result1 40 80 120
Total: 60 120 180
I need to count the number of lines except first two in the first file(result.txt) and from the second file(title.txt) I need the data from the line Total (Column3), I need to get the output like below:
Stem : 5 120
I used this script but am not getting the exact output.
#!/bin/bash
for d in stem;
do
echo "$d"
File="result.txt"
File1="title.txt"
awk 'END{print NR - 2}' "$d"/"$File"
awk '/Total/{print $(NF-1);exit}' "$d"/"$File1"
done
EDIT: Since OP's question was not clear which value exactly needed, previous answer provides sum of 2nd columns, in case OP needs to get 2nd last field value of line which has Total: keyword in it then try following:
awk '
FNR==NR{
tot=FNR
next
}
/Total:/{
sum=$(NF-1)
}
END{
print "Stem : ",tot-2,sum+0
}
' result.txt title.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when result.txt is being read.
tot=FNR ##Creating tot which has value of FNR in it.
next ##next will skip all further statements from here.
}
/Total:/{ ##Checking condition if line contains Total: then do following.
sum=$(NF-1) ##Creating sum which has 2nd last field of current line.
}
END{ ##Starting END block of this program from here.
print "Stem : ",tot-2,sum+0 ##Printing Stem string tot-2 and sum value here.
}
' result.txt title.txt ##Mentioning Input_file names here.

search by kewords and extract the phrase within the delimiter

I have a column data as follows:
abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hokg|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|hofg|
abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|hoqg|
I want to search and extract specific keywords within the frame and extract only that data with in the separators
Specific keywords are
enst.35
enst.18
enst.98
enst.63
The expected output is
abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA
If match is not found fill with NA in the output columns. There can be multiple occurance of id in the same column, but I want to consider only the first occurance.
I tried this herebut was not working effectively. Can we do this with bash script
Could you please try following, written and tested in shown samples. Mention all values in variable values_to_be_searched which you want to search in Input_file with , delimiter.
awk -v values_to_be_searched="enst.35,enst.18,enst.98,enst.63" '
BEGIN{
FS=","
num=split(values_to_be_searched,array,",")
for(i=1;i<=num;i++){
values[array[i]]
}
}
{
found=""
for(i=1;i<=NF;i++){
for(k in values){
if(match($i,k)){
print $i
found=1
break
}
}
}
if(found==""){
print "NA"
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -v values_to_be_searched="enst.35,enst.18,enst.98,enst.63" ' ##Creating variable values_to_be_searched which has all the values to be searched in it.
BEGIN{ ##Starting BEGIN section of this code from here.
FS="," ##Setting field separator as comma here.
num=split(values_to_be_searched,array,",") ##Splitting variable values_to_be_searched into an array here with delimiter comma.
for(i=1;i<=num;i++){ ##Running a for loop till value of nu here.
values[array[i]] ##Creating array values which has index as value of array which are the keywords to be searched in Input_file.
}
}
{
found="" ##Nullifying found here.
for(i=1;i<=NF;i++){ ##Running a for loop till NF here.
for(k in values){ ##Traversing through values array here.
if(match($i,k)){ ##If match of value k found in current field then do following.
print $i ##Printing current field here, looks like a match of keyword is found in current field.
found=1 ##Setting found as 1 here.
break ##Using break to come out of loop and save some cycles of for loop here.
}
}
}
if(found==""){ ##Checking condition if found is NOT SET then do following.
print "NA" ##Printing NA here.
}
}
' Input_file ##Mentioning Input_file name here.
since pandas is tagged, You can try with str.split followed by explode and then str.contains + reindex for NaN in missing rows
keywords = ['enst.35','enst.18','enst.98','enst.63']
s = df['Column'].str.split(',').explode()
s[s.str.contains('|'.join(keywords))].reindex(df.index)
0 abc|framex|gtk4|enst.35|pxc|h5g|
1 abc|frbx|hgk4|enst.18|pif|homg|
2 abc|frame|gtk|enst.98|pc|hg|
3 NaN
Name: Column, dtype: object
Note: Replace Column in the code with original column name.
Another way:
for STRING in enst.35 enst.18 enst.98 enst.63; do
tr \, \\n < file.txt | grep "$STRING" || echo NA
done
Output results in:
abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA

Applying awk operation to a specific column

I have a file which lines look like this:
chr1 66999275 67216822 + SGIP1;SGIP1;SGIP1;SGIP1;MIR3117
I now want to edit the last column to remove duplicates, so that it would only be SGIP1;MIR3117.
If I only have the last column, I can use the following awk code to remove the duplicates.
a="SGIP1;SGIP1;SGIP1;SGIP1;MIR3117"
echo "$a" | awk -F";" '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'
This returns SGIP1;MIR3117;
However, I can not figure out how I can use this to only affect my fifth column. If I just pipe in the whole line, I get SGIP1 two times, as awk then treats everything in front of the first semicolon as one column.
Is there an elegant way to do this?
Could you please try following.
awk '
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(!found[array[i]]++){
val=(val?val ";":"")array[i]
}
}
$NF=val
val=""
}
1
' Input_file
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
{
num=split($NF,array,";") ##Using split function of awk to split last field($NF) of current line into array named array with ; delimiter.
for(i=1;i<=num;i++){ ##Running a loop fro i=1 to till total number of elements of array here.
if(!found[array[i]]++){ ##Checking condition if any element of array is NOT present in found array then do following.
val=(val?val ";":"")array[i] ##Creaating variable val and keep adding value of array here(whoever satisfy above condition).
}
}
$NF=val ##Setting val value to last field of current line here.
val="" ##Nullifying variable val here.
}
1 ##1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
I don't consider it "elegant", and it works under a certain number of assumptions.
awk -F"+" '{printf("%s+ ",$1);split($2,a,";"); for(s in a){gsub(" ", "", a[s]); if(!c[a[s]]++) printf("%s;", a[s])}}' test.txt
Tested on your input, returns:
chr1 66999275 67216822 + SGIP1;MIR3117;

Add new column with times same value was found in 2 columns

Add new column with value of how many times the values in columns 1 and 2 contends exactly same value.
input file
46849,39785,2,012,023,351912.29,2527104.70,174.31
46849,39785,2,012,028,351912.45,2527118.70,174.30
46849,39785,3,06,018,351912.12,2527119.51,174.33
46849,39785,3,06,020,351911.80,2527105.83,174.40
46849,39797,2,012,023,352062.45,2527118.50,173.99
46849,39797,2,012,028,352062.51,2527105.51,174.04
46849,39797,3,06,020,352063.29,2527116.71,174.13,
46849,39809,2,012,023,352211.63,2527104.81,173.74
46849,39809,2,012,028,352211.21,2527117.94,173.69
46849,39803,2,012,023,352211.63,2527104.81,173.74
46849,39803,2,012,028,352211.21,2527117.94,173.69
46849,39801,2,012,023,352211.63,2527104.81,173.74
Expected output file:
4,46849,39785,2,012,023,351912.29,2527104.70,174.31
4,46849,39785,2,012,028,351912.45,2527118.70,174.30
4,46849,39785,3,06,018,351912.12,2527119.51,174.33
4,46849,39785,3,06,020,351911.80,2527105.83,174.40
3,46849,39797,2,012,023,352062.45,2527118.50,173.99
3,46849,39797,2,012,028,352062.51,2527105.51,174.04
3,46849,39797,3,06,020,352063.29,2527116.71,174.13,
2,46849,39809,2,012,023,352211.63,2527104.81,173.74
2,46849,39809,2,012,028,352211.21,2527117.94,173.69
2,46849,39803,2,012,023,352211.63,2527104.81,173.74
1,46849,39803,2,012,028,352211.21,2527117.94,173.69
1,46849,39801,2,012,023,352211.63,2527104.81,173.74
attempt:
awk -F, '{x[$1 $2]++}END{ for(i in x) {print i,x[i]}}' file
4684939785 4
4684939797 3
4684939801 1
4684939803 2
4684939809 2
Could you please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1,$2]++
next
}
{
print a[$1,$2],$0
}
' Input_file Input_file
Explanation: reading Input_file 2 times. Where first time I am creating an array named a with index of first and second field and counting their value on each occurrence too. On 2nd time file reading it printing count of the first 2 fields total and then printing while line.
One liner code:
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1,$2]++;next} {print a[$1,$2],$0}' Input_file Input_file

awk NR wrong with the total number of lines returned

when awk NR was used for getting the total number of lines of a file, wrong number was returned. Could you help to find out what happened?
File 'test.txt' contents :
> 2012 09 10 30.0 8 14
fdafadf
> 2013 08 11 05.0 9 1.5
fdafa
> 2011 01 12 02.0 7 1.2
daff
The average of the last column of records with '>' beginning was expected to get.
Code:
awk 'BEGIN{SUM=0}{/^> /{SUM=SUM+$6}END{print SUM/NR}' test.txt
With this code, the wrong mean of the last column was obtained instead of the right number 3. How can I get the right result with awk mission? Thanks
Could you please try following. This will take SUM of all line's last column and it will keep doing till Input_file is done with reading. It will alos count the number of occurrences of > lines because average means SUM divided by count(here count of lines), in END block of awk we could divide them and could get average as needed.
awk 'BEGIN{sum=0;count=0}/^>/{sum+=$NF;count++} END{print "avg="sum/count}' Input_file
If you want to take average of 6th column then use $6 in spite of $NF in above code too.
Explanation: Adding following only for explanation purposes.
awk ' ##Starting awk command/script here.
/^>/{ ##Checking condition if a line starts from > then do following.
sum+=$NF ##Creating a variable named sum wohse value is adding in its own value of $NF last field of current line.
count++ ##Creating a variable named count whose value is incrementing by 1 each time cursor comes here.
}
END{ ##END block of awk code here.
print "avg="sum/count ##Printing string avg= then dividing sum/count it will print the result of it.
}
' Input_file ##Mentioning Input_file name here.