Extract rows in file where a column value is included in a list? - awk

I have a huge file of data:
datatable.txt
id1 england male
id2 germany female
... ... ...
I have another list of data:
indexes.txt
id1
id3
id6
id10
id11
I want to extract all rows from datatable.txt where the id is included in indexes.txt.
Is it possible to do this with awk/sed/grep? The file sizes are so large using R or python is not convenient.

You just need a simple awk as
awk 'FNR==NR {a[$1]; next}; $1 in a' indexes.csv datatable.csv
id1 england male
FNR==NR{a[$1];next} will process on indexes.csv storing the
entries of the array as the content of the first column till the end of
the file.
Now on datatable.csv, I can match those rows from the first file by doing
$1 in a which will give me all those rows in current file whose
column $1's value a[$1] is same as in other file.

maybe i overlook something, but i build two test files:
a1:
id1
id2
id3
id6
id9
id10
and
a2:
id1 a 1
id2 b 2
id3 c 3
id4 c 4
id5 e 5
id6 f 6
id7 g 7
id8 h 8
id9 i 9
id10 j 10
with
join a1 a2 2> /dev/null
I get all lines matched by column one.

Related

awk to filter lines in a file based on match and conditional of another file

I have a file with this format:
file1
id1 12.4
id2 21.6
id4 17.3
id6 95.5
id7 328.6
And I want to filter it based on another file with the format:
file2
id1 11.5
id2 10.4
id3 58.4
id4 24.6
id5 234.4
id6 2.5
id7 330.6
First, I would like to match ids between files. Then, I want to keep the lines in file1 in which the score (second column) is greater than the score in file2. It would output this:
id1 12.4
id2 21.6
id6 95.5
I started writing the code like awk 'FNR==NR { a[$1][$2][$0]; next } $1 in a {}' file1 file2 which I think would match the ids between files, but I don't know how to complete the code to filter by the scores.
You could write the awk command first reading file2, and then keep track of the values by setting the value a[$1] = 0+$2 and add zero for a numeric comparison
Then you can do the comparison with file1
awk 'FNR==NR { a[$1] = $2+0; next } $1 in a && $2+0 > a[$1]' file2 file1
Output
id1 12.4
id2 21.6
id6 95.5

reshape a data frame pandas

I have:
data1=['id1','id2','id3','id1','id5']
data2=['','A','','B','']
data3=['m1','m1','m1','m2','m2']
data4=['1.22','sd','EUR','1.456','GB1234']
pd.DataFrame({'identifier':data1,'name':data2,'grp':data3,'value':data4})
identifier name grp value
0 id1 m1 1.22
1 id2 A m1 sd
2 id3 m1 EUR
3 id1 B m2 1.456
4 id5 m2 GB1234
I want:
id1 id2 id3 id5
A 1.220 sd EUR
B 1.456 GB1234
Any suggestions?
My real data has 109 identifiers, 6k names, 1k groups
Some notes:
There is the potential for all identifiers to be same bt different groups
In the end, I would like to have all of the identifiers as a column, the name as the index, and the values as the value
I tried df2=df.pivot(values='value',columns='field',index='ticker')
and got the error: ValueError: Index contains duplicate entries, cannot reshap
I tried reshaping a data frame in pandas but it is a little different
I think you need DataFrame.pivot_table with aggregate first and if need replace groups by first non empty string name add rename:
s = df.assign(name = df['name'].replace('', np.nan)).groupby('grp')['name'].first()
df2=df.pivot_table(values='value',
columns='identifier',
index='grp',
aggfunc='first').rename(s)
print (df2)
identifier id1 id2 id3 id5
grp
A 1.22 sd EUR NaN
B 1.456 NaN NaN GB1234

Find symmetrical entries dataframe, if not delete entry

I have the following data.
ID1 ID2 Value
1 2 5.5
2 1 10
1 3 5
Expected output:
ID1 ID2 Value
1 2 5.5
2 1 10
I only want to hold data, when I have a value for the symmetrical entry. If I only have a entry e.g. with ID1=1 and ID2=3 but no entry for ID1=3 and ID2=1 then I want to delete this datarow. How can I do this with pandas?
If all values in pairs in columns ID1 and ID2 are unique first create helper DataFrame with np.sort and return all duplicated rows with DataFrame.duplicated:
df1 = pd.DataFrame(np.sort(df[['ID1','ID2']], axis=1), index=df.index)
df = df[df1.duplicated(keep=False)]
print (df)
ID1 ID2 Value
0 1 2 5.5
1 2 1 10.0

print consecutive lines conditional on two fields and substract another fied

I would like to print consecutive lines if they have matching first field but opposite signal in third field. Then compute the distance between the second fields of consecutive lines.
Input:
id1 pos1 0.19
id1 pos2 0.33
id1 pos3 -0.25
id1 pos4 -0.22
id2 pos5 0.33
id3 pos6 -0.21
id3 pos7 -0.56
id3 pos8 -0.20
id3 pos9 0.33
id3 pos10 -0.32
Intermediate output:
id1 pos2 0.33
id1 pos3 -0.25
id3 pos8 -0.20
id3 pos9 0.33
id3 pos10 -0.32
Desired output:
id1 pos3-pos2
id3 pos9-pos8
id3 pos10-pos9
I found similar questions comparing consecutive lines but none can be applied to answer my question.
So far I tried:
awk '$1==prev1{$NF=$2-prev2;print $1,$NF} {prev2=$2;prev1=$1}'
But I do not know how to add the condition of third field must have opposite signal.
Could you please try following.
awk '
prev!=$1{
prev_val=prev=""
}
prev==$1{
if(($NF~/^-/ && prev_val!~/^-/) || ($NF!~/^-/ && prev_val~/^-/)){
print $1,$2,$NF-prev_val
}
}
{
prev=$1
prev_val=$NF
}
' Input_file
From your description this awk should do:
awk '{sc=$3~/^-/?0:1} $1==p1&&sp!=sc {print $1,($3-p3)} {sp=sc;p1=$1;p3=$3}' file
id1 -0.58
id3 0.53
id3 -0.65
sc=$3~/^-/?0:1 test if value is positive 1 or negative 0
$1==p1&&sp!=sc If current ID is equal previous ID and value change sign,
print $1,($3-p3) print ID and differential between current and previous value.
sp=sc;p1=$1;p3=$3 set previous: sp to sc, p1 to $1 and p3 to $3
awk 'prev1!=$1{
prev3=prev2=prev1=""
}
prev1==$1{
if(($3~/^-/ && prev3!~/^-/) || ($3!~/^-/ && prev3~/^-/)){
print $1,$2-prev2
}
}
{
prev1=$1
prev2=$2
prev3=$3
}
' Input
This is the answer to my question. Thanks to all for helping me.

Looping through a file while skipping the blank rows

I have a file in following format
value value 17 -1 1234 4567 value id1
value value 17 -1 2345 4580 value id1
value value 17 -1 2344 4654 value id1
value value 1 1 1234 4567 value id2
value value 1 1 3445 3455 value id2
value value 1 1 2341 2345 value id3
value value 1 1 1245 4567 value id3
value value 1 1 3234 5634 value id3
value value 1 1 3412 4512 value id3
I want to retrieve the following information for each group of lines between the blanks rows:
for eg for id1:
17 -1 1234 4654 id1
for id2:
1 1 1234 3455 id2
i.e for each id (last column) I would like to retrieve the 5 th column of the first line in that group and 6 th column of the last line in that group ( the lines are grouped by ids).
Something like this may do the work for you
$ awk '/^$/{print col3, col4, col5, col6, idval; next} $8 != idval{idval = $8; col3=$3; col4=$4; col5=$5} {col6=$6} END{print col3, col4, col5, col6, idval}' input
17 -1 1234 4654 id1
1 1 1234 3455 id2
With GNU awk
awk -vRS= -vFS='\n' '{split($1, a, /[[:blank:]]+/);
split($NF, b, /[[:blank:]]+/);
print a[3], a[4], a[5], b[6], a[8]}' file
17 -1 1234 4654 id1
1 1 1234 3455 id2
1 1 2341 4512 id3
Here is another awk
awk -vRS= '{print $3,$4,$5,$(NF-2),$8}' file
17 -1 1234 4654 id1
1 1 1234 3455 id2
1 1 2341 4512 id3
This divide every block to one record, then print field 3,4,5 third-last and 8