equalize length of the temp data - pandas

I have two text files(file1.txt,file2.txt) that contain time_stamp in julian day in first column and temperature data in second columns. Based on the time stamp of file1.txt I have to extend the length of file2.txt by appending zero, so that the length of the file2.txt will be equals to length of the file1.txt.
Input data
cat file1.txt
023 4.5
024 6.8
025 9.8
030 2.3
125 1.4
129 5.8
168 1.0
cat file2.txt
024 1.2
025 2.3
125 1.6
output
023 0.0
024 1.2
025 2.3
030 0.0
125 1.6
129 0.0
168 0.0
In my code i am unable to insert the main portion that does the magic
I tried
import numpy as np
import pandas as pd
data1=np.loadtxt("data1.txt")
data2=np.loadtxt("data2.txt")
if data1==data2:
print('same length data')
else:
............

You can try this in awk:
awk 'FNR==NR{
f2[$1]=$0
next
}
$1 in f2{print f2[$1]; next}
{printf("%s%s%1.1f\n", $1, OFS, 0.0)}
' file2 file1
Or this in Python:
f2_data={}
with open(fn2) as f2:
for line in f2:
line=line.strip()
field1, field2=line.split()
f2_data[field1]=line
with open(fn1) as f1:
for line in f1:
field1, field2=line.strip().split()
if field1 in f2_data:
print(f2_data[field1])
else:
print(field1, '0.0')
Either prints:
023 0.0
024 1.2
025 2.3
030 0.0
125 1.6
129 0.0
168 0.0
In both cases the strategy is the same:
Make an index of file2 first to see what gets printed from that file;
Print the julian date and 0.0 to fill in for date not in file2

Related

How to save different file from one file using value in specific column using bash

i want to save line using value from column $1 and save the line in one file using value from $1, if it has different value save it into another new file
112 14.7 114.98 -0.92 -0.12
112 14.8 114.02 -0.78 0.76
112 14.1 114.99 -0.98 -0.11
113 12.5 111.77 1.87 -1.88
113 12.6 111.89 -0.98 -1.65
115 15.7 110.8 2.06 0.72
118 11.9 111.01 -1.04 0.98
what i want is
file1=p004112.txt
112 14.7 114.98 -0.92 -0.12
112 14.8 114.02 -0.78 0.76
112 14.1 114.99 -0.98 -0.11
file2=p004113.txt
113 12.5 111.77 1.87 -1.88
113 12.6 111.89 -0.98 -1.65
file3=p004115.txt
115 15.7 110.8 2.06 0.72
file4=p004118.txt
118 11.9 111.01 -1.04 0.98
the file that has to change like that has namefile p004.txt p005.txt
i have tried like this
for i in `ls p????.txt|sed "s/.txt//g"`;do awk '{file=${i}$1".txt" print >> file}' ${i}.txt;done
but it doesn't work :( anyone has the solution from this problem?
Thank you
With your shown samples, please try following. Though your samples are shown as sorted 1st column but I have still used sort to sort the file with 1st column, in case your whole file is sorted with 1st column then remove sort command from following and paste Input_file at the end of awk program.
sort -k1 Input_file | awk 'prev!=$1{close(outputFile);outputFile=("p004"$1".txt")} {print > (outputFile);prev=$1}'
OR a non-oneliner form of above solution:
sort -k1 Input_file |
awk '
prev!=$1{
close(outputFile)
outputFile=("p004"$1".txt")
}
{
print > (outputFile)
prev=$1
}
'
Explanation: Simple explanation would be: Firstly sorting the Input_file with 1st column and sending its output as an Input to awk command. Then in awk program: Setting outputFile name to p004 with 1st column name appended with .txt as per need by OP and closing the output file in backend to avoid "too many opened files" error, this is done each time 1st column is changing(not equal to its previous line's value). Then printing each line into output file and setting prev value to 1st column value in each line.

averaging multiple columns in awk excluding null value [duplicate]

This question already has answers here:
Calculate mean of each column ignoring missing data with awk
(2 answers)
Closed 6 years ago.
I need to average all the columns in this file from column 3 to the last, excluding row 1:
table
jd h 3 5 8 10 11 14 15
79 1 52.0 51.0 58.0 45.0 59.0 20.0 27
79 2 52.0 51.0 58.0 45.0 59.0 20.0 -999.0
79 3 52.0 51.0 58.0 45.0 59.0 20.0 -999.0
79 4 -999.0 51.0 58.0 45.0 59.0 20.0 -999.0
Data transcribed by Chet.
This script works fine:
cat myfile.txt | awk ' NR>1{for (i=3;i<=NF;i++){a[i]+=$i;}} END {for (i=3;i<=NF;i++) {print a[i]/(NR-1)}}' > myoutput.txt
the problem is that in the columns I have null values (marked as "-999.0"), which I want to exclude from the average.
Any suggestion will be helpful.
awk 'NR > 1 { for (i = 3; i <= NF; i++) if ($i != -999.0) { sum[i] += $i; num[i]++; } }
END { for (i = 3; i <= NF; i++) print i, sum[i], num[i], sum[i]/num[i] }' \
myfile.txt > myoutput.txt
This counts only the valid field values, and counts the number of such rows for each column separately. The printing at the end identifies the field, the raw data (sum, number) and the average.

extract specific lines based on another file

I have a folder containing text files. I need to extract specific lines from the files of this folder based on another file input.txt. I tried the following code . But it doesn't work.
awk '
NR==FNR{
if(NF>1)f=$3
else A[f,$1]
next
}
(FILENAME,$3) in A {
print > ( todir "/" FILENAME )
}
' todir=/home/alan/Desktop/output FS=\* /home/alan/Desktop/input.txt FS=" " *
file1
PHE .0 10 .0 0
ASP 59.8 11 59.8 0
LEU 66.8 15 66.8 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
THR 33.2 50 33.2 0
SER 10.2 51 .9 9.3
input.txt
*file1
10
16
*file2
43
44
49
Desired output
file1
PHE 0 10 0 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
On line 3,
$3 needs to be changed to $2.
Since asterisk is the field separator in input.txt, the (empty, non-existent) string before it is counted as $1 and the file name that comes after it as $2.
awk '
NR==FNR{
if(NF>1)f=$2
else A[f,$1]
next
}

AWK print next line of match between matches

Let's presume I have file test.txt with following data:
.0
41
0.0
42
0.0
43
0.0
44
0.0
45
0.0
46
0.0
START
90
34
17
34
10
100
20
2056
30
0.0
10
53
20
2345
30
0.0
10
45
20
875
30
0.0
END
0.0
48
0.0
49
0.0
140
0.0
With AWK how would I print the lines after 10 and 20 between START and END.
So the output would be.
100
2056
53
2345
45
875
I was able to get the lines with 10 and 20 with
awk '/START/,/END/ {if($0==10 || $0==20) print $0}' test.txt
but how would I get the next lines?
I actually got what I wanted with
awk '/^START/,/^END/ {if($0==10 || $0==20) {getline; print} }' test.txt
Range in awk works fine, but is less flexible than using flags.
awk '/^START/ {f=1} /^END/ {f=0} f && /^(1|2)0$/ {getline;print}' file
100
2056
53
2345
45
875
Don't use ranges as they make trivial things slightly briefer but require a complete rewrite or duplicate conditions when things get even slightly more complicated.
Don't use getline unless it's an appropriate application and you have read and fully understand http://awk.info/?tip/getline.
Just let awk read your lines as designed:
$ cat tst.awk
/START/ { inBlock=1 }
/END/ { inBlock=0 }
foundTgt { print; foundTgt=0 }
inBlock && /^[12]0$/ { foundTgt=1 }
$ awk -f tst.awk file
100
2056
53
2345
45
875
Feel free to use single-character variable names and cram it all onto one line if you find that useful:
awk '/START/{b=1} /END/{b=0} f{print;f=0} b&&/^[12]0$/{f=1}' file

Awk: Remove duplicate lines with conditions

I have a tab-delimited text file with 8 columns:
Erythropoietin Receptor Integrin Beta 4 11.7 9.7 164 195 19 3.2
Erythropoietin Receptor Receptor Tyrosine Phosphatase F 10.8 2.6 97 107 15 3.2
Erythropoietin Receptor Leukemia Inhibitory Factor Receptor 12.0 3.6 171 479 14 3.2
Erythropoietin Receptor Immunoglobulin 9 10.4 3.1 100 108 24 3.3
Erythropoietin Receptor Collagen Alpha 1 Xx 10.7 2.7 93 105 18 3.3
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 5 11.4 3.2 114 114 25 1.7
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 14 11.1 2.1 99 100 28 1.8
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 1B 10.9 4.9 133 162 29 1.9
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 11A 11.5 5.1 130 166 25 1.9
The first and second column contain protein names and the 8th column contains the "distance" score between each protein pair. I would like to remove the lines containing duplicate protein pairs and keep only the pair with the lowest distance (the lowest value in the 8th column). This means that for the pair Protein A-Protein B I would like to remove all occurrences except the one with the lowest distance score. The pair is considered duplicate even if the protein names are swapped (in different columns). This means that Protein A Protein B is the same as Protein B Protein A.
Something like this (untested):
awk -F'\t' 'END {
for (r in rec) print rec[r]
}
{
if (mina[$1, $2] < $NF || minb[$2, $1] < $NF) {
mina[$1, $2] = $NF; minb[$2, $1] = $NF
rec[$1, $2] = $0
}
}' infile
I hope this would be the final update ^_^
kent$ awk -F'\t' '{if($1$2 in a){
if($8<a[$1$2]){
a[$1$2]=$8;r[$1$2]=$0;
}
}else if ($2$1 in a){
if($8<a[$2$1]){
a[$2$1] = $8;r[$2$1] = $0;
}
}else{
a[$1$2]=$8; r[$1$2]=$0;
}
} END{for(x in r)print r[x]}' yourFile