making rows to column and saved in separate files - pandas

I have 4 text file in a folder and each text file contain many rows of data as follows
cat a.txt
10.0000 0.0000 10.0000 0.0000
11.0000 0.0000 11.0000 0.0000
cat b.txt
5.1065 3.8423 2.6375 3.5098
4.7873 5.9304 1.9943 4.7599
cat c.txt
3.5257 3.9505 3.8323 4.3359
3.3414 4.0014 4.0383 4.4803
cat d.txt
1.8982 2.0342 1.9963 2.1575
1.8392 2.0504 2.0623 2.2037
I want to make each corresponding rows of the text file to column as
file001.txt
10.0000 5.1065 3.5257 1.8982
0.0000 3.8423 3.9505 2.0342
10.0000 2.6375 3.8323 1.9963
0.0000 3.5098 4.3359 2.1575
file002.txt
11.0000 4.7873 3.3414 1.8329
0.0000 5.9304 4.0014 2.0504
11.0000 1.9943 4.0383 2.0623
0.0000 4.7599 4.4803 2.2037
Anf finally I want to add this value 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000 to every line so the final output should be
file001.txt
10.0000 5.1065 3.5257 1.8982 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.8423 3.9505 2.0342 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
10.0000 2.6375 3.8323 1.9963 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.5098 4.3359 2.1575 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
file002.txt
11.0000 4.7873 3.3414 1.8329 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 5.9304 4.0014 2.0504 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
11.0000 1.9943 4.0383 2.0623 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 4.7599 4.4803 2.2037 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
Finally, I want to append some comments on the top of every created files
So for example file001.txt should be
#
# ascertain thin
# Metamorphs
# pch
# what is that
# 5-r
# Add the thing
# liop34
# liop36
# liop45
# liop34
# M(CM) N(M) O(S) P(cc) ab cd efgh ijkl mnopq rstuv
#
10.0000 5.1065 3.5257 1.8982 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.8423 3.9505 2.0342 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
10.0000 2.6375 3.8323 1.9963 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.5098 4.3359 2.1575 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000

files = ["a.txt", "b.txt", "c.txt", "d.txt"]
# get number of columns per file, i.e., 4 in sample data
n_each = np.loadtxt(files[0]).shape[1]
# concatanate transposed data
arrays = np.concatenate([np.loadtxt(file).T for file in files])
# rows are in column now for easier reshaping; reshape and save
n_all = arrays.shape[1]
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
arrays[:, n].reshape(n_each, len(files)).T,
fmt="%7.4f")
to add a fixed array of values right to the new arrays, you can perform horizontal stacking after tiling the new values n_each times:
# other things same as above
new_values = np.tile([5, 6, 9, 0, 1, 1], (n_each, 1))
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
np.hstack((arrays[:, n].reshape(n_each, len(files)).T,
new_values)),
fmt="%7.4f")
to add comments, header and comments parameters of np.savetxt is useful. we pass the string to header and since it already contains "# " in it, we suppress extra "# " from np.savetxt by passing comments="":
comment = """\
#
# ascertain thin
# Metamorphs
# pch
# what is that
# 5-r
# Add the thing
# liop34
# liop36
# liop45
# liop34
# M(CM) N(M) O(S) P(cc) ab cd efgh ijkl mnopq rstuv
#"""
# rows are in column now for easier reshaping; reshape and save
n_all = arrays.shape[1]
new_values = np.tile([5, 6, 9, 0, 1, 1], (n_each, 1))
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
np.hstack((arrays[:, n].reshape(n_each, len(files)).T,
new_values)),
fmt="%7.4f",
header=comment,
comments="")

Related

how to extract every nth row from numpy array

I have a numpy array and I want to extract every 3rd rows from it
input
0.00 1.0000
0.34 1.0000
0.68 1.0000
1.01 1.0000
1.35 1.0000
5.62 2.0000
I need to extract every 3rd row so that expected output will be
0.68 1.0000
5.62 2.0000
My code:
import numpy as np
a=np.loadtxt('input.txt')
out=a[::3]
But it gives different result.Hope experts will guide me.Thanks.
When undefined, the starting point of a (positive) slice is the first item.
You need to slice starting on the n-1th item:
N = 3
out = a[N-1::N]
Output:
array([[0.68, 1. ],
[5.62, 2. ]])

Which optimization techniques can I use for maximizing the sum of minimum distance of each point to other points in a unit hypercube?

Let's say I have the following unit hypercube with 9 points
My goal is to maximize this function:
In the image, Figure 1 is the original data, Figure 2 is computed using the function, and Figure 3 is the optimized function.
I want to know how can I reach to Figure 3 from Figure 1.
So far, I have tried using Simulated Annealing, but I am not able to do it in the correct way. Any other suggestions would be helpful!
You could model this as:
max sum(i, d[i])
d(i) ≤ sqrt( (x[i]-x[j])^2 + (y[i]-y[j])^2 ) for all j <> i
x[i],y[i] ∈ [0,1]
This is a non-convex problem and can be solved with a global solver such as Couenne or Baron. (Note: it will find good solutions quickly but proving global optimality is difficult and time-consuming).
This can also be attacked using a multi-start approach with a local solver (I used CONOPT in the test below). The algorithm would be:
bestobj = 0
for k = 1 to N (say N=50)
(x,y) = random points in [0,1]x[0,1]
solve NLP model
if obj > bestobj
save solution
bestobj = obj
end
Using both approaches (global solver, multistart approach) I get for 9 points:
---- VAR x x-coordinates
LOWER LEVEL UPPER MARGINAL
i1 . 0.5000 1.0000 EPS
i2 . 1.0000 1.0000 EPS
i3 . 0.5000 1.0000 EPS
i4 . . 1.0000 EPS
i5 . 0.5000 1.0000 EPS
i6 . . 1.0000 EPS
i7 . . 1.0000 EPS
i8 . 1.0000 1.0000 EPS
i9 . 1.0000 1.0000 EPS
---- VAR y y-coordinates
LOWER LEVEL UPPER MARGINAL
i1 . . 1.0000 EPS
i2 . . 1.0000 EPS
i3 . 0.5000 1.0000 EPS
i4 . 1.0000 1.0000 EPS
i5 . 1.0000 1.0000 EPS
i6 . 0.5000 1.0000 EPS
i7 . . 1.0000 EPS
i8 . 1.0000 1.0000 EPS
i9 . 0.5000 1.0000 EPS
---- VAR d min distances from point i
LOWER LEVEL UPPER MARGINAL
i1 . 0.5000 1.4142 EPS
i2 . 0.5000 1.4142 EPS
i3 . 0.5000 1.4142 EPS
i4 . 0.5000 1.4142 EPS
i5 . 0.5000 1.4142 EPS
i6 . 0.5000 1.4142 EPS
i7 . 0.5000 1.4142 EPS
i8 . 0.5000 1.4142 EPS
i9 . 0.5000 1.4142 EPS
LOWER LEVEL UPPER MARGINAL
---- VAR z -INF 4.5000 +INF .
z objective

Comparing two columns in two files using awk with duplicates

File 1
A4gnt 0 0 0 0.3343
Aaas 2.79 2.54 1.098 0.1456
Aacs 0.94 0.88 1.063 0.6997
Aadac 0 0 0 0.3343
Aadacl2 0 0 0 0.3343
Aadat 0.01 0 1.723 0.7222
Aadat 0.06 0.03 1.585 0.2233
Aaed1 0.28 0.24 1.14 0.5337
Aaed1 1.24 1.27 0.976 0.9271
Aaed1 15.91 13.54 1.175 0.163
Aagab 1.46 1.14 1.285 0.3751
Aagab 6.12 6.3 0.972 0.6569
Aak1 0.02 0.01 1.716 0.528
Aak1 0.1 0.19 0.561 0.159
Aak1 0.14 0.19 0.756 0.5297
Aak1 0.16 0.18 0.907 0.6726
Aak1 0.21 0 0 0.066
Aak1 0.26 0.27 0.967 0.9657
Aak1 0.54 1.65 0.325 0.001
Aamdc 0.04 0 15.461 0.0875
Aamdc 1.03 1.01 1.019 0.8817
Aamdc 1.27 1.26 1.01 0.9285
Aamdc 7.21 6.94 1.039 0.7611
Aamp 0.06 0.05 1.056 0.9136
Aamp 0.11 0.11 1.044 0.9227
Aamp 0.12 0.13 0.875 0.7584
Aamp 0.22 0.2 1.072 0.7609
File 2
Adar
Ak3
Alox15b
Ampd2
Ampd3
Ankrd17
Apaf1
Aplp1
Arih1
Atg14
Aurkb
Bcl2l14
Bmp2
Brms1l
Cactin
Camta2
Cav1
Ccr5
Chfr
Clock
Cnot1
Crebrf
Crtc3
Csnk2b
Cul3
Cx3cl1
Dnaja3
Dnmt1
Dtl
Ednra
Eef1e1
Esr1
Ezr
Fam162a
Fas
Fbxo30
Fgr
Flcn
Foxp3
Frzb
Fzd6
Gdf3
Hey2
Hnf4
The desired output would be wherever matches in the first column from both file print out all the columns in the first file (including duplicates).
I've tried
awk 'NR==FNR{a[$1]=$2"\t"$3"\t"$4"\t"$5; next} { if($1 in a) { print $0,a[$1] } }' File2 File1 > output
But for some reason I'm getting just few hits. Does anyone know why?
Read second file first, and store 1st column values in array arr as array keys, and then read first file, if 1st column of file1 exists in array arr which was created using file2, then print current row/record from file1.
awk 'FNR==NR{arr[$1];next}$1 in arr' file2 file1
Advantage:
if you use a[$1]=$2"\t"$3"\t"$4"\t"$5; next, if there's any data with same key will be replaced with previous value,
but if you use arr[$1];next, we store just unique key, and $1 in arr takes care of duplicate record even if it exists

Convert rows in columns with informix query

I want to convert
inpvacart inpvapvta inpvapvt1 inpvapvt2 inpvapvt3 inpvapvt4
CS-279 270.4149 0.0000 0.0000 0.0000 0.0000
AAA5030 1.9300 1.9300 1.6212 0.0000 0.0000
Query
select
inpvacart,
inpvapvta,
inpvapvt1,
inpvapvt2,
inpvapvt3,
inpvapvt4
from inpva;
into this
inpvacart line value
CS-279 1 270.4149
CS-279 2 0.00000
CS-279 3 0.00000
CS-279 4 0.00000
CS-279 5 0.00000
AAA5030 1 1.9300
AAA5030 2 1.9300
AAA5030 3 1.6212
AAA5030 4 0.0000
AAA5030 5 0.0000
I have tried this
select s.inpvacart,l.lista,l.resultados
from inpva as s,
table(values(1,s.inpvapvta),
(2,s.inpvapvt1),
(3,s.inpvapvt2),
(4,s.inpvapvt3),
(5,s.inpvapvt4))
)as l(lista,resultados);
But it does not work in informix 9
Is there a way to transpose rows to columns?
Thank You
I don't think Informix has any unpivot operator to transpose columns to rows like for instance MSSQL does, but one way to do this is to transpose the columns manually and then use union to create a single set like this:
select inpvacart, 1 as line, inpvapvta as value from inpva
union all
select inpvacart, 2 as line, inpvapvt1 as value from inpva
union all
select inpvacart, 3 as line, inpvapvt2 as value from inpva
union all
select inpvacart, 4 as line, inpvapvt3 as value from inpva
union all
select inpvacart, 5 as line, inpvapvt4 as value from inpva
order by inpvacart, line;
It's not very pretty but it should work.

group by in Matlab to find the value that resulted minimum similar to SQL

I have a dataset having columns a, b, c and d
I want to group the dataset by a,b and find c such that d is minimum for each group
I can do "group by" using 'grpstats" as :
grpstats(M,[M(:,1) M(:,2) ],{'min'});
I don't know how to find the value of M(:,3) that resulted the min in d
In SQL I suppose we use nested queries for that and use the primary keys. How can I solve it in Matlab?
Here is an example:
>> M =[4,1,7,0.3;
2,1,8,0.4;
2,1,9,0.2;
4,2,1,0.2;
2,2,2,0.6;
4,2,3,0.1;
4,3,5,0.8;
5,3,6,0.2;
4,3,4,0.5;]
>> grpstats(M,[M(:,1) M(:,2)],'min')
ans =
2.0000 1.0000 8.0000 0.2000
2.0000 2.0000 2.0000 0.6000
4.0000 1.0000 7.0000 0.3000
4.0000 2.0000 1.0000 0.1000
4.0000 3.0000 4.0000 0.5000
5.0000 3.0000 6.0000 0.2000
But M(1,3) and M(4,3) are wrong. The correct answer that I am looking for is:
2.0000 1.0000 9.0000 0.2000
2.0000 2.0000 2.0000 0.6000
4.0000 1.0000 7.0000 0.3000
4.0000 2.0000 3.0000 0.1000
4.0000 3.0000 4.0000 0.5000
5.0000 3.0000 6.0000 0.2000
To conclude, I don't want the minimum of third column; but I want it's values corresponding to minimum in 4th column
grpstats won't do this, and MATLAB doesn't make it as easy as you might hope.
Sometimes brute force is best, even if it doesn't feel like great MATLAB style:
[b,m,n]=unique(M(:,1:2),'rows');
for i =1:numel(m)
idx=find(n==i);
[~,subidx] = min(M(idx,4));
a(i,:) = M(idx(subidx),3:4);
end
>> [b,a]
ans =
2 1 9 0.2
2 2 2 0.6
4 1 7 0.3
4 2 3 0.1
4 3 4 0.5
5 3 6 0.2
I believe that
temp = grpstats(M(:, [1 2 4 3]),[M(:,1) M(:,2) ],{'min'});
result = temp(:, [1 2 4 3]);
would do what you require. If it doesn't, please explain in the comments and we can figure it out...
If I understand the documentation correctly, even
temp = grpstats(M(:, [1 2 4 3]), [1 2], {'min'});
result = temp(:, [1 2 4 3]);
should work (giving column numbers rather than full contents of columns)... Can't test right now, so can't vouch for that.