group by in Matlab to find the value that resulted minimum similar to SQL - sql

I have a dataset having columns a, b, c and d
I want to group the dataset by a,b and find c such that d is minimum for each group
I can do "group by" using 'grpstats" as :
grpstats(M,[M(:,1) M(:,2) ],{'min'});
I don't know how to find the value of M(:,3) that resulted the min in d
In SQL I suppose we use nested queries for that and use the primary keys. How can I solve it in Matlab?
Here is an example:
>> M =[4,1,7,0.3;
2,1,8,0.4;
2,1,9,0.2;
4,2,1,0.2;
2,2,2,0.6;
4,2,3,0.1;
4,3,5,0.8;
5,3,6,0.2;
4,3,4,0.5;]
>> grpstats(M,[M(:,1) M(:,2)],'min')
ans =
2.0000 1.0000 8.0000 0.2000
2.0000 2.0000 2.0000 0.6000
4.0000 1.0000 7.0000 0.3000
4.0000 2.0000 1.0000 0.1000
4.0000 3.0000 4.0000 0.5000
5.0000 3.0000 6.0000 0.2000
But M(1,3) and M(4,3) are wrong. The correct answer that I am looking for is:
2.0000 1.0000 9.0000 0.2000
2.0000 2.0000 2.0000 0.6000
4.0000 1.0000 7.0000 0.3000
4.0000 2.0000 3.0000 0.1000
4.0000 3.0000 4.0000 0.5000
5.0000 3.0000 6.0000 0.2000
To conclude, I don't want the minimum of third column; but I want it's values corresponding to minimum in 4th column

grpstats won't do this, and MATLAB doesn't make it as easy as you might hope.
Sometimes brute force is best, even if it doesn't feel like great MATLAB style:
[b,m,n]=unique(M(:,1:2),'rows');
for i =1:numel(m)
idx=find(n==i);
[~,subidx] = min(M(idx,4));
a(i,:) = M(idx(subidx),3:4);
end
>> [b,a]
ans =
2 1 9 0.2
2 2 2 0.6
4 1 7 0.3
4 2 3 0.1
4 3 4 0.5
5 3 6 0.2

I believe that
temp = grpstats(M(:, [1 2 4 3]),[M(:,1) M(:,2) ],{'min'});
result = temp(:, [1 2 4 3]);
would do what you require. If it doesn't, please explain in the comments and we can figure it out...
If I understand the documentation correctly, even
temp = grpstats(M(:, [1 2 4 3]), [1 2], {'min'});
result = temp(:, [1 2 4 3]);
should work (giving column numbers rather than full contents of columns)... Can't test right now, so can't vouch for that.

Related

making rows to column and saved in separate files

I have 4 text file in a folder and each text file contain many rows of data as follows
cat a.txt
10.0000 0.0000 10.0000 0.0000
11.0000 0.0000 11.0000 0.0000
cat b.txt
5.1065 3.8423 2.6375 3.5098
4.7873 5.9304 1.9943 4.7599
cat c.txt
3.5257 3.9505 3.8323 4.3359
3.3414 4.0014 4.0383 4.4803
cat d.txt
1.8982 2.0342 1.9963 2.1575
1.8392 2.0504 2.0623 2.2037
I want to make each corresponding rows of the text file to column as
file001.txt
10.0000 5.1065 3.5257 1.8982
0.0000 3.8423 3.9505 2.0342
10.0000 2.6375 3.8323 1.9963
0.0000 3.5098 4.3359 2.1575
file002.txt
11.0000 4.7873 3.3414 1.8329
0.0000 5.9304 4.0014 2.0504
11.0000 1.9943 4.0383 2.0623
0.0000 4.7599 4.4803 2.2037
Anf finally I want to add this value 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000 to every line so the final output should be
file001.txt
10.0000 5.1065 3.5257 1.8982 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.8423 3.9505 2.0342 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
10.0000 2.6375 3.8323 1.9963 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.5098 4.3359 2.1575 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
file002.txt
11.0000 4.7873 3.3414 1.8329 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 5.9304 4.0014 2.0504 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
11.0000 1.9943 4.0383 2.0623 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 4.7599 4.4803 2.2037 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
Finally, I want to append some comments on the top of every created files
So for example file001.txt should be
#
# ascertain thin
# Metamorphs
# pch
# what is that
# 5-r
# Add the thing
# liop34
# liop36
# liop45
# liop34
# M(CM) N(M) O(S) P(cc) ab cd efgh ijkl mnopq rstuv
#
10.0000 5.1065 3.5257 1.8982 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.8423 3.9505 2.0342 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
10.0000 2.6375 3.8323 1.9963 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.5098 4.3359 2.1575 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
files = ["a.txt", "b.txt", "c.txt", "d.txt"]
# get number of columns per file, i.e., 4 in sample data
n_each = np.loadtxt(files[0]).shape[1]
# concatanate transposed data
arrays = np.concatenate([np.loadtxt(file).T for file in files])
# rows are in column now for easier reshaping; reshape and save
n_all = arrays.shape[1]
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
arrays[:, n].reshape(n_each, len(files)).T,
fmt="%7.4f")
to add a fixed array of values right to the new arrays, you can perform horizontal stacking after tiling the new values n_each times:
# other things same as above
new_values = np.tile([5, 6, 9, 0, 1, 1], (n_each, 1))
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
np.hstack((arrays[:, n].reshape(n_each, len(files)).T,
new_values)),
fmt="%7.4f")
to add comments, header and comments parameters of np.savetxt is useful. we pass the string to header and since it already contains "# " in it, we suppress extra "# " from np.savetxt by passing comments="":
comment = """\
#
# ascertain thin
# Metamorphs
# pch
# what is that
# 5-r
# Add the thing
# liop34
# liop36
# liop45
# liop34
# M(CM) N(M) O(S) P(cc) ab cd efgh ijkl mnopq rstuv
#"""
# rows are in column now for easier reshaping; reshape and save
n_all = arrays.shape[1]
new_values = np.tile([5, 6, 9, 0, 1, 1], (n_each, 1))
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
np.hstack((arrays[:, n].reshape(n_each, len(files)).T,
new_values)),
fmt="%7.4f",
header=comment,
comments="")

how to extract every nth row from numpy array

I have a numpy array and I want to extract every 3rd rows from it
input
0.00 1.0000
0.34 1.0000
0.68 1.0000
1.01 1.0000
1.35 1.0000
5.62 2.0000
I need to extract every 3rd row so that expected output will be
0.68 1.0000
5.62 2.0000
My code:
import numpy as np
a=np.loadtxt('input.txt')
out=a[::3]
But it gives different result.Hope experts will guide me.Thanks.
When undefined, the starting point of a (positive) slice is the first item.
You need to slice starting on the n-1th item:
N = 3
out = a[N-1::N]
Output:
array([[0.68, 1. ],
[5.62, 2. ]])

Python Pandas rolling mean with window value in another column

I am using pandas.DataFrame.rolling to calculate rolling means for a stock index close price series. I can do this in Excel. How can I do the same thing in Pandas? Thanks!
Below is my Excel formula to calculate the moving average and the window length is in column ma window:
date close ma window ma
2018/3/21 4061.0502
2018/3/22 4020.349
2018/3/23 3904.9355 3 =AVERAGE(INDIRECT("B"&(ROW(B4)-C4+1)):B4)
2018/3/26 3879.893 2 =AVERAGE(INDIRECT("B"&(ROW(B5)-C5+1)):B5)
2018/3/27 3913.2689 4 =AVERAGE(INDIRECT("B"&(ROW(B6)-C6+1)):B6)
2018/3/28 3842.7155 7 =AVERAGE(INDIRECT("B"&(ROW(B7)-C7+1)):B7)
2018/3/29 3894.0498 1 =AVERAGE(INDIRECT("B"&(ROW(B8)-C8+1)):B8)
2018/3/30 3898.4977 6 =AVERAGE(INDIRECT("B"&(ROW(B9)-C9+1)):B9)
2018/4/2 3886.9189 2 =AVERAGE(INDIRECT("B"&(ROW(B10)-C10+1)):B10)
2018/4/3 3862.4796 8 =AVERAGE(INDIRECT("B"&(ROW(B11)-C11+1)):B11)
2018/4/4 3854.8625 1 =AVERAGE(INDIRECT("B"&(ROW(B12)-C12+1)):B12)
2018/4/9 3852.9292 9 =AVERAGE(INDIRECT("B"&(ROW(B13)-C13+1)):B13)
2018/4/10 3927.1729 3 =AVERAGE(INDIRECT("B"&(ROW(B14)-C14+1)):B14)
2018/4/11 3938.3434 1 =AVERAGE(INDIRECT("B"&(ROW(B15)-C15+1)):B15)
2018/4/12 3898.6354 3 =AVERAGE(INDIRECT("B"&(ROW(B16)-C16+1)):B16)
2018/4/13 3871.1443 8 =AVERAGE(INDIRECT("B"&(ROW(B17)-C17+1)):B17)
2018/4/16 3808.863 2 =AVERAGE(INDIRECT("B"&(ROW(B18)-C18+1)):B18)
2018/4/17 3748.6412 2 =AVERAGE(INDIRECT("B"&(ROW(B19)-C19+1)):B19)
2018/4/18 3766.282 4 =AVERAGE(INDIRECT("B"&(ROW(B20)-C20+1)):B20)
2018/4/19 3811.843 6 =AVERAGE(INDIRECT("B"&(ROW(B21)-C21+1)):B21)
2018/4/20 3760.8543 3 =AVERAGE(INDIRECT("B"&(ROW(B22)-C22+1)):B22)
Here is a snapshot of the Excel version.
I figured it out. But I don't think it is the best solution...
import pandas as pd
data = pd.read_excel('data.xlsx' ,index_col='date')
def get_price_mean(x):
win = data.loc[:,'ma window'].iloc[x.shape[0]-1].astype('int')
win = max(win,0)
return pd.Series(x).rolling(window = win).mean().iloc[-1]
data.loc[:,'ma'] = data.loc[:,'close'].expanding().apply(get_price_mean)
print(data)
The result is:
close ma window ma
date
2018-03-21 4061.0502 NaN NaN
2018-03-22 4020.3490 NaN NaN
2018-03-23 3904.9355 3.0 3995.444900
2018-03-26 3879.8930 2.0 3892.414250
2018-03-27 3913.2689 4.0 3929.611600
2018-03-28 3842.7155 7.0 NaN
2018-03-29 3894.0498 1.0 3894.049800
2018-03-30 3898.4977 6.0 3888.893400
2018-04-02 3886.9189 2.0 3892.708300
2018-04-03 3862.4796 8.0 3885.344862
2018-04-04 3854.8625 1.0 3854.862500
2018-04-09 3852.9292 9.0 3876.179456
2018-04-10 3927.1729 3.0 3878.321533
2018-04-11 3938.3434 1.0 3938.343400
2018-04-12 3898.6354 3.0 3921.383900
2018-04-13 3871.1443 8.0 3886.560775
2018-04-16 3808.8630 2.0 3840.003650
2018-04-17 3748.6412 2.0 3778.752100
2018-04-18 3766.2820 4.0 3798.732625
2018-04-19 3811.8430 6.0 3817.568150
2018-04-20 3760.8543 3.0 3779.659767

Convert rows in columns with informix query

I want to convert
inpvacart inpvapvta inpvapvt1 inpvapvt2 inpvapvt3 inpvapvt4
CS-279 270.4149 0.0000 0.0000 0.0000 0.0000
AAA5030 1.9300 1.9300 1.6212 0.0000 0.0000
Query
select
inpvacart,
inpvapvta,
inpvapvt1,
inpvapvt2,
inpvapvt3,
inpvapvt4
from inpva;
into this
inpvacart line value
CS-279 1 270.4149
CS-279 2 0.00000
CS-279 3 0.00000
CS-279 4 0.00000
CS-279 5 0.00000
AAA5030 1 1.9300
AAA5030 2 1.9300
AAA5030 3 1.6212
AAA5030 4 0.0000
AAA5030 5 0.0000
I have tried this
select s.inpvacart,l.lista,l.resultados
from inpva as s,
table(values(1,s.inpvapvta),
(2,s.inpvapvt1),
(3,s.inpvapvt2),
(4,s.inpvapvt3),
(5,s.inpvapvt4))
)as l(lista,resultados);
But it does not work in informix 9
Is there a way to transpose rows to columns?
Thank You
I don't think Informix has any unpivot operator to transpose columns to rows like for instance MSSQL does, but one way to do this is to transpose the columns manually and then use union to create a single set like this:
select inpvacart, 1 as line, inpvapvta as value from inpva
union all
select inpvacart, 2 as line, inpvapvt1 as value from inpva
union all
select inpvacart, 3 as line, inpvapvt2 as value from inpva
union all
select inpvacart, 4 as line, inpvapvt3 as value from inpva
union all
select inpvacart, 5 as line, inpvapvt4 as value from inpva
order by inpvacart, line;
It's not very pretty but it should work.

Pandas organise delimited rows of data frame into dictionary

After reading a cvs file with pandas by:
df = pd.read_csv(file_name, names= ['x', 'y', 'z'], header=None, delim_whitespace=True)
print df
Outputs something like:
x y z
0 ROW 1.0000 NaN
1 60.1662 30.5987 -29.2246
2 60.1680 30.5951 -29.2212
3 60.1735 30.5843 -29.2101
4 ROW 2.0000 NaN
5 60.1955 30.5410 -29.1664
6 ROW 3.0000 NaN
7 60.1955 30.5410 -29.1664
8 60.1958 30.5412 -29.1665
9 60.1965 30.5419 -29.1667
now ideally I would like to organise all the data with the assumption that everything below a "ROW" entry row in the data frame belongs to each other. Maybe I would like a dictionary of python arrays so that
dict = {ROW1: [[60.1662 30.5987 -29.2246], [60.1680 30.5951 -29.2212], [60.1735 30.5843 -29.2101]], ROW2: [[60.1955 30.5410 -29.1664]], ... }
basically each dictionary entry is a numpy array of the coordinates in the data frame. What would be the best way to do this?
Sounds like we need some dictionary comprehension here:
In [162]:
print df
x y z
0 ROW 1.0000 NaN
1 60.1662 30.5987 -29.2246
2 60.1680 30.5951 -29.2212
3 60.1735 30.5843 -29.2101
4 ROW 2.0000 NaN
5 60.1955 30.5410 -29.1664
6 ROW 3.0000 NaN
7 60.1955 30.5410 -29.1664
8 60.1958 30.5412 -29.1665
9 60.1965 30.5419 -29.1667
In [163]:
df['label'] = df.ix[df.x=='ROW', ['x','y']].apply(lambda x: x[0]+'%i'%x[1], axis=1)
In [164]:
df.label.fillna(method='pad', inplace=True)
df = df.dropna().set_index('label')
In [165]:
{k: df.ix[k].values.tolist() for k in df.index.unique()}
Out[165]:
{'ROW1': [['60.1662', 30.5987, -29.2246],
['60.1680', 30.5951, -29.2212],
['60.1735', 30.5843, -29.2101]],
'ROW2': [['60.1955', 30.541, -29.1664]],
'ROW3': [['60.1955', 30.541, -29.1664],
['60.1958', 30.5412, -29.1665],
['60.1965', 30.5419, -29.1667]]}
Here is another way.
df['label'] = (df.x == 'ROW').astype(int).cumsum()
Out[24]:
x y z label
0 ROW 1.0000 NaN 1
1 60.1662 30.5987 -29.2246 1
2 60.1680 30.5951 -29.2212 1
3 60.1735 30.5843 -29.2101 1
4 ROW 2.0000 NaN 2
5 60.1955 30.5410 -29.1664 2
6 ROW 3.0000 NaN 3
7 60.1955 30.5410 -29.1664 3
8 60.1958 30.5412 -29.1665 3
9 60.1965 30.5419 -29.1667 3
And then, by groupby on label column, you can start to process the df whatever you like. You have all the column name within each group. Very convenient to work on.