how to extract every nth row from numpy array - numpy

I have a numpy array and I want to extract every 3rd rows from it
input
0.00 1.0000
0.34 1.0000
0.68 1.0000
1.01 1.0000
1.35 1.0000
5.62 2.0000
I need to extract every 3rd row so that expected output will be
0.68 1.0000
5.62 2.0000
My code:
import numpy as np
a=np.loadtxt('input.txt')
out=a[::3]
But it gives different result.Hope experts will guide me.Thanks.

When undefined, the starting point of a (positive) slice is the first item.
You need to slice starting on the n-1th item:
N = 3
out = a[N-1::N]
Output:
array([[0.68, 1. ],
[5.62, 2. ]])

Related

making rows to column and saved in separate files

I have 4 text file in a folder and each text file contain many rows of data as follows
cat a.txt
10.0000 0.0000 10.0000 0.0000
11.0000 0.0000 11.0000 0.0000
cat b.txt
5.1065 3.8423 2.6375 3.5098
4.7873 5.9304 1.9943 4.7599
cat c.txt
3.5257 3.9505 3.8323 4.3359
3.3414 4.0014 4.0383 4.4803
cat d.txt
1.8982 2.0342 1.9963 2.1575
1.8392 2.0504 2.0623 2.2037
I want to make each corresponding rows of the text file to column as
file001.txt
10.0000 5.1065 3.5257 1.8982
0.0000 3.8423 3.9505 2.0342
10.0000 2.6375 3.8323 1.9963
0.0000 3.5098 4.3359 2.1575
file002.txt
11.0000 4.7873 3.3414 1.8329
0.0000 5.9304 4.0014 2.0504
11.0000 1.9943 4.0383 2.0623
0.0000 4.7599 4.4803 2.2037
Anf finally I want to add this value 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000 to every line so the final output should be
file001.txt
10.0000 5.1065 3.5257 1.8982 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.8423 3.9505 2.0342 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
10.0000 2.6375 3.8323 1.9963 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.5098 4.3359 2.1575 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
file002.txt
11.0000 4.7873 3.3414 1.8329 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 5.9304 4.0014 2.0504 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
11.0000 1.9943 4.0383 2.0623 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 4.7599 4.4803 2.2037 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
Finally, I want to append some comments on the top of every created files
So for example file001.txt should be
#
# ascertain thin
# Metamorphs
# pch
# what is that
# 5-r
# Add the thing
# liop34
# liop36
# liop45
# liop34
# M(CM) N(M) O(S) P(cc) ab cd efgh ijkl mnopq rstuv
#
10.0000 5.1065 3.5257 1.8982 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.8423 3.9505 2.0342 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
10.0000 2.6375 3.8323 1.9963 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
0.0000 3.5098 4.3359 2.1575 5.0000 6.0000 9.0000 0.0000 1.0000 1.0000
files = ["a.txt", "b.txt", "c.txt", "d.txt"]
# get number of columns per file, i.e., 4 in sample data
n_each = np.loadtxt(files[0]).shape[1]
# concatanate transposed data
arrays = np.concatenate([np.loadtxt(file).T for file in files])
# rows are in column now for easier reshaping; reshape and save
n_all = arrays.shape[1]
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
arrays[:, n].reshape(n_each, len(files)).T,
fmt="%7.4f")
to add a fixed array of values right to the new arrays, you can perform horizontal stacking after tiling the new values n_each times:
# other things same as above
new_values = np.tile([5, 6, 9, 0, 1, 1], (n_each, 1))
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
np.hstack((arrays[:, n].reshape(n_each, len(files)).T,
new_values)),
fmt="%7.4f")
to add comments, header and comments parameters of np.savetxt is useful. we pass the string to header and since it already contains "# " in it, we suppress extra "# " from np.savetxt by passing comments="":
comment = """\
#
# ascertain thin
# Metamorphs
# pch
# what is that
# 5-r
# Add the thing
# liop34
# liop36
# liop45
# liop34
# M(CM) N(M) O(S) P(cc) ab cd efgh ijkl mnopq rstuv
#"""
# rows are in column now for easier reshaping; reshape and save
n_all = arrays.shape[1]
new_values = np.tile([5, 6, 9, 0, 1, 1], (n_each, 1))
for n in range(n_all):
np.savetxt(f"file{str(n+1).zfill(3)}.txt",
np.hstack((arrays[:, n].reshape(n_each, len(files)).T,
new_values)),
fmt="%7.4f",
header=comment,
comments="")

Select last row from each column of multi-index Pandas DataFrame based on time, when columns are unequal length

I have the following Pandas multi-index DataFrame with the top level index being a group ID and the second level index being when, in ISO 8601 time format (shown here without the time):
value weight
when
5e33c4bb4265514aab106a1a 2011-05-12 1.34 0.79
2011-05-07 1.22 0.83
2011-05-03 2.94 0.25
2011-04-28 1.78 0.89
2011-04-22 1.35 0.92
... ... ...
5e33c514392b77d517961f06 2009-01-31 30.75 0.12
2009-01-24 30.50 0.21
2009-01-23 29.50 0.96
2009-01-10 28.50 0.98
2008-12-08 28.50 0.65
when is currently defined as an index but this is not a requirement.
Assertions
when may be non-unique.
Columns may be of unequal length across groups
Within groups when, value and weight will always be of equal length (for each when there will always be a value and a weight
Question
Using the parameter index_time, how do you retrieve:
The most recent past value and weight from each group relative to index_time along with the difference (in seconds) between index_time and when.
index_time may be a time in the past such that only entries where when <= index_time are selected.
The result should be indexed in some way so that the group id of each result can be deduced
Example
From the above, if the index_time was 2011-05-10 then the result should be:
value weight age
5e33c4bb4265514aab106a1a 1.22 0.83 259200
5e33c514392b77d517961f06 30.75 0.12 72576000
Where original DataFrame given in the question is df:
import pandas as pd
df.sort_index(inplace=True)
result = df.loc[pd.IndexSlice[:, :when], :].groupby('id').tail(1)
result['age'] = when - result.index.get_level_values(level=1)

pandas time-weighted average groupby in panel data

Hi I have a panel data set looks like
stock date time spread1 weight spread2
VOD 01-01 9:05 0.01 0.03 ...
VOD 01-01 9.12 0.03 0.05 ...
VOD 01-01 10.04 0.02 0.30 ...
VOD 01-02 11.04 0.02 0.05
... ... ... .... ...
BAT 01-01 0.05 0.04 0.03
BAT 01-01 0.07 0.05 0.03
BAT 01-01 0.10 0.06 0.04
I want to calculate the weighted average of spread1 for each stock in each day. I can break the solution into several steps. i.e. I can apply groupby and agg function to get the sum of spread1*weight for each stock in each day in dataframe1, and then calculate the sum of weight for each stock in each day in dataframe2. After that merge two data sets and get weighted average for spread1.
My question is is there any simple way to calculate weighted average of spread1 here ? I also have spread2, spread3 and spread4. So I want to write as fewer code as possible. Thanks
IIUC, you need to transform the result back to the original, but using .transform with output that depends on two columns is tricky. We write our own function, where we pass the series of spread s and the original DataFrame df so we can also use the weights:
import numpy as np
def weighted_avg(s, df):
return np.average(s, weights=df.loc[df.index.isin(s.index), 'weight'])
df['spread1_avg'] = df.groupby(['stock', 'date']).spread1.transform(weighted_avg, df)
Output:
stock date time spread1 weight spread1_avg
0 VOD 01-01 9:05 0.01 0.03 0.020526
1 VOD 01-01 9.12 0.03 0.05 0.020526
2 VOD 01-01 10.04 0.02 0.30 0.020526
3 VOD 01-02 11.04 0.02 0.05 0.020000
4 BAT 01-01 0.05 0.04 0.03 0.051000
5 BAT 01-01 0.07 0.05 0.03 0.051000
6 BAT 01-01 0.10 0.06 0.04 0.051000
If needed for multiple columns:
gp = df.groupby(['stock', 'date'])
for col in [f'spread{i}' for i in range(1,5)]:
df[f'{col}_avg'] = gp[col].transform(weighted_avg, df)
Alternatively, if you don't need to transform back and one want value per stock-date:
def my_avg2(gp):
avg = np.average(gp.filter(like='spread'), weights=gp.weight, axis=0)
return pd.Series(avg, index=[col for col in gp.columns if col.startswith('spread')])
### Create some dummy data
df['spread2'] = df.spread1+1
df['spread3'] = df.spread1+12.1
df['spread4'] = df.spread1+1.13
df.groupby(['stock', 'date'])[['weight'] + [f'spread{i}' for i in range(1,5)]].apply(my_avg2)
# spread1 spread2 spread3 spread4
#stock date
#BAT 01-01 0.051000 1.051000 12.151000 1.181000
#VOD 01-01 0.020526 1.020526 12.120526 1.150526
# 01-02 0.020000 1.020000 12.120000 1.150000

group by in Matlab to find the value that resulted minimum similar to SQL

I have a dataset having columns a, b, c and d
I want to group the dataset by a,b and find c such that d is minimum for each group
I can do "group by" using 'grpstats" as :
grpstats(M,[M(:,1) M(:,2) ],{'min'});
I don't know how to find the value of M(:,3) that resulted the min in d
In SQL I suppose we use nested queries for that and use the primary keys. How can I solve it in Matlab?
Here is an example:
>> M =[4,1,7,0.3;
2,1,8,0.4;
2,1,9,0.2;
4,2,1,0.2;
2,2,2,0.6;
4,2,3,0.1;
4,3,5,0.8;
5,3,6,0.2;
4,3,4,0.5;]
>> grpstats(M,[M(:,1) M(:,2)],'min')
ans =
2.0000 1.0000 8.0000 0.2000
2.0000 2.0000 2.0000 0.6000
4.0000 1.0000 7.0000 0.3000
4.0000 2.0000 1.0000 0.1000
4.0000 3.0000 4.0000 0.5000
5.0000 3.0000 6.0000 0.2000
But M(1,3) and M(4,3) are wrong. The correct answer that I am looking for is:
2.0000 1.0000 9.0000 0.2000
2.0000 2.0000 2.0000 0.6000
4.0000 1.0000 7.0000 0.3000
4.0000 2.0000 3.0000 0.1000
4.0000 3.0000 4.0000 0.5000
5.0000 3.0000 6.0000 0.2000
To conclude, I don't want the minimum of third column; but I want it's values corresponding to minimum in 4th column
grpstats won't do this, and MATLAB doesn't make it as easy as you might hope.
Sometimes brute force is best, even if it doesn't feel like great MATLAB style:
[b,m,n]=unique(M(:,1:2),'rows');
for i =1:numel(m)
idx=find(n==i);
[~,subidx] = min(M(idx,4));
a(i,:) = M(idx(subidx),3:4);
end
>> [b,a]
ans =
2 1 9 0.2
2 2 2 0.6
4 1 7 0.3
4 2 3 0.1
4 3 4 0.5
5 3 6 0.2
I believe that
temp = grpstats(M(:, [1 2 4 3]),[M(:,1) M(:,2) ],{'min'});
result = temp(:, [1 2 4 3]);
would do what you require. If it doesn't, please explain in the comments and we can figure it out...
If I understand the documentation correctly, even
temp = grpstats(M(:, [1 2 4 3]), [1 2], {'min'});
result = temp(:, [1 2 4 3]);
should work (giving column numbers rather than full contents of columns)... Can't test right now, so can't vouch for that.

SQL linear interpolation based on lookup table

I need to build linear interpolation into an SQL query, using a joined table containing lookup values (more like lookup thresholds, in fact). As I am relatively new to SQL scripting, I have searched for an example code to point me in the right direction, but most of the SQL scripts I came across were for interpolating between dates and timestamps and I couldn't relate these to my situation.
Basically, I have a main data table with many rows of decimal values in a single column, for example:
Main_Value
0.33
0.12
0.56
0.42
0.1
Now, I need to yield interpolated data points for each of the rows above, based on a joined lookup table with 6 rows, containing non-linear threshold values and the associated linear normalized values:
Threshold_Level Normalized_Value
0 0
0.15 20
0.45 40
0.60 60
0.85 80
1 100
So for example, if the value in the Main_Value column is 0.45, the query will lookup its position in (or between) the nearest Threshold_Level, and interpolate this based on the adjacent value in the Normalized_Value column (which would yield a value of 40 in this example).
I really would be grateful for any insight into building a SQL query around this, especially as it has been hard to track down any SQL examples of linear interpolation using a joined table.
It has been pointed out that I could use some sort of rounding, so I have included a more detailed table below. I would like the SQL query to lookup each Main_Value (from the first table above) where it falls between the Threshold_Min and Threshold_Max values in the table below, and return the 'Normalized_%' value:
Threshold_Min Threshold_Max Normalized_%
0.00 0.15 0
0.15 0.18 5
0.18 0.22 10
0.22 0.25 15
0.25 0.28 20
0.28 0.32 25
0.32 0.35 30
0.35 0.38 35
0.38 0.42 40
0.42 0.45 45
0.45 0.60 50
0.60 0.63 55
0.63 0.66 60
0.66 0.68 65
0.68 0.71 70
0.71 0.74 75
0.74 0.77 80
0.77 0.79 85
0.79 0.82 90
0.82 0.85 95
0.85 1.00 100
For example, if the value from the Main_Value table is 0.52, it falls between Threshold_Min 0.45 and Threshold_Max 0.60, so the Normalized_% returned is 50%. The problem is that the Threshold_Min and Max values are not linear. Could anyone point me in the direction of how to script this?
Assuming you want the Main_Value and the nearest (low and not high) or equal Normalized_Value, you can do it like this:
select t1.Main_Value, max(t2.Normalized_Value) as Normalized_Value
from #t1 t1
inner join #t2 t2 on t1.Main_Value >= t2.Threshold_Level
group by t1.Main_Value
Replace #t1 and #t2 by the correct tablenames.