Selection of second columns data based on match of first column with another text file in python - numpy

I have little knowledge of numpy arrays and iterations. I have two input
files. First columns of both files represents time in milliseconds.
Input file 1 is reference or simulated value. Input file 2 is obtained from test value. I want to compare(plot second vs first ) second column of input-2
file with second column of first file if and only if when there is match of time in first column in corresponding files.
I am trying it through iterations but could not find proper results yet.How
to find index when there is a match?
import numpy as np
my_file=np.genfromtxt('path/input1.txt')
Sim_val=np.genfromtxt('path/input2.txt')
inp1=my_file[:,0]
inp12=my_file[:,1]
inpt2=Sim_val[:,0]
inpt21=Sim_val[:,1]
xarray=np.array(inp1)
yarray=np.array(inp12)
data=np.array([xarray,yarray])
ldata=data.T
zarray=np.array(inpt2)
tarray=np.array(inpt21)
mdata=np.array([zarray,tarray])
kdata=mdata.T
i=np.searchsorted(kdata[:,0],ldata[:,0])
print i
My inputfile-2 & Inputfile-1 is
0 5 0 5
100 6 50 6
200 10 200 15
300 12 350 12
400 15 # Obtained 400 15 #Simulated Value
500 20 #Value 500 25
600 0 650 0
700 11 700 11
800 12 850 8
900 19 900 19
1000 10 1000 3
Having really a hard time with numpy arrays and iterations.
Please anybody suggest how can I solve above problem. In-fact I
have other columns too but all manipulation is depend on match of first column(Time match).
Once again very much thanks in Advance.

Did you mean something like
import numpy as np
simulated = np.array([
(0, 5),
(100, 6),
(200, 10),
(300, 12),
(400, 15),
(500, 20),
(600, 0),
(700, 11),
(800, 12),
(900, 19),
(1000, 10)
])
actual = np.array([
(0, 5),
(50, 6),
(200, 15),
(350, 12),
(400, 15),
(500, 25),
(650, 0),
(700, 11),
(850, 8),
(900, 19),
(1000, 3)
])
def indexes_where_match(A, B):
""" an iterator that goes over the indexes of wherever the entries in A's first-col and B's first-col match """
return (i for i, (a, b) in enumerate(zip(A, B)) if a[0] == b[0])
def main():
for i in indexes_where_match(simulated, actual):
print(simulated[i][1], 'should be compared to', actual[i][1])
if __name__ == '__main__':
main()
You could also use column-slicing, like this:
simulated_time, simulated_values = simulated[..., 0], simulated[..., 1:]
actual_time, actual_values = actual[..., 0], actual[..., 1:]
indexes_where_match = (i for i, (a, b) in enumerate(zip(simulated_time, actual_time)) if a == b)
for i in indexes_where_match:
print(simulated_values[i], 'should be compared to', actual_values[i])
# outputs:
# [5] should be compared to [5]
# [10] should be compared to [15]
# [15] should be compared to [15]
# [20] should be compared to [25]
# [11] should be compared to [11]
# [19] should be compared to [19]
# [10] should be compared to [3]

Related

Large Sampling with Replacement by index layer of a Pandas multiindexed Dataframe

Imagine a dataframe with the structure below:
>>> print(pair_df)
0 1
centre param h pair_ind
0 x1 1 (0, 1) 2.244282 2.343915
(1, 2) 2.343915 2.442202
(2, 3) 2.442202 2.538162
(3, 4) 2.538162 2.630836
(4, 5) 2.630836 2.719298
... ... ...
9 x3 7 (1, 8) 1.407902 1.417398
(2, 9) 1.407953 1.422860
8 (0, 8) 1.407896 1.417398
(1, 9) 1.407902 1.422860
9 (0, 9) 1.407896 1.422860
[1350 rows x 2 columns]
What is the most efficient way to largely (e.g., 1000 times) sample (with replacement) this dataframe by index layer centre (10 values here) and put them all together?
I have found two solutions:
1)
import numpy as np
bootstrap_rand = np.random.choice(list(range(0,10)), size=len(range(0,10*1000)), replace=True).tolist()
sampled_df = pd.concat([pair_df.loc[idx[i, :, :, :], :] for i in bootstrap_rand])
sampled_df = pair_df.unstack(['param', 'h', 'pair_ind']).\
sample(10*1000, replace=True).\
stack(['param', 'h', 'pair_ind'])
Any more efficient ideas?

Negative integers as the third parameter of np.r_? (numpy)

https://docs.scipy.org/doc/numpy/reference/generated/numpy.r_.html
Negative integers specify where in the new shape tuple the last dimension of upgraded arrays should be placed, so the default is ‘-1’.
what does this sentence mean?
np.r_['0,2,-5', [1,2,3],[4,5,6] ] # ValueError: all the input array dimensions except for the concatenation axis must match exactly
np.r_['0,2,-6', [1,2,3],[4,5,6] ] # array([[1],[2],[3],[4],[5],[6]])
-5 and -6 both exceed the second parameter "2" in '0,2,-5', why -5 can not run ,but -6 can?
The description for this third value is a bit confusing, but with these list and the other numbers there are two possibilities (plus error cases):
In [31]: np.r_['0,2', [1,2,3],[4,5,6] ] # or '0,2,-1'
Out[31]:
array([[1, 2, 3],
[4, 5, 6]])
In [32]: np.r_['0,2,0', [1,2,3],[4,5,6] ]
Out[32]:
array([[1],
[2],
[3],
[4],
[5],
[6]])
[1,2,3] as an array has shape (3,). The '2' means expand it to 2d, either (1,3) or (3,1). The third digit controls which. Details of how it works are a bit complicated.
You can look at the code yourself at np.lib.index_tricks.AxisConcatenator.
In my tests '0,2,1' is like the default, so is '0,2,-3'. Other positive values produce an error, other negative ones behave like 0. '-5' is the same as '-6' in my tests.
In [46]: np.r_['0,2,-5', [1,2,3],[4,5,6] ].shape
Out[46]: (6, 1)
In [47]: np.r_['0,2,-6', [1,2,3],[4,5,6] ].shape
Out[47]: (6, 1)
For a 3d expansion, the 3 possibilities are:
In [48]: np.r_['0,3,-1', [1,2,3],[4,5,6] ].shape # (1,1,3)
Out[48]: (2, 1, 3)
In [49]: np.r_['0,3,0', [1,2,3],[4,5,6] ].shape # (3,1,1)
Out[49]: (6, 1, 1)
In [50]: np.r_['0,3,1', [1,2,3],[4,5,6] ].shape # (1,3,1)
Out[50]: (2, 3, 1)
In the case of a (2,3) shape array expending to 3d, the alternatives are (2,3,1) or (1,2,3). It can't insert a new dimension in the middle.
In [60]: np.r_['0,3,0', np.ones((2,3))].shape
Out[60]: (2, 3, 1)
In [61]: np.r_['0,3,-1', np.ones((2,3))].shape
Out[61]: (1, 2, 3)
===
With ndmin the 2nd integer, the desired dimensions, each array is expanded with:
newobj = array(item, copy=False, subok=True, ndmin=ndmin)
then the 3rd integer is applied via a transpose. The transpose parameter is calculated with an obscure piece of code:
k2 = ndmin - item_ndim
k1 = trans1d
if k1 < 0:
k1 += k2 + 1
defaxes = list(range(ndmin))
axes = defaxes[:k1] + defaxes[k2:] + defaxes[k1:k2]
newobj = newobj.transpose(axes)
A couple of versions back, trans1d += k2+1, so it changed from one array to the next - -5 to -3 to -1. It ended up trying concatenate a (3,1) with a (1,3), raising the ValueError.
I found this bug fix by looking at the 'blame' mode of the https://github.com/numpy/numpy/blame/master/numpy/lib/index_tricks.py file:
https://github.com/numpy/numpy/commit/e7d571396e92b670a0e8de6e50366ba1dbee3c6e
BUG: Fix mutating state between items in np,r_

Sum of data entry with the given index in pandas dataframe

I try to get the sum of possible combination of given data in pandas dataframe. To do this I use itertools combination to get all of possible combinations, then by using loop, I sum each of it.
Is there any way to do this without using the loop?
Please check the following script that I created to shows what I want.
import pandas as pd
import itertools as it
A = pd.Series([50, 20, 75], index = list(range(1, 4)))
df = pd.DataFrame({'A': A})
listNew = []
for i in range(1, len(df.A)+1):
Temp=it.combinations(df.index.values, i)
for data in Temp:
listNew.append(data)
print(listNew)
for data in listNew:
print(df.A[list(data)].sum())
Output of these scripts are:
[(1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
50
20
75
70
125
95
145
thank you in advance.
IIUC, using reindex
#convert you list of tuple to data frame and using stack to flatten it
s=pd.DataFrame([(1,), (2,), (3,), (1, 2),(1, 3),(2, 3), (1, 2, 3)]).stack().to_frame('index')
# then we reindex base on the order of it using df.A
s['Value']=df.reindex(s['index']).A.values
#you can using groupby here, but since the index is here, I will recommend sum with level
s=s.Value.sum(level=0)
s
Out[796]:
0 50
1 20
2 75
3 70
4 125
5 95
6 145
Name: Value, dtype: int64

Python, Numpy: all UNIQUE combinations of a numpy.array() vector

I want to get all unique combinations of a numpy.array vector (or a pandas.Series). I used itertools.combinations but it's very slow. For an array of size (1000,) it takes many hours. Here is my code using itertools (actually I use combination differences):
def a(array):
temp = pd.Series([])
for i in itertools.combinations(array, 2):
temp = temp.append(pd.Series(np.abs(i[0]-i[1])))
temp.index=range(len(temp))
return temp
As you see there is no repetition!!
The sklearn.utils.extmath.cartesian is really fast and good but it provides repetitions which I do not want! I need help rewriting above function without using itertools and much more speed for large vectors.
You could take the upper triangular part of a matrix formed on the Cartesian product with the binary operation (here subtraction, as in your example):
import numpy as np
n = 3
a = np.random.randn(n)
print(a)
print(a - a[:, np.newaxis])
print((a - a[:, np.newaxis])[np.triu_indices(n, 1)])
gives
[ 0.04248369 -0.80162228 -0.44504522]
[[ 0. -0.84410597 -0.48752891]
[ 0.84410597 0. 0.35657707]
[ 0.48752891 -0.35657707 0. ]]
[-0.84410597 -0.48752891 0.35657707]
with n=1000 (and output piped to /dev/null) this runs in 0.131s
on my relatively modest laptop.
For a random array of ints:
import numpy as np
import pandas as pd
import itertools as it
b = np.random.randint(0, 8, ((6,)))
# array([7, 0, 6, 7, 1, 5])
pd.Series(list(it.combinations(np.unique(b), 2)))
it returns:
0 (0, 1)
1 (0, 5)
2 (0, 6)
3 (0, 7)
4 (1, 5)
5 (1, 6)
6 (1, 7)
7 (5, 6)
8 (5, 7)
9 (6, 7)
dtype: object

numpy, sums of subsets with no iterations [duplicate]

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)