Say I want to pad a numpy array, to make room for an extra column of values in one of the dimensions:
>>> cells.shape
(2, 3, 12, 4)
>>> padded = np.pad(cells, ((0,0),(0,0),(0,0),(0,1)))
>>> padded.shape
(2, 3, 12, 5)
If I have the values for the new column in another 1D array, what is the most efficient way to insert them into cells?
The answer I found with help from #user3483203 in the comments...
If we start with:
>>> cells.shape
(2, 3, 12, 4)
And we pad that array in the last dimension to add another column:
>>> padded = np.pad(cells, ((0,0),(0,0),(0,0),(0,1)))
>>> padded.shape
(2, 3, 12, 5)
Nicest way I found to insert values in the new column is:
>>> padded[..., -1] = new_values
Related
Apologies if this has already been asked, I haven't found anything specific enough although this does seem like a general question. Anyways, I have two lists of values which correspond to values in a dataframe, and I need to pull those rows which contain those values and make them into another dataframe. The code I have works, but it seems quite slow (14 seconds per 250 items). Is there a smart way to speed it up?
row_list = []
for i, x in enumerate(datetime_list):
row_list.append(df.loc[(df["datetimes"] == x) & (df.loc["b"] == b_list[i])])
data = pd.concat(row_list)
Edit: Sorry for the vagueness #anky, here's an example dataframe
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'datetimes' : [datetime(2020, 6, 14, 2), datetime(2020, 6, 14, 3), datetime(2020, 6, 14, 4)],
'b' : [0, 1, 2],
'c' : [500, 600, 700]})
IIUC, try this
dfi = df.set_index(['datetime', 'b'])
data = dfi.loc[list(enumerate(datetime_list)), :].reset_index()
Without test data in question it is hard to verify if this correct.
Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.
Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.
If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]
I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!
To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'
I'm trying to express the N-D behaviour of np.dot using only 2-D np.dot or np.tensordot.
To recap, np.dot does something like the following for N-D: It matches/broadcasts the arrays along all dimensions but the last two and performs dot products for all of them. For example, if x.shape is (2, 3, 4, 5) and y.shape is (2, 3, 5, 4), np.dot(x, y).shape is (2, 3, 4, 4) and np.dot(x, y)[i, j] is np.dot(x[i, j], y[i, j]).
Also, if x.shape is just (4, 5), it will first be converted to (2, 3, 5, 4) via np.broadcast.
I tried np.tensortdot(x, y, axes=(-1, -2)) but it repeats along every dimension of x, y instead of matching them up.
I realise I could write a loop but I was looking for a vectorised solution.
You got the broadcasting behavior of np.dot wrong:
In [254]: x=np.ones((2,3,4,5)); y=np.ones((2,3,5,4))
In [255]: np.dot(x,y).shape
Out[255]: (2, 3, 4, 2, 3, 4)
In [256]: np.matmul(x,y).shape
Out[256]: (2, 3, 4, 4)
and for the (4,5) x:
In [257]: np.dot(x[0,0],y).shape
Out[257]: (4, 2, 3, 4)
In [258]: np.matmul(x[0,0],y).shape
Out[258]: (2, 3, 4, 4)
matmul was added precisely because np.dot does not act like it is performing np.dot(x[i,j,:,:], y[i,j,:,:]) for all i,j.
The shape in Out[255] is the shape of x minus the 5, plus the shape of y minus its 5. In effect an outer produce of everything with summing on the size 5 dimension.
tensordot uses np.dot. It just reshapes and transposes the inputs to reduce the problem to a 2d dot one. Then it massages the result back to the desired shape and order.
In [259]: np.tensordot(x, y, axes=(-1,-2)).shape
Out[259]: (2, 3, 4, 2, 3, 4) # cf Out[255]
In [261]: np.einsum('ijkl,ijlm->ijkm',x,y).shape
Out[261]: (2, 3, 4, 4) # cf Out[256]
Since sparse matrices are 2d to start with - and end with, I don't understand your question. If you have multiple sparse matrices, you'll have to work with them individually.
Let's take a very simple case: an array with shape (2,3,4), ignoring the values.
>>> a.shape
(2, 3, 4)
When we transpose it and print the dimensions:
>>> a.transpose([1,2,0]).shape
(3, 4, 2)
So I'm saying: take axis index 2 and make it the first, then take axis index 0 and make it the second and finally take axis index 1 and make it the third. I should get (4,2,3), right?
Well, I thought perhaps I don't understand the logic fully. So I read the documentation and its says:
Use transpose(a, argsort(axes)) to invert the transposition of tensors
when using the axes keyword argument.
So I tried
>>> c = np.transpose(a, [1,2,0])
>>> c.shape
(3, 4, 2)
>>> np.transpose(a, np.argsort([1,2,0])).shape
(4, 2, 3)
and got yet a completely different shape!
Could someone please explain this? Thanks.
In [259]: a = np.zeros((2,3,4))
In [260]: idx = [1,2,0]
In [261]: a.transpose(idx).shape
Out[261]: (3, 4, 2)
What this has done is take a.shape[1] dimension and put it first. a.shape[2] is 2nd, and a.shape[0] third:
In [262]: np.array(a.shape)[idx]
Out[262]: array([3, 4, 2])
transpose without parameter is a complete reversal of the axis order. It's an extension of the familiar 2d transpose (rows become columns, columns become rows):
In [263]: a.transpose().shape
Out[263]: (4, 3, 2)
In [264]: a.transpose(2,1,0).shape
Out[264]: (4, 3, 2)
And the do-nothing transpose:
In [265]: a.transpose(0,1,2).shape
Out[265]: (2, 3, 4)
You have an initial axes order and final one; describing swap can be hard to visualize if you don't regularly work with lists of size 3 or larger.
Some people find it easier to use swapaxes, which changes the order of just axes. rollaxis is yet another way.
I prefer to use transpose since it can do anything the others can; so I just have to develop an intuitive for one tool.
The argsort comment operates this way:
In [278]: a.transpose(idx).transpose(np.argsort(idx)).shape
Out[278]: (2, 3, 4)
That is, apply it to the result of one transpose to get back the original order.
np.argsort([1,2,0]) returns an array like [2,0,1]
So
np.transpose(a, np.argsort([1,2,0])).shape
act like
np.transpose(a, [2,0,1]).shape
not
np.transpose(a, [1,2,0]).shape
How do I remove rows from ndarray arrays which have the same nth column value?
For eg,
a = np.ndarray([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
And I want to have rows unique by third column.
I want to have just the [1, 3, 5] row left.
numpy.unique does not do it. It will check for uniqueness in every column; I can't specify the
column by which to check uniqueness.
How can I do this efficiently for thousand + rows?
Thank you.
You could try a combination of bincount, nonzero and in1d
import numpy as np
a = np.array([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
#A tuple containing the values which are unique in column 3
unique_in_column = (np.bincount(a[:,2]) == 1).nonzero()
a[:,2] == unique_in_column[0]
unique_index = np.in1d(a[:,2], unique_in_column[0])
unique_a = a[unique_index]
This should do the trick. However, I'm not sure how this method scales with 1000+ rows.
I had done this finally:
repeatdict = {}
todel = []
for i, row in enumerate(kplist):
if repeatdict.get(row[2], 0):
todel.append(i)
else:
repeatdict[row[2]] = 1
kplist = np.delete(kplist, todel, axis=0)
Basically, I iterated over the list store the values of the third column, and if in the next iteration the same value is already found in the repeatdict dict, that row is marked for deletion, by storing its index in todel list.
Then we can get rid of the unwanted rows by calling np.delete with the list of all row indexes which we want to delete.
Also, I'm not picking my answer as the picked answer, because I know there's probably a better way to do this with just numpy magic.
I'll wait.