Ok, so this is confusing because of a lack of vocabulary.
Pandas series have an index and a value: so 'series[0]' contains (index,value).
How do I get the index (in my case it is a date), out of the series by indexing the series? This is really a very simple idea...it is just encrypted by the word "index." lol.
So, to rephrase,
I need the date of the first entry in my series and the last entry, when my series is indexed by date.
just to be clear, I have a series indexed by date, so when I print it out, it prints:
12-12-2008 1.2
12-13-2008 1.3
...
and calling
df.ix[0] -> 1.2
I need:
df.something[0] -> 12-12-2008
Got it.
df.index[0]
yields the label at index 0.
You can access the elements of your index just as you would a list. So df.index[0] will be the first element of your index and df.index[-1] will be the last.
Incidently if a series (or dataframe) has a non-integer index, df.ix[n] will return the n-th row corresponding to the n-th element of your index.
So df.ix[0] will return the first row and df.ix[-1] will return the last row. So an alternative way of getting the index values would be to use df.ix[0].name and df.ix[-1].name
Related
I have a data frame that is a single row of numerical values and I want to know if any of those values is greater than 2 and if so create a new column with the word 'Diff'
Col_,F_1,F_2
1,5,0
My dataframe is diff_df. Here is one thing I tried
c = diff_df >2
if c.any():
diff_df['difference']='Difference'
If I were to print c. it would be
Col_,F_1,F_2
False,True,False
I have tried c.all() and many iterations of other things. Clearly my inexperience is holding me back and google is not helping in this regards. Everything I try is either "The truth value of a Series (or Data Frame) is ambiguous use a.any(), a.all()...." Any help would be appreciated.
Since it is only one row, take the .max().max() of the dataframe. With one .max() you are going to get the .max() of each column. The second .max() takes the max of all the columns.
if diff_df.max().max() > 2: diff_df['difference']='Difference'
output:
Col_ F_1 F_2 difference
0 1 5 0 Difference
Use .loc accessor and .gt() to query and at the same time create new column and populate it
df.loc[df.gt(2).any(1), "difference"] = 'Difference'
Col_ F_1 F_2 difference
0 1 5 0 Difference
In addition to David's reponse you may also try this:
if ((df > 2).astype(int)).sum(axis=1).values[0] == 1:
df['difference']='Difference'
If I want to find the minimum value in each column of a NumPy array, I can use the numpy.amin() function. However, is there a way to find the two minimum values in each column, that is faster than sorting each column?
You can simply use np.partition along the columns to get smallest N numbers -
N = 2
np.partition(a,kth=N-1,axis=0)[:N]
This doesn't actually sort the entire data, simply partitions into two sections such that smallest N numbers are in the first section, also called as partial-sort.
Bonus (Getting top N elements) : Similarly, to get the top N numbers per col, simply use negative kth value -
np.partition(a,kth=-N,axis=0)[-N:]
Along other axes and higher dim arrays
To use it along other axes, change the axis value. So, along rows, it would be axis=1 for a 2D array and extend the same way for higher dimension ndarrays.
Use the min() method, and specify the axis you want to average over:
a = np.random.rand(10,3)
a.min(axis=0)
gives:
array([0.04435587, 0.00163139, 0.06327353])
a.min(axis=1)
gives
array([0.01354386, 0.08996586, 0.19332211, 0.00163139, 0.55650945,
0.08409907, 0.23015718, 0.31463493, 0.49117553, 0.53646868])
I currently have a 3 dimensional array full of different values. I would like to find the indices corresponding to the "nth" smallest values in the array. For example... If the 3 smallest values were 0.1, 0.2 and 0.3 I would like to see, in order, the indices for these values. Any help would be greatly appreciated.
A possible way to approach this would be adding an original index dimension into your 3rd array, then sorting, by the current values to find out the smallest item and returning the original index. Take a look into this: VBA array sort function?
I have the following dataframe:
df = pd.DataFrame(np.random.randn(4, 1), index=['mark13', 'luisgimenez', 'miguel72', 'luis34'],columns=['probability'])
probability
mark13 -1.054687
luisgimenez 0.081224
miguel72 -0.893619
luis34 -1.576941
I would like to remove the rows where the last character in the index string does not contain a number .
The desired output would look something like this :
(dropping the row where the index does not finishes with a number)
probability
mark13 -1.054687
miguel72 -0.893619
luis34 -1.576941
I am sure the direction I need to get is the boolean indexing but I do not know how could I reference the last character in the index name
#use isdigt to check last char of your index to be used as a mask array to filter rows.
df[[e[-1].isdigit() for e in df.index]]
Out[496]:
probability
mark13 -0.111338
miguel72 0.548725
luis34 0.682949
You can use the str accessor to check if the last character is a number:
df[df.index.str[-1].str.isdigit()]
Out:
probability
mark13 -0.350466
miguel72 1.220434
luis34 -0.962123
What is the fastest way to lookup the index of a value in sorted vector in MATLAB?
That is, is there a fast find(vector == myNumber, 1, 'first') for when vector is sorted?
I have a large matrix (200,000 x 4) of locations each with a unique integer ID recorded in the first column. I want to find the right the location of a known ID but thousands of searches can take me a little bit to find.
If you use ismembc2, the loc output should give you what you need. See this for more details:
http://www.mathworks.com/support/solutions/en/data/1-9NIE1N/index.html?product=ML&solution=1-9NIE1N
There are a number of submissions for this on FEX: http://www.mathworks.com/matlabcentral/fileexchange/?term=binary+search+vector
I do not know if it is faster but you may want to try
result=vector(vector(:,1)==myNumber,:)
result will contain the 4 elements row for which vector first column == myNumber