pandas series multi indexing error - pandas

I'm trying to slice into a multi-indexed data frame. I'm confused about conditions that generate IndexingError: Too many indexers. I'm also skeptical because I've found some bug reports about what may be this issue.
Specifically, this generates the error:
idx1 = [str(elem) for elem in [5, 6, 7, 8]]
idx2 = [str(elem) for elem in [10, 20, 30]]
index = pd.MultiIndex.from_product([idx1, idx2], names=('idx1', 'idx2'))
columns = ['m1', 'm2', 'm3']
df = pd.DataFrame(index=index, columns= columns)
df['m1'].loc[:,10]
That code above is trying to index into an index of dtypes of str, with an int, it seems to me. The error threw me off, as I don't understand why it says Too many indexers.
The below code works:
idx1 = [5, 6, 7, 8]
idx2 = [10, 20, 30]
index = pd.MultiIndex.from_product([idx1, idx2], names=('idx1', 'idx2'))
columns = ['m1', 'm2', 'm3']
df = pd.DataFrame(index=index, columns= columns)
df.loc[5,10] = [1,2,3]
df.loc[6,10] = [4,5,6]
df.loc[7,10] = [7,8,9]
type(df2['m1'])
df['m1'].loc[:,10]
There are some references to the same error: https://github.com/pandas-dev/pandas/issues/13597 which is marked closed and https://github.com/pandas-dev/pandas/issues/14885 which is open.
Is it ok to slice (a multi-indexed series) as in the lines above, assuming I get the dtype right? Also "Too many indexers" with DataFrame.loc
My pandas version is 20.3.

Related

How to use the np.where function together with the index of each element of the array?

cashflow = [0] + [10] * 7
# [0, 10, 10, 10, 10, 10, 10, 10]
for index in range(len(cashflow)):
growth_cashflow = 1.05**index * cashflow[index]
or
growth_cashflow = [1.05**index*pmt[index] for index in range(len(pmt))]
the result is:
[10.0, 10.5, 11.025, 11.576250000000002, 12.155062500000003, 12.762815625000004, 13.400956406250003]
But is it possible to get the same result with np.where?
cf = np.array(cashflow)
s = np.where(cf >= 0, 1.05**cf.index*cf, cf)
ERROR => AttributeError: 'numpy.ndarray' object has no attribute 'index'
Is it possible to get the index of each item and use it in the above multiplication?
If not, is there another way to do numpy without using for?
import numpy as np
cf=np.array([10, 10, 10, 10, 10, 10, 10])
s = cf*1.05**np.arange(len(cf))
print(s)
This should give you the output you are looking for. If you really want to get specific indices, you may want to use np.nonzero or np.argwhere.

Numpy fancy indexing with 2D array - explanation

I am (re)building up my knowledge of numpy, having used it a little while ago.
I have a question about fancy indexing with multidimenional (in this case 2D) arrays.
Given the following snippet:
>>> a = np.arange(12).reshape(3,4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> i = np.array( [ [0,1], # indices for the first dim of a
... [1,2] ] )
>>> j = np.array( [ [2,1], # indices for the second dim
... [3,3] ] )
>>>
>>> a[i,j] # i and j must have equal shape
array([[ 2, 5],
[ 7, 11]])
Could someone explain in simple English, the logic being applied to give the results produced. Ideally, the explanation would be applicable for 3D and higher rank arrays being used to index an array.
Conceptually (in terms of restrictions placed on "rows" and "columns"), what does it mean to index using a 2D array?
Conceptually (in terms of restrictions placed on "rows" and "columns"), what does it mean to index using a 2D array?
It means you are constructing a 2d array R, such that R=A[B, C]. This means that the value for rij=abijcij.
So it means that the item located at R[0,0] is the item in A with as row index B[0,0] and as column index C[0,0]. The item R[0,1] is the item in A with row index B[0,1] and as column index C[0,1], etc.
So in this specific case:
>>> b = a[i,j]
>>> b
array([[ 2, 5],
[ 7, 11]])
b[0,0] = 2 since i[0,0] = 0, and j[0,0] = 2, and thus a[0,2] = 2. b[0,1] = 5 since i[0,0] = 1, and j[0,0] = 1, and thus a[1,1] = 5. b[1,0] = 7 since i[0,0] = 1, and j[0,0] = 3, and thus a[1,3] = 7. b[1,1] = 11 since i[0,0] = 2, and j[0,0] = 3, and thus a[2,3] = 11.
So you can say that i will determine the "row indices", and j will determine the "column indices". Of course this concept holds in more dimensions as well: the first "indexer" thus determines the indices in the first index, the second "indexer" the indices in the second index, and so on.

Pandas - Row mask and 2d ndarray assignement

Got some problems with pandas, I think I'm not using it properly, and I would need some help to do it right.
So, I got a mask for rows of a dataframe, this mask is a simple list of Boolean values.
I would like to assign a 2D array, to a new or existing column.
mask = some_row_mask()
my2darray = some_operation(dataframe.loc[mask, column])
dataframe.loc[mask, new_or_exist_column] = my2darray
# Also tried this
dataframe.loc[mask, new_or_exist_column] = [f for f in my2darray]
Example data:
dataframe = pd.DataFrame({'Fun': ['a', 'b', 'a'], 'Data': [10, 20, 30]})
mask = dataframe['Fun']=='a'
my2darray = [[0, 1, 2, 3, 4], [4, 3, 2, 1, 0]]
column = 'Data'
new_or_exist_column = 'NewData'
Expected output
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]
dataframe[mask] and my2darray have both the exact same number of rows, but it always end with :
ValueError: Mus have equal len keys and value when setting with ndarray.
Thanks for your help!
EDIT - In context:
I just add some precisions, it was made for filling folds steps by steps: I compute and set some values from sub part of the dataframe.
Instead of this, according to Parth:
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
I changed to this:
dataframe.loc[mask, out] = pd.Series([f for f in features], index=mask[mask==True].index)
All values already set are overwrite by NaN values otherwise.
I miss to give some informations about it.
Thanks!
Try this:
dataframe[new_or_exist_column]=np.nan
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
It will give desired output:
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]

Slicing of a Pandas Series when index elements are not default (doesn't start with 0)

Created a Pandas Series in Python 3.7, providing the 'data' and 'index', where the data contains a list of list; len(list) = 6 and the index list contains the element which starts from 3 rather than starting from 0.
I want to slice the series.
import pandas as pd
li_a = [[1,2],[3,4],[5,6],[7,8],(9,10),(11,12)]
li_c = [3,4,5,6,7,8]
ser1 = pd.Series(data=li_a,index=li_c)
so, ser1[3] output: [1,2] i.e. the First element of the Series
I expected the output of ser1[3:] to be entire Series, but the output was
6 [7, 8]
7 (9, 10)
8 (11, 12)
dtype: object
It is working that way because you are printing by row position, not using index:
print(ser1[3:])
output:
6 [7, 8]
7 (9, 10)
8 (11, 12)
If you want to print rows from specific index number you need to use loc
print(ser1.loc[3:])
output:
3 [1, 2]
4 [3, 4]
5 [5, 6]
6 [7, 8]
7 (9, 10)
8 (11, 12)
edited: from iloc to loc :
loc gets rows (or columns) with particular labels from the index.
your full code (i have changed also your if name line:
def main():
arr = np.arange(10,16)
index1 = np.arange(3,9)
ser1 = pd.Series(data=arr,index=index1)
print(ser1)
print(ser1.loc[3:])
if __name__ == "__main__":
main()

Python pandas json 2D array

relatively new to pandas, I have a json and python files:
{"dataset":{
"id": 123,
"data": [["2015-10-16",1,2,3,4,5,6],
["2015-10-15",7,8,9,10,11,12],
["2015-10-14",13,14,15,16,17]]
}}
&
import pandas
x = pandas.read_json('sample.json')
y = x.dataset.data
print x.dataset
Printing x.dataset and y works fine, but when I go to access a sub-element y, it returns a 'buffer' type. What's going on? How can I access the data inside the array? Attempting y[0][1] it returns out of bounds error, and iterating through returns a strange series of 'nul' characters and yet, it appears to be able to return the first portion of the data after printing x.dataset...
The data attribute of a pandas Series points to the memory buffer of all the data contained in that series:
>>> df = pandas.read_json('sample.json')
>>> type(df.dataset)
pandas.core.series.Series
>>> type(df.dataset.data)
memoryview
If you have a column/row named "data", you have to access it by it's string name, e.g.:
>>> type(df.dataset['data'])
list
Because of surprises like this, it's usually considered best practice to access columns through indexing rather than through attribute access. If you do this, you will get your desired result:
>>> df['dataset']['data']
[['2015-10-16', 1, 2, 3, 4, 5, 6],
['2015-10-15', 7, 8, 9, 10, 11, 12],
['2015-10-14', 13, 14, 15, 16, 17]]
>>> arr = df['dataset']['data']
>>> arr[0][0]
'2015-10-16'