How can I output the "Name" value in pandas? - pandas

After I do an iloc (df.iloc[3]) , I get an output with all the column names and their values for a given row.
What should the code be if I only want to out put the "Name" value for the same row?
Eg:
Columns 1 Value 1
Columns 2 Value 2
Name: Row 1, dtype: object
So, in this case "Row 1".

>>> df = pd.DataFrame({'Name': ['Uncle', 'Sam', 'Martin', 'Jacob'], 'Salary': [1000, 2000, 3000, 1500]})
>>> df
Name Salary
0 Uncle 1000
1 Sam 2000
2 Martin 3000
3 Jacob 1500
df.iloc[3] gives the following:
>>> df.iloc[3]
Name Jacob
Salary 1500
Name: 3, dtype: object
However, df.iloc[3, 'Name'] throws the following exception:
>>> df.iloc[3, 'Name']
Traceback (most recent call last):
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 235, in _has_valid_tuple
self._validate_key(k, i)
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2035, in _validate_key
"a [{types}]".format(types=self._valid_types)
ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1418, in __getitem__
return self._getitem_tuple(key)
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2092, in _getitem_tuple
self._has_valid_tuple(tup)
File "/home/nikhil/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 239, in _has_valid_tuple
"[{types}] types".format(types=self._valid_types)
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Use df.loc[3, 'Name'] instead:
>>> df.loc[3, 'Name']
'Jacob'

df.iloc is a series
df.iloc['3'].name will return the name
Example:
>> df=pd.DataFrame({'data': [100,200]})
>> df=df.set_index(pd.Index(['A','B']))
>> df.iloc[1]
data 200
Name: B, dtype: int64
>> df.iloc[1].name
'B'

Related

Exporting pandas df with column of tuples to BQ throws pyarrow error

I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame({"id": [1,2,3], "items": [('a', 'b'), ('a', 'b', 'c'), tuple('d')]}
>print(df)
id items
0 1 (a, b)
1 2 (a, b, c)
2 3 (d,)
After registering my GCP/BQ credentials in the normal way...
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path_to_my_creds.json"
... I try to export it to a BQ table:
import pandas_gbq
pandas_gbq.to_gbq(df, "my_table_name", if_exists="replace")
but I keep getting the following error:
Traceback (most recent call last):
File "<string>", line 4, in <module>
File "/Users/max.epstein/opt/anaconda3/envs/rec2env/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 1205, in to_gbq
...
File "/Users/max.epstein/opt/anaconda3/envs/rec2env/lib/python3.7/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 342, in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
File "pyarrow/array.pxi", line 915, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object
I have tried converting the tuple column to string with df = df.astype({"items":str}) and adding a table_schema param to the pandas_gbq.to_gbq... line but I keep getting this same error.
I have also tried replacing the pandas_gbq.to_gbq... line with the bq_client.load_table_from_dataframe method described here but still get the same pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object error...
So I think this is a weird issue with pandas dtypes being separate from Python types, and the astype only converting the type and not the pandas dtype. Try also converting the dtype to match the type after the astype statement.
Such that.
df = df.astype({"items": str})
Is replaced with:
df = df.astype({"items": str})
df = df.convert_dtypes()
Let me know if this works.

numpy - why np.multiply.reduce([], axis=None) results in 1?

Why multiply & reduce non-existing element in an empty array-like results in 1?
np.multiply.reduce([], axis=None)
---
1
ufunc may have an identify attribute:
In [200]: np.multiply.identity
Out[200]: 1
In [201]: np.multiply.reduce([])
Out[201]: 1.0
which can be replaced in a reduce:
In [202]: np.multiply.reduce([], initial=10)
Out[202]: 10.0
In [203]: np.multiply.reduce([1,2,3], initial=10)
Out[203]: 60
In [204]: np.multiply.reduce([1,2,3], initial=None)
Out[204]: 6
and if None,it can produce an error:
In [205]: np.multiply.reduce([], initial=None)
Traceback (most recent call last):
File "<ipython-input-205-1c3b1c890fd6>", line 1, in <module>
np.multiply.reduce([], initial=None)
ValueError: zero-size array to reduction operation multiply which has no identity
max is a ufunc without an intial:
In [211]: np.max([])
Traceback (most recent call last):
File "<ipython-input-211-93f3814168a1>", line 1, in <module>
np.max([])
File "<__array_function__ internals>", line 5, in amax
File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 2733, in amax
return _wrapreduction(a, np.maximum, 'max', axis, None, out,
File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
In [212]: np.max([], initial=-1)
Out[212]: -1.0
Python reduce
In [222]: from functools import reduce
In [223]: reduce?
Docstring:
reduce(function, sequence[, initial]) -> value
Apply a function of two arguments cumulatively to the items of a sequence,
from left to right, so as to reduce the sequence to a single value.
For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
((((1+2)+3)+4)+5). If initial is present, it is placed before the items
of the sequence in the calculation, and serves as a default when the
sequence is empty.
Type: builtin_function_or_method
So with a multiply lambda:
In [224]: reduce(lambda x,y: x*y,[1,2,3])
Out[224]: 6
For a empty list, error is the default behavior:
In [225]: reduce(lambda x,y: x*y,[])
Traceback (most recent call last):
File "<ipython-input-225-780706778563>", line 1, in <module>
reduce(lambda x,y: x*y,[])
TypeError: reduce() of empty sequence with no initial value
But with a supplied initial value:
In [227]: reduce(lambda x,y: x*y,[],1)
Out[227]: 1

Querying a column with lists in it

I have a dataframe with columns with lists in them. How can I query these?
>>> df1.shape
(1812871, 7)
>>> df1.dtypes
CHROM object
POS int32
ID object
REF object
ALT object
QUAL int8
FILTER object
dtype: object
>>> df1.head()
CHROM POS ID REF ALT QUAL FILTER
0 20 60343 rs527639301 G [A] 100 [PASS]
1 20 60419 rs538242240 A [G] 100 [PASS]
2 20 60479 rs149529999 C [T] 100 [PASS]
3 20 60522 rs150241001 T [TC] 100 [PASS]
4 20 60568 rs533509214 A [C] 100 [PASS]
>>> df2 = df1.head(30)
>>> df3 = df1.head(3000)
I found a previous question, but the solutions do not quite work for me. The accepted solution does not work:
>>> df2[df2.ALT.apply(lambda x: x == ['TC'])]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer
indexer = check = labels.get_indexer(objarr)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer
indexer = self._engine.get_indexer(target._ndarray_values)
File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer
File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'numpy.ndarray'
The reason being, the booleans get nested:
>>> df2.ALT.apply(lambda x: x == ['TC']).head()
0 [False]
1 [False]
2 [False]
3 [True]
4 [False]
Name: ALT, dtype: object
So I tried the second answer, which seemed to work:
>>> c = np.empty(1, object)
>>> c[0] = ['TC']
>>> df2[df2.ALT.values == c]
CHROM POS ID REF ALT QUAL FILTER
3 20 60522 rs150241001 T [TC] 100 [PASS]
But strangely, it doesn't work when I try it on the larger dataframe:
>>> df3[df3.ALT.values == c]
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
Which is probably because the result of the boolean comparison is different!
>>> df3.ALT.values == c
False
>>> df2.ALT.values == c
array([False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False])
This is completely baffling to me.
I found a hacky solution of casting the list as tuples works for me
df = pd.DataFrame({'CHROM': [20] *5,
'POS': [60343, 60419, 60479, 60522, 60568],
'ID': ['rs527639301', 'rs538242240', 'rs149529999', 'rs150241001', 'rs533509214'],
'REF': ['G', 'A', 'C', 'T', 'A'],
'ALT': [['A'], ['G'], ['T'], ['TC'], ['C']],
'QUAL': [100] * 5,
'FILTER': [['PASS']] * 5})
df['ALT'] = df['ALT'].apply(tuple)
df[df['ALT'] == ('C',)]
This method works because the immutability of tuples allows pandas to check if the entire element is correct compared to the intra-list elementwise comparison you got for the Boolean series because lists are not hashable.

Pandas Keyerror in Simple calling of array elements

I am loading a .csv from Pandas. It has a column for country, month, year, and date. Since I'm only interested in these, I overwrite the imported dataframe with a simpler version.
df = pd.read_csv('data.csv')
idx_USA = df['Country'] == 'United States'
df = df.loc[:,['Year','Month','Date']]
print(df[:4])
This yeilds
Year Month Date
1 2007 1 1
2 2004 10 2
4 1999 10 14
7 2000 10 5
Now,oddly, when I try to access the years in a loop, I get a keyerror! This is so simple-- what is going on? Thanks.
for i in range(1,N):
print "Yr = ", df['Year'][i]
Yr = 2007
Yr = 2004
Yr =
Traceback (most recent call last):
File "testimport.py", line 19, in <module>
print "Yr = ", df['Year'][i]
File "/usr/lib/python2.7/dist-packages/pandas/core/series.py", line 491, in __getitem__
result = self.index.get_value(self, key)
File "/usr/lib/python2.7/dist-packages/pandas/core/index.py", line 1032, in get_value
return self._engine.get_value(s, k)
File "index.pyx", line 97, in pandas.index.IndexEngine.get_value (pandas/index.c:2957)
File "index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:2772)
File "index.pyx", line 149, in pandas.index.IndexEngine.get_loc (pandas/index.c:3498)
File "hashtable.pyx", line 382, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6930)
File "hashtable.pyx", line 388, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6871)
KeyError: 3
Instead of resetting the index you can just iterate through the underlying array.
arr = df['year']
for i in range(1,N):
print arr[i]

Incompatible indexer with Series

Why do I get an error:
import pandas as pd
a = pd.Series(index=[4,5,6], data=0)
print a.loc[4:5]
a.loc[4:5] += 1
Output:
4 0
5 0
Traceback (most recent call last):
File "temp1.py", line 9, in <module>
dtype: int64
a.loc[4:5] += 1
File "lib\site-packages\pandas\core\indexing.py", line 88, in __setitem__
self._setitem_with_indexer(indexer, value)
File "lib\site-packages\pandas\core\indexing.py", line 177, in _setitem_with_indexer
value = self._align_series(indexer, value)
File "lib\site-packages\pandas\core\indexing.py", line 206, in _align_series
raise ValueError('Incompatible indexer with Series')
ValueError: Incompatible indexer with Series
Pandas 0.12.
I think this is a bug, you can work around this by use tuple index:
import pandas as pd
a = pd.Series(index=[4,5,6], data=0)
print a.loc[4:5]
a.loc[4:5,] += 1