Trying to sum columns in Pandas dataframe, issue with index it seems - pandas

Trying to sum columns in Pandas dataframe, issue with index it seems...
Part of dataset looks like this, for multiple years:
snapshot of dataset
CA_HousingTrend = CA_HousingTrend_temp.pivot_table(index='YEAR',columns='UNITSSTR', aggfunc='size')
dataframe looks like this now and
this is the properties
Trying to sum multi-family units so I am specifying the columns to sum
cols = ['05', '06']
CA_HousingTrend['sum_stats'] = CA_HousingTrend[cols].sum(axis=1)
This is the error I get:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/code.py", line 90, in runcode
exec(code, self.locals)
File "", line 5, in
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/frame.py", line 3511, in getitem
indexer = self.columns._get_indexer_strict(key, "columns")1
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['05', '06'], dtype='object', name='UNITSSTR')] are in the [columns]"

Not sure if you need the index, but the pivot probably created a multi-index. Try this.
CA_HousingTrend = CA_HousingTrend_temp.pivot_table(index='YEAR',columns='UNITSSTR', aggfunc='size')
# A new dataframe just so you have something new to play with.
new_df = CA_Housing_Trend_temp.reset_index()

Related

Renaming Pandas Column Names to String Datatype

I have a df with column names that do not appear to be of the typcial string datatype. I am trying to rename these column names to give them the same name. I have tried this and I end up with an error. Here is my df with column names:
dfap.columns
Out[169]: Index(['month', 'plant_name', 0, 1, 2, 3, 4], dtype='object')
Here is my attempt at renaming columns 2,3,4,5,6 or 2:7
dfap.columns[2:7] = [ 'Adj_Prod']
Traceback (most recent call last):
File "<ipython-input-175-ebec554a2fd1>", line 1, in <module>
dfap.columns[2:7] = [ 'Adj_Prod']
File "C:\Users\U321103\Anaconda3\envs\Maps2\lib\site-packages\pandas\core\indexes\base.py", line 4585, in __setitem__
raise TypeError("Index does not support mutable operations")
TypeError: Index does not support mutable operations
Thank you,
You can't rename only some columns using that method.
You can do
tempcols=dfap.columns
tempcols[2:7]=newcols
dfap.columns=tempcols
Of course you'll want newcols to be the same len as what you're replacing. In your example you're only assigning a len 1 list.
You could do.
dfap.rename(columns=dict_of_name_changes, inplace=True)
The dict needs for the key to be the existing name and the value to be the new name. In this method you can rename as few columns as you want.
You could use rename(columns= with a lambda function to handle the renaming logic.
df.rename(columns=lambda x: x if type(x)!=int else 'Adj_Prod')
Result
Columns: [month, plant_name, Adj_Prod, Adj_Prod, Adj_Prod, Adj_Prod, Adj_Prod]

Insert multiple columns in a pandas dataframe using insert method

I want to insert multiple columns in selected positions in a pandas dataframe
import pandas as pd
df = pd.DataFrame({'product name': ['laptop', 'printer', 'printer',], 'price': [1200, 150, 1200], 'price1': [1200, 150, 1200]})
df.insert(0, 'AAA', -1)
df.insert(1, 'BBB', -2)
df
However I am wondering if I can insert multiple columns at once. I tried below,
df.insert([0, 1], ['AAA', 'BBB'], [-1, -2])
This generates error as,
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 3762, in insert
value = self._sanitize_column(column, value, broadcast=False)
File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 3899, in _sanitize_column
value = sanitize_index(value, self.index)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 751, in sanitize_index
raise ValueError(
ValueError: Length of values (2) does not match length of index (3)
Is there any way to insert multiple columns at once using insert method?
Like already in the comments mentioned, insert is not possible for this.
If you have those two columns as pd.dataFrame or pd.Series you can use pd.concat like this:
pd.concat([s1,s2,df],axis=1)
where s1 is 'AAA' and s2 is 'BBB' with its values.
Be aware the index of s1, s2 and df has to be the same here.

Pandas tells me non-ambiguous time is ambiguous

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505
It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')
I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

Pandas error with multidimensional key using .loc and a boolean

Been running into this same error for 2 weeks, even though the code worked before. Not sure if I updated pandas as part of another library install, and maybe something changed there. Currently on version 23.4. Expected outcome is returning just the row with that identifier value.
In [42]: df.head()
Out[43]:
index Identifier ...
0 51384710 ...
1 74838J10 ...
2 80589M10 ...
3 67104410 ...
4 50241310 ...
[5 rows x 14 columns]
In [43]: df.loc[df.Identifier.isin(['51384710'])].head()
Traceback (most recent call last):
File "C:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-44-a3dbf43451ef>", line 1, in <module>
df.loc[df.Identifier.isin(['51384710'])].head()
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1478, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1899, in _getitem_axis
raise ValueError('Cannot index with multidimensional key')
**ValueError: Cannot index with multidimensional key**
Code Snippet
Fixed it. I'd done df.columns = [column_list] where column_list = [...], which caused df to be treated as if it had a multiindex, even though there was only one level. removed the brackets from the df.columns assignment.
Try changing
df.loc[df.Identifier.isin(['51384710'])].head()
to
df[df.Identifier.isin(['51384710'])].head()

df.Change[-1] producing errors.

I'm trying to slice the last value of the series Change from my dataframe df.
The dataframe looks something like this
Change
0 1.000000
1 0.917727
2 1.000000
3 0.914773
4 0.933182
5 0.936136
6 0.957500
14466949 1.998392
14466950 2.002413
14466951 1.998392
14466952 1.974266
14466953 1.966224
When I input the following code
df.Change[0]
df.Change[100]
df.Change[100000]
I'm getting an output, but when I'm input
df.Change[-1]
I'm getting the following error
Traceback (most recent call last):
File "<pyshell#188>", line 1, in <module>
df.Change[-1]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\indexes\base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas\index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas\index.c:3041)
File "pandas/index.pyx", line 151, in pandas.index.IndexEngine.get_loc (pandas\index.c:3898)
KeyError: -1
Pretty much any negative number I use for slicing is resulting in an error, and I'm not exactly sure why.
Thanks.
There are several ways to do this. What's happening is that pandas has no issues with df.Change[100] because 100 is in its index. -1 is not. You happen to have your index the same as if you were using ordinal positions. To explicitly get ordinal positions, use iloc.
df.Change.iloc[-1]
or
df.Change.values[-1]