Insert multiple columns in a pandas dataframe using insert method - pandas

I want to insert multiple columns in selected positions in a pandas dataframe
import pandas as pd
df = pd.DataFrame({'product name': ['laptop', 'printer', 'printer',], 'price': [1200, 150, 1200], 'price1': [1200, 150, 1200]})
df.insert(0, 'AAA', -1)
df.insert(1, 'BBB', -2)
df
However I am wondering if I can insert multiple columns at once. I tried below,
df.insert([0, 1], ['AAA', 'BBB'], [-1, -2])
This generates error as,
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 3762, in insert
value = self._sanitize_column(column, value, broadcast=False)
File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 3899, in _sanitize_column
value = sanitize_index(value, self.index)
File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 751, in sanitize_index
raise ValueError(
ValueError: Length of values (2) does not match length of index (3)
Is there any way to insert multiple columns at once using insert method?

Like already in the comments mentioned, insert is not possible for this.
If you have those two columns as pd.dataFrame or pd.Series you can use pd.concat like this:
pd.concat([s1,s2,df],axis=1)
where s1 is 'AAA' and s2 is 'BBB' with its values.
Be aware the index of s1, s2 and df has to be the same here.

Related

Renaming Pandas Column Names to String Datatype

I have a df with column names that do not appear to be of the typcial string datatype. I am trying to rename these column names to give them the same name. I have tried this and I end up with an error. Here is my df with column names:
dfap.columns
Out[169]: Index(['month', 'plant_name', 0, 1, 2, 3, 4], dtype='object')
Here is my attempt at renaming columns 2,3,4,5,6 or 2:7
dfap.columns[2:7] = [ 'Adj_Prod']
Traceback (most recent call last):
File "<ipython-input-175-ebec554a2fd1>", line 1, in <module>
dfap.columns[2:7] = [ 'Adj_Prod']
File "C:\Users\U321103\Anaconda3\envs\Maps2\lib\site-packages\pandas\core\indexes\base.py", line 4585, in __setitem__
raise TypeError("Index does not support mutable operations")
TypeError: Index does not support mutable operations
Thank you,
You can't rename only some columns using that method.
You can do
tempcols=dfap.columns
tempcols[2:7]=newcols
dfap.columns=tempcols
Of course you'll want newcols to be the same len as what you're replacing. In your example you're only assigning a len 1 list.
You could do.
dfap.rename(columns=dict_of_name_changes, inplace=True)
The dict needs for the key to be the existing name and the value to be the new name. In this method you can rename as few columns as you want.
You could use rename(columns= with a lambda function to handle the renaming logic.
df.rename(columns=lambda x: x if type(x)!=int else 'Adj_Prod')
Result
Columns: [month, plant_name, Adj_Prod, Adj_Prod, Adj_Prod, Adj_Prod, Adj_Prod]

Trying to sum columns in Pandas dataframe, issue with index it seems

Trying to sum columns in Pandas dataframe, issue with index it seems...
Part of dataset looks like this, for multiple years:
snapshot of dataset
CA_HousingTrend = CA_HousingTrend_temp.pivot_table(index='YEAR',columns='UNITSSTR', aggfunc='size')
dataframe looks like this now and
this is the properties
Trying to sum multi-family units so I am specifying the columns to sum
cols = ['05', '06']
CA_HousingTrend['sum_stats'] = CA_HousingTrend[cols].sum(axis=1)
This is the error I get:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/code.py", line 90, in runcode
exec(code, self.locals)
File "", line 5, in
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/frame.py", line 3511, in getitem
indexer = self.columns._get_indexer_strict(key, "columns")1
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['05', '06'], dtype='object', name='UNITSSTR')] are in the [columns]"
Not sure if you need the index, but the pivot probably created a multi-index. Try this.
CA_HousingTrend = CA_HousingTrend_temp.pivot_table(index='YEAR',columns='UNITSSTR', aggfunc='size')
# A new dataframe just so you have something new to play with.
new_df = CA_Housing_Trend_temp.reset_index()

Pandas tells me non-ambiguous time is ambiguous

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505
It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')
I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

How to implement aggregation functions for pandas groupby objects?

Here's the setup for this question:
import numpy as np
import pandas as pd
import collections as co
data = [['a', 1],
['a', 2],
['a', 3],
['a', 4],
['b', 5],
['b', 6],
['b', 7]]
varnames = tuple('PQ')
df = pd.DataFrame(co.OrderedDict([(varnames[i], [row[i] for row in data])
for i in range(len(varnames))]))
gdf = df.groupby(df.ix[:, 0])
After evaluating the above, df looks like this:
>>> df
P Q
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 b 7
gdf is a DataFrameGroupBy object associated with df, where the groups are determined by the values in the first column of df.
Now, watch this:
>>> gdf.aggregate(sum)
Q
P
a 10
b 18
...but repeating the same thing after replacing sum with a pass-through wrapper for it, bombs:
>>> mysum = lambda *a, **k: sum(*a, **k)
>>> mysum(range(10)) == sum(range(10))
True
>>> gdf.aggregate(mysum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1699, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1757, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1782, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1426, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1508, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Here's a subtler (though probably related) issue. Recall that the result of gdf.aggregate(sum) was a dataframe with a single column, Q. Now, note the result below contains two columns, P and Q:
>>> import random as rn
>>> gdf.aggregate(lambda *a, **k: rn.random())
P Q
P
a 0.344457 0.344457
b 0.990507 0.990507
I have not been able to find anything in the documentation that would explain
why should gdf.aggregate(mysum) fail? (IOW, does this failure agree with documented behavior, or is it a bug in pandas?)
why should gdf.aggregate(lambda *a, **k: rn.random()) produce a two-column output while gdf.aggregate(sum) produce a one-column output?
what signatures (input and output) should an aggregation function foo have so that gdf.aggregate(foo) will return a table having only column Q (like the result of gdf.aggregate(sum))?
Your problems all come down to the columns that are included in the GroupBy. I think you want to group by P and computed statistics on Q. To do that use
gdf = df.groupby('P')
instead of your method. Then any aggregations will not include the P column.
The sum in your function is Python's sum. Groupby.sum() is written in Cython and only acts on numeric dtypes. That's why you get the error about adding ints to strs.
Your other two questions are related to that. You're inputing two columns into gdf.agg, P and Q so you get two columns out for your gdf.aggregate(lambda *a, **k: rn.random()). gdf.sum() ignores the string column.

Pandas HDFStore duplicate items Error

Hello All and thanks in advance.
I'm trying to do a periodic storing of financial data to a database for later querying. I am using Pandas for almost all of the data coding. I want to append a dataframe I have created into an HDF database. I read the csv into a dataframe and index it by timestamp. and the DataFrame looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 900 entries, 1378400701110 to 1378410270251
Data columns (total 23 columns):
....
...Columns with numbers of non-null values....
.....
dtypes: float64(19), int64(4)
store = pd.HDFStore('store1.h5')
store.append('df', df)
print store
<class 'pandas.io.pytables.HDFStore'>
File path: store1.h5
/df frame_table (typ->appendable,nrows->900,ncols->23,indexers->[index])
But when I then try to do anything with store,
print store['df']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 289, in __getitem__
return self.get(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 422, in get
return self._read_group(group)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 930, in _read_group
return s.read(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 3175, in read
mgr = BlockManager([block], [cols_, index_])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1007, in __init__
self._set_ref_locs(do_refs=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1117, in _set_ref_locs
"does not have _ref_locs set" % (block,labels))
AssertionError: cannot create BlockManager._ref_locs because block
[FloatBlock: [LastTrade, Bid1, Bid1Volume,....., Ask5Volume], 19 x 900, dtype float64]
with duplicate items
[Index([u'LastTrade', u'Bid1', u'Bid1Volume',..., u'Ask5Volume'], dtype=object)]
does not have _ref_locs set
I guess I am doing something wrong with the index, I'm quite new at this and have little knowhow.
EDIT:
The data frame construction looks like:
columns = ['TimeStamp', 'LastTrade', 'Bid1', 'Bid1Volume', 'Bid1', 'Bid1Volume', 'Bid2', 'Bid2Volume', 'Bid3', 'Bid3Volume', 'Bid4', 'Bid4Volume',
'Bid5', 'Bid5Volume', 'Ask1', 'Ask1Volume', 'Ask2', 'Ask2Volume', 'Ask3', 'Ask3Volume', 'Ask4', 'Ask4Volume', 'Ask5', 'Ask5Volume']
df = pd.read_csv('/20130905.csv', names=columns, index_col=[0])
df.head() looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1378400701110 to 1378400703105
Data columns (total 21 columns):
LastTrade 5 non-null values
Bid1 5 non-null values
Bid1Volume 5 non-null values
Bid1 5 non-null values
.................values
Ask4 5 non-null values
Ask4Volume 5 non-null values
dtypes: float64(17), int64(4)
There's too many columns for it to print out the contents. But for example:
print df['LastTrade'].iloc[10]
LastTrade 1.31202
Name: 1378400706093, dtype: float64
and Pandas version:
>>> pd.__version__
'0.12.0'
Any ideas would be thoroughly appreciated, thank you again.
Do you really have a duplicate 'Bid1' and 'Bid1Volume' columns?
Unrrelated, but you should also set the index to a datetime index
import pandas as pd
df.index = pd.to_datetime(df.index,unit='ms')
This is a bug, because the duplicate columns cross dtypes (not a big deal
but undetected).
Easiest just to not have duplicate columns.
Will be fixed in 0.13, see here: https://github.com/pydata/pandas/pull/4768