DataFrame.ix() in pandas - is there an option to catch situations when requested columns do not exist? - dataframe

My code reads CSV file into pandas DataFrame - and processes it.
The code relies on column names - uses df.ix[,] to get the columns.
Recently some column names in the CSV file were changed (without notice).
But the code was not complaining and was silently producing wrong results.
The ix[,] construct doesn't check if column exists.
If it doesn't - it simply creates it and populate with NaN.
Here is the main idea of what was going on.
df1=DataFrame({'a':[1,2,3],'b':[4,5,6]}) # columns 'a' & 'b'
df2=df1.ix[:,['a','c']] # trying to get 'a' & 'c'
print df2
a c
0 1 NaN
1 2 NaN
2 3 NaN
So it doesn't produce an error or a warning.
Is there an alternative way to select specific columns with extra check that columns exist?
My current workaround is to use my own small utility function, something like this:
import sys, inspect
def validate_cols_or_exit(df,cols):
"""
Exits with error message if pandas DataFrame object df
doesn't have all columns from the provided list of columns
Example of usage:
validate_cols_or_exit(mydf,['col1','col2'])
"""
dfcols = list(df.columns)
valid_flag = True
for c in cols:
if c not in dfcols:
print "Error, non-existent DataFrame column found - ",c
valid_flag = False
if not valid_flag:
print "Error, non-existent DataFrame column(s) found in function ", inspect.stack()[1][3]
print "valid column names are:"
print "\n".join(df.columns)
sys.exit(1)

How about:
In [3]: df1[['a', 'c']]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/wesm/code/pandas/<ipython-input-3-2349e89f1bb5> in <module>()
----> 1 df1[['a', 'c']]
/home/wesm/code/pandas/pandas/core/frame.py in __getitem__(self, key)
1582 if com._is_bool_indexer(key):
1583 key = np.asarray(key, dtype=bool)
-> 1584 return self._getitem_array(key)
1585 elif isinstance(self.columns, MultiIndex):
1586 return self._getitem_multilevel(key)
/home/wesm/code/pandas/pandas/core/frame.py in _getitem_array(self, key)
1609 mask = indexer == -1
1610 if mask.any():
-> 1611 raise KeyError("No column(s) named: %s" % str(key[mask]))
1612 result = self.reindex(columns=key)
1613 if result.columns.name is None:
KeyError: 'No column(s) named: [c]'

Not sure you can constrain a DataFrame, but your helper function could be a lot simpler. (something like)
mismatch = set(cols).difference(set(dfcols))
if mismatch:
raise SystemExit('Unknown column(s): {}'.format(','.join(mismatch)))

Related

Dropping same rows in two pandas dataframe in python

I want to have uncommon rows in two pandas dataframes. Two dataframes are df1 and wildone_df. When I check their typy both of them are "pandas.core.frame.DataFrame" but when I use below mentioned code to omit their intersection:
o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
I face following error:
TypeError Traceback (most recent call last)
<ipython-input-36-4e158c0eeb97> in <module>
----> 1 o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
5 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/algorithms.py in factorize_array(values, na_sentinel, size_hint, na_value, mask)
561
562 table = hash_klass(size_hint or len(values))
--> 563 uniques, codes = table.factorize(
564 values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
565 )
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
**TypeError: unhashable type: 'numpy.ndarray'**
How can I solve this issue?!
Omitting the intersection of two dataframes
Either use inplace=True or re-assign your dataframe when using pandas.DataFrame.drop_duplicates or any other built-in function that has an inplace parameter. You can't use them both at the same time.
Returns (DataFrame or None)
DataFrame with duplicates removed or None if inplace=True.
Try this :
o = pd.concat([wildone_df, df1]).drop_duplicates() #keep="first" by default
try this:
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()].copy()
See this post for more info

Pandas (with def and np.where): error with values in a dataframe row conditioned on another dataframe row

I have dataframes A of shape XxY with values and dataframes B of shape ZxY to be filled with statistics calculated from A.
As an example:
A = pd.DataFrame(np.array(range(9)).reshape((3,3)))
B = pd.DataFrame(np.array(range(6)).reshape((2,3)))
Now I need to fill row 1 of B with quantile(0.5) of A columns where row 0 of B > 1 (else: np.nan). I need to use a function of the kind:
def mydef(df0, df1):
df1.loc[1] = np.where(df1.loc[0]>1,
df0.quantile(0.5),
np.nan)
pass
mydef(A,B)
Now B is:
0 1 2
0 0.0 1.0 2.0
1 NaN NaN 3.5
It works perfectly for these mock dataframes and all my real ones apart from one.
For that one this error is raised:
ValueError: cannot set using a list-like indexer with a different length than the value
When I run the same code without calling a function, it doesn't raise any error.
Since I need to use a function, any suggestion?
I found the error. I erroneously had the same label twice in the index. Essentially my dataframe B was something like:
B = pd.DataFrame(np.array(range(9)).reshape((3,3)), index=[0,0,1])
so that calling the def:
def mydef(df0, df1):
df1.loc[1] = np.where(df1.loc[0]>1,
df0.quantile(0.5),
np.nan)
pass
would cause the condition and the if-false lines of np.where to not match their shapes, I guess.
Still not sure why working outside the def worked.

Extracting only object type columns in a separate list from a data-frame in pandas

I am a beginner in Python. I want to extract all the column names with DType as object into a separate list for encoding as a part of data processing. What i have tried is the below code, but getting an error
l=[]
for i in dataset.columns[i.dtype == 'object']:
l.append(i)
AttributeError Traceback (most recent call last)
in
----> 1 for i in dataset.columns[dataset.dtype == 'object']:
2 print(i)
D:\Anaconda\InstallationFolder\lib\site-packages\pandas\core\generic.py in getattr(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.getattribute(self, name)
5140
5141 def setattr(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'dtype'
The dataset.info() give the below :
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
Please help me with this error.
I want the column names with object data type in a separate list.
You could use it to extract object type and then store it in a variable
df.select_dtypes(include='object'))
Or try selecting the columns using .select_dtypes():
col_list = df_flights.select_dtypes(include=['object']).columns.to_list()
Try this:
columns = [column for column in dataset.columns if dataset[column].dtype == 'object']
When using pandas, if you are using a for loop you are probably doing something wrong.
dataset.dtypes[dataset.dtypes == "object"].index.values
You are extracting columns so this will work:
l=[]
for i in dataset.columns:
if dataset[i].dtypes == 'object':
l.append(i)

Adding Pandas series values Pandas dataframe values [duplicate]

I have a Python Pandas DataFrame:
df = pd.DataFrame(np.random.rand(5,3),columns=list('ABC'))
print df
A B C
0 0.041761178 0.60439116 0.349372206
1 0.820455992 0.245314299 0.635568504
2 0.517482167 0.7257227 0.982969949
3 0.208934899 0.594973111 0.671030326
4 0.651299752 0.617672419 0.948121305
Question:
I would like to add the first column to the whole dataframe. I would like to get this:
A B C
0 0.083522356 0.646152338 0.391133384
1 1.640911984 1.065770291 1.456024496
2 1.034964334 1.243204867 1.500452116
3 0.417869798 0.80390801 0.879965225
4 1.302599505 1.268972171 1.599421057
For the first row:
A: 0.04176 + 0.04176 = 0.08352
B: 0.04176 + 0.60439 = 0.64615
etc
Requirements:
I cannot refer to the first column using its column name.
eg.: df.A is not acceptable; df.iloc[:,0] is acceptable.
Attempt:
I tried this using:
print df.add(df.iloc[:,0], fill_value=0)
but it is not working. It returns the error message:
Traceback (most recent call last):
File "C:test.py", line 20, in <module>
print df.add(df.iloc[:,0], fill_value=0)
File "C:\python27\lib\site-packages\pandas\core\ops.py", line 771, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2939, in _combine_series
return self._combine_match_columns(other, func, level=level, fill_value=fill_value)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2975, in _combine_match_columns
fill_value)
NotImplementedError: fill_value 0 not supported
Is it possible to take the sum of all columns of a DataFrame with the first column?
That's what you need to do:
df.add(df.A, axis=0)
Example:
>>> df = pd.DataFrame(np.random.rand(5,3),columns=['A','B','C'])
>>> col_0 = df.columns.tolist()[0]
>>> print df
A B C
0 0.502962 0.093555 0.854267
1 0.165805 0.263960 0.353374
2 0.386777 0.143079 0.063389
3 0.639575 0.269359 0.681811
4 0.874487 0.992425 0.660696
>>> df = df.add(df.col_0, axis=0)
>>> print df
A B C
0 1.005925 0.596517 1.357229
1 0.331611 0.429766 0.519179
2 0.773553 0.529855 0.450165
3 1.279151 0.908934 1.321386
4 1.748975 1.866912 1.535183
>>>
I would try something like this:
firstol = df.columns[0]
df2 = df.add(df[firstcol], axis=0)
I used a combination of the above two posts to answer this question.
Since I cannot refer to a specific column by its name, I cannot use df.add(df.A, axis=0). But this is along the correct lines. Since df += df[firstcol] produced a dataframe of NaNs, I could not use this approach, but the way that this solution obtains a list of columns from the dataframe was the trick I needed.
Here is how I did it:
col_0 = df.columns.tolist()[0]
print(df.add(df[col_0], axis=0))
You can use numpy and broadcasting for this:
df = pd.DataFrame(df.values + df['A'].values[:, None],
columns=df.columns)
I expect this to be more efficient than series-based methods.

Correct way of iterating over pandas dataframe by date

I want to iterate over a dataframe's major axis date by date.
Example:
tdf = df.ix[date]
The issue I am having is that the type returned by df.ix changes, leaving me with 3 possible situations
If the date does not exist in tdf an error is thrown: KeyError: 1394755200000000000
If there is only one item in tdf: print type(tdf) returns
<class 'pandas.core.series.Series'>
If there is more than one item in tdf: print type(tdf) returns
<class 'pandas.core.frame.DataFrame'>
To avoid the first case I can simply wrap this in a try catch block or thanks to jxstanford, I can avoid the try catch block by using if date in df.index:
I run into the issue afterwards with an inconsistent API with a pandas series and a pandas data frame. I could solve this by checking for types but it seems I shouldn't have to do that. I would ideally like to keep the types the same. Is there a better way of doing this?
I'm running pandas 0.13.1 and I am currently loading my data from a CSV using
Here's a full example demonstrating the problem.
from pandas import DataFrame
import datetime
path_to_csv = '/home/n/Documents/port/test.csv'
df = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')
start_dt = df.index.min()
end_dt = df.index.max()
dt_step = datetime.timedelta(days=1)
df.sort_index(inplace=True)
cur_dt = start_dt
while cur_dt != end_dt:
if cur_dt in df.index:
print type(df.ix[cur_dt])
#run some other steps using cur_dt
cur_dt += dt_step
An example CSV that demonstrates the problem is as follows:
value1,value2,value3,Date,type
1,2,4,03/13/14,a
2,3,3,03/21/14,b
3,4,2,03/21/14,a
4,5,1,03/27/14,b
The above code prints out
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
Is it possible to get the value of value1 from tdf in a consistent manner? or am I stuck making an if statement for and separately handle each case?
if type(df.ix[cur_dt]) == DataFrame:
....
if type(df.ix[cur_dt]) == Series:
....
Not sure what your trying to do with the dataframe, but this might be better than a try/except:
tdf = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')
while cur_dt != end_dt:
if cur_dt in df.index:
# do your thing
cur_dt += dt_step
This toy code will return DataFrames consistently.
def framer(rows):
if ndim(rows) == 1:
return rows.to_frame().T
else:
return rows
for cur_date in df.index:
print type(framer(df.ix[cur_date]))
And this will give you the missing days:
df.resample(rule='D')
Have a look at the resample method docstring. It has its own options to fill up the missing data. And if you decide to make your multiple dates into a single one, the method you're looking at is groupby (if you want to combine values across rows) and drop_duplicates (if you want to ignore them). There is no need to reinvent the wheel.
You can use the apply method of the DataFrame, using axis = 1 to work on each row of the DataFrame to build a Series with the same Index.
e.g.
def calculate_value(row):
if row.date == pd.datetime(2014,3,21):
return 0
elif row.type == 'a':
return row.value1 + row.value2 + row.value3
else:
return row.value1 * row.value2 * row.value3
df['date'] = df.index
df['NewValue'] = df.apply(calculate_value, axis=1)
modifies your example input as follows
value1 value2 value3 type NewValue date
Date
2014-03-13 1 2 4 a 7 2014-03-13
2014-03-21 2 3 3 b 0 2014-03-21
2014-03-21 3 4 2 a 0 2014-03-21
2014-03-27 4 5 1 b 20 2014-03-27
[4 rows x 6 columns]