Dropping same rows in two pandas dataframe in python - pandas

I want to have uncommon rows in two pandas dataframes. Two dataframes are df1 and wildone_df. When I check their typy both of them are "pandas.core.frame.DataFrame" but when I use below mentioned code to omit their intersection:
o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
I face following error:
TypeError Traceback (most recent call last)
<ipython-input-36-4e158c0eeb97> in <module>
----> 1 o = pd.concat([wildone_df,df1]).drop_duplicates(subset=None, keep='first', inplace=False)
5 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/algorithms.py in factorize_array(values, na_sentinel, size_hint, na_value, mask)
561
562 table = hash_klass(size_hint or len(values))
--> 563 uniques, codes = table.factorize(
564 values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
565 )
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
**TypeError: unhashable type: 'numpy.ndarray'**
How can I solve this issue?!
Omitting the intersection of two dataframes

Either use inplace=True or re-assign your dataframe when using pandas.DataFrame.drop_duplicates or any other built-in function that has an inplace parameter. You can't use them both at the same time.
Returns (DataFrame or None)
DataFrame with duplicates removed or None if inplace=True.
Try this :
o = pd.concat([wildone_df, df1]).drop_duplicates() #keep="first" by default

try this:
merged_df = merged_df.loc[:,~merged_df.columns.duplicated()].copy()
See this post for more info

Related

Pyspark pandas TypeError when try to concatenate two dataframes

I got an below error while I am trying to concatenate two pandas dataframes:
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
At the beginning I thought It was emerged because of one the dataframe that includes list on some column. So I tried to concatenate the two dataframes that does not include list on their columns. But I got the same error. I printed the type of dataframes to be sure. Both of them are pandas.core.frame.DataFrame. Why I got this error even they are not list?
import pyspark.pandas as ps
split_col = split_col.toPandas()
split_col2 = split_col2.toPandas()
dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_1455538/463168233.py in <module>
2 split_col = split_col.toPandas()
3 split_col2 = split_col2.toPandas()
----> 4 dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
/home/anaconda3/envs/virtenv/lib/python3.10/site-packages/pyspark/pandas/namespace.py in concat(objs, axis, join, ignore_index, sort)
2464 for obj in objs:
2465 if not isinstance(obj, (Series, DataFrame)):
-> 2466 raise TypeError(
2467 "cannot concatenate object of type "
2468 "'{name}"
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
type(split_col)
pandas.core.frame.DataFrame
type(split_col2)
pandas.core.frame.DataFrame
I want to concatenate 2 dataframe but I stuck. Do you have any suggestion?
You're having this error because you're trying to concatenate two pandas DataFrames using the Pandas API for pyspark.
Instead of converting your pyspark dataframes to pandas dataframes using the toPandas() method, try the following:
split_col = split_col.to_pandas_on_spark()
split_col2 = split_col2.to_pandas_on_spark()
More documentation on this method.
https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.to_pandas_on_spark.html

Keyerror from new dataframe when plotting

Keyerror : 0 after creating new dataframe and then attempting to plot new dataframe.
Initially, the code facilitated plotting of the original dataframe.
A small number of rows (~5 rows) were removed and a new dataframe was created.
New dataframe displayed without issue, however upon attempting to plot the new dataframe shows a Keyerror : 0.
I have attempted to resolve the issue without success.
The following is the script for the replacing, removal of missing data and new dataframe creation.
df_pre_orderset2_t = df_pre_orderset2.replace(0, np.nan)
df_pre_orderset2_top = df_pre_orderset2_t.dropna()
pd.set_option('display.max_colwidth', None)
df_pre_orderset2_to_10 = df_pre_orderset2_top.head(10)
df_pre_orderset2_top10 = pd.DataFrame(df_pre_orderset2_to_10)
df_pre_orderset2_top10
With the plot script as follows
plt.figure(figsize=(9,7))
ax = plt.gca()
x = df_pre_orderset2_top10['warning_status']
y = df_pre_orderset2_top10['count']
n = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
t = np.arange(10)
plt.title('Warning distribution versus order sets')
plt.ylabel('Warning count by order sets')
plt.xlabel('Warning alerts')
plt.scatter(x, y, c=t, s=100, alpha=1.0, marker='^')
plt.gcf().set_size_inches(13,8)
#scatter labels
for i, txt in enumerate(n):
ax.annotate(txt, (x[i],y[i]))
plt.show()
This returns an in complete outline of the proposed plot and a keyerror : 0, as below.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-161-0cc4009cf4a7> in <module>
16 #scatter labels
17 for i, txt in enumerate(n):
---> 18 ax.annotate(txt, (x[i],y[i]))
19
20 plt.show()
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key)
851
852 elif key_is_scalar:
--> 853 return self._get_value(key)
854
855 if is_hashable(key):
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
959
960 # Similar to Index.get_value, but we do not fall back to positional
--> 961 loc = self.index.get_loc(label)
962 return self.index._get_values_for_loc(self, loc, label)
963
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: 0
The line that's failing is ax.annotate(txt, (x[i],y[i])), and it's failing when i=0. Both x and y are Series objects that are columns taken from df_pre_orderset2_top10, so I'm guessing that when you removed rows from that dataframe, the row with 0 as its index was removed. You should be able to verify this by displaying the dataframe.
If this is the case, you can reset the index to that dataframe before you extract the x and y columns. Set drop=True to make sure the old index isn't added to the dataframe as a new column.
df_pre_orderset2_top10.reset_index(drop=True, inplace=True)
That should fix the problem.

Replace NaN values in all levels of a Pandas MultiIndex

After reading in an excel sheet with a MultiIndex, I am getting np.nan appearing in the index because some of the values are 'N/A' and pd.read_excel thinks it's a good idea to convert them. However I want to keep them as 'N/A' to preserve the multi-index. I thought it would be easy to change them back using MultiIndex.fillna but I get this error:
index = pd.MultiIndex(levels=[[u'foo', u'bar'], [u'one', np.nan]],
codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=[u'first', u'second'])
df = pd.DataFrame(index=index, columns=['A', 'B'])
df
df.index.fillna("N/A")
Output:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-17-09e14dcdc74f> in <module>
----> 1 df.index.fillna("N/A")
/anaconda3/envs/torch/lib/python3.7/site-packages/pandas/core/indexes/multi.py in fillna(self, value, downcast)
1456 fillna is not implemented for MultiIndex
1457 """
-> 1458 raise NotImplementedError("isna is not defined for MultiIndex")
1459
1460 #Appender(_index_shared_docs["dropna"])
NotImplementedError: isna is not defined for MultiIndex
Update:
Code updated to reflect Pandas 1.0.2. Prior to version 0.24.0 the codes attribute of pd.MultiIndex was called labels. Also, the traceback details changed from isnull is not defined to isna is not defined as above.
The accepted solution did not work for me either. It still left NA values in the index even though inspecting the df.index.levels individually did not show NA values.
Jorge's solution pointed me in the right direction but also wasn't quite right for my case. Here is my approach, including handling of the single Index case as discussed in the comments of the accepted answer.
if isinstance(df.index, pd.MultiIndex):
df.index = pd.MultiIndex.from_frame(
df.index.to_frame().fillna(my_fillna_value)
)
else:
df.index = df.index.fillna(my_fillna_value)
Use set_levels
df.index.set_levels([l.fillna('N/A') for l in df.index.levels], inplace=True)
df
The current solution didn't work for me when having multi level columns. What i did and worked for me was the following:
df.columns = pd.MultiIndex.from_frame(df.columns.to_frame().fillna(''))

Error in filtering groupby results in pandas

I am trying to filter groupby results in pandas using the example provided at:
http://pandas.pydata.org/pandas-docs/dev/groupby.html#filtration
but getting the following error (pandas 0.12):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-d0014484ff78> in <module>()
1 grouped = my_df.groupby('userID')
----> 2 grouped.filter(lambda x: len(x) >= 5)
/Users/zz/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in filter(self, func, dropna, *args, **kwargs)
2092 res = path(group)
2093
-> 2094 if res:
2095 indexers.append(self.obj.index.get_indexer(group.index))
2096
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What does it mean and how can it be resolved?
EDIT:
code to replicate the problem in pandas 0.12 stable
dff = pd.DataFrame({'A': list('222'), 'B': list('123'), 'C': list('123') })
dff.groupby('A').filter(lambda x: len(x) > 2)
This was a quasi-bug in 0.12 and will be fixed in 0.13, the res is now protected by a type check:
if isinstance(res,(bool,np.bool_)):
if res:
add_indices()
I'm not quite sure how you got this error however, the docs are actually compiled and run with actual pandas. You should ensure you're reading the docs for the correct version (in this case you were linking to dev rather than stable - although the API is largely unchanged).
The standard workaround is to do this using transform, which in this case would be something like:
In [11]: dff[g.B.transform(lambda x: len(x) > 2)]
Out[11]:
A B C
0 2 1 1
1 2 2 2
2 2 3 3

DataFrame.ix() in pandas - is there an option to catch situations when requested columns do not exist?

My code reads CSV file into pandas DataFrame - and processes it.
The code relies on column names - uses df.ix[,] to get the columns.
Recently some column names in the CSV file were changed (without notice).
But the code was not complaining and was silently producing wrong results.
The ix[,] construct doesn't check if column exists.
If it doesn't - it simply creates it and populate with NaN.
Here is the main idea of what was going on.
df1=DataFrame({'a':[1,2,3],'b':[4,5,6]}) # columns 'a' & 'b'
df2=df1.ix[:,['a','c']] # trying to get 'a' & 'c'
print df2
a c
0 1 NaN
1 2 NaN
2 3 NaN
So it doesn't produce an error or a warning.
Is there an alternative way to select specific columns with extra check that columns exist?
My current workaround is to use my own small utility function, something like this:
import sys, inspect
def validate_cols_or_exit(df,cols):
"""
Exits with error message if pandas DataFrame object df
doesn't have all columns from the provided list of columns
Example of usage:
validate_cols_or_exit(mydf,['col1','col2'])
"""
dfcols = list(df.columns)
valid_flag = True
for c in cols:
if c not in dfcols:
print "Error, non-existent DataFrame column found - ",c
valid_flag = False
if not valid_flag:
print "Error, non-existent DataFrame column(s) found in function ", inspect.stack()[1][3]
print "valid column names are:"
print "\n".join(df.columns)
sys.exit(1)
How about:
In [3]: df1[['a', 'c']]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/wesm/code/pandas/<ipython-input-3-2349e89f1bb5> in <module>()
----> 1 df1[['a', 'c']]
/home/wesm/code/pandas/pandas/core/frame.py in __getitem__(self, key)
1582 if com._is_bool_indexer(key):
1583 key = np.asarray(key, dtype=bool)
-> 1584 return self._getitem_array(key)
1585 elif isinstance(self.columns, MultiIndex):
1586 return self._getitem_multilevel(key)
/home/wesm/code/pandas/pandas/core/frame.py in _getitem_array(self, key)
1609 mask = indexer == -1
1610 if mask.any():
-> 1611 raise KeyError("No column(s) named: %s" % str(key[mask]))
1612 result = self.reindex(columns=key)
1613 if result.columns.name is None:
KeyError: 'No column(s) named: [c]'
Not sure you can constrain a DataFrame, but your helper function could be a lot simpler. (something like)
mismatch = set(cols).difference(set(dfcols))
if mismatch:
raise SystemExit('Unknown column(s): {}'.format(','.join(mismatch)))