In an attempt to make my pandas code faster I installed modin and tried to use it. A merge of two data frames that had previously worked gave me the following error:
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.frame.DataFrame'>
Here is the info of both data frames:
printing event_df.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980101 entries, 0 to 1980100
Data columns (total 5 columns):
other_id object
id object
category object
description object
date datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 75.5+ MB
printing other_df info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752438 entries, 0 to 752437
Data columns (total 4 columns):
id 752438 non-null object
other_id 752438 non-null object
Value 752438 non-null object
Unit 752438 non-null object
dtypes: object(4)
memory usage: 23.0+ MB
Here are some rows from event_df:
other_id id category description date
08E5A97350FC8B00092F 1 some_string some_string 2019-04-09
17B71019E148415D 4 some_string some_string 2019-11-08
17B71019E148415D360 7 some_string some_string 2019-11-08
and here are 3 rows from other_df:
id other_id Value Unit
a01 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a02 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a03 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
I tried installing the version cited in this question Join two modin.pandas.DataFrame(s), but it didn't help.
Here's the line of code throwing the error:
joint_dataframe2 = pd.merge(event_df,other_df, on = ["id","other_id"])
It seems there is some problem with modin's merge functionality. Is there any workaround such as using pandas for the merge and using modin for a groupby.transform()? I tried overwriting the pandas import after the merge with import modin.pandas, but got an error saying pandas was referenced before assignment. Has anyone come across this problem and if so, is there a solution?
Your error reads like you were merging an instance of modin.pandas.dataframe.DataFrame with an instance of pandas.core.frame.DataFrame, which is not allowed.
If that's indeed the case, you could turn the pandas Dataframe into a modin Dataframe first, then you should be able to merge them, I believe.
Related
While removing the zero and null values from pandas dataframe, the datatype of field gets changed.
df.info()
Output :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 2 columns):
budget 10866 non-null int64
revenue 10866 non-null int64
dtypes: int64(2)
memory usage: 509.4+ KB
After running below code to remove zero and null values the datatype got changed.
temp_list_to_check_zero_values=['budget', 'revenue']
df[temp_list_to_check_zero_values] = df[temp_list_to_check_zero_values].replace(0, np.NAN)
df.info()
Output :
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3854 entries, 0 to 10848
Data columns (total 2 columns):
budget 3854 non-null float64
revenue 3854 non-null float64
dtypes: float64(2)
memory usage: 210.8+ KB
To preserve the datatype, we used applymap
df[temp_list_to_check_zero_values] = df[temp_list_to_check_zero_values].applymap(np.int64)
But got error :
ModuleNotFoundError: No module named 'pandas.core.apply'
Do we need to install any specific library for using applymap() ?
Upgrading your pandas should solve the issue.
Try pip install pandas --upgrade
I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB
I have a dataframe which looks like this.
x_train.info()
Int64Index: 8330 entries, 16 to 8345
Data columns (total 4 columns):
userId 8330 non-null object
base_id 8330 non-null object
rating 8330 non-null object
dtypes: object(3)
I am trying to convert this to sparse matrix using the following command
train_sparse_matrix = sparse.csc_matrix((x_train['rating'].values, (x_train['userId'].values, x_train['base_id'].values)),)
But I get the following error
<ipython-input-112-520f5e1aee89> in <module>
4
5 train_sparse_matrix = sparse.csc_matrix((x_train['result.courseViewCount'].values, (x_train['userId'].values,
----> 6 **x_train['result.base_id'].values)),)**
TypeError: 'numpy.float64' object cannot be interpreted as an integer
So I tried converting this dataframe using .astype('int32) and to_numeric() function but the x_train.info() still keeps showing as object.
Can you please help!
Data would be something like this:
userId base_id rating
5392.0 ABC001 6.0
5392.0 ETZ222 2.0
5392.0 XYZ095 1.0
Is it because the base_id contains alphabets?
Can you try converting it to numpy.int64?
.astype(numpy.int64)
If you give an extract of the real data, it will be helpful to answer
What is the difference between an h5 and hdf file? Should I use one over the other? I tried doing a timeit with the following two codes and it took about 3 minutes and 29 seconds per loop and with a 240mb file. I had an error eventually on the second code but the file size was above 300mb on disk.
hdf = pd.HDFStore('combined.h5')
hdf.put('table', df, format='table', complib='blosc', complevel=5, data_columns=True)
hdf.close()
df.to_hdf('combined.hdf', 'table', format='table', mode='w', complib='blosc', complevel=5)
Also, I had an error message which said:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types
This is due to string columns which are objects because of blank values. If I do .astype(str), all of the blanks are replaced with nan (a string which even appears in output files). Do I worry about the error message and fill in the blanks and replace them again with np.nan later, or just ignore it?
Here is the df.info() to show that there are some columns with nulls. I can't remove these rows but I could temporarily fill them with something if required.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1387276 entries, 0 to 657406
Data columns (total 12 columns):
date 1387276 non-null datetime64[ns]
start_time 1387276 non-null datetime64[ns]
end_time 313190 non-null datetime64[ns]
cola 1387276 non-null object
colb 1387276 non-null object
colc 1387276 non-null object
cold 476816 non-null object
cole 1228781 non-null object
colx 1185679 non-null object
coly 313190 non-null object
colz 1387276 non-null int64
colzz 1387276 non-null int64
dtypes: datetime64[ns](3), int64(2), object(7)
memory usage: 137.6+ MB
I'm having a basic problem with df.head(). When the function is executed, it usually displays a nice HTML formatted table of the top 5 values, but now it only appears to slice the dataframe and outputs like below:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
survived 5 non-null values
pclass 5 non-null values
name 5 non-null values
sex 5 non-null values
age 5 non-null values
sibsp 5 non-null values
parch 5 non-null values
fare 5 non-null values
embarked 5 non-null values
dtypes: float64(2), int64(4), object(3)
After looking at this thread I tried
pd.util.terminal.get_terminal_size()
and received the expected output (80, 25). Manually setting the print options with
pd.set_printoptions(max_columns=10)
Yields the same sliced dataframe results like above.
This was confirmed after diving into the documentation here and using the
get_option("display.max_rows")
get_option("display.max_columns")
and getting the correct default 60 rows and 10 columns.
I've never had a problem with df.head() before but now its an issue in all of my IPython Notebooks.
I'm running pandas 0.11.0 and IPython 0.13.2 in google chrome.
In pandas 11.0, I think the minimum of display.height and max_rows (and display.width and max_columns`) is used, so you need to manually change that too.
I don't like this, I posted this github issue about it previously.
Try using the following to display the top 10 items
from IPython.display import HTML
HTML(users.head(10).to_html())
I think pandas 11.0 head function is totally unintuitive and should
simply have remained as head() and you get your html.