While removing the zero and null values from pandas dataframe, the datatype of field gets changed.
df.info()
Output :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 2 columns):
budget 10866 non-null int64
revenue 10866 non-null int64
dtypes: int64(2)
memory usage: 509.4+ KB
After running below code to remove zero and null values the datatype got changed.
temp_list_to_check_zero_values=['budget', 'revenue']
df[temp_list_to_check_zero_values] = df[temp_list_to_check_zero_values].replace(0, np.NAN)
df.info()
Output :
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3854 entries, 0 to 10848
Data columns (total 2 columns):
budget 3854 non-null float64
revenue 3854 non-null float64
dtypes: float64(2)
memory usage: 210.8+ KB
To preserve the datatype, we used applymap
df[temp_list_to_check_zero_values] = df[temp_list_to_check_zero_values].applymap(np.int64)
But got error :
ModuleNotFoundError: No module named 'pandas.core.apply'
Do we need to install any specific library for using applymap() ?
Upgrading your pandas should solve the issue.
Try pip install pandas --upgrade
Related
I have a data frame:
pd.DataFrame({'A': range(1, 10000)})
I can get a nice human-readable thing saying that it has a memory usage of 78.2 KB using df.info():
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
I can get an unhelpful statement with similar effect using df.memory_usage() (and this is how Pandas itself calculates its own memory usage) but would like to avoid having to roll my own. I've looked at the df.info source and traced the source of the string all the way to this line.
How is this specific string generated and how can I pull that out so I can print it to a log?
Nb I can't parse the df.info() output because it prints directly to buffer; calling str on it just returns None.
Nb This line also does not help, what is initialised is merely a boolean flag for whether memory usage should be printed at all.
You can create an instance of pandas.io.formats.info.DataFrameInfo and read the memory_usage_string property, which is exactly what df.info() does:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 9999 non-null int64
dtypes: int64(1)
memory usage: 78.2 KB
>>> pd.io.formats.info.DataFrameInfo(df).memory_usage_string.strip()
'78.2 KB'
If you're passing memory_usage to df.info, you can pass it directly to DataFrameInfo:
pd.io.formats.info.DataFrameInfo(df, memory_usage='deep').memory_usage_string.strip()
I would display all information of my data frame which contains more than 100 columns with .info() from pandas but it won't :
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85529 entries, 0 to 85528
Columns: 110 entries, ID to TARGET
dtypes: float64(40), int64(19), object(51)
memory usage: 71.8+ MB
I would like it displays like this :
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime 10886 non-null object
season 10886 non-null int64
holiday 10886 non-null int64
workingday 10886 non-null int64
weather 10886 non-null int64
temp 10886 non-null float64
atemp 10886 non-null float64
humidity 10886 non-null int64
windspeed 10886 non-null float64
casual 10886 non-null int64
registered 10886 non-null int64
count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
But the problem seems to be the high number of columns from my previous data frame. I would like to show all values including non null values (NaN).
You can pass optional arguments verbose=True and show_counts=True (null_counts=True deprecated since pandas 1.2.0) to the .info() method to output information for all of the columns
pandas >=1.2.0: data_train.info(verbose=True, show_counts=True)
pandas <1.2.0: data_train.info(verbose=True, null_counts=True)
I have no idea why, but when my column in pandas has dtype: Float64, I can't make this command:
df['column'].round()
Following error follows:
AttributeError: 'FloatingArray' object has no attribute 'round'
If I set the dtype to : float64, everything goes well. Please, could you explain me why..?
Maybe you have to check your version of Pandas or your data?
df = pd.DataFrame({'column': pd.array([1.1, 2.2])})
>>> type(pd.array([1.1, 2.2]))
pandas.core.arrays.floating.FloatingArray
>>> hasattr(pd.array([1.1, 2.2]), 'round')
True
>>> df
column
0 1.1
1 2.2
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 column 2 non-null Float64 # <- Float64
dtypes: Float64(1)
memory usage: 146.0 bytes
>>> df['column'].round()
0 1.0
1 2.0
Name: column, dtype: Float64
>>> df['column'].astype(float).round()
0 1.0
1 2.0
Name: column, dtype: float64
Github links
https://github.com/pandas-dev/pandas/issues/38844
https://github.com/pandas-dev/pandas/pull/39751
Minimum Pandas version requirement: 1.3.0
Try this:
df['column'].apply(lambda x: round(x))
I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB
In an attempt to make my pandas code faster I installed modin and tried to use it. A merge of two data frames that had previously worked gave me the following error:
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.frame.DataFrame'>
Here is the info of both data frames:
printing event_df.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980101 entries, 0 to 1980100
Data columns (total 5 columns):
other_id object
id object
category object
description object
date datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 75.5+ MB
printing other_df info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752438 entries, 0 to 752437
Data columns (total 4 columns):
id 752438 non-null object
other_id 752438 non-null object
Value 752438 non-null object
Unit 752438 non-null object
dtypes: object(4)
memory usage: 23.0+ MB
Here are some rows from event_df:
other_id id category description date
08E5A97350FC8B00092F 1 some_string some_string 2019-04-09
17B71019E148415D 4 some_string some_string 2019-11-08
17B71019E148415D360 7 some_string some_string 2019-11-08
and here are 3 rows from other_df:
id other_id Value Unit
a01 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a02 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a03 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
I tried installing the version cited in this question Join two modin.pandas.DataFrame(s), but it didn't help.
Here's the line of code throwing the error:
joint_dataframe2 = pd.merge(event_df,other_df, on = ["id","other_id"])
It seems there is some problem with modin's merge functionality. Is there any workaround such as using pandas for the merge and using modin for a groupby.transform()? I tried overwriting the pandas import after the merge with import modin.pandas, but got an error saying pandas was referenced before assignment. Has anyone come across this problem and if so, is there a solution?
Your error reads like you were merging an instance of modin.pandas.dataframe.DataFrame with an instance of pandas.core.frame.DataFrame, which is not allowed.
If that's indeed the case, you could turn the pandas Dataframe into a modin Dataframe first, then you should be able to merge them, I believe.