Pandas df.head() Error - pandas

I'm having a basic problem with df.head(). When the function is executed, it usually displays a nice HTML formatted table of the top 5 values, but now it only appears to slice the dataframe and outputs like below:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
survived 5 non-null values
pclass 5 non-null values
name 5 non-null values
sex 5 non-null values
age 5 non-null values
sibsp 5 non-null values
parch 5 non-null values
fare 5 non-null values
embarked 5 non-null values
dtypes: float64(2), int64(4), object(3)
After looking at this thread I tried
pd.util.terminal.get_terminal_size()
and received the expected output (80, 25). Manually setting the print options with
pd.set_printoptions(max_columns=10)
Yields the same sliced dataframe results like above.
This was confirmed after diving into the documentation here and using the
get_option("display.max_rows")
get_option("display.max_columns")
and getting the correct default 60 rows and 10 columns.
I've never had a problem with df.head() before but now its an issue in all of my IPython Notebooks.
I'm running pandas 0.11.0 and IPython 0.13.2 in google chrome.

In pandas 11.0, I think the minimum of display.height and max_rows (and display.width and max_columns`) is used, so you need to manually change that too.
I don't like this, I posted this github issue about it previously.

Try using the following to display the top 10 items
from IPython.display import HTML
HTML(users.head(10).to_html())
I think pandas 11.0 head function is totally unintuitive and should
simply have remained as head() and you get your html.

Related

Plotting line graph from pandas DataFrame - does not work if I do not include .mean(), .sum() or even .median(). Very confused

I have a DataFrame that has list of date, city, country, and average temperature in Celcius.
Here is the .info() of this DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16500 entries, 0 to 16499
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 16500 non-null object
1 city 16500 non-null object
2 country 16500 non-null object
3 avg_temp_c 16407 non-null float64
dtypes: float64(1), object(3)
I only want to plot the avg_temp_c for the cities of Toronto and Rome, both lines on one graph. This was actually a practice problem that has the solution, so here is the code for that:
toronto = temperatures[temperatures["city"] == "Toronto"]
rome = temperatures[temperatures["city"] == "Rome"]
toronto.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="blue")
rome.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="green")
My question is: why do I need to include .mean() in lines 3 and 4? I thought the numbers were already in avg_temp_c. Also, I experimented by replacing .mean() with .sum() and .median(), and it gives the same values. However, removing .mean() altogether for both lines just gives a blank plot. I tried to figure why, but I am very confused and I want to understand. Why doesn't it work without .mean() when the values are already listed in avg_temp_c?
I tried removing .mean(). I tried replacing .mean() with .median() and .sum(), which give the exact same values for some reason. I tried just printing toronto["avg_temp_c"] and rome["avg_temp_c"], which gives me the values, but when I plot it without .mean(), .sum(), or .median(), it does not work. I am just trying to figure why this is the case, and how does all three of those methods give me the same values as if I were just to print the avg_temp_c list?
Hope my question was clear. Thank you!

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Pandas - Converting dataframe object to numbers

I have a dataframe which looks like this.
x_train.info()
Int64Index: 8330 entries, 16 to 8345
Data columns (total 4 columns):
userId 8330 non-null object
base_id 8330 non-null object
rating 8330 non-null object
dtypes: object(3)
I am trying to convert this to sparse matrix using the following command
train_sparse_matrix = sparse.csc_matrix((x_train['rating'].values, (x_train['userId'].values, x_train['base_id'].values)),)
But I get the following error
<ipython-input-112-520f5e1aee89> in <module>
4
5 train_sparse_matrix = sparse.csc_matrix((x_train['result.courseViewCount'].values, (x_train['userId'].values,
----> 6 **x_train['result.base_id'].values)),)**
TypeError: 'numpy.float64' object cannot be interpreted as an integer
So I tried converting this dataframe using .astype('int32) and to_numeric() function but the x_train.info() still keeps showing as object.
Can you please help!
Data would be something like this:
userId base_id rating
5392.0 ABC001 6.0
5392.0 ETZ222 2.0
5392.0 XYZ095 1.0
Is it because the base_id contains alphabets?
Can you try converting it to numpy.int64?
.astype(numpy.int64)
If you give an extract of the real data, it will be helpful to answer

merging two pandas data frames with modin.pandas gives ValueError

In an attempt to make my pandas code faster I installed modin and tried to use it. A merge of two data frames that had previously worked gave me the following error:
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.frame.DataFrame'>
Here is the info of both data frames:
printing event_df.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980101 entries, 0 to 1980100
Data columns (total 5 columns):
other_id object
id object
category object
description object
date datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 75.5+ MB
printing other_df info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752438 entries, 0 to 752437
Data columns (total 4 columns):
id 752438 non-null object
other_id 752438 non-null object
Value 752438 non-null object
Unit 752438 non-null object
dtypes: object(4)
memory usage: 23.0+ MB
Here are some rows from event_df:
other_id id category description date
08E5A97350FC8B00092F 1 some_string some_string 2019-04-09
17B71019E148415D 4 some_string some_string 2019-11-08
17B71019E148415D360 7 some_string some_string 2019-11-08
and here are 3 rows from other_df:
id other_id Value Unit
a01 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a02 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a03 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
I tried installing the version cited in this question Join two modin.pandas.DataFrame(s), but it didn't help.
Here's the line of code throwing the error:
joint_dataframe2 = pd.merge(event_df,other_df, on = ["id","other_id"])
It seems there is some problem with modin's merge functionality. Is there any workaround such as using pandas for the merge and using modin for a groupby.transform()? I tried overwriting the pandas import after the merge with import modin.pandas, but got an error saying pandas was referenced before assignment. Has anyone come across this problem and if so, is there a solution?
Your error reads like you were merging an instance of modin.pandas.dataframe.DataFrame with an instance of pandas.core.frame.DataFrame, which is not allowed.
If that's indeed the case, you could turn the pandas Dataframe into a modin Dataframe first, then you should be able to merge them, I believe.

HDFStore error 'correct atom type -> [dtype->uint64'

using read_hdf for first time love it want to use it to combine a bunch of smaller *.h5 into one big file. plan on calling append() of a HDFStore. later will add chunking to conserve memory.
Example table looks like this
Int64Index: 220189 entries, 0 to 220188
Data columns (total 16 columns):
ID 220189 non-null values
duration 220189 non-null values
epochNanos 220189 non-null values
Tag 220189 non-null values
dtypes: object(1), uint64(3)
code:
import pandas as pd
print pd.__version__ # I am running 0.11.0
dest_h5f = pd.HDFStore('c:\\t3_combo.h5',complevel=9)
df = pd.read_hdf('\\t3\\t3_20130319.h5', 't3', mode = 'r')
print df
dest_h5f.append(tbl, df, data_columns=True)
dest_h5f.close()
Problem: the append traps this exception
Exception: cannot find the correct atom type -> [dtype->uint64,items->Index([InstrumentID], dtype=object)] 'module' object has no attribute 'Uint64Col'
this feels like a problem with some version of pytables or numpy
pytables = v 2.4.0 numpy = v 1.6.2
We normally represent epcoch seconds as int64 and use datetime64[ns]. Try using datetime64[ns], will make your life easier. In any event nanoseconds since 1970 is well within the range of in64 anyhow. (and uint64 only buy you 2x this range). So no real advantage to using unsigned ints.
We use int64 because the min value (-9223372036854775807) is used to represent NaT or an integer marker for Not a Time
In [11]: (Series([Timestamp('20130501')])-
Series([Timestamp('19700101')]))[0].astype('int64')
Out[11]: 1367366400000000000
In [12]: np.iinfo('int64').max
Out[12]: 9223372036854775807
You can then represent time form about the year 1677 till 2264 at the nanosecond level