Adding two columns to a dataframe - dataframes.jl

I am trying to add to columns of a dataframe read from a .csv:
dfMammals = pd.read_csv('MamFuncDatCSV.csv')
dfMammals['FANM'] = df['Diet-Fruit'] + df['Diet-Nect']
But when I look at the results they dont make sense as they are not the sum of the values
'Diet-Fruit'= 0,0,0,0,0,20,20,20,20,20 dtype: int64
'Diet-Nect'=0,0,0,0,0,0,0,0,40,0 dtype: int64
'FANM'=0,20,0,80,80,60,30,10,10,0
Could anyone tell me what may be going on?

Related

Plotting line graph from pandas DataFrame - does not work if I do not include .mean(), .sum() or even .median(). Very confused

I have a DataFrame that has list of date, city, country, and average temperature in Celcius.
Here is the .info() of this DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16500 entries, 0 to 16499
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 16500 non-null object
1 city 16500 non-null object
2 country 16500 non-null object
3 avg_temp_c 16407 non-null float64
dtypes: float64(1), object(3)
I only want to plot the avg_temp_c for the cities of Toronto and Rome, both lines on one graph. This was actually a practice problem that has the solution, so here is the code for that:
toronto = temperatures[temperatures["city"] == "Toronto"]
rome = temperatures[temperatures["city"] == "Rome"]
toronto.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="blue")
rome.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="green")
My question is: why do I need to include .mean() in lines 3 and 4? I thought the numbers were already in avg_temp_c. Also, I experimented by replacing .mean() with .sum() and .median(), and it gives the same values. However, removing .mean() altogether for both lines just gives a blank plot. I tried to figure why, but I am very confused and I want to understand. Why doesn't it work without .mean() when the values are already listed in avg_temp_c?
I tried removing .mean(). I tried replacing .mean() with .median() and .sum(), which give the exact same values for some reason. I tried just printing toronto["avg_temp_c"] and rome["avg_temp_c"], which gives me the values, but when I plot it without .mean(), .sum(), or .median(), it does not work. I am just trying to figure why this is the case, and how does all three of those methods give me the same values as if I were just to print the avg_temp_c list?
Hope my question was clear. Thank you!

Pandas selecting dataframe columns using a specific string and array/list

I have a dataframe with hundreds of columns (stocks). My issue is that I need to always pull a specific column (date) followed by an array/list of others (dynamic).
Previously I was doing something like this:
df = stocks[['date', 'AAPL', 'AMZN']]
but now if I need to dynamically choose stocks based on a sector I am not sure how to make these play nice together. I am only able to pull the list without using date like this:
print(rowData['symbol'])
3 [APA.OQ, BKR.N, COG.N, CVX.N, CXO.N, COP.N, DV...
Name: symbol, dtype: object
selection = rowData['symbol'].explode()
df = stocks[selection]
how do I also get the date values? something like this doesn't work:
df = stocks[['date'][selection]]
Thanks
Let us try
df = stocks[['date'] + rowData['symbol'].iloc[0]]

Resolving error when merging dataframes on two columns

I am trying to merge two dataframes (D1 & R1) on two columns (Date & Symbol) but I'm receiving this error "You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat".
I've been using pd.merge and I've tried different dtypes. I don't want to concatenate these because I just want to add D1 to the right side of R1.
D2 = pd.merge(D1, R2, on=['Date','Symbol'])
D1.dtypes()
Date object
Symbol object
High float64
Low float64
Open float64
Close float64
Volume float64
Adj Close float64
pct_change_1D float64
Symbol_above object
NE bool
R1.dtypes()
gvkey int64
datadate int64
fyearq int64
fqtr int64
indfmt object
consol object
popsrc object
datafmt object
tic object
curcdq object
datacqtr object
datafqtr object
rdq int64
costat object
ipodate float64
Report_Today int64
Symbol object
Date int64
Ideally, the columns not in the index of R1 (gvkey - Report_Today) will be on the right side of the columns in D1.
Any help is appreciated. Thanks.
In your description of DataFrames we can see,
In D1 DataFrame column Date has type "object"
In R1 DataFrame column Date has type "int64".
Make types of these columns the same and everything will be OK.

How to count the number of categorical features with Pandas?

I have a pd.DataFrame which contains different dtypes columns. I would like to have the count of columns of each type. I use Pandas 0.24.2.
I tried:
dataframe.dtypes.value_counts()
It worked fine for other dtypes (float64, object, int64) but for a weird reason, it doesn't aggregate the 'category' features, and I get a different count for each category (as if they would be counted as different values of dtypes).
I also tried:
dataframe.dtypes.groupby(by=dataframe.dtypes).agg(['count'])
But that raises a
TypeError: data type not understood.
Reproductible example:
import pandas as pd
df = pd.DataFrame([['A','a',1,10], ['B','b',2,20], ['C','c',3,30]], columns = ['col_1','col_2','col_3','col_4'])
df['col_1'] = df['col_1'].astype('category')
df['col_2'] = df['col_2'].astype('category')
print(df.dtypes.value_counts())
Expected result:
int64 2
category 2
dtype: int64
Actual result:
int64 2
category 1
category 1
dtype: int64
Use DataFrame.get_dtype_counts:
print (df.get_dtype_counts())
category 2
int64 2
dtype: int64
But if use last version of pandas your solution is recommended:
Deprecated since version 0.25.0.
Use .dtypes.value_counts() instead.
As #jezrael mentioned that it is deprecated in 0.25.0, dtypes.value_counts(0) would give two categoryies, so to fix it do:
print(df.dtypes.astype(str).value_counts())
Output:
int64 2
category 2
dtype: int64

Grouped function between 2 columns in a pandas.DataFrame?

I have a dataframe that has multiple numerical data columns, and a 'group' column. I want to get the output of various functions over two of the columns, for each group.
Example data and function:
df = pandas.DataFrame({"Dummy":[1,2]*6, "X":[1,3,7]*4,
"Y":[2,3,4]*4, "group":["A","B"]*6})
def RMSE(X):
return(np.sqrt(np.sum((X.iloc[:,0] - X.iloc[:,1])**2)))
I want to do something like
group_correlations = df[["X", "Y"]].groupby('group').apply(RMSE)
But if I do that, the 'group' column isn't in the dataframe. If I do it the other way around, like this:
group_correlations = df.groupby('group')[["X", "Y"]].apply(RMSE)
Then the column selection doesn't work:
df.groupby('group')[['X', 'Y']].head(1)
Dummy X Y group
group
A 0 1 1 2 A
B 1 2 3 3 B
the Dummy column is still included, so the function will calculate RMSE on the wrong data.
Is there any way to do what I'm trying to do? I know I could do a for loop over the different groups, and subselect the columns manually, but I'd prefer to do it the pandas way, if there is one.
This looks like a bug (or that grabbing multiple columns in a groupby is not implemented?), a workaround is to pass in the groupby column directly:
In [11]: df[['X', 'Y']].groupby(df['group']).apply(RMSE)
Out[11]:
group
A 4.472136
B 4.472136
dtype: float64
To see it's the same:
In [12]: df.groupby('group')[['X', 'Y']].apply(RMSE) # wrong
Out[12]:
group
A 8.944272
B 7.348469
dtype: float64
In [13]: df.iloc[:, 1:].groupby('group')[['X', 'Y']].apply(RMSE) # correct: ignore dummy col
Out[13]:
group
A 4.472136
B 4.472136
dtype: float64
More robust implementation:
To avoid this completely, you could change RMSE to select the columns by name:
In [21]: def RMSE2(X, left_col, right_col):
return(np.sqrt(np.sum((X[left_col] - X[right_col])**2)))
In [22]: df.groupby('group').apply(RMSE2, 'X', 'Y') # equivalent to passing lambda x: RMSE2(x, 'X', 'Y'))
Out[22]:
group
A 4.472136
B 4.472136
dtype: float64
Thanks to #naught101 for pointing out the sweet apply syntax to avoid the lambda.