Pandas dataframes and PyCharm IntelliSense - pandas

When I create new dataframes from old ones, using concat or merge, PyCharm intellisense stops working for the resulting dataframe unless I explicitly pass it to a DataFrame constructor
import pandas as pd
d1 = {1: [1, 2, 3], 2: [11, 22, 33]}
d2 = {1: [4], 2: [5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df3 = pd.concat([df1, df2], axis=0)
df3_ = pd.DataFrame(pd.concat([df1, df2], axis=0))
In the above example df3 and df3_ are the "same" dataframe, but intellisense only works on df3_. Am I doing something wrong? How can I avoid always having to call the DataFrame constructor and still get intellisense out of pycharm?

The answer is to use type hints like this:
df3 = pd.concat([df1, df2], axis=0) # type: pandas.DataFrame

Related

pandas DataFrame value_counts on column that stores DataFrame

I am trying to use value_counts() on a pandas DataFrame column that stores another DataFrame.
Is there a possibility to get the value_counts() function working (or something similar), without having to transform my DataFrames to Strings or Hashes or something like that?
I've tried to count the inner DataFrames, which completely breaks, and then I tried with Arrays, which it seems it cannot make the correct comparison also:
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
ar1 = np.array([11,22])
ar2 = np.array([11,22])
ar3 = np.array([33,44])
df = pd.DataFrame([
['0', ar1],
['1', ar2],
['2', ar3]
], columns =['str', 'ars'])
print(df["ars"].value_counts())
Expected:
[11, 22] 2
[33, 44] 1
Actual:
[11, 22] 1
[11, 22] 1
[33, 44] 1
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
df1 = pd.DataFrame({'col1': [11], 'col2': [22]})
df2 = pd.DataFrame({'col1': [11], 'col2': [22]})
df3 = pd.DataFrame({'col1': [33], 'col2': [44]})
df = pd.DataFrame([
['0', df1],
['1', df2],
['2', df3]
], columns =['str', 'dfs'])
print(df["dfs"].value_counts())
Expected:
{} 2
{} 1
Actual:
BREAKS COMPLETELY
How can I achive the count of complex values in a DataFrame?
I'm honestly confused how either of those managed to run without raising an exception.
Neither np.array nor pd.DataFrame are hashable, and as far as I understood, hashing was necessary for value_count.
Case and point, neither of your examples can be translated to their DataFrame.value_counts equivalent, because underneath it's doing df.groupby(["ars"], dropna=True).grouper.size() which requires hashing.
>>> df.value_counts(["ars"])
TypeError: unhashable type: 'numpy.ndarray'
Overall, I would not count on any .value_count method working on non-hashable columns.

how to add dataframe columns based on value of loop

I have a dataframe in python, called df. It contains two variables, Name and Age. I want to do a loop in python to generate 10 new column dataframes, called Age_1, Age_2, Age_3....Age_10 which contain the values of Age.
So far I have tried:
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
for i in range(1,11):
df[Age_'i'] = df['Age']
Just use this for loop:
for x in range(0,11):
df['Age_'+str(x)]=df['Age']
OR
for x in range(0,11):
df['Age_{}'.format(x)]=df['Age']
OR
for x in range(0,11):
df['Age_%s'%(x)]=df['Age']
Now if you print df you will get your desired output:
you can use .assign and ** unpacking.
df.assign(**{f'Age_{i}' : df['Age'] for i in range(11)})

Compare two pandas data frame from csv

I have 2 csv files and i need to compare them using by pandas. The values in these two files are the same so I expect the df result to be empty but it shows to me they are different. Do you think i miss something when i read csv files? or another things to test/fix?
df1=pd.read_csv('apc2019.csv', sep = '|', lineterminator=True)
df2=pd.read_csv('apc2020.csv', sep = '|', lineterminator=True)
df = pd.concat([df1,df2]).drop_duplicates(keep=False)
print(df)
I'd recommend to find what's the difference first, but it is hard with the pd.equals since it will only give you either True or False, can you try this?
from pandas._testing import assert_frame_equal
assert_frame_equal(df1, df2)
This will tell you exactly the difference, and it has different levels of 'tolerance' (for example if you don't care about the column names, of the types etc)
Details here
If you want to compare with a tolerance in values:
In [20]: from pandas._testing import assert_frame_equal
...: df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [1, 9]})
...: df2 = pd.DataFrame({'a': [1, 2], 'b': [3, 5], 'c': [1.5, 8.5]})
In [21]: assert_frame_equal(df1, df2, check_less_precise=-1, check_dtype=False)
By defaut chekc_dtype is True, so it will raise an exception if you have floats vs ints.
The other parameter to change is the check_less_precise by using negatives you make the allowed error bigger

Conditional on pandas DataFrame's

Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.