Collapsing a PANDAs dataframe into a single column of all items and their occurances - pandas

I have a data frame consisting of a mixture of NaN's and strings e.g
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
ddf = pd.DataFrame(data)
I want to
1:count the total number of items and put in a new data frame e.g
NaN=2
tree=5
car=2
fish=1
cat=1
dog=1
2:Count the total number of items when compared to a separate longer list (column of a another data frame, e.g
df['compare'] =
NaN
tree
car
fish
cat
dog
rabbit
Pear
Orange
snow
rain
Thanks
Jason

For the first question:
from collections import Counter
data = {
"String1": ["NaN", "tree", "car", "tree"],
"String2": ["cat", "dog", "car", "tree"],
"String3": ["fish", "tree", "NaN", "tree"],
}
ddf = pd.DataFrame(data)
a = Counter(ddf.stack().tolist())
df_result = pd.DataFrame(dict(a), index=['Count']).T
df = pd.DataFrame({'vals':['NaN', 'tree', 'car', 'fish', 'cat', 'dog', 'rabbit', 'Pear', 'Orange', 'snow', 'rain']})
df_counts = df.vals.map(df_result.to_dict()['Count'])
THis should do :)

You can use the following code for count of items over all data frame.
import pandas as pd
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
df = pd.DataFrame(data)
def get_counts(df: pd.DataFrame) -> dict:
res = {}
for col in df.columns:
vc = df[col].value_counts().to_dict()
for k,v in vc.items():
if k in res:
res[k] += v
else:
res[k] = v
return res
counts = get_counts(df)
Output
>>> print(counts)
{'tree': 5, 'car': 2, 'NaN': 2, 'cat': 1, 'dog': 1, 'fish': 1}

Related

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Conditional mapping among columns of two data frames with Pandas Data frame

I needed your advice regarding how to map columns between data-frames:
I have put it in simple way so that it's easier for you to understand:
df = dataframe
EXAMPLE:
df1 = pd.DataFrame({
"X": [],
"Y": [],
"Z": []
})
df2 = pd.DataFrame({
"A": ['', '', 'A1'],
"C": ['', '', 'C1'],
"D": ['D1', 'Other', 'D3'],
"F": ['', '', ''],
"G": ['G1', '', 'G3'],
"H": ['H1', 'H2', 'H3']
})
Requirement:
1st step:
We needed to track a value for X column on df1 from columns A, C, D respectively. It would stop searching once it finds any value and would select it.
2nd step:
If the selected value is "Other" then X column of df1 would map columns F, G, and H respectively until it finds any value.
Result:
X
0 D1
1 H2
2 A1
Thank you so much in advance
Try this:
def first_non_empty(df, cols):
"""Return the first non-empty, non-null value among the specified columns per row"""
return df[cols].replace('', pd.NA).bfill(axis=1).iloc[:, 0]
col_x = first_non_empty(df2, ['A','C','D'])
col_x = col_x.mask(col_x == 'Other', first_non_empty(df2, ['F','G','H']))
df1['X'] = col_x

Grouping and heading pandas dataframe

I have the following dataframe of securities and computed a 'liquidity score' in the last column, where 1 = liquid, 2 = less liquid, and 3 = illiquid. I want to group the securities (dynamically) by their liquidity. Is there a way to group them and include some kind of header for each group? How can this be best achieved. Below is the code and some example, how it is supposed to look like.
import pandas as pd
df = pd.DataFrame({'ID':['XS123', 'US3312', 'DE405'], 'Currency':['EUR', 'EUR', 'USD'], 'Liquidity score':[2,3,1]})
df = df.sort_values(by=["Liquidity score"])
print(df)
# 1 = liquid, 2 = less liquid,, 3 = illiquid
Add labels for liquidity score
The following replaces labels for numbers in Liquidity score:
df['grp'] = df['Liquidity score'].replace({1:'Liquid', 2:'Less liquid', 3:'Illiquid'})
Headers for each group
As per your comment, find below a solution to do this.
Let's illustrate this with a small data example.
df = pd.DataFrame({'ID':['XS223', 'US934', 'US905', 'XS224', 'XS223'], 'Currency':['EUR', 'USD', 'USD','EUR','EUR',]})
Insert a header on specific rows using np.insert.
df = pd.DataFrame(np.insert(df.values, 0, values=["Liquid", ""], axis=0))
df = pd.DataFrame(np.insert(df.values, 2, values=["Less liquid", ""], axis=0))
df.columns = ['ID', 'Currency']
Using Pandas styler, we can add a background color, change font weight to bold and align the text to the left.
df.style.hide_index().set_properties(subset = pd.IndexSlice[[0,2], :], **{'font-weight' : 'bold', 'background-color' : 'lightblue', 'text-align': 'left'})
You can add a new column like this:
df['group'] = np.select(
[
df['Liquidity score'].eq(1),
df['Liquidity score'].eq(2)
],
[
'Liquid','Less liquid'
],
default='Illiquid'
)
And try setting as index, so you can filter using the index:
df.set_index(['grouping','ID'], inplace=True)
df.loc['Less liquid',:]

pandas: Calculate the rowwise max of categorical columns

I have a DataFrame containing 2 columns of ordered categorical data (of the same category). I want to construct another column that contains the categorical maximum of the first 2 columns. I set up the following.
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
cats = CategoricalDtype(categories=['small', 'normal', 'large'], ordered=True)
data = {
'A': ['normal', 'small', 'normal', 'large', np.nan],
'B': ['small', 'normal', 'large', np.nan, 'small'],
'desired max(A,B)': ['normal', 'normal', 'large', 'large', 'small']
}
df = pd.DataFrame(data).astype(cats)
The columns can be compared, although the np.nan items are problematic, as running the following code shows.
df['A'] > df['B']
The manual suggests that max() works on categorical data, so I try to define my new column as follows.
df[['A', 'B']].max(axis=1)
This yields a column of NaN. Why?
The following code constructs the desired column using the comparability of the categorical columns. I still don't know why max() fails here.
dfA = df['A']
dfB = df['B']
conditions = [dfA.isna(), (dfB.isna() | (dfA >= dfB)), True]
cases = [dfB, dfA, dfB]
df['maxAB'] = np.select(conditions, cases)
Columns A and B are string-types. So you gotta assign integer values to each of these categories first.
# size string -> integer value mapping
size2int_map = {
'small': 0,
'normal': 1,
'large': 2
}
# integer value -> size string mapping
int2size_map = {
0: 'small',
1: 'normal',
2: 'large'
}
# create columns containing the integer value for each size string
for c in df:
df['%s_int' % c] = df[c].map(size2int_map)
# apply the int2size map back to get the string sizes back
print(df[['A_int', 'B_int']].max(axis=1).map(int2size_map))
and you should get
0 normal
1 normal
2 large
3 large
4 small
dtype: object

How to split every string in a list in a dataframe column

I have a dataframe with a column containing a list of strings 'A:B'. I'd like to modify this so there is a new column which contains a set split by ':' containing the first element.
data = [
{'Name': 'A', 'Servers':['A:s1', 'B:s2', 'C:s3', 'C:s2']},
{'Name': 'B', 'Servers':['B:s1', 'C:s2', 'B:s3', 'A:s2']},
{'Name': 'C', 'Servers':['G:s1', 'X:s2', 'Y:s3']}
]
df = pd.DataFrame(data)
df
df['Clusters'] = [
{'A', 'B', 'C'},
{'B', 'C', 'A'},
{'G', 'X', 'Y'}
]
Learn how to use apply
In [5]: df['Clusters'] = df['Servers'].apply(lambda x: {p.split(':')[0] for p in x})
In [6]: df
Out[6]:
Name Servers Clusters
0 A [A:s1, B:s2, C:s3, C:s2] {A, B, C}
1 B [B:s1, C:s2, B:s3, A:s2] {C, B, A}
2 C [G:s1, X:s2, Y:s3] {X, Y, G}