GroupBy Function Not Applying - pandas

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')

Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Related

series.str.split(expand=True) returns error: Wrong number of items passed 2, placement implies 1

I have a series of web addresses, which I want to split them by the first '.'. For example, return 'google', if the web address is 'google.co.uk'
d1 = {'id':['1', '2', '3'], 'website':['google.co.uk', 'google.com.au', 'google.com']}
df1 = pd.DataFrame(data=d1)
d2 = {'id':['4', '5', '6'], 'website':['google.co.jp', 'google.com.tw', 'google.kr']}
df2 = pd.DataFrame(data=d2)
df_list = [df1, df2]
I use enumerate to iterate the dataframe list
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)
Received error: ValueError: Wrong number of items passed 2, placement implies 1
You are splitting the website which gives you a list-like data structure. Think [google, co.uk]. You just want the first element of that list so:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)[0]
Another alternative is to use extract. It is also ~40% faster for your data:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.extract('(.*?)\.')

Same calculation for all combinations in DataFrame

I got dataframe like below,
import pandas as pd
df = pd.DataFrame({'CITY': ['A','B','C','A','C','B'],
'MAKE_NAME': ['SO','OK','CO','LU','CO','OK'],
'USER' : ['JK','JK','MK','JK','JK','JK'],
'RESULT_CODE' : ['Y','Y','N','N','Y','Y'],
'VALID' : [1,1,1,1,1,0],
'COUNT' : [1,1,1,1,1,1] })
I want to calculate the valid/count of all combinations in double and triple and quadruple. Also i want to get result as dataframe.
For example result for double like below,
Also result for triple like below,
Thanks for all,
You can find the solution below.
import pandas as pd
df = pd.DataFrame({'CITY': ['A','B','C','A','C','B'],
'MAKE_NAME': ['SO','OK','CO','LU','CO','OK'],
'USER' : ['JK','JK','MK','JK','JK','JK'],
'RESULT_CODE' : ['Y','Y','N','N','Y','Y'],
'VALID' : [1,1,1,1,1,0],
'COUNT' : [1,1,1,1,1,1] })
for i in df.columns:
for j in df.columns:
for k in df.columns:
for l in df.columns:
try:
a1 = df.groupby([i,j,k,l], as_index=False, sort=True, group_keys=True)[['VALID','COUNT']].count()
a1['RATE'] = a1.VALID / a1.COUNT
print(a1)
except Exception:
pass

Grouping and heading pandas dataframe

I have the following dataframe of securities and computed a 'liquidity score' in the last column, where 1 = liquid, 2 = less liquid, and 3 = illiquid. I want to group the securities (dynamically) by their liquidity. Is there a way to group them and include some kind of header for each group? How can this be best achieved. Below is the code and some example, how it is supposed to look like.
import pandas as pd
df = pd.DataFrame({'ID':['XS123', 'US3312', 'DE405'], 'Currency':['EUR', 'EUR', 'USD'], 'Liquidity score':[2,3,1]})
df = df.sort_values(by=["Liquidity score"])
print(df)
# 1 = liquid, 2 = less liquid,, 3 = illiquid
Add labels for liquidity score
The following replaces labels for numbers in Liquidity score:
df['grp'] = df['Liquidity score'].replace({1:'Liquid', 2:'Less liquid', 3:'Illiquid'})
Headers for each group
As per your comment, find below a solution to do this.
Let's illustrate this with a small data example.
df = pd.DataFrame({'ID':['XS223', 'US934', 'US905', 'XS224', 'XS223'], 'Currency':['EUR', 'USD', 'USD','EUR','EUR',]})
Insert a header on specific rows using np.insert.
df = pd.DataFrame(np.insert(df.values, 0, values=["Liquid", ""], axis=0))
df = pd.DataFrame(np.insert(df.values, 2, values=["Less liquid", ""], axis=0))
df.columns = ['ID', 'Currency']
Using Pandas styler, we can add a background color, change font weight to bold and align the text to the left.
df.style.hide_index().set_properties(subset = pd.IndexSlice[[0,2], :], **{'font-weight' : 'bold', 'background-color' : 'lightblue', 'text-align': 'left'})
You can add a new column like this:
df['group'] = np.select(
[
df['Liquidity score'].eq(1),
df['Liquidity score'].eq(2)
],
[
'Liquid','Less liquid'
],
default='Illiquid'
)
And try setting as index, so you can filter using the index:
df.set_index(['grouping','ID'], inplace=True)
df.loc['Less liquid',:]

Collapsing a PANDAs dataframe into a single column of all items and their occurances

I have a data frame consisting of a mixture of NaN's and strings e.g
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
ddf = pd.DataFrame(data)
I want to
1:count the total number of items and put in a new data frame e.g
NaN=2
tree=5
car=2
fish=1
cat=1
dog=1
2:Count the total number of items when compared to a separate longer list (column of a another data frame, e.g
df['compare'] =
NaN
tree
car
fish
cat
dog
rabbit
Pear
Orange
snow
rain
Thanks
Jason
For the first question:
from collections import Counter
data = {
"String1": ["NaN", "tree", "car", "tree"],
"String2": ["cat", "dog", "car", "tree"],
"String3": ["fish", "tree", "NaN", "tree"],
}
ddf = pd.DataFrame(data)
a = Counter(ddf.stack().tolist())
df_result = pd.DataFrame(dict(a), index=['Count']).T
df = pd.DataFrame({'vals':['NaN', 'tree', 'car', 'fish', 'cat', 'dog', 'rabbit', 'Pear', 'Orange', 'snow', 'rain']})
df_counts = df.vals.map(df_result.to_dict()['Count'])
THis should do :)
You can use the following code for count of items over all data frame.
import pandas as pd
data = {'String1':['NaN', 'tree', 'car', 'tree'],
'String2':['cat','dog','car','tree'],
'String3':['fish','tree','NaN','tree']}
df = pd.DataFrame(data)
def get_counts(df: pd.DataFrame) -> dict:
res = {}
for col in df.columns:
vc = df[col].value_counts().to_dict()
for k,v in vc.items():
if k in res:
res[k] += v
else:
res[k] = v
return res
counts = get_counts(df)
Output
>>> print(counts)
{'tree': 5, 'car': 2, 'NaN': 2, 'cat': 1, 'dog': 1, 'fish': 1}

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)