Same calculation for all combinations in DataFrame - pandas

I got dataframe like below,
import pandas as pd
df = pd.DataFrame({'CITY': ['A','B','C','A','C','B'],
'MAKE_NAME': ['SO','OK','CO','LU','CO','OK'],
'USER' : ['JK','JK','MK','JK','JK','JK'],
'RESULT_CODE' : ['Y','Y','N','N','Y','Y'],
'VALID' : [1,1,1,1,1,0],
'COUNT' : [1,1,1,1,1,1] })
I want to calculate the valid/count of all combinations in double and triple and quadruple. Also i want to get result as dataframe.
For example result for double like below,
Also result for triple like below,
Thanks for all,

You can find the solution below.
import pandas as pd
df = pd.DataFrame({'CITY': ['A','B','C','A','C','B'],
'MAKE_NAME': ['SO','OK','CO','LU','CO','OK'],
'USER' : ['JK','JK','MK','JK','JK','JK'],
'RESULT_CODE' : ['Y','Y','N','N','Y','Y'],
'VALID' : [1,1,1,1,1,0],
'COUNT' : [1,1,1,1,1,1] })
for i in df.columns:
for j in df.columns:
for k in df.columns:
for l in df.columns:
try:
a1 = df.groupby([i,j,k,l], as_index=False, sort=True, group_keys=True)[['VALID','COUNT']].count()
a1['RATE'] = a1.VALID / a1.COUNT
print(a1)
except Exception:
pass

Related

Filtering list based on array columns to create a [dict(str:list)] result

My table looks like this(df):
category
product_in_cat
cat1
[A,B,C]
cat2
[E,F,G]
"category" is str, and product_in_cat is list type. I have a list:product=[A,B,G]
I want to get a final [dict(str:list)] looks like:
[{cat1:[A,B]},{cat2:[G]}]
I think I can use below code:
list1=[]
for inde,row in df.iterrows():
list1.append.({row['category']:row['product_in_cat'] in product})
I know this part is not correct,row['product_in_cat'] in product but I am not sure how to filter out the list column base on the given "product" list. Please help, and thank you in advance!
You can use np.intersect1d to find the common part of two lists:
import numpy as np
df_ = df['product_in_cat'].apply(lambda x: np.intersect1d(x, product).tolist())
l = [{k: v} for k, v in zip(df['category'], df_)]
print(l)
[{'cat1': ['A', 'B']}, {'cat2': ['G']}]
You can use convert each list in the column to a set and use intersection with the external product list:
import pandas as pd
lst = ['A','B','G']
data = {'category':['cat 1','cat 2'],
'product_in_cat': [ ['A','B','C'] ,['E','F','G']]}
df = pd.DataFrame(data)
dict(zip(df['category'],df['product_in_cat'].apply(lambda x: set(x).intersection(lst))))
#output
{'cat 1': {'A', 'B'}, 'cat 2': {'G'}}

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

Pandas get row if column is a substring of string

I can do the following if I want to extract rows whose column "A" contains the substring "hello".
df[df['A'].str.contains("hello")]
How can I select rows whose column is the substring for another word? e.g.
df["hello".contains(df['A'].str)]
Here's an example dataframe
df = pd.DataFrame.from_dict({"A":["hel"]})
df["hello".contains(df['A'].str)]
IIUC, you could apply str.find:
import pandas as pd
df = pd.DataFrame(['hell', 'world', 'hello'], columns=['A'])
res = df[df['A'].apply("hello".find).ne(-1)]
print(res)
Output
A
0 hell
2 hello
As an alternative use __contains__
res = df[df['A'].apply("hello".__contains__)]
print(res)
Output
A
0 hell
2 hello
Or simply:
res = df[df['A'].apply(lambda x: x in "hello")]
print(res)

Pandas dataframe append to column containing list

I have a pandas dataframe with one column that contains an empty list in each cell.
I need to duplicate the dataframe, and append it at the bottom of the original dataframe, but with additional information in the list.
Here is a minimal code example:
df_main = pd.DataFrame([['a', []], ['b', []]], columns=['letter', 'mylist'])
> df_main
letter mylist
0 a []
1 b []
df_copy = df_main.copy()
for index, row in df_copy.iterrows():
row.mylist = row.mylist.append(1)
pd.concat([ df_copy,df_main], ignore_index=True)
> result:
letter mylist
0 a None
1 b None
2 a [1]
3 b [1]
As you can see there is a problem that the [] empty list was replaced by a None
Just to make sure, this is what I would like to have:
letter mylist
0 a []
1 b []
2 a [1]
3 b [1]
How can I achieve that?
append method on list return a None value, that's why None appears in the final dataframe. You may have use + operator for reassignment like this:
import pandas as pd
df_main = pd.DataFrame([['a', []], ['b', []]], columns=['letter', 'mylist'])
df_copy = df_main.copy()
for index, row in df_copy.iterrows():
row.mylist = row.mylist + list([1])
pd.concat([df_main, df_copy], ignore_index=True).head()
Output of this block of code:
letter mylist
0 a []
1 b []
2 a [1]
3 b [1]
A workaround to solve your problem would be to create a temporary column mylist2 with np.empty((len(df), 0)).tolist()) and use np.where() to change the None values of mylist to an empty list and then drop the empty column.
import pandas as pd, numpy as np
df_main = pd.DataFrame([['a', []], ['b', []]], columns=['letter', 'mylist'])
df_copy = df_main.copy()
for index, row in df_copy.iterrows():
row.mylist = row.mylist.append(1)
df = (pd.concat([df_copy,df_main], ignore_index=True)
.assign(mylist2=np.empty((len(df), 0)).tolist()))
df['mylist'] = np.where((df['mylist'].isnull()), df['mylist2'], df['mylist'])
df= df.drop('mylist2', axis=1)
df
Out[1]:
letter mylist
0 a []
1 b []
2 a [1]
3 b [1]
Not only does append method on list return a None value as indicated in the first answer, but both df_main and df_copy contain pointers to the same lists. So after:
for index, row in df_copy.iterrows():
row.mylist.append(1)
both dataframes have updated lists with one element. For your code to work as expected you can create a new list after you copy the dataframe:
df_copy = df_main.copy()
for index, row in df_copy.iterrows():
row.mylist = []
This question is another great example why we should not put objects in a dataframe.

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)