Is there anything in Pandas similar to dplyr's 'list columns' - pandas

I'm currently transitioning from R to Python in my data analysis, and there's one thing I haven't seen in any tutorials out there: is there anything in Pandas similar to dplyr's 'list columns' ?
Link to refence:
https://www.rstudio.com/resources/webinars/how-to-work-with-list-columns/

pandas will accept any object type, including lists, in an object type column.
df = pd.DataFrame()
df['genre']=['drama, comedy, action', 'romance, sci-fi, drama','horror']
df.genre = df.genre.str.split(', ')
print(df, '\n', df.genre.dtype, '\n', type(df.genre[0]))
# Output:
genre
0 [drama, comedy, action]
1 [romance, sci-fi, drama]
2 [horror]
object
<class 'list'>
We can see that:
genre is a column of lists.
The dtype of the genre column is object
The type of the first value of genre is list.
There are a number of str functions that work with lists.
For example:
print(df.genre.str.join(' | '))
# Output:
0 drama | comedy | action
1 romance | sci-fi | drama
2 horror
Name: genre, dtype: object
print(df.genre.str[::2])
# Output:
0 [drama, action]
1 [romance, drama]
2 [horror]
Name: genre, dtype: object
Others can typically be done with an apply function if there isn't a built-in method:
print(df.genre.apply(lambda x: max(x)))
# Output:
0 drama
1 sci-fi
2 horror
Name: genre, dtype: object
See the documentation for more... pandas str functions
As for nesting dataframes within one another, it is possible but, I believe it's considered an anti-pattern, and pandas will fight you the whole way there:
data = {'df1': df, 'df2': df}
df2 = pd.Series(data.values(), data.keys()).to_frame()
df2.columns = ['dfs']
print(df2)
# Output:
dfs
df1 genre
0 [drama, comedy...
df2 genre
0 [drama, comedy...
print(df2['dfs'][0])
# Output:
genre
0 [drama, comedy, action]
1 [romance, sci-fi, drama]
2 [horror]
See:
Link1
Link2
A possibly acceptable work around, would be storing them as numpy arrays:
df2 = df2.applymap(np.array)
print(df2)
print(df2['dfs'][0])
# Output:
dfs
df1 [[[drama, comedy, action]], [[romance, sci-fi,...
df2 [[[drama, comedy, action]], [[romance, sci-fi,...
array([[list(['drama', 'comedy', 'action'])],
[list(['romance', 'sci-fi', 'drama'])],
[list(['horror'])]], dtype=object)

Related

Filter out rows in Spark dataframe based on condition

Example Spark dataframe:
product type
table Furniture
chair Furniture
TV Electronic
.
.
I want to drop all the rows with type as Electronic if there exists any row where type is Furniture.
Real data here has million of rows.
Easy way is to count rows with type Furniture and if its greater than zero then drop rows with type Electronic, but this would be inefficient.
Is there a way to do this efficiently?
Not sure if it's exposed to the Pyspark API but you can use ANY in an expression:
chk = df.selectExpr('ANY(type = "Furniture") as chk').collect[0]["chk"]
if chk:
df_filtered = df.where(col("type") != "Electronic")
else:
df_filtered = df
As far as I can understand, if any product is classified as Furniture, you want to remove such product's classifications as Electronic. E.g., if TV is classified both, as Electronic and Furniture, you would like to remove Electronic classification, so that TV would only be classified as Furniture.
You will have to do some kind of aggregation. The following is a way using window functions:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('table', 'Furniture'),
('chair', 'Furniture'),
('TV', 'Electronic'),
('TV', 'Furniture')],
['product', 'type'])
w = W.partitionBy('product').rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
df = df.withColumn('_types', F.collect_set('type').over(w))
df = df.filter((F.col('type') != 'Electronic') | F.forall('_types', lambda x: x != 'Furniture'))
df = df.drop('_types')
df.show()
# +-------+---------+
# |product| type|
# +-------+---------+
# | TV|Furniture|
# | chair|Furniture|
# | table|Furniture|
# +-------+---------+

Trying to print entire dataframe after str.replace on one column

I can't figure out why this is throwing the error:
KeyError(f"None of [{key}] are in the [{axis_name}]")
here is the code:
def get_company_name(df):
company_name = [col for col in df if col.lower().startswith('comp')]
return company_name
df = df[df[get_company_name(master_leads)[0]].str.replace(punc, '', regex=True)]
this is what df.head() looks like:
Company / Account Website
0 Big Moose RV, & Boat Sales, Service, Camper Re... https://bigmooservsales.com/
1 Holifield Pest Management of Hattiesburg NaN
2 Steve Nichols Insurance NaN
3 Sandel Law Firm sandellaw.com
4 Duplicate - Checkered Flag FIAT of Newport News NaN
I have tried putting the [] in every place possible but I must be missing something. I was under impression that this is how you ran transformations on one column of the dataframe without pulling the series out of the dataframe.
Thanks!
You can get the first column name for company with
company_name_col = [col for col in df if col.lower().startswith('comp')][0]
you can see the cleaned up company name with
df[company_name_col].str.replace(punc, "", regex=True)
to apply the replacement
df[company_name_col] = df[company_name_col].str.replace(punc, "", regex=True)

Pandas dataframe replace contents based on ID from another dataframe

This is what my main dataframe looks like:
Group IDs New ID
1 [N23,N1,N12] N102
2 [N134,N100] N501
I have another dataframe that has all the required ID info in an unordered manner:
ID Name Age
N1 Milo 5
N23 Mark 21
N11 Jacob 22
I would like to modify the original dataframe such that all IDs are replaced with their respective names obtained from the other file. So that the dataframe has only names and no IDs and looks like this:
Group IDs New ID
1 [Mark,Silo,Bond] Niki
2 [Troy,Fangio] Kvyat
Thanks in advance
IIUC you can .explode your lists, replace values with .map and regroup them with .groupby
df['ID'] = (df.ID.explode()
.map(df1.set_index('ID')['Name'])
.groupby(level=0).agg(list)
)
If New ID column is not a list, you can use only .map()
df['New ID'] = df['New ID'].map(df1.set_index('ID')['Name'])
you can try making a dict from your second DF and then replacing on the first using regex patterns (no need to fully understand it, check the comments bellow):
ps: since you didn't provide the full df with the codes, I created with some of them, that's why the print() won't replace all the results.
import pandas as pd
# creating dummy dfs
df1 = pd.DataFrame({"Group":[1,2], "IDs":["[N23,N1,N12]", "[N134,N100]"], "New ID":["N102", "N501"] })
df2 = pd.DataFrame({"ID":['N1', "N23", "N11", "N100"], "Name":["Milo", "Mark", "Jacob", "Silo"], "Age":[5,21,22, 44]})
# Create the unique dict we're using regex patterns to make exact match
dict_replace = df2.set_index("ID")['Name'].to_dict()
# 'f' before string means fstrings and 'r' means to interpret it as regex
# the \b is a regex pattern that it sinalizes the begining and end of the match
## so that if you're searching for N1, it won't match if it is N11
dict_replace = {fr"\b{k}\b":v for k, v in dict_replace.items()}
# Replacing on original where you want it
df1['IDs'].replace(dict_replace, regex=True, inplace=True)
print(df1['IDs'].tolist())
# >>> ['[Mark,Milo,N12]', '[N134,Silo]']
Please note the change in my dataframes. In your sample data, the IDs in df that do not exists in df1 IDs. I altered my df to ensure only IDs in df1 were represented. I use the following df
print(df)
Group IDs New
0 1 [N23,N1,N11] N102
1 2 [N11,N23] N501
print(df1)
ID Name Age
0 N1 Milo 5
1 N23 Mark 21
2 N11 Jacob 22
Solution
dict df1.Id and df.Name and map to an exploded df.IDs. Add the result to list.
df['IDs'] = df['IDs'].str.strip('[]')#Strip corner brackets
df['IDs'] = df['IDs'].str.split(',')#Reconstruct list, this was done because for some reason I couldnt explode list
#df.explode list and map df1 to df and add to list
df.explode('IDs').groupby('Group')['IDs'].apply(lambda x:(x.map(dict(zip(df1.ID,df1.Name)))).tolist()).reset_index()
Group IDs
0 1 [Mark, Milo, Jacob]
1 2 [Jacob, Mark]

Creating a column based on pattern matching in pandas

I have a date frame containing two columns 'Name' and 'Task'. I want to create a third column called 'task_category' based on matching condition from a list. Please note the below data is only for example and I actually have 100s of patterns to look for instead of the three shown below.
df = pd.DataFrame(
{'Name': ["a","b","c"],
'Task': ['went to trip','Mall Visit','Cinema']})
task_category=['trip','Mall','Cinema']
Name Task task_category
0 a went to trip trip
1 b Mall Visit Mall
2 c Cinema Cinema
Use series.str.extract()
pat=r'({})'.format('|'.join(task_category))
#'(trip|Mall|Cinema)'
df['task_category']=df.Task.str.extract(pat)
print(df)
Name Task task_category
0 a went to trip trip
1 b Mall Visit Mall
2 c Cinema Cinema
I am using find all , since this will help you finding same key words in same line
df.Task.str.findall('|'.join(task_category)).str[0]
Out[1008]:
0 trip
1 Mall
2 Cinema
Name: Task, dtype: object
Sample
df = pd.DataFrame(
{'Name': ["a","b","c"],
'Task': ['went to trip Cinema','Mall Visit','Cinema']})
df.Task.str.findall('|'.join(task_category))
Out[1012]:
0 [trip, Cinema]
1 [Mall]
2 [Cinema]
Name: Task, dtype: object

One hot encoding categorical features - Sparse form only

I have a dataframe that has int and categorical features. The categorical features are 2 types: numbers and strings.
I was able to One hot encode columns that were int and categorical that were numbers. I get an error when I try to One hot encode categorical columns that are strings.
ValueError: could not convert string to float: '13367cc6'
Since the dataframe is huge with high cardinality so I only want to convert it to a Sparse form. I would prefer a solution that uses from sklearn.preprocessing import OneHotEncoder since I am familiar with it.
I checked other questions too but none of them addresses what I am asking.
data = [[623, 'dog', 4], [123, 'cat', 2],[623, 'cat', 1], [111, 'lion', 6]]
The above dataframe contains 4 rows and 3 columns
Column names - ['animal_id', 'animal_name', 'number']
Assume that animal_id and animal_name are stored in pandas as category and number as int64 dtype.
Assuming you have the following DF:
In [124]: df
Out[124]:
animal_id animal_name number
0 623 dog 4
1 123 cat 2
2 623 cat 1
3 111 lion 6
In [125]: df.dtypes
Out[125]:
animal_id int64
animal_name category
number int64
dtype: object
first save animal_name column (if you need it in future):
In [126]: animal_name = df['animal_name']
convert animal_name column to categorical (memory saving) numeric column:
In [127]: df['animal_name'] = df['animal_name'].cat.codes.astype('category')
In [128]: df
Out[128]:
animal_id animal_name number
0 623 1 4
1 123 0 2
2 623 0 1
3 111 2 6
In [129]: df.dtypes
Out[129]:
animal_id int64
animal_name category
number int64
dtype: object
Now OneHotEncoder should work:
In [130]: enc = OneHotEncoder()
In [131]: enc.fit(df)
Out[131]:
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
In [132]: X = enc.fit(df)
In [134]: X.n_values_
Out[134]: array([624, 3, 7])
In [135]: enc.feature_indices_
Out[135]: array([ 0, 624, 627, 634], dtype=int32)
FYI, there are other powerful encoding schemes which did not add a big number of columns as onehot encoding (In fact they did not add any columns at all). Some of them are count encoding, target encoding. For more details, see my answer here and my ipynb here.