Pandas column cleanup - pandas

I have a dataset with a complex column in pandas. One of column product_info has various types of contents :
#Input type1
df['productInfo'][0]
#Output type1
'Salt & pepper shakers,Material: stoneware,Dimensions:
H6.5cm,Dachshund designs,1x black and tan, 1x brown,Hand
painted,Dishwasher safe'
#Output type2
'Pineapple string lights,Dimensions: 400x6x10cm,10 pineapple shaped LED lights,In a gold hue,3x AA batteries required (not included)'
#Output type 3
''
So essentially my productInfo column contains the above 3 kinds of contents.
What i want is to get the Material for groupby analysis: extracted from the productInfo column of the dataframe, of course only when these values exist, if they don't, just set these values as null/None or whatever
I have tried boolean masks but can't seem to make them work, anyone with any suggestion is highly appreciated.
Thanks in advance
Edit:
this was my original df:
original df
My df after extracting Material from ProductInfo:
df after extracting Material from ProductInfo
My df after extracting Material and Dimensions from ProductInfo:
enter image description here
Hopefully, you guys get an idea of what I'm trying to achieve. Most of my task is to do text extraction from complex text blobs inside each column.
If I find the relevant columns from the text clumps using regex then I update the columns else make them null. It has proven to be a big challenge, please if any of you guys can help me extract the useful info like Material and Dimensions from the productInfo text clump to their own columns, that'd be very helpful of you guys.
Thanks in Advance for anyone who tries to help me and sorry for my vague question without providing relevant information.
Happy Panda-ing(If that's a word!!)
:)

I imported pandas and re
import pandas as pd
import re
I created a helper function that does a simple regex to get the material and dimensions. I delete the material and dimension strings from the original string, returning a Series with the updated description, material, and dimensions.
def get_material_and_dimensions(row):
description = row['productInfo']
material = re.search(r'Material: (.*?),', description)
if material:
material = material.group(1)
description = description.replace(f'Material: {material},', '')
dimensions = re.search(r'Dimensions: (.*?),', description)
if dimensions:
dimensions = dimensions.group(1)
description = description.replace(f'Dimensions: {dimensions},', '')
return pd.Series([description, material, dimensions], index=['description', 'material', 'dimensions'])
Apply the function to the DataFrame
myseries = df.apply(get_material_and_dimensions, axis=1)
Then add the series to the original DataFrame, replacing df['productInfo'] with the clean df['description']
df = df.join(myseries)
df['productInfo'] = df['description']
df.drop('description', inplace=True, axis=1)

Related

pandas: split pandas columns of unequal length list into multiple columns

I have a dataframe with one column of unequal list which I want to spilt into multiple columns (the item value will be the column names). An example is given below
I have done through iterrows, iterating thruough the rows and examine the list from each rows. It seem workable as my dataframe has few rows. However, I wonder if there is any clean methods
I have done through additional_df = pd.DataFrame(venue_df.location.values.tolist())
However the list break down into as below
thanks fro your help
Can you try this code: built assuming venue_df.location contains the list you have shown in the cells.
venue_df['school'] = venue_df.location.apply(lambda x: ('school' in x)+0)
venue_df['office'] = venue_df.location.apply(lambda x: ('office' in x)+0)
venue_df['home'] = venue_df.location.apply(lambda x: ('home' in x)+0)
venue_df['public_area'] = venue_df.location.apply(lambda x: ('public_area' in x)+0)
Hope this helps!
First lets explode your location column, so we can get your wanted end result.
s=df['Location'].explode()
Then lets use crosstab in that series so we can get your end result
import pandas as pd
pd.crosstab(s).unstack()
I didnt test it out cause i dont know you base_df

Extracting a value from a pd dataframe

I have a dataframe column such as below.
{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/narrative%20film"}},"color":16734574,"parent_id":11,"name":"Narrative Film","id":31,"position":13,"slug":"film & video/narrative film"}
I want to extract the info against the word 'slug'. (In this instance it is film & video/narrative film) and store the info as a new dataframe column.
How can I do this ?
Many thanks
This is a (nested) dictionary with different kinds of entries, so it does not make much sense to treat it as a DataFrame column. You could treat it as a DataFrame row, with the dictionary keys giving the column names:
import pandas as pd
dict = {"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/narrative%20film"}},
"color":16734574, "parent_id":11, "name":"Narrative Film", "id":31, "position":13,
"slug":"film & video/narrative film"}
df = pd.DataFrame(dict, index=[0])
display(df)
Output:
urls color parent_id name id position slug
0 NaN 16734574 11 Narrative Film 31 13 film & video/narrative film
Note that the urls entry is not recognized, due to the sub-dictionary.
In any case, this does yield slug as a column, so please let me know if this answers your question.
Of course you could also extract the slug entry directly from your dictionary:
dict['slug']

Filtering out columns with Pandas

enter image description hereI want to filter a pandas data-frame to only keep columns that contain a certain wildcard and then keep the two columns directly to right of this.
The Dataframe is tracking pupil grades, overall total and feedback. I only want to keep the data that corresponds to Homework and not other assessments. So in the example below I would want to keep First Name, Last Name, any homework column and the corresponding points and feedback column which are always exported to the right of this.
First Name,Last Name,Understanding Business Homework,Points,Feedback,Past Paper Homework,Points,Feedback, Groupings/Structures Questions,Points, Feedback
import pandas as pd
import numpy as np
all_data = all_data.filter(like=('Homework') and ('First Name') and
('Second Name') and ('Points'),axis=1)
print(all_data.head())
export_csv = all_data.to_csv (r'C:\Users\Sandy\Python\Automate_the_Boring_Stuff\new.csv', index = None, header=True)

Reading csv file with pandas

enter image description hereI have a csv file with 2 columns which is text and boolean(y/n) where I am trying to put all the positive value in 1 file in 1 file and the negative one in the others. Here is what I tried:
df = pd.read_csv('text_trait_with_binary_EXT.csv','rb',delimiter=',',quotechar='"')
#print(df)
df.columns = ["STATUS", "xEXT"]
positive = []
negative = []
for line in df:
text = line[0].strip()
if line[1].strip() == "y":
positive.append(text)
elif line[1].strip() == "n":
negative.append(text)
print(positive)
print(negative)
And when I run this it just give an empty list!
I am new in using pandas so if any of you can help that would be great.
As others have commented, there is almost always a better approach than using iteration in Pandas. It has a lot of built-in functions to help avoid loops.
If I understand your intentions correctly that you want to take the values from column 1 (named 'STATUS'), split them according to whether the corresponding value in column 2 (named 'xEXT') is 'y' or 'n', and generate two lists containing the column 1 values, the following should work (to be used after your first two lines of code you posted):
positive = df.loc[df['xEXT'].str.strip() == 'y', 'STATUS'].tolist()
negative = df.loc[df['xEXT'].str.strip() == 'n', 'STATUS'].tolist()
Here is a link to the documentation on loc, which is useful for problems like this.
The above solution assumes that your data has been read in correctly. If it does not work for you, please do as others have commented and add a small sample of your data so that we are able to try out our proposed solutions.

Unable to fit_transform data from csv file in sklearn

I am trying to learn some classification in Scikit-learn. However, I couldn't figure out what this error means.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data_frame = pd.read_csv('data.csv', header=0)
data_in_numpy = data_frame.values
c = CountVectorizer()
c.fit_transform(data_in_numpy.data)
This throws an error:
NotImplementedError: multi-dimensional sub-views are not implemented
How can I go around this issue? One record from my csv file looks like:
Time Directors Actors Rating Label
123 Abc, Def A, B,c 7.2 1
I suppose this error is due to the fact that there are more than one values under Directors or Actors column.
Any help would be appreciated.
Thanks,
According to the docstring, sklearn.feature_extraction.text.CountVectorizer will:
Convert a collection of text documents to a matrix of token counts
So then why, I wonder, are you inputing numerical values?
Try transforming only the strings (directors and actors):
data_in_numpy['X'] = data_frame[['Directors', 'Actors']].apply(lambda x: ' '.join(x), axis=1)
data_in_numpy = data_frame['X'].values
First though, you might want to clean the data up by removing the commas.
data_frame['Directors'] = data_frame['Directors'].str.replace(',', ' ')
data_frame['Actors'] = data_frame['Actors'].str.replace(',', ' ')