Pandas DataFrame: remove � (unknown-character) from strings in rows

Pandas DataFrame: remove � (unknown-character) from strings in rows - pandas

I have read a csv file into python 2.7 (windows machine). Sales Price column seems to be mixture of string and float. And some rows contains a euro symbol €. Python sees € as �.
df = pd.read_csv('sales.csv', thousands=',')
print df
Gender Size Color Category Sales Price
Female 36-38 Blue Socks 25
Female 44-46 Pink Socks 13.2
Unisex 36-38 Black Socks � 19.00
Unisex 40-42 Pink Socks � 18.50
Female 38 Yellow Pants � 89,00
Female 43 Black Pants � 89,00
I was under the assumption that a simple line with replace will solve it
df=df.replace('\�','',regex=True).astype(float)
But I got encoding error
SyntaxError: Non-ASCII character
Would appreciate hearing your thoughts on this

I faced a similar problem where one of the column in my dataframe had lots of currency symbols. Euro, Dollar, Yen, Pound etc. I tried multiple solutions but the easiest one was to use unicodedata module.
df['Sales Price'] = df['Sales Price'].str.replace(unicodedata.lookup('EURO SIGN'), 'Euro')
The above will replace € with Euro in Sales Price column.

I think #jezrael comment is valid. First you need to read the file with encoding(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html under encoding section)
df=pd.read_csv('sales.csv', thousands=',', encoding='utf-8')
but for replacing Euro sign try this:
df=df.replace('\u20AC','',regex=True).astype(float)

Related

Matplotlib output line chart looks "box like" (for lack of a better word) for monthly data sampled over a 30 year period

I am doing a very simple chart with Matplotlib and Python. It is 30 years worth of monthly sampled data (PMI - US Purchasing Manager Index). All up it has around 400 monthly observation.
Sample monthly data:
Date
PMI
1/03/2022
57.1
1/02/2022
58.6
1/01/2022
57.6
1/12/2021
58.8
1/11/2021
60.6
1/10/2021
60.8
1/09/2021
60.5
I produced a very simple line chart with Matplotlib. Dataframe name is pmi. Date is not set to index but dates are set to pandas.datetime.
plt.plot(pmi.Date, pmi.PMI, c='mediumblue', lw=0.8)
Output:
Why does the output look so box like. It seems to me like it just doesn't capture all the data available in the dataframe. I'm sure it does though, so is this a formatting issue? How do you smooth this output line out so to remove sharp, edge like breaks?

Delete number of string characters from left in Column B depending on length of string in column A of pandas dataframe

As described in the title, I have the following problem:
Data is prepared as a pandas dataframe incoming as follows:
Article
Title
A0
A00183
BB2
BB2725
C2C3
C2C3945
As you can see, the "Title" column is repeating the string value of the Article column.
I want this to be deleted, so that the table looks as follows:
Article
Title
A0
0183
BB2
725
C2C3
945
I want to do this with Pandas.
I already found out how to read the length of the string row in column Article, so that I already know the amount of characters to be deducted with this:
df1['Length of Article string:'] = df1['Article:'].apply(len)
But now I am to stupid to figure out how to delete the strings, that can change in amount for every row, in the Title column.
Thanks for your help!
Kind regards
Tried Pandas Documentation, found some hints regarding split and strip, but I do not have enough know-how to implement...

You can replace from list derived from Article column.
df["Title"] = df["Title"].replace(df["Article"].tolist(), "", regex=True)
print(df)
Article Title
0 AA 0123
1 BBB 234
2 CCCC 345

you can use replace() with a lambda function.
dfx = df[['Article','Title']].apply(lambda x : x['Title'].replace((x['Article']), ''), axis=1)

Need to parse PRN file into spark dataframe

I have PRN file which looks likes this.
How can i convert it into a spark dataframe.
The schema is not fixed, I can have different file with different headers.
As you can notice, the Credit Limit is a single column while Postcode Phone are different column. This might need some kind of intelligent parsing solution.
dataframe = dataframe
.withColumn("Name", dataframe.col("value").substr(0, 15))
.withColumn("Address", dataframe.col("value").substr(16, 22))
.withColumn("Postcode", dataframe.col("value").substr(38, 9))
.withColumn("Phone", dataframe.col("value").substr(47, 13))
.withColumn("Credit Limit", dataframe.col("value").substr(61, 13))
.withColumn("Birthday", dataframe.col("value").substr(74, 8))
dataframe.schema.fields.foreach(f=>{
dataframe=dataframe.withColumn(f.name,trim(col(f.name)))
})
dataframe = dataframe.drop(dataframe("value"))
I used fixed width parser, is there other intelligent parsing solution can someone suggest?
Name Address Postcode Phone Credit Limit Birthday
Johnson, John Voorstraat 32 3122gg 020 3849381 1000000 19870101
Anderson, Paul Dorpsplein 3A 4532 AA 030 3458986 10909300 19651203
Wicket, Steve Mendelssohnstraat 54d 3423 ba 0313-398475 93400 19640603
Benetar, Pat Driehoog 3zwart 2340 CC 06-28938945 54 19640904

pandas read_csv file type with double quotes and no-double quotes

Hi I have a CSV with this format
Headers: SKU, Product_Name, product_id
3735,[Freebies PC] - Holyshield! Sunscreen Comfort Corrector Serum SPF 50+ PA++++ 5 mL,154674
4568,"Consumables Mika furit 500 gr #250 (16x12x11) packaging grape, orange)",202737
2403,Laurier Active Day Super Maxi 30 Pcs,8992727002714
I want to be able to read as dataframe in csv, however the problem is that some product names uses "," which is not being able to be read as properly. I checked other sources trying to use sep, however some product names have that others don't. How can i read it properly?
I tried using
productList = pd.read_csv('products/products.csv',encoding='utf-8', engine'python)
It returns:
sku
Product_Name
product_id
3735
[Freebies PC] - Holyshield! Sunscreen Comfort Corrector Serum SPF 50+ PA++++ 5 mL
154674
4568,"Consumables Mika furit 500 gr #250 (16x12x11) packaging grape, orange)",202737
nan
nan
42403
Laurier Active Day Super Maxi 30 Pcs
8992727002714
expected output is
sku
Product_Name
product_id
3735
[Freebies PC] - Holyshield! Sunscreen Comfort Corrector Serum SPF 50+ PA++++ 5 mL
154674
4568
Consumables Mika furit 500 gr #250 (16x12x11) packaging grape, orange)
202737
42403
Laurier Active Day Super Maxi 30 Pcs
8992727002714
How can I do so?

Content of sample.csv file:
product_id,product_name,sku_number
2168,Sanjin Watermelon Frost Obat Sariawan Powder/Bubuk,6903193004029
3798,Common Grounds Cloak & Dagger Instant Coffee 1 Sachets,313166
3799,Common Grounds Ethiopia Guji Instant Coffee 1 Sachets,175744
3580,Emina Glossy Stain Lip Tint Autumn Bell 3gr,8993137707220
"3795,""Hansaplast Kasa Steril - 7,5 x 7,5cm"",8999777016043"
"2997,""Panda GP71 2,5mm"",616920"
It seems like output process from db generates error in exported data for some reason. If you are not able to correct the process possible solution is the following:
import pandas as pd
from io import StringIO
with open('sample.csv', 'r') as f:
data = f.read().replace(',""', '","').replace('"",', '","')
df = pd.read_csv(StringIO(data))
df
Returns

Unable to create new features in Machine learning

I have a dataset. I am using pandas dataframe and named it df.
The dataset has 50,000 rows - here are the first 5:.
Name_Restaurant cuisines_available Average cost
Food Heart Japnese, chinese 60$
Spice n Hungary Indian, American, mexican 42$
kfc, Lukestreet Thai, Japnese 29$
Brown bread shop American 11$
kfc, Hypert mall Thai, Japnese 40$
I want to create column which contains the no. of cuisines available
I am trying code
df['no._of_cuisines_available']=df['cuisines_available'].str.len()
Then instead of showing the no. of cuisines, it is showing the sum of charecters.
For example - for first row the o/p should be 2 , but its showing 17.
I need a new column that contain number of stores for each restaurant. example -
here kfc has 2 stores kfc, lukestreet and kfc, hypert mall. I have completely
no idea how to code this.

i)
df['cuisines_available'].str.split(',').apply(len)
ii)
df['Name_Restaurant'].str.split(',', expand=True).melt().['value'].str.strip().value_counts()
What ii) does: split columns at ',' and store all strings thus generated in an individual column. Then use melt to make one big column, strip away spaces etc. and count individual entries.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas DataFrame: remove � (unknown-character) from strings in rows - pandas

Related

Matplotlib output line chart looks "box like" (for lack of a better word) for monthly data sampled over a 30 year period

Delete number of string characters from left in Column B depending on length of string in column A of pandas dataframe

Need to parse PRN file into spark dataframe

pandas read_csv file type with double quotes and no-double quotes

Unable to create new features in Machine learning

Categories

Resources