pandas read_csv file type with double quotes and no-double quotes - pandas

Hi I have a CSV with this format
Headers: SKU, Product_Name, product_id
3735,[Freebies PC] - Holyshield! Sunscreen Comfort Corrector Serum SPF 50+ PA++++ 5 mL,154674
4568,"Consumables Mika furit 500 gr #250 (16x12x11) packaging grape, orange)",202737
2403,Laurier Active Day Super Maxi 30 Pcs,8992727002714
I want to be able to read as dataframe in csv, however the problem is that some product names uses "," which is not being able to be read as properly. I checked other sources trying to use sep, however some product names have that others don't. How can i read it properly?
I tried using
productList = pd.read_csv('products/products.csv',encoding='utf-8', engine'python)
It returns:
sku
Product_Name
product_id
3735
[Freebies PC] - Holyshield! Sunscreen Comfort Corrector Serum SPF 50+ PA++++ 5 mL
154674
4568,"Consumables Mika furit 500 gr #250 (16x12x11) packaging grape, orange)",202737
nan
nan
42403
Laurier Active Day Super Maxi 30 Pcs
8992727002714
expected output is
sku
Product_Name
product_id
3735
[Freebies PC] - Holyshield! Sunscreen Comfort Corrector Serum SPF 50+ PA++++ 5 mL
154674
4568
Consumables Mika furit 500 gr #250 (16x12x11) packaging grape, orange)
202737
42403
Laurier Active Day Super Maxi 30 Pcs
8992727002714
How can I do so?

Content of sample.csv file:
product_id,product_name,sku_number
2168,Sanjin Watermelon Frost Obat Sariawan Powder/Bubuk,6903193004029
3798,Common Grounds Cloak & Dagger Instant Coffee 1 Sachets,313166
3799,Common Grounds Ethiopia Guji Instant Coffee 1 Sachets,175744
3580,Emina Glossy Stain Lip Tint Autumn Bell 3gr,8993137707220
"3795,""Hansaplast Kasa Steril - 7,5 x 7,5cm"",8999777016043"
"2997,""Panda GP71 2,5mm"",616920"
It seems like output process from db generates error in exported data for some reason. If you are not able to correct the process possible solution is the following:
import pandas as pd
from io import StringIO
with open('sample.csv', 'r') as f:
data = f.read().replace(',""', '","').replace('"",', '","')
df = pd.read_csv(StringIO(data))
df
Returns

Related

Need to parse PRN file into spark dataframe

I have PRN file which looks likes this.
How can i convert it into a spark dataframe.
The schema is not fixed, I can have different file with different headers.
As you can notice, the Credit Limit is a single column while Postcode Phone are different column. This might need some kind of intelligent parsing solution.
dataframe = dataframe
.withColumn("Name", dataframe.col("value").substr(0, 15))
.withColumn("Address", dataframe.col("value").substr(16, 22))
.withColumn("Postcode", dataframe.col("value").substr(38, 9))
.withColumn("Phone", dataframe.col("value").substr(47, 13))
.withColumn("Credit Limit", dataframe.col("value").substr(61, 13))
.withColumn("Birthday", dataframe.col("value").substr(74, 8))
dataframe.schema.fields.foreach(f=>{
dataframe=dataframe.withColumn(f.name,trim(col(f.name)))
})
dataframe = dataframe.drop(dataframe("value"))
I used fixed width parser, is there other intelligent parsing solution can someone suggest?
Name Address Postcode Phone Credit Limit Birthday
Johnson, John Voorstraat 32 3122gg 020 3849381 1000000 19870101
Anderson, Paul Dorpsplein 3A 4532 AA 030 3458986 10909300 19651203
Wicket, Steve Mendelssohnstraat 54d 3423 ba 0313-398475 93400 19640603
Benetar, Pat Driehoog 3zwart 2340 CC 06-28938945 54 19640904

Loop over pandas dataframe to create multiple networks

I have data of countries trade with one another. I have split the main file according to months and got 12 csv files for the year 2019. A sample of the data of January csv is provided below:
reporter partner year month trade
0 Albania Argentina 2019 01 515256
1 Albania Australia 2019 01 398336
2 Albania Austria 2019 01 7664503
3 Albania Bahrain 2019 01 400
4 Albania Bangladesh 2019 01 653907
5 Zimbabwe Zambia 2019 01 79569855
I want to make complex network for every month and print the number of nodes of every network. Now I can do it the hard (stupid) way like so.
df01 = pd.read_csv('012019.csv')
df02 = pd.read_csv('022019.csv')
df03 = pd.read_csv('032019.csv')
df1= df01[['reporter','partner', 'trade']]
df2= df02[['reporter','partner', 'trade']]
df3= df03[['reporter','partner', 'trade']]
G1 = nx.Graph()
G1 = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G1.number_of_nodes()
and so on for the next networks.
My question is how can I use a "for loop" to read the files, convert them to networks from dataframe and report the number of nodes of each node.
I tried this but nothing is reported.
for f in glob.glob('.csv'):
df = pd.read_csv(f)
df1 = df[['reporter','partner', 'trade']]
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='trade')
G.number_of_nodes()
Thanks.
Edit:
Ok. So I managed to do the above using similar codes like below:
for files in glob.glob('/home/user/VMShared/network/2nd/*.csv'):
df = pd.read_csv(files)
df1=df[['reporter','partner', 'import']]
G = nx.Graph()
G = nx.from_pandas_edgelist(df1, 'reporter', 'partner', edge_attr='import')
nx.write_graphml_lxml(G, "/home/user/VMShared/network/2nd/*.graphml")
The problem that I now face is how to write separate files. All I get from this is one file titled *.graphml. How can I get graphml files for every input file? Also if I can get the same graphml output name as the input file would be a plus.

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Pandas DataFrame: remove � (unknown-character) from strings in rows

I have read a csv file into python 2.7 (windows machine). Sales Price column seems to be mixture of string and float. And some rows contains a euro symbol €. Python sees € as �.
df = pd.read_csv('sales.csv', thousands=',')
print df
Gender Size Color Category Sales Price
Female 36-38 Blue Socks 25
Female 44-46 Pink Socks 13.2
Unisex 36-38 Black Socks � 19.00
Unisex 40-42 Pink Socks � 18.50
Female 38 Yellow Pants � 89,00
Female 43 Black Pants � 89,00
I was under the assumption that a simple line with replace will solve it
df=df.replace('\�','',regex=True).astype(float)
But I got encoding error
SyntaxError: Non-ASCII character
Would appreciate hearing your thoughts on this
I faced a similar problem where one of the column in my dataframe had lots of currency symbols. Euro, Dollar, Yen, Pound etc. I tried multiple solutions but the easiest one was to use unicodedata module.
df['Sales Price'] = df['Sales Price'].str.replace(unicodedata.lookup('EURO SIGN'), 'Euro')
The above will replace € with Euro in Sales Price column.
I think #jezrael comment is valid. First you need to read the file with encoding(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html under encoding section)
df=pd.read_csv('sales.csv', thousands=',', encoding='utf-8')
but for replacing Euro sign try this:
df=df.replace('\u20AC','',regex=True).astype(float)

How can I do SQL like operations on a R data frame?

For example, I have a data frame with data across categories and subcategories and I want to be able to get row with maximum value in a particular column etc.
SQL is what comes to mind first. But since I am not interested in joins or indices etc, python's list comprehensions would do the same thing better with a more modern syntax.
What's best practice in R for such operations?
EDIT:
For now I think I am fine with which.max. Why I asked the question the way I did is simply that I have come to learn that in R there are many libraries etc doing pretty much the same thing. Just by reading the documentation it's very hard to evaluate how popular (ie how well the library fulfills its purpose). My personal experience with Python is that the day you figure out how to use list comprehensions (with itertools as a bonus), you are pretty much covered. Over time this has evolved as best practice, you don't see lambda and filter for example that often in the general python debate these days as list comprehensions does the same thing easier and more uniform.
If you really mean SQL, a pretty straightforward answer is the 'sqldf' package:
http://cran.at.r-project.org/web/packages/sqldf/index.html
From the help for ?sqldf
library(sqldf)
a1s <- sqldf("select * from warpbreaks limit 6")
Some additional context would help, but from the sounds of it - you may be looking for which.max() or the related functions. For group by operations, I default to the plyr family of functions, but there are certainly faster alternatives in base R if speed is of utmost importance.
library(plyr)
#Make a local copy of mycars data and add the rownames as a column since ddply
#seems to drop them. I've never encountered that before actually...
myCars <- mtcars
myCars$carname <- rownames(myCars)
#Find the max mpg
myCars[which.max(myCars$mpg) ,]
mpg cyl disp hp drat wt qsec vs am gear carb carname
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 Toyota Corolla
#Find the max mpg by cylinder category
ddply(myCars, "cyl", function(x) x[which.max(x$mpg) ,])
mpg cyl disp hp drat wt qsec vs am gear carb carname
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Pontiac Firebird