Strings and quotation-marks in csv file - pandas

I have a csv file that looks like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
"1,0,3,""Braund, Mr. Owen Harris"",male,22,1,0,A/5 21171,7.25,,S"
Can I use pandas to read the csv such that it gets read in the obvious way?
In other words, I want the csv file to be read as if it looked like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
Any suggestions?

pd.read_csv(data)
is the answer to your problem.
Here is the code I used for this Kaggle dataset:
training_set = pd.read_csv('train.csv')
Output (Just first row)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S

Related

How to replace values of a column based on another data frame?

I have a column containing symbols of chemical elements and other substances. Something like this:
Commoditie
sn
sulfuric acid
cu
sodium chloride
au
df1 = pd.DataFrame(['sn', 'sulfuric acid', 'cu', 'sodium chloride', 'au'], columns=['Commodities'])
And I have another data frame containing the symbols of the chemical elements and their respective names. Like this:
Name
Symbol
sn
tin
cu
copper
au
gold
df2 = pd.DataFrame({'Name': ['tin', 'copper', 'gold'], 'Symbol': ['sn', 'cu', 'au']})
I need to replace the symbols (in the first dataframe)(df1['Commoditie']) with the names (in the second one) (df2['Names']), so that it outputs like the following:
I need the
Output:
Commoditie
tin
sulfuric acid
copper
sodium chloride
gold
I tried using for loops and lambda but got different results than expected. I have tried many things and googled, I think it's something basic, but I just can't find an answer.
Thank you in advance!
first, convert df2 to a dictionary:
replace_dict=dict(df2[['Symbol','Name']].to_dict('split')['data'])
#{'sn': 'tin', 'cu': 'copper', 'au': 'gold'}
then use replace function:
df1['Commodities']=df1['Commodities'].replace(replace_dict)
print(df1)
'''
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
'''
Try:
for i, row in df2.iterrows():
df1.Commodities = df1.Commodities.str.replace(row.Symbol, row.Name)
which gives df1 as:
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
EDIT: Note that it's very likely to be far more efficient to skip defining df2 at all and just zip your lists of names and symbols together and iterate over that.

Need to parse PRN file into spark dataframe

I have PRN file which looks likes this.
How can i convert it into a spark dataframe.
The schema is not fixed, I can have different file with different headers.
As you can notice, the Credit Limit is a single column while Postcode Phone are different column. This might need some kind of intelligent parsing solution.
dataframe = dataframe
.withColumn("Name", dataframe.col("value").substr(0, 15))
.withColumn("Address", dataframe.col("value").substr(16, 22))
.withColumn("Postcode", dataframe.col("value").substr(38, 9))
.withColumn("Phone", dataframe.col("value").substr(47, 13))
.withColumn("Credit Limit", dataframe.col("value").substr(61, 13))
.withColumn("Birthday", dataframe.col("value").substr(74, 8))
dataframe.schema.fields.foreach(f=>{
dataframe=dataframe.withColumn(f.name,trim(col(f.name)))
})
dataframe = dataframe.drop(dataframe("value"))
I used fixed width parser, is there other intelligent parsing solution can someone suggest?
Name Address Postcode Phone Credit Limit Birthday
Johnson, John Voorstraat 32 3122gg 020 3849381 1000000 19870101
Anderson, Paul Dorpsplein 3A 4532 AA 030 3458986 10909300 19651203
Wicket, Steve Mendelssohnstraat 54d 3423 ba 0313-398475 93400 19640603
Benetar, Pat Driehoog 3zwart 2340 CC 06-28938945 54 19640904

How to use pandas read_csv to read csv file having backward slash and double quotation

I have a CSV file like this (comma separated)
ID, Name,Context, Location
123,"John","{\"Organization\":{\"Id\":12345,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}","Road 1"
234,"Mike","{\"Organization\":{\"Id\":23456,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}","Road 2"
I want to create DataFrame like this:
ID | Name |Context |Location
123| John |{\"Organization\":{\"Id\":12345,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}|Road 1
234| Mike |{\"Organization\":{\"Id\":23456,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}|Road 2
Could you help and show me how to use pandas read_csv doing it?
An answer - if you are willing to accept that the \ char gets stripped:
pd.read_csv(your_filepath, escapechar='\\')
ID Name Context Location
0 123 John {"Organization":{"Id":12345,"IsDefault":false}... Road 1
1 234 Mike {"Organization":{"Id":23456,"IsDefault":false}... Road 2
An answer if you actually want the backslashes in - using a custom converter:
def backslash_it(x):
return x.replace('"','\\"')
pd.read_csv(your_filepath, escapechar='\\', converters={'Context': backslash_it})
ID Name Context Location
0 123 John {\"Organization\":{\"Id\":12345,\"IsDefault\":... Road 1
1 234 Mike {\"Organization\":{\"Id\":23456,\"IsDefault\":... Road 2
escapechar on read_csv is used to actually read the csv then the custom converter puts the backslashes back in.
Note that I tweaked the header row to make the column name match easier:
ID,Name,Context,Location

Delete abbreviations (combination of Letter+dot) from Pandas column

I'd like to delete specific parts of strings in a pandas column, such as any letter followed by a dot. For example, having a column with names:
John W. Man
Betty J. Rule
C.S. Stuart
What should remain is
John Man
Betty Rule
Stuart
SO, any letter followed by a dot, that represents an abbreviation, should go.
I can't think of a way with str.replace or anything like that.
Use Series.str.replace with reegx for match one letter with . and space after it if exist:
df['col'] = df['col'].str.replace('([a-zA-Z]{1}\.\s*)','', regex=True)
print (df)
col
0 John Man
1 Betty Rule
2 Stuart

Spark Scala : How to read fixed record length File

I have a simple question.
“How to read Files with Fixed record length?” i have 2 fields in the record- name & state.
File Data-
John OHIO
VictorNEWYORK
Ron CALIFORNIA
File Layout-
Name String(6);
State String(10);
I just want to read it and create a DataFrame from this file. Just to elaborate more for example on “fixed record length”- if you see since “OHIO” is 4 characters length, in file it is populated with 6 trailing spaces “OHIO “
The record length here is 16.
Thanks,
Sid
Read your input file:
val rdd = sc.textFile('your_file_path')
Then use substring to split the fields and then convert RDD to Dataframe using toDF().
val df = rdd.map(l => (l.substring(0, 6).trim(), l.substring(6, 16).trim()))
.toDF("Name","State")
df.show(false)
Result:
+------+----------+
|Name |State |
+------+----------+
|John |OHIO |
|Victor|NEWYORK |
|Ron |CALIFORNIA|
+------+----------+