Need to parse PRN file into spark dataframe - dataframe

I have PRN file which looks likes this.
How can i convert it into a spark dataframe.
The schema is not fixed, I can have different file with different headers.
As you can notice, the Credit Limit is a single column while Postcode Phone are different column. This might need some kind of intelligent parsing solution.
dataframe = dataframe
.withColumn("Name", dataframe.col("value").substr(0, 15))
.withColumn("Address", dataframe.col("value").substr(16, 22))
.withColumn("Postcode", dataframe.col("value").substr(38, 9))
.withColumn("Phone", dataframe.col("value").substr(47, 13))
.withColumn("Credit Limit", dataframe.col("value").substr(61, 13))
.withColumn("Birthday", dataframe.col("value").substr(74, 8))
dataframe.schema.fields.foreach(f=>{
dataframe=dataframe.withColumn(f.name,trim(col(f.name)))
})
dataframe = dataframe.drop(dataframe("value"))
I used fixed width parser, is there other intelligent parsing solution can someone suggest?
Name Address Postcode Phone Credit Limit Birthday
Johnson, John Voorstraat 32 3122gg 020 3849381 1000000 19870101
Anderson, Paul Dorpsplein 3A 4532 AA 030 3458986 10909300 19651203
Wicket, Steve Mendelssohnstraat 54d 3423 ba 0313-398475 93400 19640603
Benetar, Pat Driehoog 3zwart 2340 CC 06-28938945 54 19640904

Related

Delete number of string characters from left in Column B depending on length of string in column A of pandas dataframe

As described in the title, I have the following problem:
Data is prepared as a pandas dataframe incoming as follows:
Article
Title
A0
A00183
BB2
BB2725
C2C3
C2C3945
As you can see, the "Title" column is repeating the string value of the Article column.
I want this to be deleted, so that the table looks as follows:
Article
Title
A0
0183
BB2
725
C2C3
945
I want to do this with Pandas.
I already found out how to read the length of the string row in column Article, so that I already know the amount of characters to be deducted with this:
df1['Length of Article string:'] = df1['Article:'].apply(len)
But now I am to stupid to figure out how to delete the strings, that can change in amount for every row, in the Title column.
Thanks for your help!
Kind regards
Tried Pandas Documentation, found some hints regarding split and strip, but I do not have enough know-how to implement...
You can replace from list derived from Article column.
df["Title"] = df["Title"].replace(df["Article"].tolist(), "", regex=True)
print(df)
Article Title
0 AA 0123
1 BBB 234
2 CCCC 345
you can use replace() with a lambda function.
dfx = df[['Article','Title']].apply(lambda x : x['Title'].replace((x['Article']), ''), axis=1)

translate Dataframe using crosswalk in julia

I have a very large dataframe (original_df) with columns of codes
14 15
21 22
18 16
And a second dataframe (crosswalk) which maps 'old_codes' to 'new_codes'
14 104
15 105
16 106
18 108
21 201
22 202
Of course, the resultant df (resultant_df) that I would like would have values:
104 105
201 202
108 106
I am aware of two ways to accomplish this. First, I could iterate through each code in original_df, find the code in crosswalk, then rewrite the corresponding cell in original_df with the translated code from crosswalk. The faster and more natural option would be to leftjoin() each column of original_df on 'old_codes'. Unfortunately, it seems I have to do this separately for each column, and then delete each column after its conversion column has been created -- this feels unnecessarily complicated. Is there a simpler way to convert all of original_df at once using the crosswalk?
You can do the following (I am using column numbers as you have not provided column names):
d = Dict(crosswalk[!, 1] .=> crosswalk[!, 2])
resultant_df = select(original_df, [i => ByRow(x -> d[x]) for i in 1:ncol(original_df)], renamecols=false)

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Strings and quotation-marks in csv file

I have a csv file that looks like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
"1,0,3,""Braund, Mr. Owen Harris"",male,22,1,0,A/5 21171,7.25,,S"
Can I use pandas to read the csv such that it gets read in the obvious way?
In other words, I want the csv file to be read as if it looked like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
Any suggestions?
pd.read_csv(data)
is the answer to your problem.
Here is the code I used for this Kaggle dataset:
training_set = pd.read_csv('train.csv')
Output (Just first row)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S

Pandas DataFrame: remove � (unknown-character) from strings in rows

I have read a csv file into python 2.7 (windows machine). Sales Price column seems to be mixture of string and float. And some rows contains a euro symbol €. Python sees € as �.
df = pd.read_csv('sales.csv', thousands=',')
print df
Gender Size Color Category Sales Price
Female 36-38 Blue Socks 25
Female 44-46 Pink Socks 13.2
Unisex 36-38 Black Socks � 19.00
Unisex 40-42 Pink Socks � 18.50
Female 38 Yellow Pants � 89,00
Female 43 Black Pants � 89,00
I was under the assumption that a simple line with replace will solve it
df=df.replace('\�','',regex=True).astype(float)
But I got encoding error
SyntaxError: Non-ASCII character
Would appreciate hearing your thoughts on this
I faced a similar problem where one of the column in my dataframe had lots of currency symbols. Euro, Dollar, Yen, Pound etc. I tried multiple solutions but the easiest one was to use unicodedata module.
df['Sales Price'] = df['Sales Price'].str.replace(unicodedata.lookup('EURO SIGN'), 'Euro')
The above will replace € with Euro in Sales Price column.
I think #jezrael comment is valid. First you need to read the file with encoding(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html under encoding section)
df=pd.read_csv('sales.csv', thousands=',', encoding='utf-8')
but for replacing Euro sign try this:
df=df.replace('\u20AC','',regex=True).astype(float)