Spark Scala : How to read fixed record length File - apache-spark-sql

I have a simple question.
“How to read Files with Fixed record length?” i have 2 fields in the record- name & state.
File Data-
John OHIO
VictorNEWYORK
Ron CALIFORNIA
File Layout-
Name String(6);
State String(10);
I just want to read it and create a DataFrame from this file. Just to elaborate more for example on “fixed record length”- if you see since “OHIO” is 4 characters length, in file it is populated with 6 trailing spaces “OHIO “
The record length here is 16.
Thanks,
Sid

Read your input file:
val rdd = sc.textFile('your_file_path')
Then use substring to split the fields and then convert RDD to Dataframe using toDF().
val df = rdd.map(l => (l.substring(0, 6).trim(), l.substring(6, 16).trim()))
.toDF("Name","State")
df.show(false)
Result:
+------+----------+
|Name |State |
+------+----------+
|John |OHIO |
|Victor|NEWYORK |
|Ron |CALIFORNIA|
+------+----------+

Related

PySpark - change dataframe column value based on its existence in other dataframe

I am relatively new in using Spark and didn't find out solution for this. I found several similar questions related to this but didn't find the way how to bundle this in my use-case.
I have two dataframes, first is based on CSV which looks like this (displayed as table):
id
license_no
2005
1011
2006
1022
2007
3911
Second dataframe is based on CSV which looks like this:
license_no
active
1011
y
1022
y
3911
n
I need to check does license_no value exists in second dataframe, and if it exists and active=y, then I need to add some prefix (99 in the beginning) to its id and license_no in first dataframe - for example, license_id exists in second dataframe and it is active, so its ID in first dataframe should be changed to 992005 and license_no do 991011. If doesn't exits 88 should be added.
Dataframe should look like this after transformations:
id
license_no
992005
991011
992006
991022
882007
883911
I am not sure what is best solution for this, can I directly do this transformation in one spark command?
#Join
s = df.join(df1,how='left',on='license_no')
#Contitionally concat prefix using list squares
s.select(*[when(col('active')=='y',concat(lit('99'),str(x))).otherwise(concat(lit('88'),str(x))).alias(x) for x in s.columns if x!='active']).show()
+----------+------+
|license_no| id|
+----------+------+
| 991011|992005|
| 991022|992006|
| 883911|882007|
+----------+------+

How to use pandas read_csv to read csv file having backward slash and double quotation

I have a CSV file like this (comma separated)
ID, Name,Context, Location
123,"John","{\"Organization\":{\"Id\":12345,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}","Road 1"
234,"Mike","{\"Organization\":{\"Id\":23456,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}","Road 2"
I want to create DataFrame like this:
ID | Name |Context |Location
123| John |{\"Organization\":{\"Id\":12345,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}|Road 1
234| Mike |{\"Organization\":{\"Id\":23456,\"IsDefault\":false},\"VersionNumber\":-1,\"NewVersionId\":\"88229ef9-e97b-4b88-8eba-31740d48fd15\",\"ApiIntegrationType\":0,\"PortalIntegrationType\":0}|Road 2
Could you help and show me how to use pandas read_csv doing it?
An answer - if you are willing to accept that the \ char gets stripped:
pd.read_csv(your_filepath, escapechar='\\')
ID Name Context Location
0 123 John {"Organization":{"Id":12345,"IsDefault":false}... Road 1
1 234 Mike {"Organization":{"Id":23456,"IsDefault":false}... Road 2
An answer if you actually want the backslashes in - using a custom converter:
def backslash_it(x):
return x.replace('"','\\"')
pd.read_csv(your_filepath, escapechar='\\', converters={'Context': backslash_it})
ID Name Context Location
0 123 John {\"Organization\":{\"Id\":12345,\"IsDefault\":... Road 1
1 234 Mike {\"Organization\":{\"Id\":23456,\"IsDefault\":... Road 2
escapechar on read_csv is used to actually read the csv then the custom converter puts the backslashes back in.
Note that I tweaked the header row to make the column name match easier:
ID,Name,Context,Location

Divding column of Dataframe by constant value

I have a Data frame in below format.
| Occupation | wa_rating | Genre |
| engineer | 935 | Musical |
Now I want to divide Rating column of this Dataframe by totalRatings.
but when I am doing
resultDF = joinedDF.select(col("wa_rating")/totalRating)
It is giving me below error.
unsupported literal type class java.util.Arraylist
Likely your totalRating variable is a list. For example [100]. And you can't divide a number by a list. This throws your error:
resultDF = joinedDF.select(col("wa_rating")/[100])
but this does not
resultDF = joinedDF.select(col("wa_rating")/100)
Check that totalRating is an actual number (a float or integer). If it's a list containing a number, simply extract the number from it.
EDIT:
From your comments, we now know that your totalRating is a list. You can transform it to a number with:
totalRating = joinedDF3.groupBy().sum("Rating").collect()[0][0]

Loading files with dynamically generated columns

I need to create a SSIS project that loads daily batches of 150 files into a SQL Server database. Each batch always contains the same 150 files and each file in the batch has a unique name. Also each file can either be a full or incremental type. Incremental files have one more column than the full files. Each batch contains a control file that states if a file is full or incremental. See example of a file below:
Full File
| SID | Name | DateOfBirth |
|:---: | :----: | :-----------: |
| 1 | Samuel | 20/05/1964 |
| 2 | Dave | 06/03/1986 |
| 3 | John | 15/09/2001 |
Incremental File
| SID | Name | DateOfBirth | DeleteRow |
|:---: | :----: | :-----------: | :----------: |
| 2 | | | 1 |
| 4 | Abil | 19/11/1993 | 0 |
| 5 | Zainab | 26/02/2006 | 0 |
I want to avoid creating 2 packages (full and incremental) for each file.
Is there a way to dynamically generate the column list in each source/destination component based on the file type in the control file? For example, when the file type is incremental, the column list should include the extra column (DeleteRow).
Let's assume my ControlFile.xlsx is :
Col1 Col2
File1.xlsx Full
file2.xlsx Incremental
Flow:
1.Create a DFT where ControlFile.xlsx is captured in an object variable. Source : Control connection, Destination : RecordSet Destination
Pass this object variable in ForEach loop. ResultSet variable should be capturing Col2 of ControlFile.xlsx.
Create a Sequence container just for a start point. Add 2 DFD for full load and incremental load. Use the constraints (as shown in below
image) to decide which DFD will run.
Inside DFD, use excel source to OLEDB destination.
Use FilePath variable for connection property in Full load and incremental excel connections to make it dynamic.
Step1: overall image
Step2:
In DFT - read control file, you read the FlowControl.xlsx to save it RecordSet destination, into RecordOutput variable
Step3:
Your precedence constraints should look like below image("Full" for full load, "Incremental" for incremental load ) :
Use the source and destination connections as shown in first image. It's a bit hard to explain all the steps, but flow is simple.
one thing to notice is, you have additional column in Incremental, hence you'll need to use 'Derived Column' in your full load for correct mapping.
Also, make sure DelayValidation property is set to true.
For each loop container uses For each ADO Enumerator. Following images describe the properties :
AND
I can think of two solutions.
1) Have a script task at the beginning of the package that looks to see if this is an incremental load or a full load. If it is a full load, have it loop through all the files and add a "DeleteRow" column with all zeros to every file. Then you can use the same column list.
2) Use BiML to dynamically generate your package at run time based on the available metadata.

Value to table header in Pentaho

Hi I'm quite new in Pentaho Spoon and I have a problem:
I have a table like this:
model | type | color| q
--1---| --1-- | blue | 1
--1---| --2-- | blue | 2
--1---| --1-- | red | 1
--1---| --2-- | red | 3
--2---| --1-- | blue | 4
--2---| --2-- | blue | 5
And I would like to create a single table (to export in csv or excel) for each model grouped by type with the value of the group as header and as value the q value:
table-1.csv
type | blue | red
--1--| -1-- | -1-
--2--| -2-- | -3-
table-2.csv
type | blue
--1--| -4-
--2--| -5-
I tried with row denormalizer but nothing.
Any suggestion?
Typically it's helpful to see what you have done in order to offer help, but I know how counterintuitive the "help" on this step is.
Make sure you sort the rows on Model and Type before sending them to the denormalizer step. Then give this a try:
As for splitting the output into files, there are a few ways to handle that. Take a look at the Switch/Case step using the Model field.
Also, if you haven't found them already, take a look at the sample files that come with the PDI download. They should be in ...pdi-ce-6.1.0.1-196\data-integration\samples. They can be more helpful than the online documentation sometimes.
Row denormalizer can't be used here if number of colors is unknown, also, you can't define text output fields dynamically.
There are few ways that I can see without using java and js steps. One of them is based on the following idea: we can prepare rows with two columns:
Row Model
type|blue|red 1
1|1|1 1
2|2|3 1
type|blue 2
1|4 2
2|5 2
Then we can prepare filename for each row using Model field and then easily output all rows using text output where file name is taken from filename field. In this case all records will be exported into two files without additional efforts.
Here you can find sample transformation: copy-paste me into new transformation
Please note that it's a sample solution that works only with csv. Also it works only if you have the same number of colors for each type inside model. It's just a hint how to use spoon, it's not a complete solution.