Add rows in a CSV using spark - dataframe

I have a spark dataframe which looks like :
Finaltotal,Total,3683,3271.58,289659.5,5069340.0,4188153.83,3683,3271.58,289659.5,5069340.0,4188153.83
Finaltotal,Total,3683,3271.58,289659.5,5069340.0,4188153.83,3683,3271.58,289659.5,5069340.0,4188153.83
Finaltotal,Total,3683,3271.58,289659.5,5069340.0,4188153.83,3683,3271.58,289659.5,5069340.0,4188153.83
This is a CSV file and I need to add all rows corresponding to each column in this data frame and generated a final csv file.
This file has a header but the header can contain duplicate column names.

Related

Processing CSV files with Metadata in S3 bucket in Pentaho

I have a CSV file that goes something like this:
Report Name: Stackoverflow parse data
Date of Report: 31 October, 2022
Col1, Col2, Col3,...
Data, Data, Data, ...
The values before Headers, essentially data that states what the CSV is for and when it was created (can contain multiple values, hence has dynamic number of rows), need to be removed from the CSV so I can parse it in Pentaho. Now, the CSV files are on an S3 bucket and I am fetching them using S3 CSV Input but I am not sure how to proceed with filtering the non-required data so I can successfully parse the CSV files.
You can read the complete file as a CSV with only one column, adding the rownumber to the output. Then you apply a filter to get rid of the first n rows, and then you use the Split fields step to separate the rows into columns.
You'll need more steps to transform numbers and dates into the correct format (using the Split fields you'll get strings), and maybe more operations to preformat some other columns.
Or you could create a temporal copy of with your S3 CSV file without the first n rows, and read that file instead of the original one.
Step1: In the Csv input, just adding rownumber
Step2:Use filter
Step3:Add a output component like csv or database.

How do I load entire file content as a text into a column AzureSQLDW table?

I have a some file in an azure data lake 2 and I want to load them as a column value nvarchar(max) in AzureSQLDW. The table in AzureSQLDW is heap. I couldn't find any way to do it? All I see is column delimited when load them into multiple rows instead of one row in single column. How I achieve this?
I don't guarantee this will work, but try using COPY INTO and define non-present values for row and column delimiters. Make your target a single column table.
I would create a Source Dataset with a single column. You do this by specifying "No delimiter":
Next, go to the "Schema" tab and Import the schema, which should create a single column called "Prop_0":
Now the data should come through as a single string instead of delimited columns.

How to validate one csv data compare with another csv file using Pentaho?

I have two csv file .
In one file i have 10 rows and in another list of data .
What i want to do is , check the data of one filed of first csv and compare it with another csv file .
So how can i achieve this ?
Any help would be great .
The step you are looking for is named the a Stream Lookup step.`
Read you CSV and the reference files, and drop the two flows in a Stream Lookup and set it up as follow:
a) Lookup step = the step that reads the reference
b) Keys / field = the name of field of the CSV that contains any field able to identify the row in the reference file.
c) Keys / Lookup field = the name of the field in the reference file.
d) Field to retrieve = the name of the field in the reference to return (may be the identifier or any other field you need)
e) Field to retrieve / Type = Do not forget !
Like that, you will add a column from the reference file to the 10 rows of the CSV file. You may then filter out the rows which the Lookup did not found by testing if the value of the new column is not null.
As in the PDI all the above setup are guided with drop down lists, it should take you 2 minutes.

Python/Pandas: Length of a .csv file exported from a dataframe is different from the length of the dataframe

I have imported a .csv file using df=pandas.read_csv(.....). I calculated the number of rows in this dataframe df using print(len(df)) and it's size had some 30 rows less than the originally imported file. BUT, when I exported df directly after import as .csv (without doing any operation on this dataframe df) using df.to_csv(....), then the exported file had the same number of rows as the originally imported .csv file.
It's very hard for me to debug so as to explain the diffference between the lengths of the dataframe on one hand and both imported and exported .csv file on the other, as there are more than half-a-million rows in the dataset. Can anyone provide some hints as to what can cause such a bizzare behavior?

Import columns to existing OpenRefine project

How do I add a column from an external .csv file to an existing project?
I tried to find the solution online, but I wasn't successful.
Using the file you provided, I did this in less than one minute.
I had a project, with one column: .
If you know a little Python, try Jython. Edit Column > Add column based on this column and chose Language : Jython like this:
import csv
#we are going to use DictReader to transform our imported rows into dict,
#so we can latter just refer to the column we want by its key i.e header
rows = csv.DictReader(open('/home/yourusername/Downloads/example.csv'), delimiter=",")
for row in rows:
return row['Comprar'] #'comprar' is the header of the column i want