Convert bulk .xlsx files to .csv (UTF-8) in Pentaho - pentaho

I am new to Pentaho. I am trying to build a transformation that can convert a bunch of .xlsx files to .csv (utf-8).
I tried Get file Names and Text File Output, but it saves a single file as csv and the content of that file is the file properties.
I also tried Microsoft Excel Input and Microsoft Excel Output and that did not work either.
Any help will be appreciated. TIA!

I have prepare a SOLUTION for you. I have made my solution full dynamic. For that reason solution is combination of 6 (transformation & job). You only need to define following 2 things:-
Source folder location
Destination folder location
Others will work dynamically.
Also, I have learn a lot with this solution.

Would you like to generate a separate CSV for each Excel file?
It is better to do it like this:
Using the Get File Names component, read the list of Excel files from the folder.
Then call Execute Transformation, and pass the name of the file.
Then a separate Transformation will be performed for each file, and a separate CSV will be generated in it for each Excel file.

Related

How to write dataframe to csv file with sheetname using spark scala

I am trying to write the dataframe to csv file with option sheetName but it is not working for me.
df13.coalesce(1).write.option("delimiter",",").mode(SaveMode.Overwrite).option("sheetName","Info").option("header","true").option("escape","").option("quote","").csv("path")
Can anyone help me on that
I don't think in CSV file you actually have a sheet name , ideally the filename is the sheet name in a CSV file. Can you try changing to excel and try..
Spark can't directly do this while writing as a csv, There is no option as sheetName, The output path is path you mention as .csv("path").
Spark uses hadoops file format, which is partitioned in multiple part files under the output path, 1 part file on your case. Also do not repartitions to 1 unless you really need it.
One thing you can do is write the dataframe without repartition and use HADOOP API to merge those small many part files to single.
Here is more on detail Write single CSV file using spark-csv
We can only 1 default sheet in csv file if we want multiple sheet then we should write the dataframe to excel format instead of csv file format.

How to create format files using bcp from flat files

I want to use a format file to help import a comma delimited file using bulk insert. I want to know how you generate format files from a flat file source. The microsoft guidance on this subjects makes it seem as though you can only generate a format file from a SQL table. But I want it to look at text file and tell me what the delimiters are in that file.
Surely this is possible.
Thanks
The format file can, and usually does include more than just delimiters. It also frequently includes column data types, which is why it can only be automatically generated from the Table or view the data is being retrieved from.
If you need to find the delimiters in a flat file, I'm sure there are a number of ways to create a script that could accomplish that, as well as creating a format file.

How to extract output from a unix script to .xls/.xlsx file

Previously I have extracted the output from my unix sql script to a .csv file but it seems to cause an issue. The master script should be able to cleanly scan and append these spreadsheets into one table but the .csv file is creating an issue.
When I extracted the output from SQL developer to an XLS or XLSX file there were no issues.
Is there anyway that I can extract it in the same format as SQL Developer does?
Yes its true that sqlplus cannot extract the data as excel spreadsheet. But I have a bypass technique, if you open that csv file and save it as xls/xlsx then any ETL tool can read the file as the expected file.

Creating metadata dynamically from a flat .csv file in CC

I am having some difficulties on how to dynamically create a metadata, which need to be extracted from the header line of a flat .csv file in CC.
Usually, I manually define the metadata by select New Metadata --> Extract from flat file in CC. However the metadata of the file may changes with additional columns. Thus, I do not know the metadata of the file and I can not define it in this static approach.
It would be helpful if you could suggest a solution to create metadata dynamically and using this newly created metadata for connecting to other components. Perhaps an example graph file for demonstration would be great.
Thanks,
Andy
I have discovered this kind of solution.
You just have to fill in flat .csv filename into csv readers and writers.
MetaDataMaster.grf - runs the graphs below.
MetaDataCreator.grf - creates metadata according to csv header and
write it into meta_example.fmt file
MetaDataUser.grf - Reads csv according to created meta_example.fmt file - you can add there a reformat and use just some predefined fields.
You can run the 2nd and 3rd graph separately to test it.

Read multiple xlsx files in a folder and enter a foreach loop in SSIS

I am new to SSIS and would like to know how to read multiple XLSX format files from a folder and enter them in a foreach loop so that they can insert into the database.
I was reviewing this examples:
SSIS reading multiple xml files from folder
Read files from multiple Folders using SSIS?
But I would like to have more details about the foreach loop or if there is another way to upload data files to the database.
Any suggestions are welcome.
If i understood this correctly,you want to know how to work with Foreach loop
In the foreach loop select enumeration as foreach file enumerator,folder as the folder where the xlsx files exist and in files set it to *.xlsx that way it will pick all xlsx files in the folder then in variable mappings assign the value to a variable lets say varloaction, go to expression on the excel configuration manager and use the varloaction variable to build connection string.
You need to set a for-each loop in SSIS package for reading multiple files having same formats.
Also, if the files are saved in different folders, you need to create separate connections for each folder.
Refer:
For reading multiple XLS files to SQL Server tables:
https://www.mssqltips.com/sqlservertip/4165/how-to-read-data-from-multiple-excel-files-with-sql-server-integration-services/
For reading multiple XLSX files to SQL Server tables:
http://www.techbrothersit.com/2013/12/ssis-read-multiple-sheets-from-excel.html