write recursively different DFs in a xlsx without overwriting using Jupyter - pandas

I have 100 csv files to analyse, which I open recursively, analyse the data, and then I would like to write the results in a xlsx file.
I want to write all the results in the same xlsx file, without overwriting the results of the previous files, and putting the data one below the other. (I cannot put the results in a DF and then append all of them due to memory issue).
Briefly, the idea is as follow
for file in folder:
open the csv
analyse the data
Results into a DF
Results into Xlsx (starting from the first free row)
any suggestion?
thanks

Related

Convert bulk .xlsx files to .csv (UTF-8) in Pentaho

I am new to Pentaho. I am trying to build a transformation that can convert a bunch of .xlsx files to .csv (utf-8).
I tried Get file Names and Text File Output, but it saves a single file as csv and the content of that file is the file properties.
I also tried Microsoft Excel Input and Microsoft Excel Output and that did not work either.
Any help will be appreciated. TIA!
I have prepare a SOLUTION for you. I have made my solution full dynamic. For that reason solution is combination of 6 (transformation & job). You only need to define following 2 things:-
Source folder location
Destination folder location
Others will work dynamically.
Also, I have learn a lot with this solution.
Would you like to generate a separate CSV for each Excel file?
It is better to do it like this:
Using the Get File Names component, read the list of Excel files from the folder.
Then call Execute Transformation, and pass the name of the file.
Then a separate Transformation will be performed for each file, and a separate CSV will be generated in it for each Excel file.

How to write dataframe to csv file with sheetname using spark scala

I am trying to write the dataframe to csv file with option sheetName but it is not working for me.
df13.coalesce(1).write.option("delimiter",",").mode(SaveMode.Overwrite).option("sheetName","Info").option("header","true").option("escape","").option("quote","").csv("path")
Can anyone help me on that
I don't think in CSV file you actually have a sheet name , ideally the filename is the sheet name in a CSV file. Can you try changing to excel and try..
Spark can't directly do this while writing as a csv, There is no option as sheetName, The output path is path you mention as .csv("path").
Spark uses hadoops file format, which is partitioned in multiple part files under the output path, 1 part file on your case. Also do not repartitions to 1 unless you really need it.
One thing you can do is write the dataframe without repartition and use HADOOP API to merge those small many part files to single.
Here is more on detail Write single CSV file using spark-csv
We can only 1 default sheet in csv file if we want multiple sheet then we should write the dataframe to excel format instead of csv file format.

How to extract output from a unix script to .xls/.xlsx file

Previously I have extracted the output from my unix sql script to a .csv file but it seems to cause an issue. The master script should be able to cleanly scan and append these spreadsheets into one table but the .csv file is creating an issue.
When I extracted the output from SQL developer to an XLS or XLSX file there were no issues.
Is there anyway that I can extract it in the same format as SQL Developer does?
Yes its true that sqlplus cannot extract the data as excel spreadsheet. But I have a bypass technique, if you open that csv file and save it as xls/xlsx then any ETL tool can read the file as the expected file.

Kettle - Read multiple files from a folder

I'm trying to read multiple XML files from a folder, to compile all the data they have (all of them have the same XML structure), and than save that data in a CSV file.
I already have a 'read-files' Transformation with the steps: Get File Names and Copy Rows to Result, to get all the XML files. (it's working - I print a file with all the files names)
Then, I enter in a 'for-each-file' Job which has a Transformation with the Get Rows from Result Step, and then another Job to process those files.
I think I'm loosing information from the 'read-files' Transformation to the Transformation in the 'for-each-file' Job which Get all the rows. (I print another file with all the files names, but it is empty)
Can you tell me if I'm thinking in the right way? I have to set some variables, or some option that is disabled? Thanks.
Here is an example of "How to process a Kettle transformation once per filename"
http://www.timbert.net/doku.php?id=techie:kettle:jobs:processtransonceperfile

Looking for efficient methods of loading large excel (xlsx) files into SQL

I'm looking for alternate data import solutions. Currently my process is as follows:
Open a large xlsx file in excel
Replace all "|" (pipes) with a space or another unique character
Save the file as pipe-delimited CSV
Use the import wizard in SQL Server Management Studio 2008 R2 to import the CSV file
The process works; however, steps 1-3 take a long time since the files being loaded are extremely large (approx. 1 million records).
Based on some research, I've found a few potential solutions:
a) Bulk Import - This unfortunately does not eliminate steps 1-3 mentioned above since the files need to be converted to a flat (or CSV) format
b) OpenRowSet/OpenDataSource - There are 2 issues with this problem. First, it takes a long time to load (about 2 hours for a million records). Second, when I try to load many files at once (about 20 file each containing 1 million records), I receive an "out-of-memory" error
I haven't tried SSIS; I've heard it has issues with large xlsx files
So this leads to my question. Are there any solutions/alternate options out there that will make importing of large excel files faster?
Really appreciate the help.
I love Excel as a data visualization tool but it's pants as a data transport layer. My preference is to either query it with the JET/ACE driver or use C# for non-tabular data.
I haven't cranked it up to the millions but I'd have to believe the first approach would have to be faster than your current simply based on the fact that you do not have to perform double reads and writes for your data.
Excel Source as Lookup Transformation Connection
script task in SSIS to import excel spreadsheet
Something I have done before (and I bring up because I see your file type is XLSX, not XLS) is open the file though winzip, pull the XML data out, then import it. Starting in 2007, the XLSX file is really a zip file with many folders/files in it. if the excel file is simple (not a lot of macros, charts, formating, etc), you can just pull the data from the XML file that is in the background. I know you can see it through WINZIP, I dont know about other compression apps.