I have an SQL table which has results for each period(e.g. Jan 2013). These are the steps that I follow:
I want to select records from each period.
Put the results into a CSV file.
Then copy the headers and save the CSV file to a text file with different name.
Then take the text file and gzip that file.
Now find the records that are in the text file and create a counts file. (.txt)
Now take that gzip file and counts file and create a .tar file.
Again create a counts file that points to .tar file.
I have to do these steps for all the periods that are in that table.
Is there an easier way to do this, like a Perl/Python script or a Batch file or something?
Related
I am new to Pentaho. I am trying to build a transformation that can convert a bunch of .xlsx files to .csv (utf-8).
I tried Get file Names and Text File Output, but it saves a single file as csv and the content of that file is the file properties.
I also tried Microsoft Excel Input and Microsoft Excel Output and that did not work either.
Any help will be appreciated. TIA!
I have prepare a SOLUTION for you. I have made my solution full dynamic. For that reason solution is combination of 6 (transformation & job). You only need to define following 2 things:-
Source folder location
Destination folder location
Others will work dynamically.
Also, I have learn a lot with this solution.
Would you like to generate a separate CSV for each Excel file?
It is better to do it like this:
Using the Get File Names component, read the list of Excel files from the folder.
Then call Execute Transformation, and pass the name of the file.
Then a separate Transformation will be performed for each file, and a separate CSV will be generated in it for each Excel file.
I have 100 csv files to analyse, which I open recursively, analyse the data, and then I would like to write the results in a xlsx file.
I want to write all the results in the same xlsx file, without overwriting the results of the previous files, and putting the data one below the other. (I cannot put the results in a DF and then append all of them due to memory issue).
Briefly, the idea is as follow
for file in folder:
open the csv
analyse the data
Results into a DF
Results into Xlsx (starting from the first free row)
any suggestion?
thanks
My backup table has 3 files: 2 ending with .backup_info and one folder with another folder containing 10 CSV files. What would be format of the URL which will specify the backup file location?
I'm trying below and every time I get a file not found error.
gs://bucket_name/name_of_the_file_which_ended_with_backup_info.info
When you go to look at your file from your backup, it should have a structure like this:
Buckets/app-id-999999999-backups
And the filenames should look like:
2017-08-20T02:05:19Z_app-id-999999999_data.json.gz
Therefore the path will be:
gs://app-id-9999999999-backups/2017-08-20T02:05:19Z_app-id-9999999999_data.json.gz
Make sure you do not include the word "Buckets", I am guess that is the confusion.
I have a parquet file created from text /dat file using Pig Script.
Now i would like to know how many records in the parquet file without reading the file?
Is there anyway, Parquet file stores the number of rows somewhere in meta-data?
Read from the path using parquet.pig.ParquetLoader. Then the parqet file will be a normal file and then you can go for a count of the records.
LOGS = LOAD '/X/Y/abc.parquet' USING parquet.pig.ParquetLoader ;
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT_STAR(LOGS);
dump LOG_COUNT;
Previously I have extracted the output from my unix sql script to a .csv file but it seems to cause an issue. The master script should be able to cleanly scan and append these spreadsheets into one table but the .csv file is creating an issue.
When I extracted the output from SQL developer to an XLS or XLSX file there were no issues.
Is there anyway that I can extract it in the same format as SQL Developer does?
Yes its true that sqlplus cannot extract the data as excel spreadsheet. But I have a bypass technique, if you open that csv file and save it as xls/xlsx then any ETL tool can read the file as the expected file.