PDI. I would like to aggregate a line to csv im creating - pentaho

For make it easy, i have a table on a ddbb that i'm making the input on pdi then i want to output it as a csv but i would like to add a line before.
For example what i would look like when you open the csv:
i'm idiot
id;xxxxx;xx;xx;ddd;d
1;ddd;ddd;ddd;ddd;d
2;ds;dd;ss;dss;s
i'm using pentaho

I think the easiest way to achieve this would be to use two transformations, the first one create the file with the line, and the second one uses the Text file output step to add the data to the file created in the first transformation. The Text file output step allows you to check the Append, so you add the new information to the existing file.

Related

Azure Data Factory 2 : How to split a file into multiple output files

I'm using Azure Data Factory and am looking for the complement to the "Lookup" activity. Basically I want to be able to write a single line to a file.
Here's the setup:
Read from a CSV file in blob store using a Lookup activity
Connect the output of that to a For Each
within the For Each, take each record (a line from the file read by the Lookup activity) and write it to a distinct file, named dynamically.
Any clues on how to accomplish that?
Use Data flow, use the derived column activity to create a filename column. Use the filename column in sink. Details on how to implement dynamic filenames in ADF is describe here: https://kromerbigdata.com/2019/04/05/dynamic-file-names-in-adf-with-mapping-data-flows/
Data Flow would probably be better for this, but as a quick hack, you can do the following to read the text file line by line in a pipeline:
Define your source dataset to output a line as a single column. Normally I would use "NoDelimiter" for this, but that isn't supported by Lookup. As a workaround, define it with an incorrect Column Delimiter (like | or \t for a CSV file). You should also go to the Schema tab, and CLEAR the schema. This will generate a column in the output named "Prop_0".
In the foreach activity, set the Items to the Lookup's "output.value" and check "Sequential".
Inside the foreach, you can use item().Prop_0 to grab the text of the line:
To the best of my understanding, creating a blob isn't directly supported by pipelines [hence my suggestion above to look into Data Flow]. It is, however, very simple to do in Logic Apps. If I was tackling this problem, I would create a logic app with an HTTP Request Received trigger, then call it from ADF with a Web activity and send the text line and dynamic file name in the payload.

How to create format files using bcp from flat files

I want to use a format file to help import a comma delimited file using bulk insert. I want to know how you generate format files from a flat file source. The microsoft guidance on this subjects makes it seem as though you can only generate a format file from a SQL table. But I want it to look at text file and tell me what the delimiters are in that file.
Surely this is possible.
Thanks
The format file can, and usually does include more than just delimiters. It also frequently includes column data types, which is why it can only be automatically generated from the Table or view the data is being retrieved from.
If you need to find the delimiters in a flat file, I'm sure there are a number of ways to create a script that could accomplish that, as well as creating a format file.

How to add headers when we DUMP the data in output file using PIG scripts?

I tried to search for it but cannot find the tip/recommendations.
Here is my situation. I have all the data lined up correctly and output working fine using pig script. Stored the files in a output directory. The output files are more than 100 files so what i have done is accumulated the results file using another pig script.
I was wondering if there is anything in PIG LATIN that will help me add "Header" to the accumulated results file so that business users can quickly use it as it also has headers?
Please advise
If you are using DUMP in Pig script and redirecting the result to a single file, you can use DESCRIBE before DUMP. Doing so will append schema information as header to your output file
A = LOAD 'test' USING PigStorage() AS (col1:int, col2:chararray);
DESCRIBE A;
DUMP A;
output will be something like:
A: {col1: int,col2: chararray}
1,test
2,test
...
Pig can store the schema into a different file ".pig_schema" using PigStorage:
store A into 'outputFile' using PigStorage('\t', '-schema');
will save your data in the outputFile using tabs as delimiters and also creates the schema file.
You can store the header in a separate file, LOAD it and UNION it with your data. Then you need to do an ORDER BY (that might be tricky depending on your data).
Another way would be to use hadoop getmerge.
In general, this is not something pig is very good at, you might as well write a script in another language.

Dynamically populate external tables location

I'm trying to use oracle external tables to load flat files into a database but I'm having a bit of an issue with the location clause. The files we receive are appended with several pieces of information including the date so I was hoping to use wildcards in the location clause but it doesn't look like I'm able to.
I think I'm right in assuming I'm unable to use wildcards, does anyone have a suggestion on how I can accomplish this without writing large amounts of code per external table?
Current thoughts:
The only way I can think of doing it at the moment is to have a shell watcher script and parameter table. User can specify: input directory, file mask, external table etc. Then when a file is found in the directory, the shell script generates a list of files found with the file mask. For each file found issue a alter table command to change the location on the given external table to that file and launch the rest of the pl/sql associated with that file. This can be repeated for each file found with the file mask. I guess the benefit to this is I could also add the date to the end of the log and bad files after each run.
I'll post the solution I went with in the end which appears to be the only way.
I have a file watcher than looks for files in a given input dir with a certain file mask. The lookup table also includes the name of the external table. I then simply issue an alter table on the external table with the list of new file names.
For me this wasn't much of an issue as I'm already using shell for most of the file watching and file manipulation. Hopefully this saves someone searching for ages for a solution.

Creating metadata dynamically from a flat .csv file in CC

I am having some difficulties on how to dynamically create a metadata, which need to be extracted from the header line of a flat .csv file in CC.
Usually, I manually define the metadata by select New Metadata --> Extract from flat file in CC. However the metadata of the file may changes with additional columns. Thus, I do not know the metadata of the file and I can not define it in this static approach.
It would be helpful if you could suggest a solution to create metadata dynamically and using this newly created metadata for connecting to other components. Perhaps an example graph file for demonstration would be great.
Thanks,
Andy
I have discovered this kind of solution.
You just have to fill in flat .csv filename into csv readers and writers.
MetaDataMaster.grf - runs the graphs below.
MetaDataCreator.grf - creates metadata according to csv header and
write it into meta_example.fmt file
MetaDataUser.grf - Reads csv according to created meta_example.fmt file - you can add there a reformat and use just some predefined fields.
You can run the 2nd and 3rd graph separately to test it.