I'm working with Pentaho Data Integration (Kettle) and I have a question.
I have two input files file1.txt and file2.txt with the same header:
file1.txt
NAME;AGE
alberto;22
angela;22
madelaine;23
file2.txt
NAME;AGE
carlos;56
fernando;30
ana;16
and I want to merge both files into one, files_together.txt
NAME;AGE
alberto;22
angela;22
madelaine;23
carlos;56
fernando;30
ana;16
I've tried all (I think) and I don't know how to do it. I've been searching in Google, Youtube... with no positive match.
Thank you very much.
Answer; just put the output of each file you want to merge as input of the final one.
I personally found the "Append Stream" to be more useful as it kept the streams together. By pointing two inputs into one output, they are running in parallel so the results will be interlaced, depending on various factors. Using Append Stream will give you results from file1 then results from file2 in the output.
You must "Select Values" step. The name of the fields must be the same.
I was trying something similar with .csv files. Tried dong what you suggested but it didn't work for me. Many other blogs said "It would be better to use Excel scripting then employing Pentaho Data Integration (Kettle) for this." Which is not true.
You can use "Append Stream" step which is under flow category of Transformation. Which takes two input merge it provide you with expected merged file. You can also this step to merge more number of file with each other.
Related
I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber
I am new in pentaho and need some more information from experts. I have a transformation, where I need to read a file and then change the data from it. Problem is that the number of colums in the file can change every time (it can be in this month 10 colums in file, in next 12). I have read that I must use a „Metadata Injection“ step for such cases, but I can not find a good example for it. Can someone explain or show simple example how this Metadata injection generally works?
Thank you for your help.
I'll describe my scenario so you guys understand what type of design pattern I'm looking for.
I'm making an application where I provide someone with a link that is associated with one or more files. For example, someone needs somePowerpoint.ppx, main.cpp and somevid.mp4, and I have a tool that makes kj13h1djdsja213j1hhadad9933932 associated with those 3 files so that I can give someone
mysite.com/getfiles?fid=kj13h1djdsja213j1hhadad9933932
and they'll get a list of those files that they can download individually or all at once.
Since I'm new to SQL, the only way I know of doing that is having my tool use a table like
fid | filename
------------------------------------------------------------------
kj13h1djdsja213j1hhadad9933932 somePowerpoint.ppx
kj13h1djdsja213j1hhadad9933932 main.cpp
kj13h1djdsja213j1hhadad9933932 somevid.mp4
jj133823u22h248884h4h24h01h232 someotherfile.someextension
to go along with the above example. It would be nice if I could do some equivalent of
fid | filename(s)
---------------------------------------------------------------------------
kj13h1djdsja213j1hhadad9933932 somePowerpoint.ppx, main.cpp, somevid.mp4
jj133823u22h248884h4h24h01h232 someotherfile.someextension
but I'm not sure if that's possible or if I should be using some other design pattern altogether.
Any advice?
I believe Concatenate many rows into a single text string? can help give you a query that would generate your condensed format (you'd still want to store it in SQL with the full list, but you could make a view showing the condensed version using the query in the link)
How many different kind of steps in Pentaho can accept more than one input stream, such as "Merge Join", "Stream Look up"?
What's the typical user scenario of them?
Any script related steps can accept more than one stream as input, like javascript or UDJC? e.g. use one stream as data source, another as filter condition?
Thank you all.
All the steps under "Joins" and "Lookup", joins just like table join, lookup is to using one stream as source dataset another as "translate" dictionary, this is what I know
Answer to 3 questions as below:
All the Steps available in "Joins" and "Lookup" section will accept two streams. (i haven't tried with 3 streams) Some filter steps like Java Filter will also accept more than one stream.
Typical use scenario is to get data from one or more streams and to work on your business logic. There is no specific example i can explain at the moment.
As per my knowledge, you cannot use more than one stream in JavaScript Step. You might get an error like
I am trying to stream two columns of different names. Input 1 has column "a" and Input 2 has column "b".
You can ignore this error if you can make both the input stream columns to the same name.
Hope this help :)
I am looking for a direct and efficient method to read out csv-files and handily work with the data in Excel/VBA?
The best thing would be: direct access of data by specifying row and column. Can you tell me of your preferred option? Do you know an additional option to the following two?
A: Use Workbooks.Open or Workbooks.OpenText to open the csv-file as a workbook. Then work with the workbook (compare this thread).
B: Use Open strFilename For Input As #1 to write the data into a string. Work with the string (compare this thread).
Thanks a lot!
==========EDIT=========
Let me add what I have learned from your posts so far: The optimal option to do the task depends too much on what you want to do exactly, thus no answer possible. Also, there are the following additional options to read csv files:
C: Use VBScript-type language with ADO (SQL-type statements). I still am figuring out how to create a minimal example that works.
D: Use FileSystemObject, see e.g. this thread
The fastest and most efficient way to add CSV data to excel is to use Excel's text import wizard.
This parses CSV file, giving you several options to format and organize the data.
Typically, when programming one's own CSV parser, one will ignore the odd syntax cases, causing rework of the parsing code. Using the excel wizard covers this and gives you some other bonuses (like formatting options).
To load csv, (in Excel 2007/2010) from the "data" tab, pick "From Text" to start the "Import Text Wizard". Note the default delimiter is tab, so you'll need to change it to comma (or whatever character) in step 2.