Without going too much into our final product, I have a bunch of transformations that are strung together via Transformation Executor steps at the end. Transformation 1 has a Transformation Executor step at the end that executes Transformation 2.
I now have the need to build transformations that handle more than one input stream (e.g. utilize an Append Streams step under the covers). While it's easy enough to have each input transformation execute the same transformation during its Transformation Executor step, I'm afraid it won't do the right thing unless I specify which input stream within the Transformation Executor step the current stream needs to be mapped to. Unfortunately, Pentaho's documentation is very weak (i.e. nonexistent) on this topic.
Is there a way to specify that?
If I'm way off and this really isn't a concern, please clarify that.
Thanks in advance as always for any help you can give me!!
Related
Actually, in ML pipeline components we are specifying inputs and outputs clearly .
For example in TFX statisticgen take input from examplegen and outputs some statistics.so input and output is clear which is same in all components .so why we need orchestrators .if anyone knows please help me?
In real-life projects, everything can be much more complicated:
the input data can be from the different sources: database, file system, third-party services. So we need to do classical ETL before we can start working with data.
you can use different technologies in the one pipeline. For instance, Spark as a preprocessing tool, after you can need to use an instance with GPU for the model training.
last, but not least - in production you need to care much more things. For instance data validation, model evaluation, etc. I wrote a separate article about how to organize this part using Apache Airflow.
I have a multiple stage ETL transform stage using pandas. Basically, I load almost 2Gb of data from Mongodb and then I apply several functions in the columns. My question is if there's any way to break those transformations in multiple Airflow tasks.
The options I have considered are:
Creating a temporary table in Mongodb and loading/storing the transformed data frame between steps. I found this cumbersome and totally prone to a non-usual overhead due to disk I/O
Passing data among the tasks using XCom. I think this is a nice solution but I worry about the sheer size of the data. The docs explicitly state
Any object that can be pickled can be used as an XCom value, so users should make sure to use objects of appropriate size.
Using an in-memory storage between steps. Maybe saving the data in a Redis server or something, but I'm not really sure if that would be any better than just using XCom altogether.
So, does any of you have any tips on how to handle this situation? Thanks!
I'm creating a Data Warehouse and I'm using Pentaho Data Integration for the ETL.
I have 2 options:
Each dimension ETL in its own transformation (I have to open all the transformation files to run my dimensions), I thought this might be practical if I want to run one dimension only for example.
Gather all the dimensions in one transformation. (Is this considered like creating a new job in PDI that includes all the transformation? If it is, what is the best way to relate the transformations?)
I'm new to all of this, by the way.
Any suggestions?
Thank you.
I am running a custom processor on a rowset that does not seem to run in parallel. The underlying ~1GB text file is first read into a table that is partitioned via round robin. The 'Extract' runs on 200 vertices but then (under 'Aggregate' node) the processing [that does various complex computations] happens on only 2 vertices even though the parallelism parameter is much higher than that. Is there a special hint that needs to be used to dictate the compiler to use more vertex? Is there a function or property that needs to be overridden to set the parallelism at this phase as well?
Sorry for the late reply. But it is vacation time :).
It is good to see that the extract phase is fully scaled out.
Without seeing the script or the generated plan it is a bit difficult to say why you only see 2 vertices in some places. There are a couple of reasons why that may be the case:
you don't have enough data to scale out to more.
your aggregation needs more data and thus the plan has less parallelism.
your operation is intrinsically less parallel.
The optimizer's data cardinality estimation is off and chooses not enough parallelism. We have some ability to hint, but I rather first see the job.
Note that custom processors often block the optimizer from pushing optimizations through in the script (using the READ ONLY option for example helps) and can throw off the cardinality estimations.
If you send me the script, the job graph and the link to the job to mrys at Microsoft, I and the team will look into it next week after the holidays are over.
Seems like it is supported in Hadoop(reference), but I dont know how to use this.
I want to :
a.) Map - Read a huge XML file and load the relevant data and pass on to reduce
b.) Reduce - write two .sql files for different tables
Why I am choosing map/reduce is because I have to do this for over 100k(may be many more) xml files residing ondisk. any better suggestions are welcome
Any resources/tutorials explaining how to use this is appreciated.
I am using Python and would want to learn how to achieve this using streaming
Thank you
Might not be an elegant solution, but you could create two templates to convert the output of the reduce tasks into the required format once the job is complete. Much could be automated by writing a shell script which would look for the reduce outputs and apply the templates on them. With the shell script the transformation happens in sequence and doesn't take care of the n machines in the cluster.
Or else in the reduce tasks you could create the two output formats into a single file with some delimiter and split them later using the delimiter. In this approach since the transformation happens in the reduce, the transformation is spread across all the nodes in the cluster.