I was just wondering why Pentaho PDI has many names such as spoon, kettle and Pentaho PDI what is the real name of this tool?
(I'm talking about the tool to extract data from certain data source and modify and migrate to another location)
As dirk said - basically PDI started out life as any other independent open source product. The naming schema of these products is typically quite jokey. You can clearly see the reason for the original naming - everything being related to things you do in the kitchen. For example - you use a spoon to mix your recipe, and spoon is the tool to make your ETL. So it's actually a pretty useful naming scheme.
Obviously when Pentaho came along they needed something more professional. Hence PDI was created.
In reality everyone still uses the old names. (Even pentaho - look at the jira, github, and shell script names!)
I'm surprised there is both a kettle and a pdi tag on here actually! they should be merged!
Related
have been working on a projet about data integration, analysing and reporting using Pentaho. So at last, I needed to do some reporting using Pentaho report designer, weekly. The problem that is our data is so big (about 4M/day), so the reporting platform was too slow and we can't do any other queries from tables in use, until we kill the process. Is there any solution to this ? A reporting tool or platform that we can use instead of Pentaho tool without having to change the whole thing and get from the first ETL steps.
I presume you mean 4M records/day.
That’s not particularly big for reporting, but the devil is in the details. You may need to tweak your db structure and analyze the various queries at play.
As you describe it, there isn’t much info to go with to give you a more detailed answer.
I would like if anyone can explain me what makes Apache Pig an ETL tool and what the opposite would be. I understand that ETL means, extract, transform and load the data, which Pig does so, but so does other platforms like Flink, Spark and R (you get the data, perform some operations and load it somewhere else) and I could not find any information saying those tools are also considered ETL. Maybe I am missing something? Maybe I do not fully understand what does ETL means? Thanks.
As you told ETL tool means , the tool which can be used for Extracting , transforming , and loading data.And for ETL tool we will have a UI for visual development eg: Informatica/Datastage. I am not sure whether we can include PIG as 'tool' for ETL purpose. But surely it can be used for ETL process.PIG/HIVE are the client side libraries for this purpose.
Drill looks like an interesting tool for the ad-hoc drill down queries as opposed to the high-latency Hive.
It seems that there should be a decent integration between those two but i couldn't find it.
Lets assume that today all of my work is done on Hive/Shark how can i integrate it with Drill?
Do I have to switch to the Drill engine back and forth?
I'm looking for an integration similar to what Shark and Hive have.
Although there are provisions to implement Drill-Hive integration, your question seems to be a bit "before the time" thing. Drill still has a long way to go and folks have been trying really hard to get all this done as soon as possible.
As per their roadmap, Drill will first support Hadoop FileSystem implementations and HBase. Second, Hadoop-related data formats will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools will be provided to produce column-based formats. Fourth, Drill tables can be registered in HCatalog. Finally, Hive is being considered as the basis of the DrQL implementation.
See this for more details.
I'm looking for a tool to build a Mondrian's cube XML description file to use over my Star Schema (one fact table and 4 dimensions tables directly link to the fact table).
I'm a bit lost, browsing tutorial doesn't help since they seem to be about another version than mine. Mine doesn't include the Star Model perspective, I just have a Model perspective.
My feeling is that this Model perspective is just useful to build Mondrian cube for a flat model (with just one big fact table). I would be glad if someone could confirm that. But my main demand is : how to build a Star Model description schema (with Pentaho's tool) ? If there's a missing plugin, how to install it ?
You need to download a separate tool for the moment called schema workbench. Get it from the mondrian not pentaho project on SourceForge.
This tool will be replaced by something new in the next 6 months but it does what you need and has handy validation built in
I am currently developing a parameterized report using Pentaho.
Pentaho defines parameters that can be given at the time of generation or inserted by some external source via a Pentaho API.
Now comes the question. I have a scripted data source in groovy and would like to parameterize it a bit. How can I/what's the best way to access the parameters (defined in pentaho) in a scripted data source?
If you use an SQL data source you can directly say ${ParamName} and it replaces the string; however if you use a scripted source this doesn't seem to work.
Any and all comments are more than welcome!
P.S. Sorry for this seemingly trivial question, but we all know how badly documented pentaho is.
Do either of these pages help?
http://www.sherito.org/2011/11/pentaho-reportings-metadata-datasources.html
http://forums.pentaho.com/archive/index.php/t-96689.html