How do i use cache in kettle pentaho? - pentaho

I am processing data, where i get some information from rest api, based on the value of a field.
Now, value may repeat for that field and if I already have fetched the data for that value, from REST, i would like to reuse that value and saving an API call (slowest operation in the transformation).
is is possible? if yes, how?
Regards
Ajay

#RFVoltini you are right, maybe we could try to setup a H2 db server for this purpouse: http://type-exit.org/adventures-with-open-source-bi/2011/01/using-an-on-demand-in-memory-sql-database-in-pdi/
other option is using memcached in java : http://sacharya.com/using-memcached-with-java/

I've did an example transformation, that gets from a webservice country names by country codes. I've used the idea where you just need to get from the webservice the distinct country codes/names then lookup them on your main pipeline.
Take a look at this example: https://docs.google.com/open?id=0B-AwXLgq0XmaV0V0cHlfTFZlVUU and see if this method applies to you.

Related

RediSql (for redis): Get column names as well as data type?

I am using the excellent RediSql, a module for Redis, to get a powerful caching solution.
When sending a command to Redis, that interacts with the SqLite db in the background, like this:
REDISQL.EXEC db "SELECT * FROM jobcache"
I get a result like this:
I get a type for the integer column, but not for the string, and no column names are provided.
Is there a way to get column name and defined data type always? I would need this, as I need to convert the results back to a more standard sql result format.
unfortunately, at the moment this is not possible with the EXEC command.
You can use the QUERY.INTO command reference
QUERY.INTO add the result of your query into a stream, it adds the column and the values for each row. Then you can consume the stream in whichever way you prefer.
When doing query (reads) against RediSQL is a good practice to use the .QUERY family of commands, this avoids useless replication of data, in the case you are in a cluster setup.
Moreover, it is possible to use the .QUERY commands also against replica of the main redis instance, while the .EXEC commands can be used only against the primary instance.

Include Sub-Entities from relations when loading an Entity in Groovy Service

What I'm trying to achieve here is to load some fields from sub-entities.
For instance, let's suppose i want to load some features for the product list. In xml it's pretty clear:
<row-actions>
<entity-find-one entity-name="mantle.product.feature.ProductFeature" value-field="brandList">
<field-map field-name="productFeatureId" from="featureList.productFeatureId"/>
<field-map field-name="productFeatureTypeEnumId" from="featureList.productFeatureId" value="PftBrand"/>
</entity-find-one>
</row-actions>
Is there a way to do something similar in groovy, without iterating through the whole product list and add the desired fields manually?
Also, can somebody give me a short example with the concrete use of sqlFind(http://www.moqui.org/javadoc/org/moqui/entity/EntityFacade.html) ?
I tried to solve the issue i'm asking about using a join query but I couldn't figure out how the SQL query is supposed to look like.
a. The element 'entity-find-one' queries on primary key and returns a single map. You need to use the 'entity-find' element .
b. Yes, you can always drop down to groovy using the script tag. e.g. Just use ec.entity.find("mantle.product.feature.ProductFeature") or whatever you need in your groovy script.
c. In moqui, joined tables are handled by the 'view-entity' element and you can predefine your own (place in your 'entities' folder) or use the many existing ones that are provided in the framework. You don't need SQL.
EDIT - Sorry, you can also do it on the fly by using the EntityFind.makeEntityDynamicView() method.
Hope that helps.

Azure Stream Analytics -> how much control over path prefix do I really have?

I'd like to set the prefix based on some of the data coming from event hub.
My data is something like:
{"id":"1234",...}
I'd like to write a blob prefix that is something like:
foo/{id}/guid....
Ultimately I'd like to have one blob for each id. This will help how it gets consumed downstream by a couple of things.
What I don't see is a way to create prefixes that aren't related to date and time. In theory I can write another job to pull from blobs and break it up after the stream analytics step. However, it feels like SA should allow me to break it up immediately.
Any ideas?
{date} , {time} and {partition} are the only ones supported in blob output prefix. {partition} is a number.
Using a column value in blob prefix is currently not supported.
If you have a limited number of such {id}s then you could workaround by writing multiple "select --" statements with different filters writing to different outputs and hardcode the prefix in the output. Otherwise it is not possible with just ASA.
It should be noted that now you actually can do this. Not sure when it was implemented but you can now use a single property from your message as a custom partition key and the syntax is exactly as the OP has asked for: foo/{id}/something/else
More details are documented here: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-custom-path-patterns-blob-storage-output
Key points:
Only one custom property allowed
Must be a direct reference to an existing message property (i.e. no concatenations like {prop1+prop2})
If the custom property results in too many partitions (more than 8,000) then an arbitrary number of blobs may be created for the same parition

How do you write a test for dynamic API content?

I am working on a wrapper for an API, and one of the endpoints returns data that doesn't have the same results each time.
What is a good strategy to test that the endpoint is still valid?
This is a general question, although I am mostly interested in getting this to work in Python.
You need to define what you actually expect from the result. What are the statements that always hold for the result?
Popular candidates/examples are
it is valid JSON/HTML/XML
it contains certain substrings
it has certain "fields"
certain fields can be parsed as a date using a specific format, and the resulting date is within +/-1h of now.

REST API Parameter precedence

I'm working on creating a REST API. Lets say the resource I'm serving is called object and it contains a number of properties.
Apart from requesting the entire set of objects like this
GET api.example.com/objects
I want to allow requesting a single object by providing either the objectid or objectname,
like this
GET api.example.com/objects?objectid=
GET api.example.com/objects?objectname=
What I'm confused about is, how should I handle a request like this?
GET api.example.com/objects?objectid=x&objectname=y
In this case, should I return a 400 Bad Request, or should one of the parameters take precedence over the other? How does REST define this behavior?
REST generally assumes there is a unique URL for a resource, so it would be:
GET api.example.com/objects/objectId
Parameters are commonly used for searching, so you would have something like:
GET api.example.com/objects?objectName=x
A better approach would be use a generic key in the parameter string to retrieve field values of your specific resource
GET api.example.com/objects/objectId?field=objectName,anotherField
It complements xpapad's suggestion, and can add scalable structure in how you define a consistent approach to your API design.