Process All Rows in Javascript Step - pentaho

How do I read and process all the rows that read from an excel, in Javascript step.
I have the following transformation
The first step is to read all the rows from a spreadsheet and then I will post it to a REST service. I am waiting for the REST operation is to complete before entering the JS script. In the JS step, I want to process all the records at a time and save it as one single record to XML.
Just to give some more clarity on my requirement, I will give one scenario. I have an input file with two columns. Material no and quantity. After the JavaScript step, I need to post the data read from the spreadsheet to another service. This service will return me the free goods associated with the input. But to get free good, I need a combination of materials. For eg: if the input is TV and DVD Player, I get something free. But I won't get anything free, if I am passing only TV or DVD player as the input. So in this case my data is:
**Material Qty**
TV 1
DVD Player 1
My REST service has the following structure.
{
"items": [
{
"material":"TV",
"quantity":"1",
},
{
"material":"DVD Player",
"quantity":"1",
}
]
}
Any input on how to achieve this would be really valuable. Thank you.

One way of addressing this in a transformation is to use a group by. Try using unique rows by hashset or even a group by utility.
This step will replace the blocking step and wait until all rows are processed and grouped.
Below is the sample (3 ways)
As an example, the Unique Rows (Hashset) can read all the fields from the JavaScript which will make it distinct after reading the entire dataset.

Related

Checking Whether Table Data Exists, Updating / Inserting Into Two Tables & Posting End Outcome

I am working on my cron system which gathers informaiton via an API call. For most, it has been fairly straight forward, but now I am faced with multiple difficulties, as the API call is dependant on who is making the API request. It runs through each users API Key and certain information will be visible/hidden to them and visaversa to the public.
There are teams, and users are part of teams. A user can stealth their move, however all information will be showed to them and their team, however this will not be visible to their oponent, however both teams share the same id and have access tothe same informaiton, just one can see more of it than the other.
Defendants Point Of View
"attacks": {
"12345`": {
"timestamp": 1645345234,
"attacker_id": "",
"attacker_team_id": "",
"defender_id": 321,
"defender_team_id": 1,
"stealthed": 1
}
}
Attackers Point Of View
"attacks": {
"12345`": {
"timestamp": 1645345234,
"attacker_id": 123,
"attacker_team_id": 2
"defender_id": 321,
"defender_team_id": 1,
"stealthed": 1,
"boosters": {
"fair_fight": 3,
"retaliation": 1,
"group_attack": 1
}
}
}
So, if the defendant's API key is first used, id 12345 will already be in the team_attacks table but will not include the attacker_id and attacker_team_id. For each insert there after, I need to check to see whether the new insert's ID already exist and has any additional information to add to the row.
Here is the part of my code that loops through the API and obtains the data, it loops through all the attacks per API Key;
else if ($category === "attacks") {
$database = new Database();
foreach($data as $attack_id => $info) {
$database->query('INSERT INTO team_attacks (attack_id, attacker_id, attacker_team_id, defender_id, defender_team_id) VALUES (:attack_id, :attacker_id, :attacker_team_id, :defender_id, :defender_team_id)');
$database->bind(':attack_id', $attack_id);
$database->bind(':attacker_id', $info["attacker_id"]);
$database->bind(':attacker_team_id', $info["attacker_team_id"]);
$database->bind(':defender_id', $info["defender_id"]);
$database->bind(':defender_team_id', $info["defender_team_id"]);
$database->execute();
}
}
I have also been submitting to the news table, and typically I have simply been submitting X new entries have been added or whatnot, however I haven't a clue if there is a way to check during the above if any new entries and any updated entries to produce two news feeds:
2 attacks have bee updated.
49 new attack information added.
For this part, I was simply counting how many is in the array, but this only works for the first ever upload, I know I cannot simply count the array length on future inserts which require additional checks.
If The attack_id Does NOT Already Exist I also need to submit the boosters into another table, for this I was adding them to an array during the above loop and then looping through them to submit those, but this also depends on the above, not simply attempting to upload for each one without any checks. Boosters will share the attack_id.
With over 1,000 teams who will potentially have at least one members join my site, I need to be as efficient as this as possible. The API will give the last 100 attacks per call and I want this to be within my cron which collects any new data every 30 seconds, so I need to sort through potentially 100,000.
In SQL, you can check conditions when inserting new data using merge:
https://en.wikipedia.org/wiki/Merge_(SQL)
Depending on the database you are using, the name and syntax of the command might be different. Common names for the command are also upsert and replace.
But: If you are seeking for high performance and almost-realtimeness, consider using a cache holding critical aggregated data instead of doing the aggregation 100'000 times per minute.
This may or may not be the "answer" you're looking for. The question(s) imply use of a single table for both teams. It's worth considering one table per team for writes to avoid write contention altogether. The two data sets could be combined at query time in order to return "team" results via the API. At scale, you could have another process calculating and storing combined team results in an API-specific cache table that serves the API request.

How to store and serve coupons with Google tools and javascript

I'll get a list of coupons by mail. That needs to be stored somewhere somehow (bigquery?) where I can request and send it to the user. The user should only be able to get 1 unique code, that was not used beforehand.
I need the ability to get a code and write, that it was used, so the next request gets the next code...
I know it is a completely vague question but I'm not sure how to implement that, anyone has any ideas?
thanks in advance
Thr can be multiples solution for same requirement, one of them is given below :-
Step 1. Try to get coupons over a file (CSV, JSON, and etc) as per your preference/requirement.
Step 2. Load Source file to GCS (storage).
Step 3. Write a Dataflow code which read data from GCS (file) an load data to a different Bigquery table (tentative name: New_data). Sample code.
Step 4. Create a Dataflow code to read data from Bigquery table New_data and compare it with History_data and identify new coupons and write data to a file on GCS or Bigquery table. Sample code.
Step 5. Schedule entire process over an orchestrator/Cloud scheduler/Cron tab job.
Step 6. Once you have data you can send it to consumers through any communication channel.

Randomly select DynamoDB entry

I'm have a DynamoDB table called URLArray that contains a list of URL's (myURL) and a unique video name (myKey).
I need to do two things:
When a user clicks the next video button, a random entry needs to be selected from this URLArray. There could be potentially tens of thousands of rows.
The user is logged into the app. Everytime they finish watching a video, the video's unique video name is recorded. So....when the user has seen a video, its added to a list in a table called Users under the user's info row.
Soo...This random entry that gets selected when the user clicks the next video button in point 1, has to be compared to the list of videos they've already seen. To make sure that it doesn't randomly appear again for that particular user.
I do something woefully inefficient so far, that works, but it's not great:
By the way i'm using AppSync + GraphQL to interact with the DynamoDB table. I first get a local copy of the URLArray:
//Gets a list of the Key/URL pairs in the UrlArrays table in GraphQL ****IN CONSTRUCTOR, so we have this URLArray data when componentDidMount()****
listUrlArrays = async () => {
try {
URLData = await API.graphql(graphqlOperation(ListUrlArrays)); //GraphQL query
//URLData[] is available in the entire class
this.setState({urlArrayLength: apiData.data.listURLArrays.items.length}); //gets the length of URLArray (i.e. how many videos are in the database)
}
}
As an overview, when user clicks for the next video:
//When clicking next video
async nextVideo(){
await this.logVideosSeen(); //add myKey to the list of videos in *Users* table the logged in user has now seen
await this.getURL(); //get the NEXT upcoming video's details, for Video Player to play and make sure it's not been seen before
}
//This will update the 'listOfVideosSeen[]' in Users table with videos unique myKey, the logged in user has seen
logVideosSeen = async () => {
.......
}
async getURL() {
var dbIndex = this.getUniqueRandomNumber(this.state.urlArrayLength); //Choose a number between 0 and N number of videos in URLArray
//the hasVideoBeenSeen() basically gets the list of videos a user has already seen from `Users` table with the GraphQL getusers command, and creates a local copy of this list (can get big). I use javascripts indexOf() to check whether myKey already exists in the list
while(await this.hasVideoBeenSeen(this.state.URLData[dbIndex].myKey)) //while true i.e. user has seen that video before
{
dbIndex = this.getUniqueRandomNumber(this.state.urlArrayLength); //get another random number to fetch a new myKey
}
//If false, we'll exit the loop and know we've got a not seen before myKey, proceed to set to play...
if(dbIndex != null){
this.setState({ playURL: this.state.URLData[dbIndex].vidURL }); //Retrieve the URL from the local URLArray that we're going to play (i.e. the next video to come)
}
}
I can share a little more code if needed, but essentially I wanted to know how to:
Let a Lambda function select a random number based on the current URLArray size (i may need to keep a local copy of URLArray anyway). But i think point 2 here is where it's really inefficient..
Let a Lambda function check (the while loop) against the Users table whether myKey has already been seen. Mainly to shift this computational burden to the cloud instead of the local device the app runs on.
AFTER A THINK................
Thanks for the suggestion Seth. I have been thinking about it for some time, and while the randomness requirement still holds true, I think there is some truth in what you’ve suggested. The reason I need randomness, is so that 2 users sat side by side for example, can’t predict which video is coming next. It shouldn’t be a predictable sequence of videos. I'm not sure I can use Scan function with AWS Amplify/GraphQL. So remember there’s 2 things going on here: (1) a video upload, recording it in the URLArray sensibly for future reference. (2) users viewing a previously unseen random video and then moving onto another unseen random video
*(1)
I like your idea of using a number to index the URLArray, and it’s helped to make life a bit easier. So the first URL being at index 0, the next at 1 etc…
My thinking here (to avoid me doing a ListUrlArrays() and bringing the WHOLE array locally to the phone), is to create a GSI called VideoNumber for the URLArray table. This will be the unique VideoNumber column with a number 0-N. So imagine the diagram above having another column called VideoNumber. Row 1 having VideoNumber set to 0, Row 2 having VideoNumber set to 1 etc… THEN all I would need to do, is locally on the device, generate a random number between 0-N, call a getURLArrayIdbyVideoNumber() query specific for that GSI, with the number that we just generated, and it’ll unlock the information I need from the row. Voila! I think that shifts most of that heavy burden away now.
Question: Before each video is uploaded, how do I easily get the current total number of rows N in the table (or row count)? I would then increment it by one.
The other thing I can do is save this current count number in another DynamoDB table that I use for persisted data, read the number from there before upload, and write an N+1 after upload to increment it (2 DynamoDB operations per upload). It’s not ideal.
*(2)
When a user has finished watching a video, I can log in a list (under the users information in DynamoDB), which video’s they’ve already seen. So for example this could now be a seen list: [3,12,73,108,57] for the 5 videos they’ve seen so far. When the user clicks nextVideo() we’ll generate a random newNumber, and straight away compare that with any number in the seen list. I use seenlist.indexOf(newNumber) and it will, either go again or stop if the newNumber doesn’t exist in the list. THEN I can go through the GSI query, and retrieve the relevant information to display the video from URLArray.
I think that this indexOf() is the biggest computational burden on the device, and obviously gets a little slower as the seenList increases. But it should be quicker with pure integer numbers then an alphanumeric myKey as I was using before. Any other suggestion would be welcome :)
I’ve yet to try it, but it was just an idea, as I need to keep the random element. But first, do you know how I can easily find the number of rows or table count of URLArray?
I think you'll have an easier time coming up with a solution to this problem if you drop the randomness requirement. It sounds like the more important requirement is presenting the user with a video they haven't seen before.
If that's correct, it sounds like your access pattern could be stated as
Fetch previously unseen video for user
which is an easier problem to solve.
Unlike SQL databases, there are often many ways to implement a given access pattern in DynamoDB. My answer here is just one way.
Imagine your URLArray table as a giant array. The first URL is at index 0, the next URL is at index 1, the second URL at index 2, and so on. Each user of your application would start by watching the video at URL index 0, then URL index 1, etc. This would ensure the user never sees the same video twice. You would not need to store a list of all the videos they've seen. Instead, you could store the index of the last video they saw.
Your application could grab the first n videos from the table to present to your users. Once that list was exhausted, it could go grab the next n videos. And so on...
What I've described here is essentially how pagination is implemented in DynamoDB. To bring this abstraction back to the world of DynamoDB, your algorithm could look something like this:
Scan the URLArray table for the first "page" of URLs (a scan operation with no filter criteria)
Along with the results, DynamoDB will respond with a LastEvaluatedKey, which will allow you to retrieve the next page of results starting from this position
Present your user with each video you pulled back from the scan operation, making sure to record the id (the Primary Key) the the last video they saw.
When you exhaust the URLs from step 1, execute another scan operation with the ExclusiveStartKey set to the LastEvaluatedKey returned from step 2.
When users return to your application, query for the next page from the URLArray table with ExclusiveStartKey set to the id of the last video they viewed.
This effectively uses the scan operation to search through your URLArray table one page at a time. Your application would effectively be searching the table from top to bottom, keeping track of where each user is at any given time. When a user revisits your application, just start where they left off.
In response to your edit:
If your use case requires the next video to be unpredictable (e.g. no 2 users can predict what video is next), you have a few problems to solve at the same time:
Selecting an item in an unpredictable/random manner
Tracking what a user has already seen
Putting those two requirements together makes for a tricky access pattern. Let's say you have N videos in your table, and the user has viewed N-1 of these videos leaving only one video unseen. If you are fetching your next video randomly and need to ensure it has not yet been seen, how will you find the last unseen video? How many times would you need to guess before you came across the only unseen video? What query/scan operation could you perform that does this in a single request to DDB? I'm not saying it's impossible, it's just...complicated.
I think it's better to generate a strategy that is unpredictable to the user, but predictable to you when it comes to select the next unseen video.
For example, you could pre-calculate a random order of indexes from 1..N ahead of time, which would represent the order you present the videos for a given user. You could go through that list sequentially, keeping track of the last seen index. That way, you'd always know which video was next and that the video hadn't previously been seen by this user. Fetching that video would be a simple query operation to DDB.
You also asked how to find the number of items in DynamoDB. Unfortunately, there is no DynamoDB equivalent of the SQL count operation. The answer to this question is not straightforward. For the benefit of the community (and to get a diverse set of answers), I'd suggest you make a separate question on Stackoverflow regarding the number of items in a DDB table.

How to fetch data for a news feed like system?

I have few tables as shown below
Polls
PollId Question Option
1 What 1
2 Why 4
Updates
UpdateId Text
1 Sleep
2 Play
Polls and updates are just two sample tables (In reality there are more tables like ,photos, videos,links etc). But when a user visit his home (like facebook new feed) he must be displayed with data relevant to him (no such data included in this example). ie I want to select data from all tables with less number of query executions. (ie, I want to present a mixture of datas, ie polls, photos, videos etc )
Currently, I'm fetching only ids and type (ie which table) from all of the tables and gather further data while iterating through this resultset. (ie from c# calling another SqlQuery) .
Is there a way to query the data from whole tables at once? (OUTER JOIN?, UNION?)
Or simply,
How can I select different type of entities at once in a single sql Query?
You could write your query so that you have one long select list for everything you want and it all comes back in one result set but I suspect that wouldn't work too well because you might have varying numbers of different types of items per user.
If you really must have it all in one hit then you can issue multiple queries in one go and get multiple result sets back. To handle this you can use an ADO.Net DataSet. See this SO example (but not the accepted answer - see Vikram Dibyal's answer as that gives a very basic overview of what I think you're asking for).
I won't copy and paste the stuff from the linked thread, just head over and take a look.

Process Each Row in Kettle ONE AT A TIME?

I was wondering if it is possible to work on a per row basis in the kettle?
I am trying to implement a reporting scheme which consists of a table, where the requests get queued for processing and then the Pentaho job that picks up the records on that table.
my job currently has 3 transformations in it,
1st is to get records from the queued requests table
2nd is to analyze the values on each record and come up with multiple results based on that record. for example, a user would request to have records of movies of the horror genre. then it should spit out the horror movies
3rd is to further retrieve the information about the movies such as the year, director and etc, which is to be outputted to an excel file.
this is the idea, but it's a bit challenging doing it in Pentaho as it does stuff all at the same. is there a way that I can make my job work on records one by one?
EDIT.
Just to add, I have been trying to extend the implementation of the Pentaho cookbook sample but if I compare to my design, its like step 2 and step 3 only.
I can't seem to make the table input step work one at a time.
i just made it act like the implementation in the cookbook, i did adjustments on it. instead of using two transformations to gather all the necessary fields, i just retrieved all the information that i need in 1 transformation.
then after that i copied those information to the next steps, then some queries to complete the information and it is now working.
passing parameters between transformations is a bit confusing, there are parameters to be set on the transformation itself and also on the job where the transformations lay so i kinda went guessing for some time just to make it work.