I have a data grid with the following fields and one row:
Data Grid
Fields: A , B , C
Row 1: fooA, fooB,
I have another data grid with the following rows -just one field-:
Intervals
Fields: C
Row 1: 10
Row 2: 20
Row 3: 35
Row 4: 40
I would like to understand which Pentaho Data Integration (Kettle) step/box shall be used to get:
Fields: A , B , C
Row 1: fooA, fooB, 10
Row 2: fooA, fooB, 20
Row 3: fooA, fooB, 35
Row 4: fooA, fooB, 40
That is a simple Join Rows (Cartesian product), though i do not know how exactly are you going to use this for variable data, if the Grid with 1 row multiple columns will ALWAYS have 1 row and multiple columns than it is fine, a cartesian product will replicate fooA, fooB in N' rows from the second grid.
All you have to do first in the second grid is split the single field into 2 columns, one with the "Row X" and another with the number value.
Your KTR should look something like this:
Which outputs this:
Related
I have a dataframe with columns: A(continuous variable) and B(discrete 1 or 0). The df is initially sorted by A variable.
I need to order the dataframe so for each set of X rows, there are Y rows with value 1 in B column, and (X-Y) rows with 0 (B column) (when possible!). But these sets should have variable A in desceding order. X and Y are input by the user
Example:
X=4, Y=3
Rows 0-11 are ok, since the sets (0-3),(4-7) and (8-11) has 3 rows with 1 in column B and only one row with 0 AND variable A is descending. However, rows 12-15 are not ok, since there are 2 rows with 1(variable B) and two with 0. Row 17 would replace row 15 to make this set valid. There is no problem if the last rows has 0 in variable B, since there isn't any with value 1.
The code should be general enough to run on dataframes with different number of rows.
Any ideas?
Using Pentaho PDI 8.3.
After REST calls with quite complex data structures, I was able to extract data with a row for each data element in a REST result/ E.g:
DataCenterClusterAbstract
1
UK1
Datacenter (auto generated)
Company
29
0
39
15
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
2
UK1_Murex
Datacenter (auto generated)
Company
0
0
0
0
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
3
UK1_UNIX
Notice that there are 8 data elements that are spread out into separate rows. I would like to condense these 8 data elements into one row each iteration in Pentaho. Is this possible? And assign field names?
Row flattener
Condense 8 data element in columns into one row. Each of these 8 data elements are repeating.
(1) Add row flattener
(2) Assign field names for the rows coming in - so you have 10 data attributes in rows specify a field name for each row.
(3) In table output use space as seperator
Below are two sets of data. Each has two columns. I want that that the similar data comes in front of each other.
This is a manual solution with formulas and sorting.
Imagine the following data in columns A to E:
Enter the following formulas into columns G to K
Column G: =IFERROR(IF(VLOOKUP(D:D,A:B,2,FALSE)=E:E,1,2),3)
Column H: =IF(G:G<3,D:D,"")
Column I: =IFERROR(VLOOKUP(H:H,A:B,2,FALSE),"")
Column J: =D:D
Column K: =IFERROR(VLOOKUP(J:J,D:E,2,FALSE),"")
The column G sort by now shows:
1 if part and quantity matched
2 if only part matched
3 if nothing matched
So if you now select data from A3:K10 and sort by column G (sort by) then it will result in this:
To set the scene, what I define as identical rows are when the combination of destination and vehicle_brand are the same. For instance in the figure below,
SQL table name: cardriven
rows 2 and 3 are "identical" because of the Dallas-Toyota "combination." Now I want to only display the row with the higher request_id. So for example, between rows 2 and 3, row 3 would get displayed and row 2 would be hidden/removed because 169 > 100. So in the end, only rows 3, 4, 5, 7, and 8 will show and rows 1, 2, 6, and 9 would get hidden/removed.
Hopefully you understand what I am going for here but if you have any questions, please let me know. This will be written in SQL code.
Another problem: I added a new column for dates and entered some random ones for rows 2-4. Row 2 is 12/1/17, row 3 is 11/5/2016, and row 4 is 7/6/2017. Note that row 3 has the highest request_id out of the Dallas-Toyota combination. I decided to enter a new entry in with a request_id = 501 and entry of Dallas, Toyota, and 12/22/2017. After running the program, for Dallas-Toyota I return row 3 but with request_id = 501! It SHOULD return the entry I just entered.
You can use Group By and the Max function to get the highest value.
SELECT MAX(request_id), destination, vehicle_brand
FROM cardriven
GROUP BY destination, vehicle_brand
Example:
column A column B
A 1
A 2
B 2
B 2
C 1
C 1
I would somehow like to get the following result:
column A column B
A 1.5
B 2
C 1
(which are averages of 1 and 2, 2 and 2 and 1 and 1)
How do I achieve that?
Thanks
If you're using Excel 2007 or above, you can also use the shorter AVERAGEIF function:
=AVERAGEIF($A$1:$A:$6,D1,$B$1:$B$6)
Less typing, easier to read..
In D1:D3, type A, B, C. Then in E1, put this formula
=SUMIF($A$1:$A$6,D1,$B$1:$B$6)/COUNTIF($A$1:$A$6,D1)
and fill down to E3. If you want to replace the existing data, copy E1:E3 and paste-special-values over itself. Then delete A:C.
Alternatively, you can add headers to your data, say "Letter" and "Number". Then create a Pivot Table from your data. Put Letter in the rows section and Number in the Data section. Change your Data section from SUM to AVERAGE and you'll get the same result.