Any idea how to summarize data in a Pentaho transformation and then insert the summary row directly under the group being summarized.
I can use a Group By step and get a summarised result stream having one row per key field, but what I want is each sorted group written to the output and the summary row inserted underneath, thus preserving the input.
In the Group By, you can do 'Include all Rows', but this just appends the summary fields to the end of each existing row. It does not create new summary rows.
Thanks in advance
To get the summary rows to appear under the group by blocks you have to use some tricks, such as introducing a numeric "order" field, setting the value of the original data to 1 and the sub totals rows to 2.
Also in the group-by/ sub-totals stream, I am generating a sum field, say "subtotal". You have to make sure to also include this as a blank in your regular stream or else the metadata will be divergent and the final merge will not work.
Here is the best explanation I have found for this pattern:
https://www.packtpub.com/books/content/pentaho-data-integration-4-working-complex-data-flows
You will need to copy the rows too a different stream, and then merge or join them again, to make it a separate row.
Related
I have brought a table from an Authority database into Excel via power query OBDC type, that includes fields like:
Date - various
Comments - mem_txt
Sequence - seq_num
The Comments field has a length restriction, and if a longer string is entered, it returns multiple rows with the Comments field being chopped into suitable lengths and the order returned in the Sequence field as per extract below. All other parts of the records are the same.
I want to collapse the rows based and concatenate the various Comments into a single entry. There is a date/time column just outside of the screen shot above that can be used to group the rows by (it is the same for the set of rows, but unique across the data set).
For example:
I did try bring the data in by a query connection, using the GROUP_CONCAT(Comments SEPARATOR ', ') and GROUP BY date, but that command isn't available in Microsoft Query.
Assuming the date/time column you refer to is named date_time, the M code would be:
let
Source = Excel.CurrentWorkbook(){[Name = "Table1"]}[Content],
#"Grouped Rows" = Table.Group(
Source,
{"date_time"},
{{"NewCol", each Text.Combine([mem_text])}}
)
in
#"Grouped Rows"
Amend the Source line as required.
I do not understand how to deal with duplicates when generating my output, so I ended up getting several duplicates but I want one only.
I've tried using LIMIT but that only applies when selecting I suppose. I also used DISTINCT but wrong scenario I guess.
grouped = GROUP wantedTails BY tail_number;
smmd = FOREACH grouped GENERATE wantedTails.tail_number as Tails, SUM(wantedTails.distance) AS totaldistance;
So for my grouped, I got smg like (not the whole):
({(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB)},44550)
but I expect (N983JB,44550). How can I delete those duplicates generated during grouping? Thank you!
The way I see it, there are two ways to de-duplicate data in Pig.
Less flexible but a convenient way is to apply MAX to the columns which need to be de-duplicated after performing a GROUP BY. Apply SUM only if you want to add up values across duplicates:
dataWithDuplicates = LOAD '<path_to_data>';
grouped = GROUP dataWithDuplicates BY tail_number;
dedupedData= FOREACH grouped GENERATE
--Since you have grouped on tailNumber, it is already de-duped
group AS tailNumber,
MAX(dataWithDuplicates.distance) AS dedupedDistance,
SUM(dataWithDuplicates.distance) AS totalDistance;
If you want more flexibility while de-duping, you can take help of nested-FOREACH in Pig. This question captures the gist of its usage: how to delete the rows of data which is repeating in Pig. Other references for nested-FORACH: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html
I have the following data in my excel. and my requirements is to sort a group of rows based on Forecast Age in Descending order. I have already tried the built in sort function in excel but it doesn't meet my requirements.
Let's say for example my forecast age is in Range("BO54")=17 and Range("BO88")=19, Range("BO88") value is bigger than the other so it should to be on the top. Range("BO88")'s group is Range("A78:CA91") .
In short I want 88th row and it's group of row which is Range("A78:CA91") to be at the top. Is there a way to do this without sorting the data within Range("A78:CA91")?
Well I can suggest you one way which might be lengthy to code.
First you can run a loop on all rows with a condition that if "Description" = "Forecast Age" then paste it in a new sheets. This will give you list of rows with every "house" and age. Sort it based on age and add a column which will assign rank() it based on age.
After this add a rank column in your main data and vlookup the above obtained ranks to each house and then finally, sort the entire table based on these ranks.
You should obtain the result you expect.
This is just a heads up. You can optimize the process and improvise it.
I'm using a Pivot Table in Excel 2010, and while searching posts I find that a lot of users are frustrated like me because it doesn't keep all formats.
So what I'm trying to do is run a macro that formats columns in a Pivot table, but limited to the last row and column in the table. I have the formatting info, but I just need to know how to apply it to specific columns and rows.
What I was thinking might work is finding the last column of the Values row, in this case "Stops per Rte" which is the last Values column; however, I have months listed at the top, so it repeats across months. If the user filters only certain months then the # of columns will decrease.
Same goes for the # of rows: of course, the user should be able to expand/collapse rows as needed, so I only want the column format to go to the last row or just above "Grand Total", if possible.
Hopefully, this makes sense. Thanks in advance! = )
I use a query which fetches say 50 records and passes it to a datatable. This record is then displayed in a tabular format. The display has pagination used displaying 10 records at a time. There is a facility to move to next or previous set of record or move forward or backwards by 1 record.
I have to find Min and Max of a column for the set of record currently visible. I am planning to use Compute method but I am not sure if it allows filtering on anything other than the columns in datatable.
Do I have to include row number in my query or is there a better solution (something along the line mentioned below)?
CType(dtLineup.Compute("Min(ArrivalDate)", dt.row(2) to dt.row(12)), Date)
There is nothing like your pseudo code in MSDN on DataColumn.Expression. You could include a row number in your query, as you said, but an alternative is to add a row number column to your data table and use that in the filter expression.
DataColumn col = new DataColumn("rownumber", typeof(int));
col.AutoIncrement = true;
col.AutoIncrementSeed = 1;
datatable.Columns.Add(col);
Another alternative could be to do paging by linq (Skip-Take) and compute the aggregate function over the returned rows. But that may be a major departure of your current application structure.