HANA Studio - Calculation view Calculated column not being aggregated correctly - hana

I encounter a problem while I am trying to aggregate (sum) a calculated column which was created in another Aggregation node from another Calculation view.
Calculation View : TEST2
Projection 1 (Plain projection of another query)
Projection1
Aggregation 1 Sum Amount_LC by HKONT and Unique_document_identifier. In the aggregation, a calculated column Clearing_sum is created with the following formular:
Aggregation1
[Question 1] The result of this calculation in raw data preview makes sense for me but the result in Analysis tab seems incorrect. What is the cause of this different output between Analysis and Raw Data?
Result Raw Data
Result Analysis
I thought that it might be the case that, instead of summing up, the analysis uses the formular of Clearing_sum since it is in the same node.
So I tried creating a new Calculation (TEST3) with a projection on this TEST2 (all columns included) and ran to see the output. I still get the same output (correct raw data but incorrect analysis).
Test3
Result Analysis Test3
[QUESTION 2] How can I get my desired result? (e.g. the sum of Clearing_sum for the highlighted row should be 2 according to Raw data tab). I also tried enabling the Client-side aggregation in the Calculated column, but it did not help.

Without the actual models (and not just screenshots) it is hard to tell what the cause of the problem here is.
One possible cause could be that removing the HKONT changed the grouping level of the underlying view that computed SUM(Amount_LC). In turn, this affects the calculation of Clearing_sum.
A way to avoid this is to instruct HANA to not strip those unreferenced columns and to not change the grouping level. To do that, the KEEP FLAG needs to be set for the columns that should stay part of the grouping.
For a more detailed explanation of this flag, check the documentation and/or blog posts like Usage of “Keep Flag”.

Related

MS SSAS - Need to return a measure in a calculated member based on a tuple set and a max ofunderlying ID

I require some more advanced MDX knowledge than mine.
I need to get the RepoRate_MAX for repo products, at book and instrument level, but also looking at the Java code I'm replacing that code always uses the max MurexId.
How can I perform the below (I've placed MAX in here on the dimension but this is wrong) and I need the combo of the dimensions and also the MAX MurexId:
[Measures].[RepoRate_VAL] = (([Deal].[ProductType].&[REPO],[Deal].[Book],[Deal].[Instrument],MAX([Deal].[MurexId])),[Measures].[RepoRate_MAX])
I'm sure it's a simple one but my mind is part way between the Java OO and MDX worlds currently haha :D
Thanks
Leigh
So after some experimenting I found out about the TAIL and Item MDX functions.
I think at one point I did get it working, but didn't make a note of what did work. I was playing around with this and variants of it..but most versions ended up in unusable query times:
[Measures].[RepoRate_VAL] = (([Deal].[ProductType].&[REPO],[Deal].[Book],[Deal].[Instrument],TAIL(EXISTING([Deal].[MurexId].[MurexId])).Item(0)),[Measures].[RepoRate_MAX])
So I then decided to push the RepoRate calculation back to the SQL data preparation script. Cleaner/smoother data is always better and then to have simple calculated members.
I used SQL to determine the RepoRate from tradelevel with MAX(MurexId) and GROUP BY on Book, Instrument to then update my main fact table to ensure that the correct RepoRate was set at Book, Instrument level.
Thus the calculated member is then:
[Measures].[RepoRate_VAL] = (([Deal].[Book],[Deal].[Instrument]),[Measures].[RepoRate_MAX])
Fast data prep and a fast calculated member on the Excel/Pivot/UI layer.

SAP DBTech JDBC: [2048]: column store error: search table error: [2724] Olap temporary data size exceeded 31/32 bit limit"

I have a calculation view which is based on other calculation views and joins to bring material Accounts data from different vendors (all joins have 1-1 mapping with target). in the final view I have a calculated column as Formatted_MATERIAL (Material numbers without any leading zeros, used Ltrim() to remove leading zeros.)
Now, when I'm searching Formatted_MATERIAL equal to some specific number it's showing read error (heading). If I'm searching for some range of material it's giving results.
For example, if I search for material (500098), it's present in following query results
select "Formatted_MATERIAL"
FROM "_SYS_BIC"."CA_REPORTS_001_VK"
where "Formatted_MATERIAL" between 5000000 and 6000000
order by "Formatted_MATERIAL"
but no results for
select "Formatted_MATERIAL"
FROM "_SYS_BIC"."CA_REPORTS_001_VK"
where "Formatted_MATERIAL" = 5000098
The cause of the error is that during some processing step in one of the views you're using, the intermediate result set exceeds 2 billion records.
Based on my experience with typical HANA use cases (that would mostly be use cases in relation to SAP products) I am pretty sure that the way these underlying views have been modelled is not really right. Whenever you try and join or aggregate an intermediate result set of two billion records at once, chances are that important operations like filtering, projection and aggregation should have been done much earlier in the model.
Of course, without seeing the model(s) and the execution details (use PlanViz for this) and knowing with HANA version you're using, there is nothing we can say about how to solve this issue.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:
Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.
In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.