I am trying to build the cube and getting below error:
What should I do to resolve it?
Internal error: The operation terminated unsuccessfully. Errors in the
OLAP storage engine: The attribute key cannot be found when
processing: Table: 'dbo.FACT1', Column: 'LoanAge', Value: '-93'. The
attribute is 'LoanAge'. Errors in the OLAP storage engine: The record
was skipped because the attribute key was not found. Attribute:
LoanAge of Dimension: LoanAge from Database: Cube_Data, Cube: Bond
Analytics OLAP, Measure Group: FACT1, Partition: Fact Combined
SUBPRIME 20180401 HPI Median, Record: 185597. Errors in the OLAP
storage engine: The process operation ended because the number of
errors encountered during processing reached the defined limit of
allowable errors for the operation. Errors in the OLAP storage engine:
An error occurred while processing the 'Fact Combined SUBPRIME
20180401 HPI Median' partition of the 'FACT1' measure group for the
'Bond Analytics OLAP' cube from the cube_Data database. Server: The
current operation was cancelled because another operation in the
transaction failed. Internal error: The operation terminated
unsuccessfully. Errors in the OLAP storage engine: An error occurred
while processing the 'Fact Combined ALTA_20180401 HPI Median'
partition of the 'FACT1' measure group for the 'Bond Analytics OLAP'
cube from the Cube_Data database.
Greg actually replied in comment under your question.
Let me enlarge his explanation a bit.
Table dbo.FACT1 has a row with column LoanAge = -93
It's record #185597 when cube is doing T-SQL query to grab partition Fact Combined SUBPRIME 20180401 HPI Median data.
However this value ( -93 ) is not present in LoanAge dimension LoanAge attribute.
To fix it you need to:
add this value into LoanAge dimension table
"Process Update" LoanAge dimension
Process Fact Combined SUBPRIME 20180401 HPI Median partition again.
And figure out why dimension has no -93 value.
You probably need to implement late-arrival dimension scenario is your facts are coming earlier than dimension values.
E.g. one unknown value is comming from facts, add it, mark with some default name (e.g. 'Unknown -93'). And update them later once dimension reference table has this code.
This is the common case, not exactly it's applied to such simple attribute like age (numeric value with no additional description).
Related
I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.
newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')
Here is my table structure and partitions information.
hive> desc emp.partition_Load_table;
OK
veh_code varchar(17)
veh_flag varchar(1)
veh_model smallint
veh_country varchar(3)
# Partition Information
# col_name data_type comment
veh_country varchar(3)
hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS
Now I am reading this table back in pyspark inside a dataframe.
df2_data = spark.sql("""
SELECT *
from udb.partition_Load_table
""");
df2_data.show() --> is working
But I am not able to filter it using partition key column
from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')
I am getting below error message:
: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive.
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.
newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")
below is working
newdataframe.where(col("veh_country")=='CHN').show()
my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.
I have stumbled on this issue also. What helped for me was to do this line:
spark.sql("SET spark.sql.hive.manageFilesourcePartitions=False")
and then use spark.sql(query) instead of using dataframe.
I do not know what happens under the hood, but this solved my problem.
Although it might be too late for you (since this question was asked 8 months ago), this might help for other people.
I know the topic is quite old but:
I've received same error but the actual source problem was hidden much deeper in logs. If you facing same problem as me, go to the end of your stack trace and you might find actual reason for job to be failing. In my case:
a. org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:865)\n\t... 142 more\nCaused by: MetaException(message:Rate exceeded (Service: AWSGlue; Status Code: 400; Error Code: ThrottlingException ... which basically means I've exceeded AWS Glue Data Catalog quota OR:
b. MetaException(message:1 validation error detected: Value '(<my filtering condition goes here>' at 'expression' failed to satisfy constraint: Member must have length less than or equal to 2048 which means that filtering condition I've put in my dataframe definition is too long
Long story short, deep dive into your logs because reason for your error might be really simple, just the top message is far from being clear.
If you are working with tables that has huge number of partitions (in my case hundreds of thousands) I would strongly recommend against setting spark.sql.hive.manageFilesourcePartitions=False . Yes, it will resolve the issue but the performance degradation is enormous.
I know this is mainly a design problem. I 've read that there is a workaround for this issue by customising errors at processing time but I am not glad to have to ignore errors, also the cube process is scheduled so ignore errors is not a choice at least a good one.
This is part of my cube where the error is thrown.
DimTime
PK (int)
MyMonth (int, Example = 201501, 201502, 201503, etc.)
Another Columns
FactBudget
PK (int)
Month (int, Example = 201501, 201502, 201503, etc.)
Another columns...
The relation in DSV is set as follows.
DimTiempo = DimTime, FactPresupuesto=FactBudget, periodo = MyMonth, PeriodoPresupFK = Month
Just translated for understanding.
The Relationship in cube is as follows:
The cube was built without problem, when processing the errror: The attribute key cannot be found when processing was thrown.
It was thrown due to FactBudget has some Month values (201510, 201511, 201512 in example) which DimTime don't, so the integrity is broken.
As mentioned in the answer here this can be solved at ETL process. I think I can do nothing to get the relationship if one fact table has foreign keys that has not been inserted in dimensions.
Note MyMonth can be values 201501, 201502, 201503 etc. This is set for year and month concatenated, DimTime is incremental inserted and every day is calculated that column so in this moment DimTime don't have values for 201507 onwards.
Is there a workaround or pattern to handle this kind of relationships?
Thanks for considering my question.
I believe that the process you are following is incorrect: you should setup any time related dimensions via a degenerate/fact dimension methodology. That is, the time dimension would not really be a true dimension - rather, it is populated through the fact table itself which would contain the time. If you look up degenerate dimensions you'll see what I mean.
Is there a reason why you're incrementally populating DimTime? It certainly isn't the standard way to do it. You need the values you're using in your fact to already exist in the dimensions. I would simply script up a full set of data for DimTime and stop the incremental updates of it.
I ran into this issue while trying to process a cube in Halo BI. It seems that some "datetime" styles "are" supported by SQL Server but not Halo BI. This statement does not cause an error:
CAST(CONVERT(char(11),table.[col name],113) AS datetime) AS [col name]
however this does not process without an error:
CAST(CONVERT(char(16),table.[col name],120) AS datetime) AS [col name]
However both of these work in SQL Server Management Studio 2012.
Another cause of this error is due to the cube measures being improperly mapped to the fact table.
I have a SSAS tabular mode cube that reads data from an Actian Matrix database using ODBC. The project processes fine when I'm using a data set with 1 Million rows but when I try to use a bigger one (300 Million rows), the process runs for around 15 minutes and fails with the message:
The operation failed because the source database does not exist, the source table does not exist, or because you do not have access to the data source.
More Details:
OLE DB or ODBC error: [ParAccel][ODBC Driver][PADB]57014:ERROR: Query (25459) cancelled on user's request
DETAIL: Query (25459) cancelled on user's request
; 57014.
An error occurred while processing the partition 'XXXX' in table 'YYYY'.
The current operation was cancelled because another operation in the transaction failed.
The message says that the database doesn't exist but it doesn't make sense because it works perfectly fine on the first case (and the difference is just a "where clause" to limit the number of rows)
I'm using a server that has 96 Gb of FREE ran and I can see all the memory being consumed while the "processing" process is running. When it is all consumed, it runs for a few axtra seconds and fails. Also, I know for a fact that the 300 Million row dataset exported to a csv file has 36 Gb on its raw format, so it should fit full in memory without any compression.
I can also guarantee that the query works fine on its own on the source database so the "Query (25459) cancelled on user's request" message also doesn't make much sense.
Does anyone have any idea on what may be going on?
Memory consumption on the derivative of the input rows (the resulting cube) cannot be estimated on the byte size of the input. It is a function of the Cartesian graph product of all distinct values of the cube dimensions.
If you were building a cube with 2 input rows over 2 dimensions and 2 measurements:
State|City|Population
---------------------
NY|New York|8406000
CA|Los Angeles|3884000
State|City|Population|Number of records
---------------------------------------
NULL|NULL|12290000|2
NY|NULL|8406000|1
NY|New York|8406000|1
CA|NULL|3884000|1
CA|Los Angeles|3884000|1
NULL|Los Angeles|3884000|1
NULL|New York|8406000|1
You can't expect the output being generated as the input data rows are processed to be equivalent in size. If the ODBC driver keeps the entire input in memory before it lets you read it, then you would have to account for both the input and the output to reside in memory until the cube generation is complete.
This answer is much clearer on the subject: How to calculate the likely size of an OLAP cube
I'm getting this error when I'm trying process DataMining with NestedTable in it.
Error 5 Errors in the metadata manager. The 'XYZZZZZ' dimension in the 'Dim XYZ' measure group has either zero or multiple granularity attributes.
Must have exactly one attribute. 0 0
Any idea why this is happening?
can you post your mining structure's code?
I think you have to create it with the MISSING_VALUE_SUBSTITUTION parameter to get rid of zero granularities. It always solves my proble when I have a times series with a gap on it
Little bit of background, we have an OLAP system that has been happily processing it's cube for a customer for a long time. Then recently it started to fail. This has coincided with the main developer accidentally getting married and making himself unavailable. So obviously I can't go pestering him.
We have a date dimension that works at Year, Month, Day level. We have thow hierarchies for calendar and fiscal years.
It's currently throwing a message that I find pretty indecipherable (not being an OLAP dev) and the examples I've read online refer to it being caused by weeks splitting across months, which is not a problem I have.
The message is:
Rigid relationships between attributes cannot be changed during incremental processing of a dimension.
When I reprocess the cube I now get problems related to dates. When I reprocess the Date dimension I get the following:
Internal error: The operation terminated unsuccessfully.
Errors in the OLAP storage engine: Rigid relationships between attributes cannot be changed during incremental processing of a dimension.
Errors in the OLAP storage engine: An error occurred while the 'Date ID' attribute of the 'Date' dimension from the 'TMC_CUBE_TESCO' database was being processed.
Errors in the OLAP storage engine: The process operation ended because the number of errors encountered during processing reached the defined limit of allowable errors for the operation.
Server: The operation has been cancelled.
When I view the entire detail for the Date Dimension I see that it has processed a heap of SELECT statements but falls down here:
SELECT DISTINCT [dbo_dw_DIMdate].[DateTime] AS [dbo_dw_DIMdateDateTime0_0],[dbo_dw_DIMdate].[DayOfMonth] AS [dbo_dw_DIMdateDayOfMonth0_1],[dbo_dw_DIMdate].[MonthNumberCalendar] AS [dbo_dw_DIMdateMonthNumberCalendar0_2],[dbo_dw_DIMdate].[YearCalendar] AS [dbo_dw_DIMdateYearCalendar0_3]
FROM [dbo].[dw_DIMdate] AS [dbo_dw_DIMdate]
Processing Dimension Attribute 'Date ID' failed. 1 rows have been read.
Start time: 10/21/2011 10:30:35 PM; End time: 10/21/2011 10:30:35 PM; Duration: 0:00:00
SQL queries 1
SELECT DISTINCT [dbo_dw_DIMdate].[DateID] AS [dbo_dw_DIMdateDateID0_0],[dbo_dw_DIMdate].[DayOfCalendarYear] AS [dbo_dw_DIMdateDayOfCalendarYear0_1],[dbo_dw_DIMdate].[DayOfFiscalYear] AS [dbo_dw_DIMdateDayOfFiscalYear0_2],[dbo_dw_DIMdate].[DayOfWeek] AS [dbo_dw_DIMdateDayOfWeek0_3],[dbo_dw_DIMdate].[IsCalendarYearToDate] AS [dbo_dw_DIMdateIsCalendarYearToDate0_4],[dbo_dw_DIMdate].[IsFiscalYearToDate] AS [dbo_dw_DIMdateIsFiscalYearToDate0_5],[dbo_dw_DIMdate].[IsLastCalendarMonth] AS [dbo_dw_DIMdateIsLastCalendarMonth0_6],[dbo_dw_DIMdate].[IsLastWeek] AS [dbo_dw_DIMdateIsLastWeek0_7],[dbo_dw_DIMdate].[IsWeekDay] AS [dbo_dw_DIMdateIsWeekDay0_8],[dbo_dw_DIMdate].[IsYesterday] AS [dbo_dw_DIMdateIsYesterday0_9],[dbo_dw_DIMdate].[DateTime] AS [dbo_dw_DIMdateDateTime0_10],[dbo_dw_DIMdate].[DayOfWeekName_engb] AS [dbo_dw_DIMdateDayOfWeekName_engb0_11],[dbo_dw_DIMdate].[ShortDayOfWeekName_engb] AS [dbo_dw_DIMdateShortDayOfWeekName_engb0_12],[dbo_dw_DIMdate].[WeekNumberCalendar] AS [dbo_dw_DIMdateWeekNumberCalendar0_13],[dbo_dw_DIMdate].[WeekNumberFiscal] AS [dbo_dw_DIMdateWeekNumberFiscal0_14],[dbo_dw_DIMdate].[WeekCommencing] AS [dbo_dw_DIMdateWeekCommencing0_15],[dbo_dw_DIMdate].[YearFiscal] AS [dbo_dw_DIMdateYearFiscal0_16],[dbo_dw_DIMdate].[YearCalendar] AS [dbo_dw_DIMdateYearCalendar0_17],[dbo_dw_DIMdate].[IsLastCalendarWeek] AS [dbo_dw_DIMdateIsLastCalendarWeek0_18]
FROM [dbo].[dw_DIMdate] AS [dbo_dw_DIMdate]
Error Messages 1
I'm not after "sent me teh codez", but any help understanding the error message and problem would be very much appreciated.
You should process your DB in full mode, not incremental (if size of DB is not very big). But it's only one approach. Also, you could have problems with your dictionary (source table for your dimension). Use the query from process window - try to obtain the same distinct count for attribute Id and attribute Name fields.