Kylin Cardinality Calculation

Kylin Cardinality Calculation - hive

I have a Kylin Cube built on some data that is partitioned on date. Whenever a new date's data is added into hive, Kylin is not able to detect it. Is this normal behaviour?
Currently I am manually reloading the table in data-sources tab. This calls for a recalculation of cardinality. The data is too big and calculation of cardinality is taking so long.
Can anyone help me? Am I missing anything?

When some data is added into the hive external table, we must run :
MSCK REPAIR TABLE table_name;
Kylin will then be able to read the new partitions.

Related

How to add/reflect query changes to the existing data in spark

I have a table created in production with incorrect data.My rec_srt_dt and rec_end_dt columns have been loaded wrongly. rec_srt_dt is sys_dt and now I have modified the query to load the data properly. My question is how do I have to handle the existing data present in production table and how to add new changes to that data?
My source table is oracle, Using Spark for the transformations and the target table is in AWS.
Kindly help me in this.

Issue in Partition in SSAS tabular model with DirectQuery Mode

I am trying to create a sample partition to a tabular model database in DirectQuery mode, and I got the following error after setting the filter and trying to import:
"Failed to save modifications to the server: Error returned: 'A table that has partitions using DirectQuery mode and a Full DataView can have only one partition in DirectQuery mode. In this mode, table 'FactInternetSales' has invalid partition settings. You might need to merge or delete partitions so that there is only one partition in DirectQuery mode with Full Data View."
Would anyone please help me understand the issue. Thank you

A DirectQuery model is one which doesn’t cache the data in the model. Instead as the DirectQuery model is queried it in turn generates queries against the backend SQL data source at query time. This is compared to an Import model where the source data is imported ahead of time and compressed in memory for snappy query performance. Import models require periodic refreshes so data won’t get stale. DirectQuery models don’t require refresh since they always reflect what’s in the source system.
The error you got is self explanatory. DirectQuery models should only have one partition per table and that partition’s query should cover 100% of the date range your model should cover for that particular table. So check FactInternetSales partitions and remove all but one partition and remove the WHERE clause from the partition query.

Error: "The table or data volume was larger than BI Engine supports at this time"

Trying to use BI Engine with a BigQuery table and Data Studio. I get the error "The table or data volume was larger than BI Engine supports at this time". My table is partitioned, what can I do to fix this?

Make sure to tell Data Studio to use the partitioned column!
If Data Studio shows a table is not accelerated with an error like this:
Then go to data source definition, and make sure to tell it to use the partitioning column as a partitioning column:
Partitioned Table
Use <column name> as partitioning column
Now Data Studio shows the dashboard being accelerated:
Track developments for this issue on https://issuetracker.google.com/issues/140507651.

AWS Athena MSCK REPAIR TABLE tablename command

Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
to fail on?
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.

MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
What to do instead depends on a number of things that are unique to your situation.
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas