Clickhouse. Optimize table on cluster doesnt work properly

Clickhouse. Optimize table on cluster doesnt work properly - sql

I have a cluster 2x2 v19.15.2.2 with replicated partitioned table.
select * from system.parts
=> some_part_0_0_1, some_part_0_0_2 and etc.
shows me some unmerged parts.
Doc says, that during call optimize all the parts will be merged, but after call such query
// current settings on each node
optimize_throw_if_noop = 1
replication_alter_partitions_sync = 2
optimize table my_table on cluster my_cluster partition my_partition final
It just generate one more part and old parts doesnt merge.
What i am doing wrong ? Thanks

select * from system.parts WHERE active
merge process (initiated by optimize) merges several old (active) parts into a new active part. Old parts (merged) became inactive and will be removed after 8 minutes (defending 8 minutes because CH does not use fsync by performance reason).

Related

Optimize joining two big tables ORACLE 19C

how can I optimize the query below :
SELECT A.CNACT, A.FACML, A.LCACT, H.CAECH, H.CMECH, H.MCCMP,H.DAHIS,RANK() OVER (PARTITION BY H.CNACT,H.CAECH,H.CMECH ORDER BY H.DAHIS DESC) RK
FROM NATACF A,HISTER H WHERE A.CNACT = H.CNACT;
select count (*) FROM NATACF; -->74794
select count (*) FROM HISTER; -->2100720
you find in attachment the execution plan
Thank you.
As you see window sort and hash JOIN are not optimised effectively. What is the best way to optimise this?
the screenshot below of prod database :

Long story short, you want ALL data from *both" tables - no filtering in place.
Oracle reads whole smaller (driving) table into hash map.
Uses joining column CNACT as a key
They reads whole bigger table and performs lookup in the hashmap for each row read.
the complexity is O(N+M), each row is read only once
There is no way to evaluate such a query in faster way (aside from dirty tricks like putting both tables into CLUSTER, pinning tables in buffer cache,...).
PS: it is strange that explain plan shows 2 sec - both tables are not actually so big.
While prod DB says 5hours.
Try to execute the query using set timing on set echo off set pagesize 0 set termout off set feedback off set pause off set verify off set headings off. Basically read the whole result and then discard it and print exec. time. And you will see.
Maybe it is the app (or network) who has problem to transfer the whole big result set. Is such a case you would see "SQL*Net Message to client" wait event in AWR. Like the database is waiting for the application to accept more data. Like you are sending about 14GB of data into the application.
For example Java has problems with GC, or each row is used to costly Java Object creation.

we resolved the problem using the with :
WITH Z AS(SELECT
| X.DATRA,X.COINI,X.COINT,X.NUCPT,N.COBCN,X.CNACT,N.LCACT,N.CNACR,DECODE(X.CSOPT,NULL,X.CAECH,O.CAECR) CAECH,
| DECODE(X.CSOPT,NULL,X.CMECH,O.CMECR) CMECH,X.CSOPT,X.MTSNA,N.COTSJ,X.CSENS,X.QTCCP,D.TXCHA,R.MAINT,X.CODEV,
| D.CDVRF,C.TYEDI,C.NUSES
10: | FROM CUMPOR X,
| MRXIDE C,.....
The WITH clause - The materialized subquery data is persistent through the query.

How to limit BigQuery query size for testing a query sample through the web user-interface?

I would like to know if it is possible to limit the bigquery query size when running a query through the web user-interface?
My idea is just to test the query but instead of querying all my tables; I would like just to query a part of it with for instance a number of row.
Limit is not optimizing my query cost, so the idea is to find a function similar to "row_number" or "fetch".
Sorry I'm a marketer and not a developer, so thank you in advance for your kind help.

How to limit BigQuery query size for testing ... ?
1 - Try to minimize number of tables involved in your testing
In your query – there are 60+ tables involved for respectively dates between 2016-12-11 and nowadays
SELECT <fields_list> FROM
TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
TIMESTAMP('20161211'),
TIMESTAMP('20170315'))
Instead you can use same day as start and end of time range, thus drastically reducing number of involved tables (down to just one table) and overall scan size. For example
SELECT <fields_list> FROM
TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
TIMESTAMP('20161211'),
TIMESTAMP('20161211'))
2 - Minimize number of rows. Ability to do so really depends on how your table is being loaded with data. If table loaded incrementally - you can use so called table decorators.
Note - this technique works with tables within last 7 days
For example, below will scan only data that was in table at one hour ago (so called snapshot decorator)
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212#-3600000]
This works well with the most recent day's table especially at the start of the day when size of table is not big yet
So, to limit further, you can use below version (so called range decorator) - gives you data added between one hour and half an hour ago
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212#-3600000--1800000]
Finally, #0 is a special case that references the oldest possible snapshot of the table: either 7 days in the past, or the table's creation time if the table is less than 7 days old. For example
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170210#0]
3 - Test against Sampled Table. If you expect experimenting with your query again and again - you can first prepare downsized version of your table with just as many rows as you need and applying sampling logic that fit in your business logic. To limit number of rows you can use LIMIT Clause. To get random rows you can use RAND function for example
After sampled table is prepared - run all your query against it till when you have final version - after this - you can run it against your original table(s)
And btw, to create sampled table you need to set destination table under options in Web UI.

how to handle query execution time (performance issue ) in oracle

I have situation need to execute patch script for million row of data.The current query execution time is not meet the expectation for few rows (18000) which is take around 4 hours( testing data before deploy for live ).
The patch script actually select million row of data in loop and update according to the specification , im just wonder how long it could take for million row of data since it take around 4 hour for just 18000 rows.
to overcome this problem im decided to create temp table hold the entire select statement data and proceed with the patch process using the temp table where the process could be bit faster compare select and update.
is there any other ways i can use to handle this situation ? Any suggestion and ways to solve this.
(Due to company policy im unable to post the PL/SQl script here )

seems there is no one can answer my question here i post my answer.
In oracle there is Parallel Execution which is allows spreading the processing of a single SQL statement across multiple threads.
So by using this method i solved my long running query (4 hours ) to 6 minz ..
For more information :
https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
http://www.oracle.com/technetwork/articles/database-performance/geist-parallel-execution-1-1872400.html

SQL_ID, CBO, Optimizer_Mode and Plan Change History

This is a 4 part question
What is the logic behind SQL_ID. . . Does the value change for the same SQL over time? Does it persists between DB Restarts? Or every plan change gives a new SQL_ID?
How can i check the plan change history for a particular query? Given the SQL_ID i tried querying dba_hist_sqlstat table but it does not give the time of plan change and other details so as to be able to match with the v$sql_plan table.
I have the parameter optimizer_mode set to FIRST_ROWS. Even then when is see the table dba_hist_sqlstat, it indicate ALL_ROWS for some SQL_IDs . . . Can oracle disregard the instance level parameter to use what it deems most suitable?
Between 8PM and 2 PM a query was performing badly. Taking 6 seconds for its execution. After 3 PM the query started responding in < 1 Second. I have the AWR report for the periods that shows this detail. There was no difference in load on the DB in these 2 periods. How could i arrive at the root of this? I am trying to find the History of the plan change but appreciate more feedback to best analyze such issues.
The DB Version is Oracle 10.1.0.4 running on AIX 5.3 64b

1.- SQLID is calculated with a hash function based on the actual SQL text, it shouldnt change with restart or between databases at least the same versions, different oracle versions could have different hashing functions right?, so as long as you do not change the sql text (this includes blanks commas and everything) SQLID will remain the same.
2.- Use DBMS_XPLAN.DISPLAY_AWR to display all plans for a SQL_ID: select * from table(dbms_xplan.display_awr(sql_id => '[your SQL_ID]'));
3.- Oracle will only do this if a query has an OPTIMIZER GOAL hint in it.
4.- There are many things in play for this one. I would start by looking at top 5 timed events in AWR for both periods of time. If they are alike, I would then go investigate the PLAN history for the statement, see if it changed during periods and how data behaved during the periods as well. One of these three should give you the answer.

when unloading a table from amazon redshift to s3, how do I make it generate only one file

When I unload a table from amazon redshift to S3, it always splits the table into two parts no matter how small the table. I have read the redshift documentation regarding unloading, but no answers other than it says sometimes it splits the table (I've never seen it not do that). I have two questions:
Has anybody every seen a case where only one file is created?
Is there a way to force redshift to unload into a single file?

Amazon recently added support for unloading to a single file by using PARALLEL OFF in the UNLOAD statement. Note that you still can end up with more than one file if it is bigger than 6.2GB.

By default, each slice creates one file (explanation below). There is a known workaround - adding a LIMIT to the outermost query will force the leader node to process whole response - thus it will create only one file.
SELECT * FROM (YOUR_QUERY) LIMIT 2147483647;
This only works as long as your inner query returns fewer than 2^31 - 1 records, as a LIMIT clause takes an unsigned integer argument.
How files are created? http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html
Amazon Redshift splits the results of a select statement across a set of files, one or more files per node slice, to simplify parallel reloading of the data.
So now we know that at least one file per slice is created. But what is a slice? http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html
The number of slices is equal to the number of processor cores on the node. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices.
It seems that the minimal number of slices is 2, and it will grow larger when more nodes or more powerful nodes is added.

As of May 6, 2014 UNLOAD queries support a new PARALLEL options. Passing PARALLEL OFF will output a single file if your data is less than 6.2 gigs (data is split into 6.2 GB chunks).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Clickhouse. Optimize table on cluster doesnt work properly - sql

select * from system.parts WHERE active merge process (initiated by optimize) merges several old (active) parts into a new active part. Old parts (merged) became inactive and will be removed after 8 minutes (defending 8 minutes because CH does not use fsync by performance reason).

Related

Optimize joining two big tables ORACLE 19C

How to limit BigQuery query size for testing a query sample through the web user-interface?

how to handle query execution time (performance issue ) in oracle

SQL_ID, CBO, Optimizer_Mode and Plan Change History

when unloading a table from amazon redshift to s3, how do I make it generate only one file

Categories

Resources