clickhouse cluster : data not replicated - replication

I have a cluster with 2 nodes for a test.
1 Shard and 2 replica.
3 nodes in the zookeeper cluster
<remote_servers>
<ch_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>ch1</host>
<port>9000</port>
</replica>
<replica>
<host>ch2</host>
<port>9000</port>
</replica>
</shard>
</ch_cluster>
</remote_servers>
macros in ch1 :
<macros>
<shard>shard_01</shard>
<replica>replica-01</replica>
</macros>
macro in ch2 :
<macros>
<shard>shard_01</shard>
<replica>replica-02</replica>
</macros>
zookeeper configuration :
<zookeeper>
<node>
<host>zoo1</host>
<port>2181</port>
</node>
<node>
<host>zoo2</host>
<port>2181</port>
</node>
<node>
<host>zoo3</host>
<port>2181</port>
</node>
</zookeeper>
I create the first table
CREATE TABLE IF NOT EXISTS test.hits_local ON CLUSTER ch_cluster
(
`date` Datetime,
`user_id` String,
`pageviews` Int32
)
ENGINE = ReplicatedMergeTree('/clickhouse/ch_cluster/tables/{shard}/hits_local', '{replica}')
PARTITION BY toStartOfHour(date)
ORDER BY (date)
then i create a distributed table :
CREATE TABLE IF NOT EXISTS test.hits ON CLUSTER 'ch_cluster'
AS test.hits_local
(
`date` Datetime,
`user_id` String,
`pageviews` Int32
)
ENGINE = Distributed('ch_cluster', 'test', 'hits_local')
then i insert data in test.hits_local table in ch1
when select data from test.hits_local in ch2 there is no data
then i tried to select from test.hits Distributed table in ch2 the data appear after 5-6 min
but no data in test.hits_local in ch2
my question when the data replicated to ch2?
who is responsible to replicate data to another node ? is it a zookeeper or should i insert the data into tables in ch1 and ch2?
should i change <internal_replication>true</internal_replication> to false ?
is it necessary for the data to be replicated to test.hits_local in ch2?
thank you.

should i change <internal_replication>true</internal_replication> to false ?
No, you should not. If you use ReplicatedMergeTree internal_replication MUST BE true.
Replication is done by ReplicatedMergeTree table engine internally.
Replicas communicate using their hostnames and port=9009.
Check system.replication_queue table for errors.
Most probably the node "ch1" announced own hostname in Zookeeper i.e. "localhost".
So the second node "ch2" unable to access localhost:9009 or something.
Such issues you can find in clickhouse-server.log or system.replication_queue (it has a column with errors).
Usually replication lag is less than 2 seconds even in very high-loaded setups.

Related

Airflow: how can i automate such that a query runs for every date specified rather than hard coding?

I am new to airflow so apoliges if this has been asked somewhere.
I have a query i run in hive that is partitioned on year month so e.g. 202001.
how can i run a query which specifies a variable for different values within the query in airflow? eg. taking this example
from airflow import DAG
from airflow.operators.mysql_operator import MySqlOperator
default_arg = {'owner': 'airflow', 'start_date': '2020-02-28'}
dag = DAG('simple-mysql-dag',
default_args=default_arg,
schedule_interval='00 11 2 * *')
mysql_task = MySqlOperator(dag=dag,
mysql_conn_id='mysql_default',
task_id='mysql_task'
sql='<path>/sample_sql.sql',
params={'test_user_id': -99})
where my sample_sql.hql looks like:
ALTER TABLE sample_df DROP IF EXISTS
PARTITION (
cpd_ym = ${ym}
) PURGE;
INSERT INTO sample_df
PARTITION (
cpd_ym = ${ym}
)
SELECT
*
from sourcedf
;
ANALYZE TABLE sample_df
PARTITION (
cpd_ym = ${ym}
)
COMPUTE STATISTICS;
ANALYZE TABLE sample_df
PARTITION (
cpd_ym = ${ym}
)
COMPUTE STATISTICS FOR COLUMNS;
i want to run the above for different values of ym using airflow e.g. between 202001 and 202110 how can i do this?
I'm a bit confused because you are asking about Hive yet you show example of MySqlOperator. In any case assuming the the sql/hql parameter is templated you can use execution_date directly in your query. Thus you can extract the year & month to be used for the partition value.
Example:
mysql_task = MySqlOperator(
dag=dag,
task_id='mysql_task',
sql="""SELECT {{ execution_date.strftime('%y%m') }}""",
)
So in your sample_sql.hql it will be:
ALTER TABLE sample_df DROP IF EXISTS
PARTITION (
cpd_ym = {{ execution_date.strftime('%y%m') }}
) PURGE;
You mentioned that you are new to Airflow so make sure you are aware what execution_date is and how it's being calculated (if you are not check this answer). You can do string manipulations to other macros as well. Choose the macro that is suitable to your needs (execution_date / prev_execution_date / next_execution_date / etc...).

BigQuery: Store semi-structured JSON data

I have data which can have varying json keys, I want to store all of this data in bigquery and then explore the available fields later.
My structure will be like so:
[
{id: 1111, data: {a:27, b:62, c: 'string'} },
{id: 2222, data: {a:27, c: 'string'} },
{id: 3333, data: {a:27} },
{id: 4444, data: {a:27, b:62, c:'string'} },
]
I wanted to use a STRUCT type but it seems all the fields need to be declared?
I then want to be able to query and see how often each key appears, and basically run queries over all records with for example a key as though it was in its own column.
Side note: this data is coming from URL query strings, maybe someone thinks it is best to push the full url and the use the functions to run analysis?
There are two primary methods for storing semi-structured data as you have in your example:
Option #1: Store JSON String
You can store the data field as a JSON string, and then use the JSON_EXTRACT function to pull out the values it can find, and it will return NULL for any value it cannot find.
Since you mentioned needing to do mathematical analysis on the fields, let's do a simple SUM for the values of a and b:
# Creating an example table using the WITH statement, this would not be needed
# for a real table.
WITH records AS (
SELECT 1111 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
UNION ALL
SELECT 2222 AS id, "{\"a\":27, \"c\": \"string\"}" as data
UNION ALL
SELECT 3333 AS id, "{\"a\":27}" as data
UNION ALL
SELECT 4444 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
)
# Example Query
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum FROM (
SELECT id,
CAST(JSON_EXTRACT(data, "$.a") AS INT64) AS aValue, # Extract & cast as an INT
CAST(JSON_EXTRACT(data, "$.b") AS INT64) AS bValue # Extract & cast as an INT
FROM records
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
There are some pros and cons to this approach:
Pros
The syntax is fairly straight forward
Less error prone
Cons
Store costs will be slightly higher since you have to store all the characters to serialize to JSON.
Queries will run slower than using pure native SQL.
Option #2: Repeated Fields
BigQuery has support for repeated fields, allowing you to take your structure and express it natively in SQL.
Using the same example, here is how we would do that:
## Using a with to create a sample table
WITH records AS (SELECT * FROM UNNEST(ARRAY<STRUCT<id INT64, data ARRAY<STRUCT<key STRING, value STRING>>>>[
(1111, [("a","27"),("b","62"),("c","string")]),
(2222, [("a","27"),("c","string")]),
(3333, [("a","27")]),
(4444, [("a","27"),("b","62"),("c","string")])
])),
## Using another WITH table to take records and unnest them to be joined later
recordsUnnested AS (
SELECT id, key, value
FROM records, UNNEST(records.data) AS keyVals
)
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum
FROM (
SELECT R.id, CAST(RA.value AS INT64) AS aValue, CAST(RB.value AS INT64) AS bValue
FROM records R
LEFT JOIN recordsUnnested RA ON R.id = RA.id AND RA.key = "a"
LEFT JOIN recordsUnnested RB ON R.id = RB.id AND RB.key = "b"
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
As you can see, to perform a similar, it is still rather complex. You also still have to store items like strings and CAST them to other values when necessary, since you cannot mix types in a repeated field.
Pros
Store size will be less than JSON
Queries will typically execute faster.
Cons
The syntax is more complex, not as straight forward
Hope that helps, good luck.

Data ingest issues hive: java.lang.OutOfMemoryError: unable to create new native thread

I'm a hive newbie and having an odyssey of problems getting a large (1TB) HDFS file into a partitioned Hive managed table. Can you please help me get around this? I feel like I have a bad config somewhere because I'm not able to complete reducer jobs.
Here is my query:
DROP TABLE IF EXISTS ts_managed;
SET hive.enforce.sorting = true;
CREATE TABLE IF NOT EXISTS ts_managed (
svcpt_id VARCHAR(20),
usage_value FLOAT,
read_time SMALLINT)
PARTITIONED BY (read_date INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
TBLPROPERTIES("orc.compress"="snappy","orc.create.index"="true","orc.bloom.filter.columns"="svcpt_id");
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET set hive.cbo.enable=true;
SET hive.tez.auto.reducer.parallelism=true;
SET hive.exec.reducers.max=20000;
SET yarn.nodemanager.pmem-check-enabled = true;
SET optimize.sort.dynamic.partitioning=true;
SET hive.exec.max.dynamic.partitions=10000;
INSERT OVERWRITE TABLE ts_managed
PARTITION (read_date)
SELECT svcpt_id, usage, read_time, read_date
FROM ts_raw
DISTRIBUTE BY svcpt_id
SORT BY svcpt_id;
My cluster specs are:
VM cluster
4 total nodes
4 data nodes
32 cores
140 GB RAM
Hortonworks HDP 3.0
Apache Tez as default Hive engine
I am the only user of the cluster
My yarn configs are:
yarn.nodemanager.resource.memory-mb = 32GB
yarn.scheduler.minimum-allocation-mb = 512MB
yarn.scheduler.maximum-allocation-mb = 8192MB
yarn-heapsize = 1024MB
My Hive configs are:
hive.tez.container.size = 682MB
hive.heapsize = 4096MB
hive.metastore.heapsize = 1024MB
hive.exec.reducer.bytes.per.reducer = 1GB
hive.auto.convert.join.noconditionaltask.size = 2184.5MB
hive.tex.auto.reducer.parallelism = True
hive.tez.dynamic.partition.pruning = True
My tez configs:
tez.am.resource.memory.mb = 5120MB
tez.grouping.max-size = 1073741824 Bytes
tez.grouping.min-size = 16777216 Bytes
tez.grouping.split-waves = 1.7
tez.runtime.compress = True
tez.runtime.compress.codec = org.apache.hadoop.io.compress.SnappyCodec
I've tried countless configurations including:
Partition on date
Partition on date, cluster on svcpt_id with buckets
Partition on date, bloom filter on svcpt, sort by svcpt_id
Partition on date, bloom filter on svcpt, distribute by and sort by svcpt_id
I can get my mapping vertex to run, but I have not gotten my first reducer vertex to complete. Here is my most recent example from the above query:
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1043 1043 0 0 0 0
Reducer 2 container RUNNING 9636 0 0 9636 1 0
Reducer 3 container INITED 9636 0 0 9636 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/03 [=>>-------------------------] 4% ELAPSED TIME: 6804.08 s
----------------------------------------------------------------------------------------------
The error was:
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 2, vertexId=vertex_1537061583429_0010_2_01, diagnostics=[Task failed, taskId=task_1537061583429_0010_2_01_000070, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : java.lang.OutOfMemoryError: unable to create new native thread
I either get this OOM error which I cannot seem to get around or I get datanodes going offline and not being able to meet my replication factor requirements.
At this point I've been troubleshooting for over 2 weeks. Any contacts for professional consultants I can pay to solve this problem would also be appreciated.
Thanks in advance!
I ended up solving this after speaking with a Hortonworks tech guy. Turns out I was over-partitioning my table. Instead of partitioining by day over about 4 years I partitioned by month and it worked great.

with-constrained consecutive updates

Please assume I have built a query in MS Sqlserver, it has the following structure:
WITH issues_a AS
(
SELECT a_prop
FROM ds_X x
)
, issues_b AS
(
SELECT key
, z.is_flagged as is_flagged
, some_prop
FROM ds_Z z
JOIN issues_a i_a
ON z.a_diff = i_a.a_prop
)
-- {{ run }}
UPDATE samples
SET error =
CASE
WHEN i_b.some_prop IS NULL THEN '#1 ...'
WHEN UPPER(i_b.is_flagged) != 'Y' THEN '#2 ...'
END
FROM samples s
left join issues_b i_b ON s.key = i_b.key;
Now I want enhance the whole thing, updating one more table in a consecutive way by enclosing parts of the query in BEGIN TRANSACTION and COMMIT, but don't get my head around the how of it. Tried enclosing the whole expression with the transaction parenthesis, but that didn't bring me any further.
Are there any other ways to achieve the above task - even without concatenating the consecutive updates in a transactional manner, though better it would be?
For abbreviation the task again: WITH <...>(...), <...>(...) UPDATE <... Using data from latter WITH> UPDATE <... using data from latter WITH>?
Hope you don't mind my poor grammar...

SQL: Finding a subgraph

I have a graph network stored in an SQL server. The graph network ( collection of labeled, undirected and connected graphs) is stored in Vertex-Edge mapping scheme (i.e there are 2 tables..one for vertices and one for edges) :
Vertices ( graphID , vertexID, vertexLabel )
Edges ( graphID , sourceVertex , destinationVertex ,edgeLabel )
I am looking for a simple way of counting a particular subgraph in this network. For example: I would like to find how many instances of "A-B-C" are present in this network : "C-D-A-B-C-E-A-B-C-F". I have a few ideas on how this can be done in say Java or C++ ...but I have no clue how to approach this problem using SQL. any ideas?
A little background: I'm no student..this is a small project I would like to pursue. I do a lot of social media analysis (in memory) but have little experience mining graphs against an SQL database.
my idea is to create a stored procedure which input is a string like 'A-B-C' or a precreated table with vertices in proper order ('A', 'B', 'C'). So you will have a loop and step by step you should walk through the path 'A-B-C'. For this you need a temp table for vertices on current step:
1)step 0
#currentLabel = getNextVertexLabel(...) --need to decide how to do this
select
*
into #v
from Vertices
where
vertexLabel = #currentLabel
--we need it later
select
*
into #tempV
from #v
where
0 <> 0
2)step i
#currentLabel = getNextVertexLabel(...)
insert #tempV
select
vs.*
from #v v
join Edges e on
e.SourceVertex = v.VertexID
and e.graphID = v.graphID
join Vertices vs on
e.destinationVertex = vs.VertexID
and e.graphID = vs.graphID
where
vs.vertexLabel = #currentLabel
truncate table #v
insert #v
select * from #tempV
truncate table #tempV
3)after loop
You result will store at #v. So the number of subgraphs will be:
select count(*) from #v