why it is too slow when insert hdfs hive partitioned table? - hive

I ve created table like this: (non partioned)
create external table `ersin_db`.`DW_ETL`
(
`ID` INT,
`NAME` STRING
)
stored as parquet
LOCATION '/user/ers/ersyn61/'
tblproperties('parquet.compression'='SNAPPY');
when I try insert it is fast.
but when I create partitioned table like this:
create external table `ersin_db`.`DW_ETL`
(
`ID` INT,
`NAME` STRING
)
partitioned by(partition_etldate_string string )
stored as parquet
LOCATION '/user/ers/ersyn61/'
tblproperties('parquet.compression'='SNAPPY');
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
set hive.optimize.sort.dynamic.partition=true;
the insert is slow?
How can I it faster?
thanks in advance

I think i can answer to this.
Your second table is a dynamic partitioned table. While inserting into a dynamically partitioned table, hive sort the final data and write into each partition one by one(default behaviour). Since, you partitioned on partition_etldate_string it takes a lot of time to insert into each partition one by one. here is a typical SQL summary when it tries to insert into a dynamically partitioned table on year,month. Notice, how SORT operation is taking 17min while data processing is taking only 1min.
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
02:SORT 2 17m16s 30m50s 55.05M -1 25.60 GB 12.00 MB
01:EXCHANGE 2 9s493ms 12s822ms 55.05M -1 26.98 MB 2.90 MB HASH(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))
00:SCAN HDFS 2 51s958ms 1m10s 55.05M -1 76.06 MB 704.00 MB default.my_table
Your first table is not partitioned so hive wont sort the data and write one partition at a time but it will write all data together.
Depending on volume of data, dynamic partition can take a lot of time to load. This is a default behavior and i am not sure how to put a workaround this. You can use static partition but it will be difficult to handle partitions based on date.

Related

ClickHouse - SELECT row of data is too slow

The following problem occurred in our project, which we cannot solve.
We have a huge data of our logs, and we go to ClickHouse from MongoDB.
Our table is created like this:
CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
raw String,
ts DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
device_id String MATERIALIZED JSONExtractString(raw, 'device_id'),
level Int8 MATERIALIZED JSONExtractInt(raw, 'level'),
context String MATERIALIZED JSONExtractString(raw, 'context'),
event String MATERIALIZED JSONExtractString(raw, 'event'),
event_code String MATERIALIZED JSONExtractInt(raw, 'event_code'),
data String MATERIALIZED JSONExtractRaw(raw, 'data'),
date Date DEFAULT toDate(ts),
week Date DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week
and I'm running a query like so
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
this is the result 10 rows in set. Elapsed: 6.23 sec.
And second without order, limit and offset:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid'
this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+
Is too slow.
Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?
The same implementation on MongoDB takes 200-500ms max
Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?
I'm not sure how do you ingest your data, but let's focus on the reading part.
For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.
I would imagine these queries will do the scan faster:
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05'
ORDER BY ts DESC
LIMIT 10
OFFSET 0;
SELECT device_id,toDateTime(ts),context,level,event,data
FROM logs
WHERE device_id = 'some_uuid' AND week = '2021-07-05';
AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'logs'
So, if you want to store data in weekly parts, I would imagine partitioning could be like
...
ORDER BY (device_id, ts)
PARTITION BY toMonday(week)
This is also a good piece of information: Using Partitions and Primary keys in queries

Calculating per second peak values after summing up individual values in clickhouse

I am currently using Clickhouse cluster (2 shards, 2 replicas) to read transaction logs from my server. The log contains fields like timestamp, bytes delivered, ttms, etc. The structure of my table is as below:
CREATE TABLE db.log_data_local ON CLUSTER '{cluster}' (
timestamp DateTime,
bytes UInt64,
/*lots of other fields */
) ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/db/tables/logs/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY timestamp
TTL timestamp + INTERVAL 1 MONTH;
CREATE TABLE db.log_data ON CLUSTER '{cluster}'
AS cdn_data.http_access_data_local
ENGINE = Distributed('{cluster}','db','log_data_local',rand());
I am ingesting data from Kafka and using materialized view to populate this table. Now I need to calculate the peak throughput per second from this table. So basically I need to sum up the bytes field per second and then find the max value for a 5 minute period.
I tried using ReplicatedAggregatingMergeTree with aggregate functions for the throughput, but the peak value I get is much less compared to the value I get when I directly query the raw table.
The problem is, while creating the material view to populate the peak values, querying the distributed table directly is not giving any results but if I query the local table then only partial data set is considered. I tried using an intermediary table to compute the per-second total and then to create the materialized but I faced the same issue.
This is the schema for my peaks table and the materialized view I am trying to create:
CREATE TABLE db.peak_metrics_5m_local ON CLUSTER '{cluster}'
(
timestamp DateTime,
peak_throughput AggregateFunction(max,UInt64),
)
ENGINE=ReplicatedAggregatingMergeTree('/clickhouse/{cluster}/db/tables/peak_metrics_5m_local/{shard}','{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp)
TTL timestamp + toIntervalDay(90);
CREATE TABLE db.peak_metrics_5m ON CLUSTER '{cluster}'
AS cdn_data.peak_metrics_5m_local
ENGINE = Distributed('{cluster}','db','peak_metrics_5m_local',rand());
CREATE MATERIALIZED VIEW db.peak_metrics_5m_mv ON CLUSTER '{cluster}'
TO db.peak_metrics_5m_local
AS SELECT
toStartOfFiveMinute(timestamp) as timestamp,
maxState(bytes) as peak_throughput,
FROM (
SELECT
timestamp,
sum(bytes) as bytes,
FROM db.log_data_local
GROUP BY timestamp
)
GROUP BY timestamp;
Please help me out with a solution to this.
It's impossible to implement with MV. MV is an insert trigger.
sum(bytes) as bytes, ... GROUP BY timestamp works against inserted buffer and does not read data from log_data_local table.
https://github.com/ClickHouse/ClickHouse/issues/14266#issuecomment-684907869

More efficiently writing partitioned parquet when partitioning column is skewed

I'm working on writing a large table (approximately 1.2b rows) in partitioned parquet, I'm using state (like US state) as the partitioning key. The issue is that there is a large number of null state values. This table is often queried by state, so having a large partition with the null states is not an issue, but I'm having trouble more efficiently generating the table.
I've tried creating the table with the non-null states, then inserting the null, but from what I can tell all the null values still just get put in one big partition and therefore sent to one worker.
It would be great if there was a way to insert into a specific partition. Like for my example, write the non-null states, then insert remaining records into the state=null or hive_default_partition in a way that would still parallelize across the cluster.
Try writing the non-null data using automatic partitioning, then repartition the null data and write it separately, e.g.:
df.where($”state”.isNotNull).write.partitionBy($”state”).parquet(“my_output_dir”)
df.where($”state”.isNull).repartition(100).write.parquet(“my_output_dir/state=__HIVE_DEFAULT_PARTITION__”)
Using the SQL API, you can use a repartitioning hint (introduced in Spark 2.4) to accomplish the same:
spark-sql> describe skew_test;
id bigint NULL
dt date NULL
state string NULL
# Partition Information
# col_name data_type comment
state string NULL
Time taken: 0.035 seconds, Fetched 6 row(s)
spark-sql> CREATE TABLE `skew_test2` (`id` BIGINT, `dt` DATE, `state` STRING)
> USING parquet
> OPTIONS (
> `serialization.format` '1'
> )
> PARTITIONED BY (state);
Time taken: 0.06 seconds
spark-sql> insert into table skew_test2 select * from skew_test where state is not null;
Time taken: 1.208 seconds
spark-sql> insert into table skew_test2 select /*+ REPARTITION(100) */ * from skew_test where state is null;
Time taken: 1.39 seconds
You should see 100 tasks created by Spark for the final statement, and your state=__HIVE_DEFAULT_PARTITION__ directory should contain 100 parquet files. For more information on Spark-SQL hints, check out https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hint-framework.html#specifying-query-hints

My data can’t be date partitioned, how do I use clustering?

Currently I using following query:
SELECT
ID,
Key
FROM
mydataset.mytable
where ID = 100077113 
and Key='06019'
My data has 100 million rows:
ID - unique
Key - can have ~10,000 keys
If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.
How can I use the new clustering capabilites in BigQuery to partition on the field Key?
(I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)
I have a table with 12M rows and 76 GB of data. This table has no timestamp column.
This is how to cluster said table - while creating a fake date column for fake partitioning:
CREATE TABLE `fh-bigquery.public_dump.github_java_clustered`
(id STRING, size INT64, content STRING, binary BOOL
, copies INT64, sample_repo_name STRING, sample_path STRING
, fake_date DATE)
PARTITION BY fake_date
CLUSTER BY id AS (
SELECT *, DATE('1980-01-01') fake_date
FROM `fh-bigquery.github_extracts.contents_java`
)
Did it work?
# original table
SELECT *
FROM `fh-bigquery.github_extracts.contents_java`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(3.3s elapsed, 72.1 GB processed)
# clustered table
SELECT *
FROM `fh-bigquery.public_dump.github_java_clustered2`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(2.4s elapsed, 232 MB processed)
What I learned here:
Clustering can work with unique ids, even for tables without a date to partition by.
Prefer using a fake date instead of a null date (but only for now - this should be improved).
Clustering made my query 99.6% cheaper when looking for rows by id!
Read more: https://medium.com/#hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b
you can have one filed of type DATE with NULL value, so you will be able partition by that field and since the table partitioned you will be able to enjoy clustering
You need to recreate your table with an additional date column with all rows having NULL values. And then you set partition to the date column. This way your table is partitioned.
After you've done with this, you will add clustering, based on the columns you identified in your query. Clustering will improve processing time and query costs will be reduced.
Now you can partition table on an integer column so this might be a good solution, remember there is a limit of 4,000 partitions for each table. So because you have ~10,000 keys I will suggest to create a sort of group_key that bundles ids together or maybe you have another column that you can leverage as integer with a cardinality < 4,000.
Recently BigQuery introduced support for clustering table even if they are not partitioned. So you can simply cluster on your integer field and don't use partitioning all together. Although, this solution will not be most effective for data scan optimisation.

Can two Hive Partitions Share One Set of Files?

A typical question is can a Hive partition be made up of multiple files. My question is the inverse. Can multiple Hive partitions point to the same file? I'll start with what I mean, then the use case.
What I mean:
Hive Partition File Name
20120101 /file/location/201201/file1.tsv
20120102 /file/location/201201/file1.tsv
20120103 /file/location/201201/file1.tsv
The Use Case: Over the past many years, we've been loading data into Hive in monthly format. So it looked like this:
Hive Partition File Name
201201 /file/location/201201/file1.tsv
201202 /file/location/201202/file1.tsv
201203 /file/location/201203/file1.tsv
But now the months are too large, so we need to partition by day. So we want the new files starting with 201204 to be daily:
Hive Partition File Name
20120401 /file/location/20120401/file1.tsv
20120402 /file/location/20120402/file1.tsv
20120403 /file/location/20120403/file1.tsv
But we want all the existing partitions to be redone into daily as well, so we would partition it as I propose above. I suspect this would actually work no problem, except that I suspect Hive would re-read the same datafile N times for each additional partition defined against the file. For example, in the very first "What I Mean" code block above, partitions 20120101..20120103 all point to file 201201/file1.tsv. So if the query has:
and partitionName >= '20120101' and partitionName <= '20120103"
Would it read "201201/file1.tsv" three times to answer the query? Or will Hive be smart enough to know it's only necessary to scan "201201/file1.tsv" once?
It looks like Hive will only scan the file(s) once. I finally decided to just give it a shot and run a query and find out.
First, I set up my data set like this in the filesystem:
tableName/201301/splitFile-201301-xaaaa.tsv.gz
tableName/201301/splitFile-201301-xaaab.tsv.gz
...
tableName/201301/splitFile-201301-xaaaq.tsv.gz
Note that even though I have many files, this is equivalent for Hive to having one giant file for the purposes of this question. If it makes it easier, pretend I just pasted a single file above.
Then I set up my Hive table with partitions like this:
alter table tableName add partition ( dt = '20130101' ) location '/tableName/201301/' ;
alter table tableName add partition ( dt = '20130102' ) location '/tableName/201301/' ;
...
alter table tableName add partition ( dt = '20130112' ) location '/tableName/201301/' ;
The total size of my files in tableName/201301 was about 791,400,000 bytes (I just eyeballed the numbers and did basic math). I ran the job:
hive> select dt,count(*) from tableName where dt >= '20130101' and dt <= '20130112' group by dt ;
The JobTracker reported:
Counter Map Reduce Total
Bytes Read 795,308,244 0 795,308,244
So it only read the data once. HOWEVER... the query output was all jacked:
20130112 392606124
So it thinks there was only one "dt", and that was the final "partition", and it had all rows. So you have to be very careful including "dt" in your queries when you do this, it would appear.
Hive would scan the file multiple times. Earlier answer was incorrect. Hive reads the file once, but generates "duplicate" records. The issue is that the partition columns are included in the total record, so for each record in the file, you would get multiple records in Hive, each with different partition values.
Do you have any way to recover the actual day from the earlier data? If so, the ideal way to do things would be to totally repartition all the old data. That's painful, but it's a one-time cost and would save you having a really weird Hive table.
You could also move to having two Hive tables: the "old" one partitioned by month, and the "new" one partitioned by day. Users could then do a union on the two when querying, or you could create a view that does the union automatically.