Why is Select Count(*) slower than Select * in hive - sql

When i am running queries in VirtualBox Sandbox with hive. I feel Select count(*) is too much slower than the Select *.
Can anyone explain what is going on behind?
And why this delay is happening?

select * from table
It can be a Map only job But
Select Count(*) from table
It can be a Map and Reduce job
Hope this helps.

There are three types of operations that a hive query can perform.
In order of cheapest and fastest to more expensive and slower here they are.
A hive query can be a metadata only request.
Show tables, describe table are examples. In these queries the hive process performs a lookup in the metadata server. The metadata server is a SQL database, probably MySQL, but the actual DB is configurable.
A hive query can be an hdfs get request.
Select * from table, would be an example. In this case hive can return the results by performing an hdfs operation. hadoop fs -get, more or less.
A hive query can be a Map Reduce job.
Hive has to ship the jar to hdfs, the jobtracker queues the tasks, the tasktracker execute the tasks, the final data is put into hdfs or shipped to the client.
The Map Reduce job has different possibilities as well.
It can be a Map only job.
Select * from table where id > 100 , for example all of that logic can be applied on the mapper.
It can be a Map and Reduce job,
Select min(id) from table;
Select * from table order by id ;
It can also lead to multiple map Reduce passes, but I think the above summarizes some behaviors.

This is because the DB is using clustered primary keys so the query searches each row for the key individually, row by agonizing row, not from an index.
Run optimize table. This will ensure that the data pages are
physically stored in sorted order. This could conceivably speed up a
range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id
column. This will store a copy of that column in index pages which be
much faster to scan. After creating it, check the explain plan to
make sure it's using the new index

Related

How to efficiently find the latest partition from a S3 dataset using Spark

I have dataset that has data added almost everyday, and needs to be processed everyday in a part of a larger ETL.
When I select the partition directly, the query is really fast:
SELECT * FROM JSON.`s3://datalake/partitoned_table/?partition=2019-05-20`
Yet, the issue is that the event type does not generate data on some Sundays, resulting in a non-existing partition on that particular day. Because of this, I cannot use the previous statement to run my daily job.
Another attempt led me to try to have spark find the latest partition of that dataset, in order to be sure the bigger query wouldn't fail:
SELECT * FROM JSON.`s3://datalake/partitoned_table/`
WHERE partition = (SELECT MAX(partition) FROM JSON.`s3://datalake/partitoned_table/`)
This works every time, but it is unbelievably slow.
I found numerous articles and reference on how to build and maintain partitions, yet nothing about how to read them correctly.
Any idea how to have this done properly?
(SELECT MAX(partition) FROM JSON.s3://datalake/partitoned_table/)
This query will be executed as a subquery in Spark.
Reason for slowness
1. Subquery needs to be executed completely before the actual query execution starts.
2. The Above query will list all the S3 files to retrieve the partition information. If the folder has a large number of files, this process will take a long time. Time taken for listing is directly proportional to the number of files.
We could create a table on top of s3://datalake/partitoned_table/ with the partitioning scheme, let's say the name of the table is tbl
You could perform an
ALTER TABLE tbl RECOVER PARTITIONS
which stores the partition information in metastore. This also involves a listing, but it is a one-time operation and spark spawns multiple threads to perform the listing to make it faster.
Then we could fire
SELECT * FROM tbl WHERE partition = (SELECT MAX(partition) FROM tbl`)
Which will get the partition information only from metastore, without having to list the object store which I believe is an expensive operation.
The cost incurred in this approach is that of recovering partitions.
After which all queries will be faster(when data for new partition comes, we need to recover partitions again)
WorkAround when you don't have Hive-
FileSystem.get(URI.create("s3://datalake/partitoned_table"), conf).listStatus(new Path("s3://datalake/partitoned_table/"))
Above code will give you list of file partitions example - List(s3://datalake/partitoned_table/partition=2019-05-20, s3://datalake/partitoned_table/partition=2019-05-21....)
This is very efficient because it is only fetching metadata from the s3 location.
Just take the latest file partitions and use it your SQL.

Clustering in BigQuery using CREATE TABLE

Unsure if I cluster correctly. Basicly I am looking at GCP Billing Info of say 50 clients. Each client has a Billing_ID and I cluster on that billing_ID. I use the clustered table for a data studio dashboard
See the the SQL query below to see what I do right now
CREATE OR REPLACE TABLE `dashboardgcp`
PARTITION BY DATE(usage_start_time)
CLUSTER BY billing_account_id
AS
SELECT
*
FROM
`datagcp`
WHERE
usage_start_time BETWEEN TIMESTAMP('2019-01-01')
AND TIMESTAMP(CURRENT_DATE)
It is succesfully clustered like this, I am just not a noticeable query performance increase!
So I thought by clustering it with billing_ID I should see an increase in dashboard performance
Please consider the following points:
Cluster structure
A Cluster field is composed of an array of fields, like boxes, outer to inner, As state in BigQuery link
When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
This means As #Gordon wrote, in your query the WHERE part needs to start from the outer field to the inner one to make the most out of your cluster field. In your case, if the userId is part of the WHERE you need to change your cluster field to match this
Cluster limitation
Cluster typically works better for query who scan over 1GB of data, So if you are not scanning this amount of data you won't see the improvement you are looking for
Cluster with Ingestion tables
Assuming your dara is not static and you keep adding data to your table, datagcp, you need to be aware that cluster indexing is a process which BigQuery perform off-line to the insert operation and a separate one to partitioning.
The side effect is that your clustering "weaken" over-time. To solve this you will need to use merge command to re-build your cluster in order to get the most out of your cluster
From the docs:
“Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted”.

Querying GROUP BY on 1+ billion rows in MemSQL

I have a table with 1.3 billion rows (MemSQL, columnstore pattern). I need to query a GROUP BY on 3 fields (id1, id2, text) and fetch the latest record for each of this 3-tuple. The table gets populated through a pipeline mounted onto a EFS folder. Currently, it has about 200k csv files of 2MB each.
I need help writing an optimized query for this case or if it can be done some other way.
Edit: I am not able to find any blog/help online, most of them talk about solutions involving the creation of an extra table which is not possible for me now (very heavy memory usage in that case).
Something like below is not going to work and takes my 5-node cluster down:
select max(eventTime) from table1 group by id1, id2, field1
there a couple of considerations here.
1) what is your shard key for the columnstore table?
2) are you using MemSQL 6.5, the most recent version?
3) have you reviewed this resource about Optimizing Table Data Structures? https://www.memsql.com/static/memsql_whitepaper_optimizing_table_data_structures.pdf
Ensure that common columns for all queries in the columnstore key to improve segment elimination.
If the data is inserted in an order, like a timestamp, it's best to put that column first in the columnstore key to minimize the work of the background merger process.
If there are lots of distinct values in one of the keys of the composite key, put that last. Put the key part with less distinctness first to increase the likelihood that segment elimination will be able to affect later columns.
Also, what would help is if run ran
EXPLAIN select max(eventTime) from table1 group by id1, id2, field1;
so that we could see the explain plan.
It takes long time because it needs a proper design for the database. So you have to choose the shard key to be those three columns (id1,id2,field1). I recommend to use column store for that query rather than row store.

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

hsqldb select query really slow

I use hsqldb and I have constructed a database with a table that contains 10.000.000 records. It takes about 15min to construct this table. Then in another program that needs these data, I try to read them. I thought that reading them in groups of 100.000 would be faster. So I execute this query:
rs = st.executeQuery("SELECT * FROM PATIENT WHERE pid>="+start+" AND pid<="+end+" ;");
where start and end define the group I want to read each time.
I have made an index on pid but query execution is still very slow. Actually it's been running for 24 minutes and has fetched the first 24 out of 100 groups. Is this normal? What else can I do?
Thank you!
It is a good idea to select in groups of 100000. You can verify that your query is using the index by executing this statement with your query:
EXPLAIN PLAN FOR SELECT ...
If the query is indeed using the index, you can speed up selection by defining
SET TABLE PATIENT CLUSTERED ON (PID)
or define PID as the primary key
You should also look at increasing the cache size of the database, or increasing the nio file usage limit, or other tweaks discussed in the Guide. Use the latest HSQLDB jars from http://hsqldb.org/support/