I ran this SQL query in databricks to check the distinct values of a column in a parquet file:
SELECT distinct country
FROM parquet_table
This took 1.31 hours to run. Am I doing something wrong here that such a simple query is taking so long?
Related
I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).
I ran two queries to get count of records for two different dates from a Hive managed table partitioned on process date field.
SELECT COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-01' --returned 2 million
SELECT COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-02' --returned 3 million
But if I ran the below query with a UNION ALL clause, the counts returned are different from that of above mentioned individual queries.
SELECT '2018-01-01', COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-01'
UNION ALL
SELECT '2018-01-02', COUNT(1) FROM prd_fct.mktng WHERE process_dt='2018-01-02'
What can be the root cause for this difference?
One of our teammate helped us to identify the issue.
When we run a single count() query,the query is not physically executed on table rather count will be taken from statistics.
One of the remedy is to collect the stats on table agian,then the count() on single table will reflect actual count
Regards,
Anoop
I too faced a similar issue with count(*) returning incorrect count. I added the below to my code and the counts are consistent now.
For non-partitioned table use:
ANALYZE TABLE your_table_name COMPUTE STATISTICS
For partitioned table, analyze the recently added partition by specifying the partition value:
ANALYZE TABLE your_table_name
PARTITION(your_partition_name=your_partition_value)
COMPUTE STATISTICS;
I have a problem with Apache Derby. I imported the data from www.geonames.org and want to get the DISTINCT names.
The query SELECT name FROM GEONAME returns the results instantly.
The query SELECT DISTINCT name FROM GEONAME takes almost 20 minutes to complete.
How can I speed this up? There already is an index on the name table (CREATE INDEX GEONAME_NAME_index ON GEONAME(NAME))
I use JDBC from Scala to get data from Hive. In Hive I have a simple table with 20 rows in the following format:
user_id, movie_title, rating, date
To group users by movie I do 3 nested select requests:
1) select distinct user_id
2) for each user_id:
select distinct movie_title //select all movies that user saw
3) for each movie_title:
select distinct user_id //select all users who saw this movie
On a local Hive table with 20 rows these nested queries work 26 min! Hive returns first user_id after a minute! Questions:
1)Why Hive is so slow?
2) Any way to optimize 3 nested selects?
Hive uses the MapReduce framework to process queries. There is a decent amount of constant overhead attached to every MapReduce job you run. Each of your queries (which is a fair amount because of your nesting) is going to have to spin up a MapReduce job and that takes time regardless of how much data you have.
Newer versions of Hive are much more responsive, but still not ideal for this type of selection.
Your best bet is to try to minimize the number of queries by using group by or something similar.
Create two tables by inserting records into them based on select distinct queries. First containing distinct user, movie rated where user_rated = user, second, movie_rated = movie. That way, these two tables can be joined to get the desired group by result.
I am having a SQL query which is sheduled to run on every week and pulls the data from different database and the query is running for around 2 hrs this is due to the amount of data it is selecting, on the same time this is utilizing more CPU utilization on the source SQL server where database abc resides. The query is given below,
select a.* from abc.art_si a inner join abc.article b
on a.ARTICLEID = b.ARTICLEID where b.TYPE_IND='B'
I would like to know the below,
running of this query will utilize more CPU? If so,
is there any way to optimize the above query?
Your advise will be very helpful for me.
Thank you.