I have been reading through the docs but cannot seem to find something similar to Prometheus' topk function.
Is there maybe a sort then limit?
As an example, lets say I wanted the top 10 hosts by cpu? Is that even possible?
You can use a combination of ORDER BY and LIMIT:
SELECT *
FROM hosts
ORDER BY cpu DESC
LIMIT 10
Related
I want to sample 100 rows from a big_big_table (millions and millions of rows), and run some query on these 100 rows. Mainly for testing purposes.
The way I wrote this runs for really long time, as if it reads the whole big_big_table, and only then take the LIMIT 100:
WITH sample_table AS (
SELECT *
FROM big_big_table
LIMIT 100
)
SELECT name
FROM sample_table
ORDER BY name
;
Question: What's the correct/fast way of doing this?
Check hive.fetch.task.* configuration properties
set hive.fetch.task.conversion=more;
set hive.fetch.task.aggr=true;
set hive.fetch.task.conversion.threshold=1073741824; --1Gbyte
Set these properties before your query and if you are lucky, it will work without map-reduce. Also consider limiting to single partition.
It may not work depending on the storage type/serde and files size. If files are of small size/splittable and table is native, it may work fast without Map-Reduce started.
I have a very special kind of query to write. In PostGIS / BigQuery, I have a point. I can buffer this point by increments and perform an aggregation query, such as count(distinct()) on the unique records that fall within this point. Once the count reaches a certain threshold, I would like to return the input value of the geographic object, ie. its radius or diameter. This problem can be phrased as "how far do I have to keep going out until I hit 'n' [ids]?".
Finely incrementing the value of the buffer or radius will be insufferably slow and expensive. Can anyone think of a nice way to short this and offer a solution that provides a nice answer quickly (in BQ or PSQL terms!)?
Available GIS functions:
st_buffer()
st_dwithin()
Thank you!
You would have to order by distance and keep the N closest points. The <-> operator will use the spatial index.
SELECT *
FROM pointLayer p
ORDER BY p.geometry <-> st_makePoint(...)
LIMIT 10; --N
You don't need to increment radius finely - I would rather double it, or maybe even increment 10x and once you have enough distinct records, take N nearest ones.
I've used BigQuery scripting to solve similar problem (with N=1, but it is easy to modify for any N, just use LIMIT N instead of LIMIT 1 and modify stopping condition):
https://medium.com/#mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5
So, here is my problem.
I've got a database which imports data from CSV that is huge. It contains around 32000 entries, but has around 200 header columns, hence the standard select is slow.
When I do:
MyModel.all or MyModel.eager_load.all it takes anywhere from 45 seconds up to a minute to load all the entries.
The idea was to use limit to pull maybe 1000 entries like:
my_model = MyModel.limit(1000)
This way I can get the last id like:
last_id = my_model.last.id
To load next 1000 queries I literally use
my_model.where('id > ?', last_entry).limit(1000)
# then I set last_entry again, and keep repeating the process
last_entry = my_model.last.id
But this seems like an overkill, and doesn't seem right.
Is there any better or easier way to do this?
Thank you in advance.
Ruby on Rails has the find_each method that does exactly what you try to do manually. It loads all records from the database in batches of 1000.
MyModel.find_each do |instance|
# do something with this instance, for example, write into the CVS file
end
Rails has an offset method that you can combine with limit.
my_model = MyModel.limit(1000).offset(1000)
You can see the API documentation here: https://apidock.com/rails/v6.0.0/ActiveRecord/QueryMethods/offset
Hope that helps :)
We are currently facing performance issues when order by clause is provided as a part of the query.
Current Specs:
We are running two geode servers with capacity of 20Gb(Max heap size) each. Geode has around 3.1 million records and the table has 1.48 million.
Query:
query --query="SELECT DISTINCT cashFlowId,upstreamSystem,upstreamSystemTxnDate,valueDate,amount,status FROM WHERE AND account IN SET ('XYZ','ABC') AND valueDate >= TO_DATE('20180320', 'yyyyMMdd') AND status = 'Booked' AND isActive = true AND category = 'Actual' ORDER BY amount DESC LIMIT 100"
The above query retrieves the output in 13-15 seconds after 2-3 times.
Actual Result Set: 666553
No of Records in the table: 1.49 million
What have we tried/observed so far?
We found that the index (type: range) is being picked correctly.
No improvement even after allocating more memory to JVM .
Verified that IN operator has no impact on the query performance. We tried the same using OR operator
On removing the Order by clause, the query gets completed in 2 seconds. We figured that sorting is eating most of the time.
Could you please guide or shed some information in improving the query performance?
Server Metrics:
Category | Metric | Value
--------- | --------------------- | ------------
cluster | totalHeapSize | 47135
cache | totalRegionEntryCount | 3100429
Like Urizen said, check the number of GC's going on but there is more. Here is the code and it looks fairly tight: Geode Order By Comparator. There is another factor related to the nature of distributed sort order that has little to do with Geode as a product. Each node does its ordering but when the results get returned from each node, those results need to be merged with the results from other nodes. In other words, given a set of {2,4,3,1,6,5}, node 1 can sort {2,5,6} and node 2 sorts {1,3,4} but the controlling node needs to do a merge for you to get {1,2,3,4,5,6}. I suspect that there's some of that going on as well. This has nothing to do with Geode per se but just distributed order by's. In database performance optimization theory, the database is the worst place to do an order by.
I'm wondering here if the better way to do this is to return 2 answer sets: 1) your answer set that you want but unsorted, and 2) a small KV collection of items where K is amount and V is the key. Then on the client you do a sort of the small KV collection and iterate over the KV collection reading your larger answer set in that order.
If you didn't want to write a function to do that, you could do one additional query up front to select amount, key FROM ..., wrap that in a sorted collection and then do your full unsorted query. This should be really quick since your 2 seconds is partially being consumed by network on such a large answer set.
Jason may have some technical insights but removing the load from the server may be the answer if you have large answer sets like you do.
When using something like SELECT * FROM Object ORDER BY RANDOM() LIMIT 200, to randomly sample 200 objects out of a table, is the sampling done with or without replacement? I am speculating it is with, but I don't know for sure. I have not found any documentation about this. I am using SQLite but I don't think the implementation there differs from the rest.
First a random value is assigned to all rows, then the topmost 200 are selected, so it is done without replacement since it is impossible for the same row to be selected twice.