Table's mean_partition_size shows wrong, extremely high values - scylla

I have a table in scylla, call it 'tablex', in keyspace 'keyspacey', now, I load the data from it into Spark, and I have observed a very large number of partitions, digging in the code I saw it uses mean_partition_size, it could be seen using the query:
SELECT range_start, range_end, partitions_count, mean_partition_size FROM system.size_estimates WHERE keyspace_name = 'keyspacey' AND table_name = 'tablex';
tablex has 586 rows every one consists of timestamp, text, text, bigint.
Running the query above, I got 256 rows, all having partition_count=1 and nean_partition_size=5960319812.
What could be the cause of the problem and how to solve it?

Looks like you hit this bug: https://github.com/scylladb/scylla/issues/3916
Fixed in Scylla 3.0 - we would recommend an upgrade. Upgrade guide is at https://docs.scylladb.com/upgrade/upgrade-opensource/upgrade-guide-from-2.3-to-3.0/

Related

Improve performance of deducting values of same table in SQL

for a metering project I use a simple SQL table in the following format
ID
Timestamp: dat_Time
Metervalue: int_Counts
Meterpoint: fk_MetPoint
While this works nicely in general I have not found an efficient solution for one specific problem: There is one Meterpoint which is a submeter of another Meterpoint. I'd be interested in the Delta of those two Meterpoints to get the remaining consumption. As the registration of counts is done by one device I get datapoints for the various Meterpoints at the same Timestamp.
I think I found a solution applying a subquery which appears to be not very efficient.
SELECT
A.dat_Time,
(A.int_Counts- (SELECT B.int_Counts FROM tbl_Metering AS B WHERE B.fk_MetPoint=2 AND B.dat_Time=A.dat_Time)) AS Delta
FROM tbl_Metering AS A
WHERE fk_MetPoint=1
How could I improve this query?
Thanks in advance
You can try using a window function instead:
SELECT m.dat_Time,
(m.int_counts - m.int_counts_2) as delta
FROM (SELECT m.*,
MAX(CASE WHEN fk.MetPoint = 2 THEN int_counts END) OVER (PARTITION BY dat_time) as int_counts_2
FROM tbl_Metering m
) m
WHERE fk_MetPoint = 1
From a query point of view, you should as a minimum change to a set-based approach instead of an inline sub-query for each row, using a group by as a minimum but it is a good candidate for a windowing query, just as suggested by the "Great" Gordon Linoff
However if this is a metering project, then we are going to expect a high volume of records, if not now, certainly over time.
I would recommend you look into altering the input such that delta is stored as it's own first class column, this moves much of the performance hit to the write process which presumably will only ever occur once for each record, where as your select will be executed many times.
This can be performed using an INSTEAD OF trigger or you could write it into the business logic, in a recent IoT project we computed or stored these additional properties with each inserted reading to greatly simplify many types of aggregate and analysis queries:
Id of the Previous sequential reading
Timestamp of the Previous sequential reading
Value Delta
Time Delta
Number of readings between this and the previous reading
The last one sounds close to your scenario, we were deliberately batching multiple sequential readings into a single record.
You could also process the received data into a separate table that includes this level of aggregation information, so as not to pollute the raw feed and to allow you to re-process it on demand.
You could redirect your analysis queries to this second table, which is now effectively a data warehouse of sorts.

distribute by sort by /cluster by clear cut idea

I'm not getting a clear cut idea about distribute by sort by /cluster by in hive
according to my understanding ,multiple reducers are used when we use distribute by sort by /cluster by in hive, so that sorting happens faster
but why is that the reducers are needed in sorting of columns ,sorting can be done by a map and it does not involve any aggregate function
does it have any relation with clustered by sorted by that we use when we create a table
the problem I'm facing is,
select * from order_items cluster by order_item_order_id limit 10;
for the above query ,reducer count is not changed even though i use the command,
set mapreduce.job.reduce=4
it still remains 1
you can see here ,even though your change reducer count ,it still remains 1
though there is a post related to this , the answer given there does not clear my doubts..
thanks in advance....

tableUnavailable dependent upon size of search

I'm experiencing something rather strange with some queries that I'm performing in BigQuery.
Firstly, I'm using an externally backed table (csv.gz) with about 35 columns. The total data in the location is around 5Gb, with an average file size of 350mb. The reason I'm doing this, is that I continually add data and remove to the table on a rolling basis - to give me a view of the last 7 days of our activity.
When querying, if perform something simple like:
select * from X limit 10
everything works fine. It continues to work fine if you increase the limit up to 1 million rows. As soon as you up the limit to ten million:
select * from X limit 10000000
I end up with a tableUnavailable error "Something went wrong with the table you queried. Contact the table owner for assistance. (error code: tableUnavailable)"
Now according to to any literature on this, this usually results from using some externally owned table (I'm not). I can't find any other enlightening information for this error code.
Basically, If I do anything slightly complex on the data, I get the same result. There's a column called event that has maybe a couple hundred of different values in the entire dataset. If I perform the following:
select eventType, count(1) from X group by eventType
I get the same error.
I'm getting the feeling that this might be related to limits on external tables? Can anybody clarify or shed any light on this?
Thanks in advance!
Doug

Cassandra secondary index get_indexed_slices timing out

I am using Cassandra 0.8 with 2 secondary indexes for columns like "DeviceID" and "DayOfYear". I have these two indexes in order to retrieve data for a device within a range of dates. Whenever I get a date filter, I convert this into DayOfYear and search using indexed slices using .net Thrift API. Currently I cannot upgrade the DB as well.
My problem is I usually do not have any issues retrieving rows using the get_indexed_slices query for the current date (using current day of year). But whenever I query for yesterday's day of year (which is one of the indexed column), I get a time out for the first time I make the query. And most of the times, it returns when I query the second time and 100% during the third time.
Both these columns are created as double data type in the column family and I generally get 1 record per minute. I have 3 nodes in the cluster and the nodetool reports suggest that the nodes are up and running, though the load distribution report from nodetool looks like this.
Starting NodeTool
Address DC Rack Status State Load Owns
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 7.59 GB 51.39%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 394.24 MB 3.81%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 4.42 GB 44.80%
and my configuration in YAML is as below.
hinted_handoff_enabled: true
max_hint_window_in_ms: 3600000 # one hour
hinted_handoff_throttle_delay_in_ms: 50
partitioner: org.apache.cassandra.dht.RandomPartitioner
commitlog_sync: periodic
commitlog_sync_period_in_ms: 120000
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 24
sliced_buffer_size_in_kb: 64
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
rpc_timeout_in_ms: 50000
index_interval: 128
Is there something I may be missing? Are there any problems in the config?
Duplicate your data in another column family where the key is your search data. Row slice are mutch faster
Personally I never got to use secondary index in production environments. Or I had problems with timeout, or the speed of data retrieve by secondary index was lower that the amount of data inserted. I think that it is related with not sequentially reading data and HD seek time.
If you come from a relational model, playOrm is just as fast and you can be relational on a noSQL store BUT you just need to partition your extremely large tables. IF you do that, you can then use "scalable JQL" to do your stuff
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS t(:partId) select t FROM TABLE as t INNER JOIN t.security as s where s.securityType = :type and t.numShares = :shares")
IT also has the #ManyToOne, #OneToMany, etc. etc. annotations for a basic ORM though some things work differently in noSQL but a lot is the similar.
I finally solved my problem in a different way. In fact I realized the problem is with my data model.
The problem comes because we we come from a RDBMS background. I restructured the data model a little and now, I get responses faster.

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.