Cassandra secondary index get_indexed_slices timing out - indexing

I am using Cassandra 0.8 with 2 secondary indexes for columns like "DeviceID" and "DayOfYear". I have these two indexes in order to retrieve data for a device within a range of dates. Whenever I get a date filter, I convert this into DayOfYear and search using indexed slices using .net Thrift API. Currently I cannot upgrade the DB as well.
My problem is I usually do not have any issues retrieving rows using the get_indexed_slices query for the current date (using current day of year). But whenever I query for yesterday's day of year (which is one of the indexed column), I get a time out for the first time I make the query. And most of the times, it returns when I query the second time and 100% during the third time.
Both these columns are created as double data type in the column family and I generally get 1 record per minute. I have 3 nodes in the cluster and the nodetool reports suggest that the nodes are up and running, though the load distribution report from nodetool looks like this.
Starting NodeTool
Address DC Rack Status State Load Owns
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 7.59 GB 51.39%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 394.24 MB 3.81%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 4.42 GB 44.80%
and my configuration in YAML is as below.
hinted_handoff_enabled: true
max_hint_window_in_ms: 3600000 # one hour
hinted_handoff_throttle_delay_in_ms: 50
partitioner: org.apache.cassandra.dht.RandomPartitioner
commitlog_sync: periodic
commitlog_sync_period_in_ms: 120000
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 24
sliced_buffer_size_in_kb: 64
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
rpc_timeout_in_ms: 50000
index_interval: 128
Is there something I may be missing? Are there any problems in the config?

Duplicate your data in another column family where the key is your search data. Row slice are mutch faster
Personally I never got to use secondary index in production environments. Or I had problems with timeout, or the speed of data retrieve by secondary index was lower that the amount of data inserted. I think that it is related with not sequentially reading data and HD seek time.

If you come from a relational model, playOrm is just as fast and you can be relational on a noSQL store BUT you just need to partition your extremely large tables. IF you do that, you can then use "scalable JQL" to do your stuff
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS t(:partId) select t FROM TABLE as t INNER JOIN t.security as s where s.securityType = :type and t.numShares = :shares")
IT also has the #ManyToOne, #OneToMany, etc. etc. annotations for a basic ORM though some things work differently in noSQL but a lot is the similar.

I finally solved my problem in a different way. In fact I realized the problem is with my data model.
The problem comes because we we come from a RDBMS background. I restructured the data model a little and now, I get responses faster.

Related

How do databases store live second data?

So what I mean by live second data is something like the stock market where every second the data is getting inputted to the exact area of the specific stock item.
How would the data look in the database? Does it have a timestamp of each second? If so, wouldn't that cause the database to quickly fill up? Are there specific Databases that manage this type of stuff?
Thank you!
Given the sheer amount of money that gets thrown around in fintech, I'd be surprised if trading platforms even use traditional RDMBS databases to store their trading data, but I digress...
How would the data look in the database?
(Again, assuming they're even using a relation-based model in the first place) then something like this in SQL:
CREATE TABLE SymbolPrices (
Symbol char(4) NOT NULL, -- 4 bytes, or even 3 bytes given a symbol char only needs 32 bits-per-char.
Utc datetime NOT NULL, -- 8 byte timestamp (nanosececond precision)
Price int NOT NULL -- Assuming integer cents (not 4 digits), that's 4 bytes
)
...which has a fixed row length of 16 bytes.
Does it have a timestamp of each second?
It can do, but not per second - you'd need far greater granularity than that: I wouldn't be surprised if they were using at least 100-nanosecond resolution, which is a common unit for computer system clock "ticks" (e.g. .NET's DateTime.Ticks is a 64-bit integer value of 100-nanosecond units). Java and JavaScript both use milliseconds, though this resolution might be too coarse.
Storage space requirements for changing numeric values can always be significantly optimized if you instead store the deltas instead of absolute values: I reckon it could come down to 8 bytes per record:
I reason that 3 bytes is sufficient to store trade timestamp deltas at ~1.5ms resolution assuming 100,000 trades per day per stock: that's 16.7m values to represent a 7 hour (25,200s) trading window,
Price deltas also likely be reduced to a 2 byte value (-$327.68 to +$327.67).
And assuming symbols never exceed 4 uppercase Latin characters (A-Z), then that can be represented in 3 bytes.
Giving an improved fixed row length of 8 bytes (3 + 3 + 2).
Though you would now need to store "keyframe" data every few thousand rows to prevent needing to re-play every trade from the very beginning to get the current price.
If data is physically partitioned by symbol (i.e.. using a separate file on disk for each symbol) then you don't need to include the symbol in the record at all, bringing the row length down to merely 5 bytes.
If so, wouldn't that cause the database to quickly fill up?
No, not really (at least assuming you're using HDDs made since the early 2000s); consider that:
Major stock-exchanges really don't have that many stocks, e.g. NASDAQ only has a few thousand stocks (5,015 apparently).
While high-profile stocks (APPL, AMD, MSFT, etc) typically have 30-day sales volumes on the order of 20-130m, that's only the most popular ~50 stocks, most stocks have 30-day volumes far below that.
Let's just assume all 5,000 stocks all have a 30-day volume of 3m.
That's ~100,000 trades per day, per stock on average.
That would require 100,000 * 16 bytes per day per stock.
That's 1,600,000 bytes per day per stock.
Or 1.5MiB per day per stock.
556MiB per year per stock.
For the entire exchange (of 5,000 stocks) that's 7.5GiB/day.
Or 2.7TB/year.
When using deltas instead of absolute values, then the storage space requirements are halved to ~278MiB/year per stock, or 1.39TB/year for the entire exchange.
In practice, historical information would be likely be archived and compressed (likely using a column-major approach to make them more amenable to good compression with general purpose compression schemes, and if data is grouped by symbol then that shaves off another 4 bytes).
Even without compression, partitioning by symbol and using deltas means needing around only 870GB/year for the entire exchange.
That's small enough to fit into a $40 HDD drive from Amazon.
Are there specific Databases that manage this type of stuff?
Undoubtedly, but I don't think they'd need to optimize for storage-space specifically - more likely write-performance and security.
They use different big data architectures like Kappa and Lambda where data is processed in both near real-time and batch pipelines, in this case live second data is "stored" in a messaging engine like Apache Kafka and then it's retrieved, processed and ingested to databases with streaming processing engines like Apache Spark Streaming
They often don't use RDMBS databases like MySQL, SQL Server and so forth to store the data and instead they use NoSQL data storage or formats like Apache Avro or Apache Parquet stored in buckets like AWS S3 or Google Cloud Storage properly partitioned to improve performance.
A full example can be found here: Streaming Architecture with Apache Spark and Kafka

Storing large amount of data in Redis / NoSQL or Relational db?

I need to store and access financial market candle stick information.
The amount of candles sticks that I will need to store is beginning to looking staggering (huge). There are 1000s of markets and each one has many trading pairs, and each pair has many time frames, and each time frame is an array of candles like the below. The array below could be for hourly price data or daily price data for example.
I need to make this information available to multiple users at any given time, so need to store it and make it available somehow.
The data looks something like this:
[
{
time: 1528761600,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
},
{
time: 1528761610,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
},
{
time: 1528761630,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
}
]
Consumers of the data will mostly be a complex Javascript based charting app, but other consumers will be node code, and perhaps other backend code.
My current best idea is to put save the candlesticks in Redis, though I have also considered a noSQL database. I'm not super experienced in either, so I'm not 100% sure Redis is the right choice. It seems to be the most performant option though, but perhaps harder to work with, since I am having to learn a lot, and I'm not convinced that the method of saving and retrieval used by Redis is going to make this very easy since, I will need to continually add candles to each array.
I'm currently thinking something like:
Do an initial fetch from the candle stick api and either:
Create a Redis hash with a suitable label and stingify the whole array of candles into the hash, so that it back be parsed by Javascript etc
Drawbacks of this approach:
Every time a new candle is created, I have to parse the json, add any new candles sticks and stringify and save it.
Pros of this approach:
I can use Javascript to manage the array and make sure it's sorted etc
Create a Redis list of time stamps, which allows me to just push new candles onto the list and trust it to be in the right order. I can then do a Redis SCAN? to return time stamps between the specific dates and then use the time stamps to pull the data out of a Redis hash. After retriveng all of this, then building a json object similar to above to pass to Javascript.
I have to say that both of these approaches feels way more painfull to me putting the data in a relational database. I imagine that a no-SQL database could also be way easier, but I'm not experienced with them, so I can't say for sure.
I'm a bit lost and out of my experience here, as you can tell, and would love any advice anyone can give me.
Thanks :)
Your data is very regular - each candlestick has essentially 1 64 bit long for timestamp, and 4 32 bit numbers for the prices. This makes it very amenable to bitfield.
Storing the data
Here is how I would store it -
stock-symbol:daily_prices = bitfield with 30 * 5 records, assuming you are storing data for past 30 days
stock-symbol:hourly_prices = bitfield with 24 * 5 records
This way, your memory is (30*5 + 24*5) * 16 bytes = 4320 bytes per symbol + constant overhead per key.
You don't need to store the timestamp (see below). Also, I have assumed 4 bytes to store the price. You can store it as a whole number by eliminating the decimal.
Writing the data
To insert hourly prices, find the current hour (say 07:00 hours). If you treat the bitfield as an array of 4 byte integers, you will have to skip 7 * 4 = 28 integers. You then insert the prices at position 28, 29, 30, 31 (0 based indexes).
So, to store price for AAPL at 07:00 hours, you would run the command
bitfield AAPL:hourly_prices set i32 28 <open price> i32 29 <close price> i32 30 <highest price> i32 31 <lowest price>
You would do something similar for daily prices as well.
Reading Data
If you are building a charting library, most likely you would want to return data for multiple symbols for a given time range. Let's say you want to pull out daily prices for past 7 days, your logic will be -
For each symbol:
Get start and end range within the array
Invoke the Get Range command.
If you run this in a pipeline, it will be very fast.
Other tips
Usually, you would to filter by some property of the symbol. For example, "show me graphs of top 10 tech companies for the last 5 days".
A symbol itself is relational data. I would recommend storing that in a relational database. Just get the symbol names as a list from the relational database, and then fetch the stock prices from redis.
Redis has its limits, like anything, but they're pretty high, and if you're clever about it, you can get amazing performance out of redis. If you outgrow one instance you can start thinking about clustering, which should scale relatively linearly to a level where budget is a bigger concern than performance.
Without having a really great grasp of the data you're describing and its relations, sounds like what you're looking for is a sorted set, perhaps sorted by date. You can ZSCAN a sorted set to move through it sequentially, or you can do lots of other great things against one as well. You might have data that requires a few different things - eg a hash for some data and an entry into an index for the hash itself, or even in a few different indexes. A simple redis list might also do the job for you, since it's inherently ordered by insertion order ( this may or may not work for your cases of course; it may depend on whether your input is inherently temporally ordered).
At the end of the day, redis performance is generally dictated by how "well" the data is stored in redis - in other words, how well the native redis capabilities have been mapped into your problem domain. It's pretty easy to use and to program against. I'd highly recommend you look into it.

Apache Geode : Improve query performance when order by clause is provided in the query

We are currently facing performance issues when order by clause is provided as a part of the query.
Current Specs:
We are running two geode servers with capacity of 20Gb(Max heap size) each. Geode has around 3.1 million records and the table has 1.48 million.
Query:
query --query="SELECT DISTINCT cashFlowId,upstreamSystem,upstreamSystemTxnDate,valueDate,amount,status FROM WHERE AND account IN SET ('XYZ','ABC') AND valueDate >= TO_DATE('20180320', 'yyyyMMdd') AND status = 'Booked' AND isActive = true AND category = 'Actual' ORDER BY amount DESC LIMIT 100"
The above query retrieves the output in 13-15 seconds after 2-3 times.
Actual Result Set: 666553
No of Records in the table: 1.49 million
What have we tried/observed so far?
We found that the index (type: range) is being picked correctly.
No improvement even after allocating more memory to JVM .
Verified that IN operator has no impact on the query performance. We tried the same using OR operator
On removing the Order by clause, the query gets completed in 2 seconds. We figured that sorting is eating most of the time.
Could you please guide or shed some information in improving the query performance?
Server Metrics:
Category | Metric | Value
--------- | --------------------- | ------------
cluster | totalHeapSize | 47135
cache | totalRegionEntryCount | 3100429
Like Urizen said, check the number of GC's going on but there is more. Here is the code and it looks fairly tight: Geode Order By Comparator. There is another factor related to the nature of distributed sort order that has little to do with Geode as a product. Each node does its ordering but when the results get returned from each node, those results need to be merged with the results from other nodes. In other words, given a set of {2,4,3,1,6,5}, node 1 can sort {2,5,6} and node 2 sorts {1,3,4} but the controlling node needs to do a merge for you to get {1,2,3,4,5,6}. I suspect that there's some of that going on as well. This has nothing to do with Geode per se but just distributed order by's. In database performance optimization theory, the database is the worst place to do an order by.
I'm wondering here if the better way to do this is to return 2 answer sets: 1) your answer set that you want but unsorted, and 2) a small KV collection of items where K is amount and V is the key. Then on the client you do a sort of the small KV collection and iterate over the KV collection reading your larger answer set in that order.
If you didn't want to write a function to do that, you could do one additional query up front to select amount, key FROM ..., wrap that in a sorted collection and then do your full unsorted query. This should be really quick since your 2 seconds is partially being consumed by network on such a large answer set.
Jason may have some technical insights but removing the load from the server may be the answer if you have large answer sets like you do.

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

Trending 100 million+ rows

I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real
Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.
I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.
Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?