I am thinking of using cassandra for storing my data. I have a server_id, start_time, end_time, messages_blob.
CREATE TABLE messages (
server_id uuid,
start bigint,
end bigint,
messages_blob blob,
PRIMARY KEY ((server_id), start,end)
) WITH CLUSTERING ORDER BY (start,end);
I have two types of queries:
get all server_ids and messages_blob at start time > 100 and start time < 300.
get all messages_blob's for a bunch of server_ids at a time.
Can the above schema help me do it? I need to put billions of records in this table very quickly and do reads after all inserts have happened. The reads queries are not too many, compared to writes, but i need the data back as quickly as possible.
With this table structure you can only execute the 2nd query - you'll just need to execute queries for every single server_id separately, best via async API.
For 1st query, this table structure won't work, as Cassandra needs to know partition key (server_id) to perform query - otherwise it will require a full scan that will timeout when you have enough data in table.
To execute this query you have several choices.
Add another table that will have start as partition key, and there you can store primary keys of records in first table. Something like this:
create table lookup (start bigint, server_id uuid, end bigint,
primary key(start, server_id, end));
this will require that you write data into 2 tables, or you maybe can use materialized view for this task (although it could be problematic if you use OSS Cassandra, as it has plenty of bugs there). But you'll need to be careful with size of partition for that lookup table.
Use Spark for scanning the table - because you have start as first clustering column, then Spark will able to perform predicates pushdown, and filtering will happen inside Casasndra. But it will be much slowly than using lookup table.
Also, be very careful with blobs - Cassandra doesn't work well with big blobs, so if you have blobs with size more than 1Mb, you'll need to split them into multiple pieces, or (better) to store them on file system, or some other storage, like, S3, and keep in Cassandra only metadata.
Related
I am writing a new module where in I am polling a couple of thousand records from kafka every minute throughout the day and then splitting each of them into two tables and then committing to kafka broker. I intend to run some aggregate queries on the few million records collected the previous day
I am splitting up the records into two tables as the payload is dynamic in nature and I am interested in only a few fields in the json payload. My assumption is that the entire row gets loaded in memory on the db even while running a query even though the aggregate has to run on just two columns. So just extract the columns responsible for the counts into a separate table from the start.
Customer_Count Where in I run aggregate queries on counts per customer type per purchase type.
Customer_Payload Wherein I plan to just archive the full payload to an object store later.
I plan to do batch insert within one transactional block, first to the payload table and then to the counts table, assuming that failure in inserting any of the records in either of the tables because of exceptions, app or db crash causes batch inserts to both of them to rollback.
Since I am writing a couple of thousand records per transaction into two tables, is there a possibility that a db crash or an app crash while commit was going on leads to partial writes to one of the tables?
My assumption is that as this is synchronous transaction, any db crash before that commit when thru at the db level will be rolled back.
Same for any crash in the spring boot application that the transaction will not committed.
I am being extra cautious as these metrics result in some revenue operations downstream, hence the question on the possibility of partial commits.
The tables look somewhat like this
Counts Table
create table customer_counts
(
id bigserial PK,
customer_id string Not Null,
count int,
purchase_type String,
process_dt date
)
create index metric_counts_idx on (customer_id, purchase_type, process_dt)
Payload table
create table customer_payload
(
id bigserial PK,
customer_id string Not Null,
payload text,
process_dt date
)
create index metric_payload_idx on (customer_id, process_dt)
Then I run a
select sum(count), customer_id, purchase_type
from customer_counts
group by customer_id, purchase_type
on the counts table at the end of the day on a few million records.
I just use the payload table to select and push to an object store.
PS: I was also wondering if creating an index on customer_id, purchase_type, count could save me from the trouble of creating an extra table just for counts but from what I read, indexes are only meant for lookups and the aggregate will run after loading the entire row. You cannot guarantee if the query planners takes the index into account every time. Any suggestions on this approach will also help simplify my design from two tables to one table limiting the question on partial commits to just one table.
I plan to use the default settings in postgresql for transactions and commits. We use Spring Boot JdbcTemplate for db access and the #Transactional block at java app level. The size of the payload varies between .5 KB to 10 KB. I also index on the customer id, purchase_type and date. The postgres version is 9.6.
You will not see partially-committed transactions. Nothing about your setup seems worrysome.
The "entire row" thing isn't quite right. PG actually loads things a page at a time, which usually means >1 row - but a page will only contain fairly compact row data, large values get compressed and stored out-of-band (aka TOAST). If you neither select nor filter on payload, you should not end up reading most of its field data.
As to your PS, I think this should actually be amenable to an index-only scan. AIUI, you would only be INSERTing and never UPDATE/DELETEing, which should mean the vast majority of the table is visible to all transactions, which is the big factor in making index-only scans worth it. You would want to use a single index on customer_id, purchase_type and count, which could be used to satisfy your final query.
When specifying TIMESTAMP column as partition - The data is saved on the disk by the partition allows each access.
Now, BigQuery allows to also define up to 4 columns which will used as cluster field.
If I get it correctly the partition is like PK and the cluster fields are like indexes.
So this means that the cluster fields has nothing to do with how records are saved on the disk?
If I get it correctly the partition is like PK
This is not correct, Partition is not used to identify a row in the table rather enable BigQuery to Store each partitioned data in a different segment so when you scan a table by Partition you ONLY scan the specified partitions and thus reduce your scanning cost
cluster fields are like indexes
This is correct cluster fields are used as pointers to records in the table and enable quick/minimal cost access to data regardless to the partition. This means Using cluster fields you can query a table cross partition with minimal cost
I like #Felipe image from his medium post which gives nice visualization on how data is stored.
Note: Partitioning happens on the time of the insert while clustering happens as a background job performed by BigQuery
We have a clustered transactional table (10k buckets) which seems to be inefficient for the following two use cases
merges with daily deltas
queries based on date range.
What we want to do is to partition table by date and thus create partitioned clustered transactional table. Daily volume suggests number of buckets to be around 1-3, but inserting into the newly created table produces number_of_buckets reduce tasks which is too slow and causes some issues with merge on reducers due to limited hard drive.
Both issues are solvable (for instance, we could split the data into several chunks and start separate jobs to insert into the target table in parallel using n_jobs*n_buckets reduce tasks though it would result in several reads of the source table) but i believe there should be the right way to do that, so the question is: what is this right way?
P.S. Hive version: 1.2.1000.2.6.4.0-91
I have a database of users for a web API, but I also want to store usage history for each user, i.e: page request count, data volumes, etc. What is the best way to implement this, in terms of database structure? My initial thought was to retain the main table, but then create a history table for each user. This seems horribly impractical, however. My gut feeling is that I probably need one separate table for usage history, but I am unclear as to how to structure it.
I am using SQLite.
For an event logging model (which is what you want), I can recommend two options
One table, lets call it activity_log.
`activity_log`{
id INTEGER PRIMARY KEY,
user_id MEDIUM INT NOT NULL,
event_type VARCHAR(10),
event_time TIMESTAMP
}
For each event in your system affecting a user, you insert a record into this role (i believe the column names are self-explanatory). I believe SQLite doesn't provide native TIMESTAMP type so you'll have to handle the storage in your application code. What this design will leave you with a table that has the potential to grow very large, but it will give you fine grained statistics. SQLite doesn't support clustered indexes but there are some options here that will help you out with performance tuning.
The same table as above, only instead of inserting a new row for every event, you're going to perform a conditional insert i.e. update existing rows for users already in and update for new users. This option will keep your table several times smaller than what you have above, but you'll only have access to the most recent use of your api.
If you can afford it, I'd say go with number 1.
In one of my programs, I maintain a table of module usage per user. The structure of the table is
table id
user id
prog id
date/time
history flag (0=current, 1=history)
runs (number of time user has run program on date)
About once a week, I aggregate the data in the table: if user 1 has run program 1 twice on a given date, then initially there will be two entries in the table:
1;1;1;04/10/12 08:56;0;1
2;1;1;04/10/12 09:33;0;1
After aggregation, the table becomes
3;1;1;04/10/12 00:00;1;2
Whilst the aggregation loses the time part, no other data is lost and queries against the table will be quicker.
I have a SQL server DB , which have a Table , which Log every Exceptions with details along with 2 XML's (1 for Request , 1 for Response).
These 2 XML's are Compressed.
Now as the Data volume is high , I need to clean the Table in every 3-4 month.
What are the Optimization technique , I can use to avoid Data Clean up's.
Create Indexes on all columns that require searching.
Run optimize table tablename (or similar according to your RDBMS) through cron job every day.
Probably the best thing you can look into is table partitioning, which will allow you to quickly remove data when it needs age out. Also, make sure that you cluster your index on a monotonically increasing value (either a surrogate identity value or a datetime column, like dateofreciept); this will reduce fragmentation on the clustered index.