Database: Oracle 11g
I am working on a greenfield project and designing a database schema. I have an audit table which, as the name suggests,
will grow to hold huge number of records eventually. Following is the table definition (after pruning the extraneous columns).
create table ClientAudit (
id number(19,0) primary key,
clientId number(19,0) not null,
createdOn timestamp with time zone default systimestamp not null
);
id is a natural number to be populated by oracle sequence.
clientId is a unique client identifier.
For ease of query by reporting, I am creating a following view as well, which gives the latest record for each client, based on createdOn:
create or replace view ClientAuditView
as
select * from (
select ca.*,max(ca.createdOn) keep (dense_rank last order by ca.createdOn)
over (partition by ca.clientId) maxCreatedOn
from ClientAudit ca
)
where createdOn=maxCreatedOn;
/
I am not sure what should be the partitioning key here if I were to partition ClientAudit table.
Should it be ClientId, or CreatedOn?
What should be the partitioning strategy?
Since the selection is on the createdon, i would suggest to have a range partition and also the query should refer to the correct partition based on passed date.
You will not benefit from partition pruning this way. If you plan to store data for a very long time, this will result in view working very slow.
I recommend storing "latestAuditTimestamp" or "lastAuditId" in the clients table or another entity and will re-do the view as following:
create or replace view ClientAuditView
as
select ca.* from ClientAudit ca
where (clientId,createdOn) in (select clientId,lastAuditTimestamp from Clients c)
;
/
In the later stage you can optimize it more, by adding range condition for maximum/minimum lastAuditTimstamp in case number of clients goes too high and HASH SEMI JOIN would be used.
Related
Creating oracle partition for a table for the every day.
ALTER TABLE TAB_123 ADD PARTITION PART_9999 VALUES LESS THAN ('0001') TABLESPACE TS_1
Here I am getting error because value is decreased as 0001 as lower boundary.
You can have Oracle automatically create partitions by using the PARTITION BY RANGE option.
Sample DDL, assuming that the partition key is column my_date_column :
create table TAB_123
( ... )
partition by range(my_date_column) interval(/*numtoyminterval*/ NUMTODSINTERVAL(1,'day'))
( partition p_first values less than (to_date('2010-01-01', 'yyyy-mm-dd')) tablespace ts_1)
;
With this set up in place, Oracle will, if needed, create a partition on the fly when you insert data into the table. It is also usually a good idea to create a default partition, as shown above.
This naming convention (last digit of year plus day number) won't support holding more than ten years worth of data. Maybe you think that doesn't matter but I know databases which are well into their second decade. Be optimistic!
Also, that key is pretty much useless for querying. Most queries against partitioned tables want to get the benefit of partition elimination. But that only' works if the query uses the same value as the partition key. Developers really won't want to be casting a date to YDDD format every time they write a select on the table.
So. Use an actual date for defining the partition key and hence range. Also for naming the partition if it matters that much.
ALTER TABLE TAB_123
ADD PARTITION P20200101 VALUES LESS THAN (date '2020-01-02') TABLESPACE TS_1
/
Note that the range is defined by less than the next day. Otherwise the date of the partition name won't align with the date of the records in the actual partition.
Currently I using following query:
SELECT
ID,
Key
FROM
mydataset.mytable
where ID = 100077113
and Key='06019'
My data has 100 million rows:
ID - unique
Key - can have ~10,000 keys
If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.
How can I use the new clustering capabilites in BigQuery to partition on the field Key?
(I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)
I have a table with 12M rows and 76 GB of data. This table has no timestamp column.
This is how to cluster said table - while creating a fake date column for fake partitioning:
CREATE TABLE `fh-bigquery.public_dump.github_java_clustered`
(id STRING, size INT64, content STRING, binary BOOL
, copies INT64, sample_repo_name STRING, sample_path STRING
, fake_date DATE)
PARTITION BY fake_date
CLUSTER BY id AS (
SELECT *, DATE('1980-01-01') fake_date
FROM `fh-bigquery.github_extracts.contents_java`
)
Did it work?
# original table
SELECT *
FROM `fh-bigquery.github_extracts.contents_java`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(3.3s elapsed, 72.1 GB processed)
# clustered table
SELECT *
FROM `fh-bigquery.public_dump.github_java_clustered2`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(2.4s elapsed, 232 MB processed)
What I learned here:
Clustering can work with unique ids, even for tables without a date to partition by.
Prefer using a fake date instead of a null date (but only for now - this should be improved).
Clustering made my query 99.6% cheaper when looking for rows by id!
Read more: https://medium.com/#hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b
you can have one filed of type DATE with NULL value, so you will be able partition by that field and since the table partitioned you will be able to enjoy clustering
You need to recreate your table with an additional date column with all rows having NULL values. And then you set partition to the date column. This way your table is partitioned.
After you've done with this, you will add clustering, based on the columns you identified in your query. Clustering will improve processing time and query costs will be reduced.
Now you can partition table on an integer column so this might be a good solution, remember there is a limit of 4,000 partitions for each table. So because you have ~10,000 keys I will suggest to create a sort of group_key that bundles ids together or maybe you have another column that you can leverage as integer with a cardinality < 4,000.
Recently BigQuery introduced support for clustering table even if they are not partitioned. So you can simply cluster on your integer field and don't use partitioning all together. Although, this solution will not be most effective for data scan optimisation.
my schema for the table is : A)
CREATE TABLE friend_list (
userId uuid,
friendId uuid,
accepted boolean,
ts_accepted timestamp,
PRIMARY KEY ((userId ,accepted), ts_accepted)
) with clustering order by (ts_accepted desc);
Here I am able to perform queries like:
1. SELECT * FROM friend_list WHERE userId="---" AND accepted=true;
2. SELECT * FROM friend_list WHERE userId="---" AND accepted=false;
3. SELECT * FROM friend_list WHERE userId="---" AND accepted IN (true,false);
But the 3rd query involves more read, so I tried to change the schema like this :
B)
CREATE TABLE friend_list (
userId uuid,
friendId uuid,
accepted boolean,
ts_accepted timestamp,
PRIMARY KEY (userId , ts_accepted)
) with clustering order by (ts_accepted desc);
CREATE INDEX ON friend_list (accepted);
With this type B schema, the 1st and 2nd queries works, but I can simplify the third query as :
3. SELECT * FROM friend_list WHERE userId="---";
I believe that the second schema gives much better performance for third query, as it won't do the condition check on every row.
Cassandra experts...Please suggest me which is the best schema on achieving this.A or B.
First of all , are you aware that your second schema does not work at all like the first one ? In the first one the 'accepted' field was part of the key, but in the second not at all ! You don't have the same unique constraint, you should check that it is not a problem for your model.
Second if you only want to not have to include the 'acceptation' field for every request you have two possibilities :
1 - You can use 'acceptation' as a clustering column :
PRIMARY KEY ((userId), accepted, ts_accepted)
This way your 3rd request can be :
SELECT * FROM friend_list WHERE userId="---";
And you will get the same result more efficiently.
But this approach has a problem, it will create larger partitions, which is not the best for good performances.
2 - Create two separate tables
This approach is much more adequate for the Cassandra spirit. With Cassandra it is not unusual to duplicate the data if it can improve the efficiency of the requests.
So in your case you would keep your first schema for the first table and the first and second request,
and you would create another table with the same data but a schema slightly different , either with the secondary index if the 'accepted' does not need to be part of the primary key (as you did for your second schema), or a primary key like this :
PRIMARY KEY ((userId), accepted, ts_accepted)
I would definitely prefer the secondary index for the second table if possible because the accepted column has a low cardinality (2) and thus very well fitted for secondary indexes.
EDIT :
Also you used a timestamp in your primary key. Be aware that it may be a problem if you can have the same user creating two rows in this table. Because the timestamp does not guaranty unicity : what happens if the two rows are created the same millisecond ?
You should probably use a TimeUUID. This type very commonly used in Cassandra guaranty the unicity by combining a Timestamp and UUID.
Furthermore a timestamp in a primary key can create temporary hotspots in a Cassandra node, definitely beter to avoid.
I've a table of "Users" each user has many "Projects" and each project has many "Clients" so it's many-to-many so I keep track of clients events in a different table.
The problem is that I can't figured out how to choose the key and the index so the queries will be with best performance.
The table with Key:
CREATE TABLE project_clients_events(
id timeuuid,
user_id int,
project_id int,
client_id text,
event text,
PRIMARY KEY ((user_id, project_id), id, client_id)
);
Now there will be more then 100K of events per (user_id, project_id) so I need to be able to paginate throw the result:
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
How can I group the results and paginate?
Thanks!
Let me answer your question in two parts. First the pagination and then the partition key
Cassandra CQL driver supports automatic paging now, so you need not worry about designing a complex where clause.
Statement stmt = new SimpleStatement("SELECT * FROM images");
stmt.setFetchSize(100);
ResultSet rs = session.execute(stmt);
// Iterate over the ResultSet here
This link will be helpful :
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
Deciding the partition depends on the queries you may have. For example if most of your queries use the user_id and project_id (i.e. most of your queries fetch results only based on user_id and client_id) then it’s better to have then as a part of the partition key, as all those results will be placed in the same Cassandra column (on the same node) and fetched together.
Hence I would advise you to first decide the queries and select your partition keys accordingly. As your performance will depend on what the queries are vs. how the columns are stored in Cassandra
This could help you http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (slides 45-70)
I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.