Redis to cache a Table with 2 Million Rows - redis

We have table in postgres with 2 Million rows, For faster access we would need to cache the data in Redis. How you would cache a table in Redis? For instance take Employee table with Name,Designation,Project,Manager columns. I need to use a Redis Data-structure to store 2 Million Employee rows and possibly query the redis Data Structure with
find employees where Designation='Engineer' and Project='Awesome Project' .Is it possible to cache this Data in redis and have a query to include couple of where conditions?

Related

SQL Server : multiple small tables vs one large table for searching

I have 25 tables with the same structure, but different data. Each table has 7 millions rows. To find a record I have to go through each table one by one i.e. search table 1, if the record is found then show it and exit otherwise search table 2 and so on until table 25.
The structure is:
Name, Cell Number, ID Card Number, Address
In performance perspective:
Is it ok or should I merge all tables to on large table.
At what extent I can combine the tables. (How many rows are good to be in one table and then another table should be created).
Note: I have only search query on Cell Number and ID card number
In general, it is better to store all rows in a single table rather than in multiple tables. To speed queries, you should use facilities such as indexes and partitions.
Normally, when this question comes up, the issue is small tables (think dozens of rows) versus "large" tables (think thousands or millions of rows). In that extreme case, the decision is more cut-and-dry:
There is overhead to executing searches on multiple tables. Preparing and running queries takes some effort.
There is overhead in data storage. Tables store rows on data pages and the pages are not shared with other tables. If these pages are half-filled, then the I/O time is wasted.
Any improvements on performance, such as indexes, are either wasted on small tables or need to be repeated ad infinitum.
In your case, with a handful of large tables, these considerations are weaker. There is overhead for searching tables. But then again, it takes some time to run a query against 7 million rows -- and if the query requires scanning the table, the compile time is much less than the execution time. Such large tables have minuscule amount of wasted overhead in terms of half-filled "last" pages.
What I would say instead is that storing entities across multiple tables just makes managing the database trickier, so why bother? If i had to guess, you have 25 months of history (24 months of history plus the current month). I would recommend that you store such data in a single table, perhaps partitioned by month.

How to partition 10 billion row SQL tables quickly using AWS?

I have a SQL database of data delivered in a normalized format with several tables that have several billions of rows of data. I have decided to partition the large tables into separate tables by itemId since when I query the data I only care about 1 item at a time. I would end up having 5000+ tables at the end after partitioning the data. The problem is, partitioning the data takes about 25 minutes to build a single table for 1 item.
5000 items x 25 minutes = 86.8 days
It would take over 86 days to fully partition my entire SQL database. My entire database is about 2.5TB.
Is this something I can leverage AWS for to parallelize on an item level? Can I use AWS database migration services to host the database in its current form and then use AWS process to churn through all of the 5000 queries to partition the big tables into 5000 smaller tables with 2M rows each?
If not, is this something I just have to throw more hardware at to make it run faster (CPU or RAM)?
Thanks in advance.
This doesn't seem like a good strategy. For one thing, simple arithmetic is that 10,000,000,000 rows with 5,000 rows per item results in 2,000,000 partitions in the table.
The limit in Redshift (by default) is 1,000,000 partition per table:
Amazon Redshift Spectrum has the following quotas when using the
Athena or AWS Glue data catalog:
A maximum of 10,000 databases per account.
A maximum of 100,000 tables per database.
A maximum of 1,000,000 partitions per table.
A maximum of 10,000,000 partitions per account.
You should re-think your partitioning strategy. Or perhaps your problem is not suitable for Redshift. There may be other database strategies more suitable for your use-case. (This is not the forum for recommending specific software solutions, however.)
Use the itemid as sortkey and distkey. if the table is vacummed properly and you select one itemid this should have good results, where access time is almost as good as a single table. distkey is used to distribute the data between shards, which means each itemid's blocks would be stored together on the same shard making retrieving all of them faster. Having the itemid also be sortkey means that for itemid's with small row numbers that all exist on the same shard, finding the rows within the table's blocks on a shard would be as fast as possible.
Creating a separate table for each item, where every other attribute of the table remains the same, doesn't seem logical. If the data format is the same, then keep the data in the same table unless there is a particular problem to overcome.
If you set the itemId as the SORTKEY on a Redshift table, then Redshift will be able to skip-over the blocks that do not contain a desired value (when using WHERE itemId = 'xxx'). This will be highly efficient.
Admittedly, trying to keep such a large table sorted would probably be too hard to VACUUM. It would still work reasonably well without the SORTKEY since blocks can still be skipped, but not as efficiently because the data for that itemId would be spread over more blocks.

How to increase scan speed in Hbase

I am new to Apache Hbase and I am using hbase-0.98.13 and I have created a table sample with column family sample_family. And I have loaded the output from pig script to hbase table. when I try to scan the table based on one of the column in column family it takes more than 2 minutes.
Here is the query
scan 'sample', {FILTER=>"SingleColumnValueFilter('sample_family','id',=,'binary:1000')"}
Can any one tell me how to bring this process in one or two seconds?
Is there any configuration changes to be made for this? Can any one help me in this?
There's no silver bullet to make a search in HBase fast.
A scan in your example has to iterate over all the rows in a table, that's why it takes significant time on large tables. And there are no secondary indices in HBase that help to improve a search by specific columns.
The most effective way to improve scans perfomance is to have properly designed row keys. HBase internally keeps rows sorted by row keys, and you can specify start and end rows for a scan. So it's crucial to have row keys designed for search by the most frequent criteria. In your question you search by column id where a value is 1000. You could put this id into the row key (however, you have to make sure you avoid regions hotspotting).

Performance of inserting to big table while checking for duplicates

I have a simple table that contains a varchar(100). I am trying to populate it with 1 billion unique records. I have a stored proc that takes a table type parameter containig 1000 records at a time and inserts it into the table while checking no duplicate exists. After about 50 million the performance goes down. I tried sharding the table and using the sql table partitioning with balanced distribution but no gain was observed.
How can i build this solution in sql with reasonable performance?
You might want to try de-duping the data before you put it into the database, then disabling the unique key while inserting so you don't have to deal with rebuilding it as you go.

How big is too big for a PostgreSQL table?

I'm working on the design for a RoR project for my company, and our development team has already run into a bit of a debate about the design, specifically the database.
We have a model called Message that needs to be persisted. It's a very, very small model with only three db columns other than the id, however there will likely be A LOT of these models when we go to production. We're looking at as much as 1,000,000 insertions per day. The models will only ever be searched by two foreign keys on them which can be indexed. As well, the models never have to be deleted, but we also don't have to keep them once they're about three months old.
So, what we're wondering is if implementing this table in Postgres will present a significant performance issue? Does anyone have experience with very large SQL databases to tell us whether or not this will be a problem? And if so, what alternative should we go with?
Rows per a table won't be an issue on it's own.
So roughly speaking 1 million rows a day for 90 days is 90 million rows. I see no reason Postgres can't deal with that, without knowing all the details of what you are doing.
Depending on your data distribution you can use a mixture of indexes, filtered indexes, and table partitioning of some kind to speed thing up once you see what performance issues you may or may not have. Your problem will be the same on any other RDMS that I know of. If you only need 3 months worth of data design in a process to prune off the data you don't need any more. That way you will have a consistent volume of data on the table. Your lucky you know how much data will exist, test it for your volume and see what you get. Testing one table with 90 million rows may be as easy as:
select x,1 as c2,2 as c3
from generate_series(1,90000000) x;
https://wiki.postgresql.org/wiki/FAQ
Limit Value
Maximum Database Size Unlimited
Maximum Table Size 32 TB
Maximum Row Size 1.6 TB
Maximum Field Size 1 GB
Maximum Rows per Table Unlimited
Maximum Columns per Table 250 - 1600 depending on column types
Maximum Indexes per Table Unlimited
Another way to speed up your queries significantly on a table with > 100 million rows is to cluster the table on the index that is most often used in your queries. Do this in your database's "off" hours. We have a table with > 218 million rows and have found 30X improvements.
Also, for a very large table, it's a good idea to create an index on your foreign keys.
EXAMPLE:
Assume we have a table named investment in a database named ccbank.
Assume the index most used in our queries is (bankid,record_date)
Here are the steps to create and cluster an index:
psql -c "drop index investment_bankid_rec_dt_idx;" ccbank
psql -c "create index investment_bankid_rec_dt_idx on investment(bankid, record_date);"
psql -c "cluster investment_bankid_rec_dt_idx on investment;"
vacuumdb -d ccbank -z -v -t investment
In steps 1-2 we replace the old index with a new, optimized one. In step 3 we cluster the table: this basically puts the DB table in the physical order of the index, so that when PostgreSQL performs a query it caches the most likely next rows. In step 4 we vacuum the database to reset the statistics for the query planner.