Querying GROUP BY on 1+ billion rows in MemSQL - sql

I have a table with 1.3 billion rows (MemSQL, columnstore pattern). I need to query a GROUP BY on 3 fields (id1, id2, text) and fetch the latest record for each of this 3-tuple. The table gets populated through a pipeline mounted onto a EFS folder. Currently, it has about 200k csv files of 2MB each.
I need help writing an optimized query for this case or if it can be done some other way.
Edit: I am not able to find any blog/help online, most of them talk about solutions involving the creation of an extra table which is not possible for me now (very heavy memory usage in that case).
Something like below is not going to work and takes my 5-node cluster down:
select max(eventTime) from table1 group by id1, id2, field1

there a couple of considerations here.
1) what is your shard key for the columnstore table?
2) are you using MemSQL 6.5, the most recent version?
3) have you reviewed this resource about Optimizing Table Data Structures? https://www.memsql.com/static/memsql_whitepaper_optimizing_table_data_structures.pdf
Ensure that common columns for all queries in the columnstore key to improve segment elimination.
If the data is inserted in an order, like a timestamp, it's best to put that column first in the columnstore key to minimize the work of the background merger process.
If there are lots of distinct values in one of the keys of the composite key, put that last. Put the key part with less distinctness first to increase the likelihood that segment elimination will be able to affect later columns.
Also, what would help is if run ran
EXPLAIN select max(eventTime) from table1 group by id1, id2, field1;
so that we could see the explain plan.

It takes long time because it needs a proper design for the database. So you have to choose the shard key to be those three columns (id1,id2,field1). I recommend to use column store for that query rather than row store.

Related

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

simple select is taking huge time on the table

I have a table with around 17 millions of Transaction data. It have clustered key and Non Clustered key on Key columns. To simple select also it is taking 11 minutes to retrieve data and for DML Operations it is taking good amount of time.
Simple select
Select * from TransactionTable
People will ask what you have done from your side
1)I have created indexes (Clustered and Non Clustered)
2)using DM Views physical stats I have checked whether the table is fragmented or not ?
3)Before doing DML Operations I have Re-Organized the Indexes.
Please suggest me the way
I can only think to try to reduce the size of the table by adjusting the data types to the minimum requirements. If you have a lot of Null values, try to use Sparse columns.
What might help you, is keeping the data compressed.
If I remember correctly, you'll have to repopulate the table.
The more interesting thing however, is what are you going to do with the data.

Why is Select Count(*) slower than Select * in hive

When i am running queries in VirtualBox Sandbox with hive. I feel Select count(*) is too much slower than the Select *.
Can anyone explain what is going on behind?
And why this delay is happening?
select * from table
It can be a Map only job But
Select Count(*) from table
It can be a Map and Reduce job
Hope this helps.
There are three types of operations that a hive query can perform.
In order of cheapest and fastest to more expensive and slower here they are.
A hive query can be a metadata only request.
Show tables, describe table are examples. In these queries the hive process performs a lookup in the metadata server. The metadata server is a SQL database, probably MySQL, but the actual DB is configurable.
A hive query can be an hdfs get request.
Select * from table, would be an example. In this case hive can return the results by performing an hdfs operation. hadoop fs -get, more or less.
A hive query can be a Map Reduce job.
Hive has to ship the jar to hdfs, the jobtracker queues the tasks, the tasktracker execute the tasks, the final data is put into hdfs or shipped to the client.
The Map Reduce job has different possibilities as well.
It can be a Map only job.
Select * from table where id > 100 , for example all of that logic can be applied on the mapper.
It can be a Map and Reduce job,
Select min(id) from table;
Select * from table order by id ;
It can also lead to multiple map Reduce passes, but I think the above summarizes some behaviors.
This is because the DB is using clustered primary keys so the query searches each row for the key individually, row by agonizing row, not from an index.
Run optimize table. This will ensure that the data pages are
physically stored in sorted order. This could conceivably speed up a
range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id
column. This will store a copy of that column in index pages which be
much faster to scan. After creating it, check the explain plan to
make sure it's using the new index

Sql Server Paging issue

Friends,
I have already implemented paging in my SP -
with MyData As (
select ROW_NUMBER() over (order by somecolumn desc) AS [Row],
x,y,z,...
)
Select x,y,z,...
From MyData
Where [Row] between ((#currentPage - 1) * #pageSize + 1) and (#currentPage*#pageSize)
The problem here is that data is retried very fast if with clause return smaller number of rows but it takes long time when there are millions of records. Sometimes it times out.
Is there any other alternative?
Thanks for sharing your valuable time.
SQL server optimisation is a very broad subject and it is pretty much impossible to work out the issue with the limited amount of information you have posted. However since you're in a rush for a solution - Firstly I would suggest checking your actual execution plan, post it here, and making sure that the index is actually being used - if this is not the case then consider using the FASTFIRSTROW table hint to force the index to be used - check here and here on how it can improve things and here in how it can make things worse.
Next to consider is SQL parameter sniffing - it's unlikely from what you have said but possible check here for details enter link description here
For large scale performance gains you may need to look at architectural changes at the very least ensure that your transaction logs are on a different disk to your data.The reason you separate the database files from the log files is because database access is random and log access is sequential. Best practice dictates that you don't mix those two I/O types on the same disk
Also if you've got million of rows then you really need to consider splitting the data across multiple disks.
Finally I would strongly consider partioning either the table or the index see here for a start
The reason why your query is slow is that you have sort whole table on every request. To speed it up significantly you need to avoid sorting big chuck of data, at cost of CPU, HDD/Memory or limitations on pagination logic.
As there is not much information about how you table is sorted and if you insert in the middle / delete entries very often, I'll narrow down you question by making these assumptions:
I would imagine you have a table storing an archive of articles. New entries are mostly at the bottom of the table, entries from the middle of the table deleted rarely.
You sort always by the same column somecolumn and in the same order, e.g. descending.
You do not have any user entered filters (like article title or author).
This makes the table static in terms of the output: each article will be on the same place, unless a new one inserted. New one come to the top of your output. Then you can store ROW_NUMBER() OVER () as a column. A more convenient solution will be an IDENTITY column. It will speed up things if you create a clustered index on this column
alter table add [Record_Number] int null IDENTITY
This new column is added as null so you can populate values first time. Then you can make it not null.
On the other hand you can last row number very quickly by
select #Max_Row = SELECT MAX(row_number) from MyTable
Now when you have total number of rows, page size and page number you can select rows you need in one statement without sorting the whole lot.
Select * From MyTable
Where row_number between
(#Max_Row - #Page * #Page_Size) + 1 AND
#Max_Row -(#Page - 1) * #Page_Size
If you do have a filter in your CTE, then give some more information about how your data is structured, so we can think of a way to limit the scope of CTE.

SQL - Optimizing performance of bulk inserts and large joins?

I am doing ETL for log files into a PostgreSQL database, and want to learn more about the various approaches used to optimize performance of loading data into a simple star schema.
To put the question in context, here's an overview of what I do currently:
Drop all foreign key and unique
constraints
Import the data (~100 million records)
Re-create the constraints and run analyze on the fact table.
Importing the data is done by loading from files. For each file:
1) Load the data from into a temporary table using COPY (the PostgreSQL bulk upload tool)
2) Update each of the 9 dimension tables with any new data using an insert for each such as:
INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;
The analyze is run at the end of the INSERT with the idea of keeping the statistics up to date over the course of tens of millions of updates (Is this advisable or necessary? At minimum it does not seem to significantly reduce performance).
3) The fact table is then updated with an unholy 9-way join:
INSERT INTO event (time, status, fk_host, fk_etype, ... )
SELECT t.time, t.status, host.id, etype.id ...
FROM temp_table as t
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table
Are there better approaches I'm overlooking?
I've tried several different approaches to trying to normalize the data incoming from a source as such and generally I've found the approach you're using now to be my choice. Its easy to follow and minor changes stay minor. Trying to return the generated id from one of the dimension tables during stage 2 only complicated things and usually generates far too many small queries to be efficient for large data sets. Postgres should be very efficient with your "unholy join" in modern versions and using "select distinct except select" works well for me. Other folks may know better, but I've found your current method to be my perferred method.
During stage 2 you know the primary key of each dimension you're inserting data into (after you've inserted it), but you're throwing this information away and rediscovering it in stage 3 with your "unholy" 9-way join.
Instead I'd recommend creating one sproc to insert into your fact table; e.g. insertXXXFact(...), which calls a number of other sprocs (one per dimension) following the naming convention getOrInsertXXXDim, where XXX is the dimension in question. Each of these sprocs will either look-up or insert a new row for the given dimension (thus ensuring referential integrity), and should return the primary key of the dimension your fact table should reference. This will significantly reduce the work you need to do in stage 3, which is now reduced to a call of the form insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)
The approach we've adopted in our getOrInsertXXX sprocs is to insert a dummy value if one is not available and have a separate cleanse process to identify and enrich these values later on.