I'm using Hive on MRv2, and I'm trying to optimize hive queries.
The database assumes the purchase history of a convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.
SELECT c.gender,
g.name,
i.name,
Sum(ph.num)
FROM purchase_histories ph
JOIN customers c
ON ( c.id = ph.cus_id
AND ph.dt < $var1
AND ph.dt > $var2 )
JOIN items i
ON ( i.id = ph.item_id )
JOIN genres g
ON ( g.id = i.gen_id )
GROUP BY c.gender,
g.name,
i.name;
I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).
I compared this database and no partition database(contains same tables) by above query. I input some kinds of values to $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.
I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but a not partitioned database is about 150. I don't think the many mappers is definitely good but 10~30 mappers are too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.
The result of EXPLAIN are no_partitions and partitioned. And execution logs are exe_log_no_partition and exe_log_partitioned.
Thanks.
Addition
1, I saw EXPLAIN result of partitioned and thought that the number of mapper is calculated from below formula:
(the table size 2619958583)/(mapreduce.input.fileinputformat.split.maxsize=256000000)
Is it wrong?
Related
I know that this type of questions are asked before, but I couldn't find one with my exact problem.
I'll try to give an exaggerated example.
Let's say that we want to find companies with at least one employee older than 40 and at least one customer younger than 20.
Query my colleague wrote for this problem is like this :
SELECT DISTINCT(c.NAME) FROM COMPANY c
LEFT JOIN EMPLOYEE e ON c.COMPANY_ID = e.COMPANY_ID
LEFT JOIN CUSTOMER u ON c.COMPANY_ID = u.COMPANY_ID
WHERE e.AGE > 40 and u.AGE < 20
I'm new to databases. But looking at this query (like a time complexity problem) it will create an unnecessarily huge temporary table. It will have employeeAmount x customerAmount rows for each companies.
So, I re-wrote the query:
SELECT c.NAME FROM COMPANY c
WHERE EXISTS (SELECT * FROM EMPLOYEE e WHERE e.AGE > 40 AND c.COMPANY_ID = e.COMPANY_ID )
OR EXISTS (SELECT * FROM CUSTOMER u WHERE u.AGE < 20 AND c.COMPANY_ID = u.COMPANY_ID )
I do not know if this query will be worse since it will run 2 subqueries for each company.
I know that there can be better ways to write this. For example writing 2 different subqueries for 2 age conditions and then UNION'ing them may be better. But I really want to learn if there is something wrong with one of / both of two queries.
Note: You can increase the join/subquery amount. For example, "we want to find companies with at least one employee older than 40 and at least one customer younger than 20 and at least one order bigger than 1000$"
Thanks.
The exists version should have much better performance in general, especially if you have indexes on company_id in each of the subtables.
Why? The JOIN version creates an intermediate result with all customers over 40 and all employees under 20. That could be quite large if these groups are large for a particular company. Then, the query does additional work to remove duplicates.
There might be some edge cases where the first version has fine performance. I would expect this, for instance, if either of the groups were empty -- no employees ever under 20 or no customers ever over 40. Then the intermediate result set is empty and removing duplicates is not necessary. For the general case, though, I recommend exists.
To know what really happens in your current environment, with your database settings and with your data you need to compare real execution plans (not just EXPLAIN PLAN which gives only the estimated plan). Only real execution plan can give detailed resources used by the query like CPU and IO in addition to detailed steps used by Oracle (full table scan, joins, etc.).
Try:
ALTER SESSION STATISTICS_LEVEL=ALL;
<your query>
SELECT * FROM TABLE(dbms_xplan.display(NULL, NULL, format=>'allstats last'));
Do not assume, just test.
If I have a data stream that gives me 10 million records a day (Stream A), and another that gives me 1 billion a day (Stream B) what is an efficient way to see if there is an overlap in the data?
More specifically, if there is a customer in Stream A who visits a webpage, and that same customer visits a different webpage in Stream B, how can I tell that the customer visited both webpages?
My initial thought was to put the records into a relational database and do a join, but I know that is very inefficient.
What is a more efficient way to do this? How would I be able to do this using a tool like Hadoop or Spark?
A join should be an efficient way of dealing with this. You should have both data sets ordered, or an index on the CustomerID (and the index would be ordered by CustomerID). Because of the indexing, the SQL engine would know that the sets are ordered and should be able to do the join very efficiently.
If you're only looking for instances where the CustomerID is in both, it might be a SQL query along the lines of:
Select Distinct A.CustomerID
From A
Inner Join B
on A.CustomerID = B.CustomerID
I have a table that has 14,091 rows (2 columns, let's say first name, last name). I then have a calendar table that has 553 rows of just dates (first of each month). I do a cross join in order to get every combination of first name, last name, & first of month because this is my requirement. This takes just over a minute.
Is there anything I can do about this to make it faster or can a cross join never get any faster like I suspect?
People Table
first_name varchar2(100)
last_name varchar2(1000)
Dates Table
dt DateTime
select a.first_name, a.last_name, b.dt
from people a, dates b
It will be slow as it making all possible combinations. 14091 * 553. It will not going to be fast until you have either index or inner join.
Yeah. Takes over a minute. Let's get this clear. You talk of 14091 * 553 rows - that is 7792323. Rounded that is 7.8 million rows. And loading them into a data table (which is not known for performance).
Want to see slow? Put them into a grid. THEN you see slow.
The requirements make no sense in a table. None. Absolutely none.
And no, there is no way to speed up the loading of 7.8 million rows into a data structure that is not meant to hold these amounts of data.
I have a table with close to 30 million records. Just several columns. One of the column 'Born' have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.
For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Every query of that sort (only Born parameter used) returns just over 1 million results.
I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople from the select clause the whole thing speeds up a lot.
Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.
Updated following discussion in comments
The cause of the problem here is very low cardinality of the IX_Person_Born index.
SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.
You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.
This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.
Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.
A high cardinality index (say for full date of birth), rather than year would be much quicker.
This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries because index-trees don't help much.
In short: there's no significantly quicker way to get the count using T-SQL and indexes.
You have a couple of options:
1. Data Aggregation
Either OLAP/Cube rollups or do it yourself:
select Born, count(*)
from Person
group by Born
The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.
2. Parallel Queries
Split into two queries:
SELECT count(*)
FROM Person
WHERE Born = '1970'
SELECT TOP 30 *
FROM Person
WHERE Born = '1970'
Then run these either in parallel server side, or add it to the user interface.
3. No-SQL
This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.
At this point a technology change is probably out, but it may be worth investigating so I've included it.
I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.
DECLARE #TotalPeople int
--does this query run fast enough? If not, there is no hope for a combo query.
SET #TotalPeople = (SELECT count(*) FROM Person WHERE Born = '1970')
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, #TotalPeople as TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
You usually can't take a slow query, combine it with a fast query, and wind up with a fast query.
One of the column 'Born' have not more than 30 different values and there is an index defined on it.
Either SQL Server isn't using the index or statistics, or the index and statistics aren't helpful enough.
Here is a desperate measure that will force Sql's hand (at the potential cost of making writes very expensive - measure that, and blocking schema changes to the Person table while the view exists).
CREATE VIEW dbo.BornCounts WITH SCHEMABINDING
AS
SELECT Born, COUNT_BIG(*) as NumRows
FROM dbo.Person
GROUP BY Born
GO
CREATE UNIQUE CLUSTERED INDEX BornCountsIndex ON BornCounts(Born)
By putting a clustered index on a view, you make it a system maintained copy. The size of this copy is much smaller than 30 Million rows, and it has the exact information you're looking for. I did not have to change the query to get it to use the view, but you're free to use the view's name in the query if you like.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, **max(Row) AS TotalPeople**
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
why not like that ?
edit , dont know why bold doesnt work :<
Here is a novel approach using system dmv's if you can get by with a "good enough" count, you don't mind creating an index for every distinct value for [Born], and you don't mind feeling a little bit dirty inside.
Create a filtered index for each year:
--pick a column to index, it doesn't matter which.
CREATE INDEX IX_Person_filt_1970 on Person ( id ) WHERE Born = '1970'
CREATE INDEX IX_Person_filt_1971 on Person ( id ) WHERE Born = '1971'
CREATE INDEX IX_Person_filt_1972 on Person ( id ) WHERE Born = '1972'
Then use the [rows] column from sys.partitions to to get a rowcount.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *,
(
SELECT sum(rows)
FROM sys.partitions p
inner join sys.indexes i on p.object_id = i.object_id and p.index_id =i.index_id
inner join sys.tables t on t.object_id = i.object_id
WHERE t.name ='Person'
and i.name = 'IX_Person_filt_' + '1970' --or at #p1
) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Sys.partitions isn't guaranteed to be accurate in 100% of cases (usually it is exact or really close) This approach won't work if you need to filter on anything but [Born]
I have a database which is 6GB in size, with a multitude of tables however smaller queries seem to have the most problems, and want to know what can be done to optimise them for example there is a Stock, Items and Order Table.
The Stock table is the items in stock this has around 100,000 records within with 25 fields storing ProductCode, Price and other stock specific data.
The Items table stores the information about the items there are over 2,000,000 of these with over 50 fields storing Item Names and other details about the item or product in question.
The Orders table stores the Orders of Stock Items, which is the when the order was placed plus the price sold for and has around 50,000 records.
Here is a query from this Database:
SELECT Stock.SKU, Items.Name, Stock.ProductCode FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
WHERE (Stock.Status = 1 OR Stock.Status = 2) AND Order.Customer = 12345
ORDER BY Order.OrderDate DESC;
Given the information here what could be done to improve this query, there are others like this, what alternatives are there. The nature of the data and the database cannot be detailed further however, so if general optmisation tricks and methods are given these will be fine, or anything which applies generally to databases.
The Database is MS SQL 2000 on Windows Server 2003 with the latest service packs for each.
DB Upgrade / OS Upgrade are not options for now.
Edit
Indices are Stock.SKU, Items.ProductCode and Orders.OrderID on the tables mentioned.
Execution plan is 13-16 seconds for a Query like this 75% time spent in Stock
Thanks for all the responses so far - Indexing seems to be the problem, all the different examples given have been helpful - dispite a few mistakes with the query, but this has helped me a lot some of these queries have run quicker but combined with the index suggestions I think I might be on the right path now - thanks for the quick responses - has really helped me and made me consider things I did not think or know about before!
Indexes ARE my problem added one to the Foriegn Key with Orders (Customer) and this
has improved performance by halfing execution time!
Looks like I got tunnel vision and focused on the query - I have been working with DBs for a couple of years now, but this has been very helpful. However thanks for all the query examples they are combinations and features I had not considered may be useful too!
is your code correct??? I'm sure you're missing something
INNER JOIN Batch ON Order.OrderID = Orders.OrderID
and you have a ) in the code ...
you can always test some variants against the execution plan tool, like
SELECT
s.SKU, i.Name, s.ProductCode
FROM
Stock s, Orders o, Batch b, Items i
WHERE
b.OrderID = o.OrderID AND
s.ProductCode = i.ProductCode AND
s.Status IN (1, 2) AND
o.Customer = 12345
ORDER BY
o.OrderDate DESC;
and you should return just a fraction, like TOP 10... it will take some milliseconds to just choose the TOP 10 but you will save plenty of time when binding it to your application.
The most important (if not already done): define your primary keys for the tables (if not already defined) and add indexes for the foreign keys and for the columns you are using in the joins.
Did you specify indexes? On
Items.ProductCode
Stock.ProductCode
Orders.OrderID
Orders.Customer
Sometimes, IN could be faster than OR, but this is not as important as having indexes.
See balexandre answer, you query looks wrong.
Some general pointers
Are all of the fields that you are joining on indexed?
Is the ORDER BY necessary?
What does the execution plan look like?
BTW, you don't seem to be referencing the Order table in the question query example.
Table index will certainly help as Cătălin Pitiș suggested.
Another trick is to reduce the size of the join rows by either use sub select or to be more extreme use temp tables. For example rather than join on the whole Orders table, join on
(SELECT * FROM Orders WHERE Customer = 12345)
also, don't join directly on Stock table join on
(SELECT * FROM Stock WHERE Status = 1 OR Status = 2)
Setting the correct indexes on the tables is usually what makes the biggest difference for performance.
In Management Studio (or Query Analyzer for earlier versions) you can choose to view the execution plan of the query when you run it. In the execution plan you can see what the database is really doing to get the result, and what parts takes the most work. There are some things to look for there, like table scans, that usually is the most costly part of a query.
The primary key of a table normally has an index, but you should verify that it's actually so. Then you probably need indexes on the fields that you use to look up records, and fields that you use for sorting.
Once you have added an index, you can rerun the query and see in the execution plan if it's actually using the index. (You may need to wait a while after creating the index for the database to build the index before it can use it.)
Could you give it a go?
SELECT Stock.SKU, Items.Name, Stock.ProductCode FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID AND (Order.Customer = 12345) AND (Stock.Status = 1 OR Stock.Status = 2))
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
ORDER BY Order.OrderDate DESC;
Elaborating on what Cătălin Pitiș said already: in your query
SELECT Stock.SKU, Items.Name, Stock.ProductCode
FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
WHERE (Stock.Status = 1 OR Stock.Status = 2) AND Order.Customer = 12345
ORDER BY Order.OrderDate DESC;
the criterion Order.Customer = 12345 looks very specific, whereas (Stock.Status = 1 OR Stock.Status = 2) sounds unspecific. If this is correct, an efficient query consists of
1) first finding the orders belonging to a specific customer,
2) then finding the corresponding rows of Stock (with same OrderID) filtering out those with Status in (1, 2),
3) and finally finding the items with the same ProductCode as the rows of Stock in 2)
For 1) you need an index on Customer for the table Order, for 2) an index on OrderID for the table Stock and for 3) an index on ProductCode for the table Items.
As long your query does not become much more complicated (like being a subquery in a bigger query, or that Stock, Order and Items are only views, not tables), the query optimizer should be able to find this plan already from your query. Otherwise, you'll have to do what kuoson is suggesting (but the 2nd suggestion does not help, if Status in (1, 2) is not very specific and/or Status is not indexed on the table Status). But also remember that keeping indexes up-to-date costs performance if you do many inserts/updates on the table.
To shorten my answer I gave 2 hours ago (when my cookies where switched off):
You need three indexes: Customer for table Order, OrderID for Stock and ProductCode for Items.
If you miss any of these, you'll have to wait for a complete table scan on the according table.