If I have a data stream that gives me 10 million records a day (Stream A), and another that gives me 1 billion a day (Stream B) what is an efficient way to see if there is an overlap in the data?
More specifically, if there is a customer in Stream A who visits a webpage, and that same customer visits a different webpage in Stream B, how can I tell that the customer visited both webpages?
My initial thought was to put the records into a relational database and do a join, but I know that is very inefficient.
What is a more efficient way to do this? How would I be able to do this using a tool like Hadoop or Spark?
A join should be an efficient way of dealing with this. You should have both data sets ordered, or an index on the CustomerID (and the index would be ordered by CustomerID). Because of the indexing, the SQL engine would know that the sets are ordered and should be able to do the join very efficiently.
If you're only looking for instances where the CustomerID is in both, it might be a SQL query along the lines of:
Select Distinct A.CustomerID
From A
Inner Join B
on A.CustomerID = B.CustomerID
Related
Well, the query I need is simple, and maybe is in another question, but there is a performance thing in what I need, so:
I have a table of users with 10.000 rows, the table contains id, email and more data.
In another table called orders I have way more rows, maybe 150.000 rows.
In this orders I have the id of the user that made the order, and also a status of the order. The status could be a number from 0 to 9 (or null).
My final requirement is to have every user with the id, email, some other column , and the number of orders with status 3 or 7. it does not care of its 3 or 7, I just need the amount
But I need to do this query in a low-impact way (or a performant way).
What is the best approach?
I need to run this in a redash with postgres 10.
This sounds like a join and group by:
select u.*, count(*)
from users u join
orders o
on o.user_id = u.user_id
where o.status in (3, 7)
group by u.user_id;
Postgres is usually pretty good about optimizing these queries -- and the above assumes that users(user_id) is the primary key -- so this should work pretty well.
I'm using Hive on MRv2, and I'm trying to optimize hive queries.
The database assumes the purchase history of a convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.
SELECT c.gender,
g.name,
i.name,
Sum(ph.num)
FROM purchase_histories ph
JOIN customers c
ON ( c.id = ph.cus_id
AND ph.dt < $var1
AND ph.dt > $var2 )
JOIN items i
ON ( i.id = ph.item_id )
JOIN genres g
ON ( g.id = i.gen_id )
GROUP BY c.gender,
g.name,
i.name;
I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).
I compared this database and no partition database(contains same tables) by above query. I input some kinds of values to $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.
I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but a not partitioned database is about 150. I don't think the many mappers is definitely good but 10~30 mappers are too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.
The result of EXPLAIN are no_partitions and partitioned. And execution logs are exe_log_no_partition and exe_log_partitioned.
Thanks.
Addition
1, I saw EXPLAIN result of partitioned and thought that the number of mapper is calculated from below formula:
(the table size 2619958583)/(mapreduce.input.fileinputformat.split.maxsize=256000000)
Is it wrong?
Good morning/afternoon! I was hoping someone could help me out with something that probably should be very simple.
Admittedly, I’m not the strongest SQL query designer. That said, I’ve spent a couple hours beating my head against my keyboard trying to get a seemingly simple three way join working.
NOTE: I'm querying a Vertica DB.
Here is my query:
SELECT A.CaseOriginalProductNumber, A.CaseCreatedDate, A.CaseNumber, B.BU2_Key as BusinessUnit, C.product_number_desc as ModelNumber
FROM pps_sfdc.v_Case A
INNER JOIN reference_data.DIM_PRODUCT_LINE_HIERARCHY B
ON B.PL_Key = A.CaseOriginalProductLine
INNER JOIN reference_data.DIM_PRODUCT C
ON C.product_line_code = A.CaseOriginalProductLine
WHERE B.BU2_Key = 'XWT'
LIMIT 20
I have a view (v_Case) that I’m trying to join to two other tables so I can lookup a value from each of them. The above query returns identical data on everything EXCEPT the last column (see below). It's like it's iterating through the last column to pull out the unique entries, sort of like a "GROUP BY" clause. What SHOULD be happening is that I get unique rows with specific "BusinessUnit" and "ModelNumber" for that record.
DUMEPRINT 5/2/2014 8:56:27 AM 3002845327 JJT Product 1
DUMEPRINT 5/2/2014 8:56:27 AM 3002845327 JJT Product 2
DUMEPRINT 5/2/2014 8:56:27 AM 3002845327 JJT Product 3
DUMEPRINT 5/2/2014 8:56:27 AM 3002845327 JJT Product 4
I modeled my solution after this post:
How to deal with multiple lookup tables for beginners of SQL?
What am I doing wrong?
Thank you for any help you can provide.
Data issue. General rule in trouble shooting these is the column that is distinct (in this case C.product_number_desc as ModelNumber) for each record is generally where the issue is going to be...and why I pointed you towards dim_product.
If you receive duplicates, this query below will help identify if this table is giving you the issues. Remember key in this statement can be multiple fields...whatever you are joining the table on:
Select key,count(1) from table group by key having count(1)>1
Other options for the future...don't assume it's your code, duplicates like this almost always point towards dirty data (other option is you are causing cross joins because keys are not correct). If you comment out the 'c' table and the column referred to in the select clause, you would have received one row...hence your dupes were coming from the 'c' table here.
Good luck with it
I am working on an application that allows users to build a "book" from a number of "pages" and then place them in any order that they'd like. It's possible that multiple people can build the same book (the same pages in the same order). The books are built by the user prior to them being processed and printed, so I need to group books together that have the same exact layout (the same pages in the same order). I've written a million queries in my life, but for some reason I can't grasp how to do this.
I could simply write a big SELECT query, and then loop through the results and build arrays of objects that have the same pages in the same sequence, but I'm trying to figure out how to do this with one query.
Here is my data layout:
dbo.Books
BookId
Quantity
dbo.BookPages
BookId
PageId
Sequence
dbo.Pages
PageId
DocName
So, I need some clarification on a few things:
Once a user orders the pages the way they want, are they saved back down to a database?
If yes, then is the question to run a query to group book orders that have the same page-numbering, so that they are sent to the printers in an optimal way?
OR, does the user layout the pages, then send the order directly to the printer? And if so, it seems more complicated/less efficient to capture requested print jobs, and order them on-the-fly on the way out to the printers ...
What language/technology are you using to create this solution? .NET? Java?
With the answers to these questions, I can better gauge what you need.
With the answers to my questions, I also assume that:
You are using some type of many-to-many table to store customer page ordering. If so, then you'll need to write a query to select distinct page-orderings, and group by those page orderings. This is possible with a single SQL query.
However, if you feel you want more control over how this data is joined, then doing this programmatically may be the way to go, although you will lose performance by reading in all the data, and then outputting that data in a way that is consumable by your printers.
The books are identical only if the page count = match count.
It was tagged TSQL when I started. This may not be the same syntax on SQL.
;WITH BookPageCount
AS
(
select b1.bookID, COUNT(*) as [individualCount]
from book b1 with (nolock)
group by b1.bookID
),
BookCombinedCount
AS
(
select b1.bookID as [book1ID], b2.bookID as [book2ID], COUNT(*) as [combindCount]
from book b1 with (nolock)
join book b2 with (nolock)
on b1.bookID < b2.bookID
and b1.squence = b2.squence
and b1.page = b2.page
group by b1.bookID, b2.bookID
)
select BookCombinedCount.book1ID, BookCombinedCount.book2ID
from BookCombinedCount
join BookPageCount as book1 on book1.bookID = BookCombinedCount.book1ID
join BookPageCount as book2 on book2.bookID = BookCombinedCount.book2ID
where BookCombinedCount.combindCount = book1.individualCount
and BookCombinedCount.combindCount = book2.individualCount.PageCount
I have a database which is 6GB in size, with a multitude of tables however smaller queries seem to have the most problems, and want to know what can be done to optimise them for example there is a Stock, Items and Order Table.
The Stock table is the items in stock this has around 100,000 records within with 25 fields storing ProductCode, Price and other stock specific data.
The Items table stores the information about the items there are over 2,000,000 of these with over 50 fields storing Item Names and other details about the item or product in question.
The Orders table stores the Orders of Stock Items, which is the when the order was placed plus the price sold for and has around 50,000 records.
Here is a query from this Database:
SELECT Stock.SKU, Items.Name, Stock.ProductCode FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
WHERE (Stock.Status = 1 OR Stock.Status = 2) AND Order.Customer = 12345
ORDER BY Order.OrderDate DESC;
Given the information here what could be done to improve this query, there are others like this, what alternatives are there. The nature of the data and the database cannot be detailed further however, so if general optmisation tricks and methods are given these will be fine, or anything which applies generally to databases.
The Database is MS SQL 2000 on Windows Server 2003 with the latest service packs for each.
DB Upgrade / OS Upgrade are not options for now.
Edit
Indices are Stock.SKU, Items.ProductCode and Orders.OrderID on the tables mentioned.
Execution plan is 13-16 seconds for a Query like this 75% time spent in Stock
Thanks for all the responses so far - Indexing seems to be the problem, all the different examples given have been helpful - dispite a few mistakes with the query, but this has helped me a lot some of these queries have run quicker but combined with the index suggestions I think I might be on the right path now - thanks for the quick responses - has really helped me and made me consider things I did not think or know about before!
Indexes ARE my problem added one to the Foriegn Key with Orders (Customer) and this
has improved performance by halfing execution time!
Looks like I got tunnel vision and focused on the query - I have been working with DBs for a couple of years now, but this has been very helpful. However thanks for all the query examples they are combinations and features I had not considered may be useful too!
is your code correct??? I'm sure you're missing something
INNER JOIN Batch ON Order.OrderID = Orders.OrderID
and you have a ) in the code ...
you can always test some variants against the execution plan tool, like
SELECT
s.SKU, i.Name, s.ProductCode
FROM
Stock s, Orders o, Batch b, Items i
WHERE
b.OrderID = o.OrderID AND
s.ProductCode = i.ProductCode AND
s.Status IN (1, 2) AND
o.Customer = 12345
ORDER BY
o.OrderDate DESC;
and you should return just a fraction, like TOP 10... it will take some milliseconds to just choose the TOP 10 but you will save plenty of time when binding it to your application.
The most important (if not already done): define your primary keys for the tables (if not already defined) and add indexes for the foreign keys and for the columns you are using in the joins.
Did you specify indexes? On
Items.ProductCode
Stock.ProductCode
Orders.OrderID
Orders.Customer
Sometimes, IN could be faster than OR, but this is not as important as having indexes.
See balexandre answer, you query looks wrong.
Some general pointers
Are all of the fields that you are joining on indexed?
Is the ORDER BY necessary?
What does the execution plan look like?
BTW, you don't seem to be referencing the Order table in the question query example.
Table index will certainly help as Cătălin Pitiș suggested.
Another trick is to reduce the size of the join rows by either use sub select or to be more extreme use temp tables. For example rather than join on the whole Orders table, join on
(SELECT * FROM Orders WHERE Customer = 12345)
also, don't join directly on Stock table join on
(SELECT * FROM Stock WHERE Status = 1 OR Status = 2)
Setting the correct indexes on the tables is usually what makes the biggest difference for performance.
In Management Studio (or Query Analyzer for earlier versions) you can choose to view the execution plan of the query when you run it. In the execution plan you can see what the database is really doing to get the result, and what parts takes the most work. There are some things to look for there, like table scans, that usually is the most costly part of a query.
The primary key of a table normally has an index, but you should verify that it's actually so. Then you probably need indexes on the fields that you use to look up records, and fields that you use for sorting.
Once you have added an index, you can rerun the query and see in the execution plan if it's actually using the index. (You may need to wait a while after creating the index for the database to build the index before it can use it.)
Could you give it a go?
SELECT Stock.SKU, Items.Name, Stock.ProductCode FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID AND (Order.Customer = 12345) AND (Stock.Status = 1 OR Stock.Status = 2))
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
ORDER BY Order.OrderDate DESC;
Elaborating on what Cătălin Pitiș said already: in your query
SELECT Stock.SKU, Items.Name, Stock.ProductCode
FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
WHERE (Stock.Status = 1 OR Stock.Status = 2) AND Order.Customer = 12345
ORDER BY Order.OrderDate DESC;
the criterion Order.Customer = 12345 looks very specific, whereas (Stock.Status = 1 OR Stock.Status = 2) sounds unspecific. If this is correct, an efficient query consists of
1) first finding the orders belonging to a specific customer,
2) then finding the corresponding rows of Stock (with same OrderID) filtering out those with Status in (1, 2),
3) and finally finding the items with the same ProductCode as the rows of Stock in 2)
For 1) you need an index on Customer for the table Order, for 2) an index on OrderID for the table Stock and for 3) an index on ProductCode for the table Items.
As long your query does not become much more complicated (like being a subquery in a bigger query, or that Stock, Order and Items are only views, not tables), the query optimizer should be able to find this plan already from your query. Otherwise, you'll have to do what kuoson is suggesting (but the 2nd suggestion does not help, if Status in (1, 2) is not very specific and/or Status is not indexed on the table Status). But also remember that keeping indexes up-to-date costs performance if you do many inserts/updates on the table.
To shorten my answer I gave 2 hours ago (when my cookies where switched off):
You need three indexes: Customer for table Order, OrderID for Stock and ProductCode for Items.
If you miss any of these, you'll have to wait for a complete table scan on the according table.