Query a dynamic deduplicated table - google-bigquery

I am using BigQuery to give my colleagues access to aggregated data in our system.
I have a raw_orders table where I store orders data. The thing is that the lines in this table are subject to change across time. When a change occurs, I add a new line in this table. So my table looks like this:
+-----+-------+---------------------+---------------------+
| id | total | created_at | updated_at |
+-----+-------+---------------------+---------------------+
| ABC | 15.76 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| ABC | 12.43 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
| DEF | 19.03 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| DEF | 12.03 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
+-----+-------+---------------------+---------------------+
To allow my collaborators to query on a deduplicated table easily, I created a view of deduplicated lines using:
CREATE OR REPLACE VIEW xxx.orders as
select ro.*
from (
select ro.id, max(ro.updated_at) max_updated_at
from xxx.raw_orders ro
group by ro.id
) tmp inner join xxx.raw_orders ro2 on ro2.id = tmp.id && ro2.updated_at = tmp.max_updated_at
order by f.created_at desc
This works great, but I feel that I am spending too much budget on simple requests like:
SELECT * FROM rubee.orders WHERE created_at > '2020-11-01 00:00:00';
If I understand well, because of the view step, big query must use a lot of storage to deduplicate lines before responding a single result.
Am I doing something wrong here? How do you give access to deduplicated data without spending too much storage? Would you have a better strategy for what I try to do?

Ideally, you will use a materialized view for the purpose, but right now BigQuery has limited support on materialized view. You cannot create a mview to replace the view you were using.
It is possible to create a materialized view for the inner query, which may make the whole query less expensive but please read on.
Cost. There is no simple answer whether you are "spending too much budget" on the query.
If you're on pay-per-query plan and charged by "processed bytes", then although the query is more expensive for BigQuery to process, you're charged no more than scanning the whole table once (although technically the table was scanned more than once). In another word, deduplication is free. However, if your query pattern allows to to cluster/partition your table somehow to avoid scanning the whole table, then this "self-join" view does prevent you from saving the budget.
If you have reservation on slots, then you will benefit from making the query faster.
Suggestions. Give the situation is different case by case, the general suggestions are:
If it is possible, separating the data into "archived" and "active" so that "archived" data stay deduplicated (and partitioned/clustered to allow efficient search), and you only need a view to dedup "active" data.
Create a materialized view (on the inner "GROUP BY" query) may speed up the query a bit but not necessarily make it "cheaper", you may be charged the size of the base table + mview.

Related

How can I improve KQL query for large dataset for heatmap

I have a KQL query below which will provide a real nice heatmap to map out top access by country for Azure WAF.
The challenge here is that this query cannot go beyond 24 hours as the number of records I have way too big. How can i improve this to even display like weekly and monthly stats ?
// source: https://datahub.io/core/geoip2-ipv4
set notruncation;
let CountryDB=externaldata(Network:string, geoname_id:string, continent_code:string, continent_name:string, country_iso_code:string, country_name:string)
[#"https://datahub.io/core/geoip2-ipv4/r/geoip2-ipv4.csv"]
| extend Dummy=1;
let AppGWAccess = AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS"
| where Category == "ApplicationGatewayAccessLog"
| where userAgent_s !in ("bot")
| project TimeGenerated, clientIP_s;
AppGWAccess
| extend Dummy=1
| summarize count() by Hour=bin(TimeGenerated,6h), clientIP_s,Dummy
| partition by Hour(
lookup (CountryDB|extend Dummy=1) on Dummy
| where ipv4_is_match(clientIP_s, Network)
)
| summarize sum(count_) by country_name
What you're doing is creating hourly aggregations over all the data. Instead, you should create a Materialized View that will do the aggregations in the background for you.
Quoting the documentation:
Materialized views expose an aggregation query over a source table. Materialized views always return an up-to-date result of the aggregation query (always fresh). Querying a materialized view is more performant than running the aggregation directly over the source table, which is performed each query.

How to flatten a one-to-many relationship

While trying to build a data warehousing application using Talend, we are faced with the following scenario.
We have two tables tables that look like
Table master
ID | CUST_NAME | CUST_EMAIL
------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM
Events Table
ID | CUST_ID | EVENT_NAME | EVENT_DATE
---------------------------------------
1 | 1 | ACC_APPLIED | 2014-01-01
2 | 1 | ACC_OPENED | 2014-01-02
3 | 1 | ACC_CLOSED | 2014-01-02
There is a one-to-many relationship between master and the events table.Since, given a limited number of event names I proposing that we denormalize this structure into something that looks like
ID | CUST_NAME | CUST_EMAIL | ACC_APP_DATE_ID | ACC_OPEN_DATE_ID |ACC_CLOSE_DATE_ID
-----------------------------------------------------------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM | 20140101 | 20140102 | 20140103
THE DATE_ID columns refer to entries inside the time dimension table.
First question : Is this a good idea ? What are the other alternatives to this scheme ?
Second question : How do I implement this using Talend Open Studio ? I figured out a way in which I moved the data for each event name into it's own temporary table along with cust_id using the tMap component and later linked them together using another tMap. Is there another way to do this in talend ?
To do this in Talend you'll need to first sort your data so that it is reliably in the order of applied, opened and closed for each account and then denormalize it to a single row with a single delimited field for the dates using the tDenormalizeRows component.
After this you'll want to use tExtractDelimitedFields to split the single dates field.
Yeah, this is a good idea, this is called a cumulative snapshot fact. http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
Not sure how to do this in Talend (dont know the tool) but it would be quite easy to implement in SQL using a Case or Pivot statement
Regarding only your first question, it's certainly a good idea -- unless there is any possibility of the same persons applying-opening-closing their account more than once AND you want to keep all this information in their history (so UPDATE wouldn't help).
Snowflaking is definitely not a good option if you are going to design a data warehouse. So, denormalizing will certainly be a good choice in this case. Following article almost fits perfectly to clear the air over such scenarios,
http://www.kimballgroup.com/2008/09/design-tip-105-snowflakes-outriggers-and-bridges/

Vertica and joins

I'm adapting a web analysis tool to use Vertica as the DB. I'm having real problems optimizing joins. I tried creating pre-join projections for some of my queries, and while it did make the queries blazing fast, it slowed data loading into the fact table to a crawl.
A simple INSERT INTO ... SELECT * FROM which we use to load data into the fact table from a staging table goes from taking ~5 seconds to taking 20+ minutes.
Because of this I dropped all pre-join projections and tried using the Database Designer to design query specific projections but it's not enough. Even with those projections a simple join is taking ~14 seconds, something that takes ~1 second with a pre-join projection.
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
We're running Vertica on a 5 node cluster, each node having 2 x quad core CPU and 32 GB of memory. The tables in my example query have 188,843,085 and 25,712,878 rows respectively.
The EXPLAIN output looks like this:
EXPLAIN SELECT referer_via_.url as referralPageUrl, COUNT(DISTINCT sessio
n.id) as visits FROM owa_session as session JOIN owa_referer AS referer_vi
a_ ON session.referer_id = referer_via_.id WHERE session.yyyymmdd BETWEEN
'20121123' AND '20121123' AND session.site_id = '49' GROUP BY referer_via_
.url ORDER BY visits DESC LIMIT 250;
Access Path:
+-SELECT LIMIT 250 [Cost: 1M, Rows: 250 (STALE STATISTICS)] (PATH ID: 0)
| Output Only: 250 tuples
| Execute on: Query Initiator
| +---> SORT [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| | Order: count(DISTINCT "session".id) DESC
| | Output Only: 250 tuples
| | Execute on: All Nodes
| | +---> GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 1M, Rows: 1 (STALE
STATISTICS)] (PATH ID: 2)
| | | Aggregates: count(DISTINCT "session".id)
| | | Group By: referer_via_.url
| | | Execute on: All Nodes
| | | +---> GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 1M, Rows
: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | | Group By: referer_via_.url, "session".id
| | | | Execute on: All Nodes
| | | | +---> JOIN HASH [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID:
4) Outer (RESEGMENT)
| | | | | Join Cond: ("session".referer_id = referer_via_.id)
| | | | | Execute on: All Nodes
| | | | | +-- Outer -> STORAGE ACCESS for session [Cost: 463, Rows: 1 (ST
ALE STATISTICS)] (PUSHED GROUPING) (PATH ID: 5)
| | | | | | Projection: public.owa_session_projection
| | | | | | Materialize: "session".id, "session".referer_id
| | | | | | Filter: ("session".site_id = '49')
| | | | | | Filter: (("session".yyyymmdd >= 20121123) AND ("session"
.yyyymmdd <= 20121123))
| | | | | | Execute on: All Nodes
| | | | | +-- Inner -> STORAGE ACCESS for referer_via_ [Cost: 293K, Rows:
26M] (PATH ID: 6)
| | | | | | Projection: public.owa_referer_DBD_1_seg_Potency_2012112
2_Potency_20121122
| | | | | | Materialize: referer_via_.id, referer_via_.url
| | | | | | Execute on: All Nodes
To speedup join:
Design session table as being partitioned on column "yyyymmdd". This will enable partition pruning
Add condition on column "yyyymmdd" to _referer_via_ and partition on it, if it is possible (most likely not)
have column site_id as possible close to the beginning of order by list in used (super)projection of session
have both tables segmented on referer_id and id correspondingly.
And having more nodes in cluster do help.
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
I guess the amount affected would vary depending on data sets and structures you are working with. But, since this is the variable you changed, I believe it is safe to say the pre-join projection is causing the slowness. You are gaining query time at the expense of insertion time.
Someone please correct me if any of the following is wrong. I'm going by memory and by information picked up with conversations with others.
You can speed up your joins without a pre-join projection a few ways. In this case, the referrer ID. I believe if you segment your projections for both tables with the join predicate that would help. Anything you can do to filter the data.
Looking at your explain plan, you are doing a hash join instead of a merge join, which you probably want to look at.
Lastly, I would like to know via the explain plan or through system tables if your query is actually using the projections Database Designer has recommended. If not, explicitly specify them in your query and see if that helps.
You seem to have a lot of STALE STATISTICS.
Responding to STALE statistics is important. Because that is the reason why your queries are slow. Without statistics about the underlying data, Vertica's query optimizer cannot choose the best execution plan. And responding to STALE statistics only improves SELECT performance not update performance.
If you update your tables regularly do remember there are additional things you have to consider in VERTICA. Please check the answer that I posted to this question.
I hope that should help improve your update speed.
Explore the AHM settings as explained in that answer. If you don't need to be able to select deleted rows in a table later, it is often a good idea to not keep them around. There are ways to keep only the latest epoch version of the data. Or manually purge deleted data.
Let me know how it goes.
I think your query could use some more of being explicit. Also don't use that Devil BETWEEN Try this:
EXPLAIN SELECT
referer_via_.url as referralPageUrl,
COUNT(DISTINCT session.id) as visits
FROM owa_session as session
JOIN owa_referer AS referer_via_
ON session.referer_id = referer_via_.id
WHERE session.yyyymmdd <= '20121123'
AND session.yyyymmdd > '20121123'
AND session.site_id = '49'
GROUP BY referer_via_.url
-- this `visits` column needs a table name
ORDER BY visits DESC LIMIT 250;
I'll say I'm really perplexed as to why you would use the same DATE with BETWEEN may want to look into that.
this is my view coming from an academic background working with column databases, including Vertica (recent PhD graduate in database systems).
Blockquote
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
Blockquote
Yes, updating projections is very slow and you should ideally do it only in large batches to amortize the update cost. The fundamental reason is that each projection represents another copy of the data (of each table column that is part of the projection).
A single row insert requires adding one value (one attribute) to each column in the projection. For example, a single row insert in a table with 20 attributes requires at least 20 column updates. To make things worse, each column is sorted and compressed. This means that inserting the new value in a column requires multiple operations on large chunks of data: read data / decompress / update / sort / compress data / write data back. Vertica has several optimization for updates but cannot hide completely the cost.
Projections can be thought of as the equivalent of multi-column indexes in a traditional row store (MySQL, PostgreSQL, Oracle, etc.). The upside of projections versus traditional B-Tree indexes is that reading them (using them to answer a query) is much faster than using traditional indexes. The reasons are multiple: no need to access head data as for non-clustered indexes, smaller size due to compression, etc. The flipside is that they are way more difficult to update. Tradeoffs...

How to query huge MySQL databases?

I have 2 tables, a purchases table and a users table. Records in the purchases table looks like this:
purchase_id | product_ids | customer_id
---------------------------------------
1 | (99)(34)(2) | 3
2 | (45)(3)(74) | 75
Users table looks like this:
user_id | email | password
----------------------------------------
3 | joeShmoe#gmail.com | password
75 | nolaHue#aol.com | password
To get the purchase history of a user I use a query like this:
mysql_query(" SELECT * FROM purchases WHERE customer_id = '$users_id' ");
The problem is, what will happen when tens of thousands of records are inserted into the purchases table. I feel like this will take a performance toll.
So I was thinking about storing the purchases in an additional field directly in the user's row:
user_id | email | password | purchases
------------------------------------------------------
1 | joeShmoe#gmail.com | password | (99)(34)(2)
2 | nolaHue#aol.com | password | (45)(3)(74)
And when I query the user's table for things like username, etc. I can just as easily grab their purchase history using that one query.
Is this a good idea, will it help better performance or will the benefit be insignificant and not worth making the database look messier?
I really want to know what the pros do in these situations, for example how does amazon query it's database for user's purchase history since they have millions of customers. How come there queries don't take hours?
EDIT
Ok, so I guess keeping them separate is the way to go. Now the question is a design one:
Should I keep using the "purchases" table I illustrated earlier. In that design I am separating the product ids of each purchase using parenthesis and using this as the delimiter to tell the ids apart when extracting them via PHP.
Instead should I be storing each product id separately in the "purchases" table so it looks like this?:
purchase_id | product_ids | customer_id
---------------------------------------
1 | 99 | 3
1 | 34 | 3
1 | 2 | 3
2 | 45 | 75
2 | 3 | 75
2 | 74 | 75
Nope, this is a very, very, very bad idea.
You're breaking first normal form because you don't know how to page through a large data set.
Amazon and Yahoo! and Google bring back (potentially) millions of records - but they only display them to you in chunks of 10 or 25 or 50 at a time.
They're also smart about guessing or calculating which ones are most likely to be of interest to you - they show you those first.
Which purchases in my history am I most likely to be interested in? The most recent ones, of course.
You should consider building these into your design before you violate relational database fundamentals.
Your database already looks messy, since you are storing multiple product_ids in a single field, instead of creating an "association" table like this.
_____product_purchases____
purchase_id | product_id |
--------------------------
1 | 99 |
1 | 34 |
1 | 2 |
You can still fetch it in one query:
SELECT * FROM purchases p LEFT JOIN product_purchases pp USING (purchase_id)
WHERE purchases.customer_id = $user_id
But this also gives you more possibilities, like finding out how many product #99 were bought, getting a list of all customers that purchased product #34 etc.
And of course don't forget about indexes, that will make all of this much faster.
By doing this with your schema, you will break the entity-relationship of your database.
You might want to look into Memcached, NoSQL, and Redis.
These are all tools that will help you improve your query performances, mostly by storing data in the RAM.
For example - run the query once, store it in the Memcache, if the user refresh the page, you get the data from Memcache, not from MySQL, which avoids querying your database a second time.
Hope this helps.
First off, tens of thousands of records is nothing. Unless you're running on a teensy weensy machine with limited ram and harddrive space, a database won't even blink at 100,000 records.
As for storing purchase details in the users table... what happens if a user makes more than one purchase?
MySQL is hugely extensible, and don't let the fact that it's free convince you of otherwise. Keeping the two tables separate is probably best, not only because it keeps the db more normal, but having more indices will speed queries. A 10,000 record database is relatively small in deference to multi-hundred-million record health record databases.
As far as Amazon and Google, they hire hundreds of developers to write specialized query languages for their specific application needs... not something developers like us have the resources to fund.

Substitute MySQL result

I'm getting the following data from a MySQL database
+----------------+------------+---------------------+----------+
| account_number | total_paid | doc_date | doc_type |
+----------------+------------+---------------------+----------+
| 18 | 54.0700 | 2009-10-22 02:37:09 | IN |
| 425 | 49.9500 | 2009-10-22 02:31:47 | PO |
+----------------+------------+---------------------+----------+
The query is fine and I'm getting the data I need except that the doc_type isn't very human readable. To fix this, I've done the following
CREATE TEMPORARY TABLE doc_type (id char(2), string varchar(60));
INSERT INTO doc_type VALUES
('IN', 'Invoice'),
('PO', 'Online payment'),
('PF', 'Offline payment'),
('CA', 'Credit adjustment'),
('DA', 'Debit adjustment'),
('OR', 'Order');
I then add a join against this temporary table so my doc_type column is easier to read which looks like this
+----------------+------------+---------------------+----------------+
| account_number | total_paid | doc_date | document_type |
+----------------+------------+---------------------+----------------+
| 18 | 54.0700 | 2009-10-22 02:37:09 | Invoice |
| 425 | 49.9500 | 2009-10-22 02:31:47 | Online payment |
+----------------+------------+---------------------+----------------+
Is this the best way to do this? Is it possible to replace the text in one query? I started looking at if statements but it doesn't seem to be what I'm after or maybe I just read it incorrectly.
// EDIT //
Thanks everyone. I suppose I'll keep doing it this way.
Unfortunately, it's not possible to change doc_type to integer as this is an existing database for a billing application I didn't write. I'd end up breaking functionality if I made any changes other than adding a table here and there.
Also appreciate the easy to understand case statement from Rahul. May come in handy later.
Your current way is the best. Arguably, document_type can be changed to an int, to save space and whatnot, but that's irrelevant.
Doing the join will be much faster and readable than any chained ifs.
Not to mention, extensible. Should you need to add a new doc_type, it's just an insert vs. potentially several queries.
You can use the SQL CASE statement to do this in a single query.
Select account_number, total_paid, doc_date,
case doctype
when 'IN' then 'Invoice'
when 'PO' then 'Online Payment'
end
from table
It is the best way to do this :)
If doc_type could be an integer, you also can use ELT function, as in
SELECT ELT(doc_type, 'Invoice', 'Document') FROM table;
but it is still worse than simple join as you have to put this thing into every query and every application that using the database, and changing description becomes a hell.
IIRC this is the correct way to achieve what you want to do. It's a normalized design
I think you are asking about the design and not how the data has to be fetched? If it is so, then I should tell I have always used the above kind of design.
This design leads to normalized database. There won't be consistency problems if you ever needed to change the name of the field like Invoice and Online Payment
I would suggest you to change doc_type field to int as not only it saves space(as told by Tordek) but it is also faster when you execute queries.
Firstly.If you used Invoice in doct_type as string, then the problems could have been was that string search is extremely slow when compared to other datatypes.
Second, it is case sensitive (which may lead to mistakes.
Thirdly, since string takes up much space, so much more space is required for storing it in the main table.
Fourth, If you ever required to change the name Invoice to say Billing, then searching for Invoice would take time and each and every row containing this value had to be updated