Getting OLTP like performance from BigQuery results - google-bigquery

I'm working on a project where we need to display BigQuery results in a table within a web application.
We’ve built the feature by paging, sorting and searching directly in BigQuery, but the performance isn’t what you would expect of a modern web application. It takes several seconds the apply a search term or change a page.
I can't really share much code, but this a general question that applies any large resultset generated in BigQuery.
For little bit of context. We create a view in BigQuery by joining a product catalog to orders.
WITH Catalog AS
(
SELECT
productId,
FROM `CatalogTable`
),
Orders AS (
SELECT
p.productId,
SUM(p.qty) AS qty
FROM `OrdersView` as o, o.products AS p
GROUP BY p.productId
)
SELECT
c.productId,
IF(o.qty IS NULL, 0, o.qty) AS qty,
ROW_NUMBER() OVER(ORDER BY qty DESC) as salesRank
FROM Catalog AS c
LEFT JOIN
Orders AS o
ON CONCAT(c.name, c.sku) = CONCAT(o.name, o.sku)
And the view is queried like so:
SELECT ...
FROM `catalog` c
LEFT JOIN `catalogView` cv
WHERE c.name LIKE '%searchTerm%'
LIMIT 10
OFFSET 0
What are the options for making this grid-view perform as it would if it were built on a traditional SQL database (or close to the performance)?
I've considered clustering, but i don't believe this is an option since i'm not partitioning the table:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
NOTES:
It's acceptable for the results to be a little delayed, if streaming the results into another database is an option.
The query is called via a WebApi endpoint and displayed in an Angular grid-view.
New orders are imported every 15 minutes so the results from this query won't be entirely static, they can change periodically.
The data-grid must support paging, sorting and searching, and the grid could contain 10,000 plus results.

BigQuery should not be used if you expect OLTP behavior or performance.
In your case, if you want to keep your project on GCP and also keep your data model as similar as possible with the model you already have, I would suggest you to take a look at Cloud SQL and Cloud Spanner.
Both are fully managed Relational Databases. The main difference is that Cloud Spanner is horizontally scalable whereas Cloud SQL is not, i.e. if you need only one node, use Cloud SQL. If you need to grow up your cluster, use Cloud Spanner.
Furthermore, both of them have it's respective Web APIs. You can find the Cloud Spanner Web API reference here. For the Cloud SQL, the reference depends on which DBMS you choose: SQLServer, MySQL or PostgreSQL.
I hope it helps

Related

SQL Query optimisation for PS

I am currently creating my webshop for my local parts store using prestashop 1.7.8.6
I developed scripts on myself and have successfully made the website work correctly.
But, with 2 millions rows of products including each 30 columns and multiple joins, i can't have a decent loading time on getProducts query.
Even with indexes and cache...
I use a simple query on prestashop product table and join id products from car filter table to match ps_product table.
I would like to know if it would be better to create tables for each vehicle using an id, and fill it with ps_product data, to use this table only instead of using multiple joins.
I'am using innoDB as engine.
Thanks
Prestashop does not have great performance with such a huge amount of data/products, as you have seen using native methods, so the best option is to strengthen your MySQL server.
Consider using one or more dedicated machines for SQL with replication),
by saving your data in external tables or store it to some distributed NoSQL system built to deal with large amount of data (like Elasticsearch or similar) so you can scale it easily and you can write your own code/module to retrieve what you need.

How can I query the latest version of a page in SQL, when that information is stored in a secondary table?

Consider that I'm writing a wiki¹. I may have one table that contains a row for each wiki page, and another that contains each version of that page, with a foreign key to the page that the version corresponds to. A user may request to view a list of every page, including the title of the page (which is included in the versions table since the title can be updated and thus should be tracked withversions).
I could first do a query to get a list of wiki pages, and then do a separate query to get the title of each page, but this number of queries seems like it runs many more queries than I need, and is thus less performant due to server round trips, and some (very minor) blocking in the SQL library.
Instead, I'd rather do something like a JOIN between the wiki pages table and the versions table, but then I'll get a separate row in the result for each version, transferring and preparing lot more data than I need. In my query to view a page's contents, I just use ORDER BY timestamp DESC LIMIT 1, which works great there to solve this problem, but this won't work as-is for a list case since I need more than one row. Can I make the order by and limit apply separately to each set of rows that share a page id?
My next idea is to try something with subqueries, and this is all that my research attempts point to, to essentially do my first option but where the Postgres' optimizer can see the entire operation at once and hopefully optimize it more than with many queries, and to avoid more round trips and blocking, but when I looked at Postgres' list of available subquery options, I was unable to figure out how to use any of them to solve this problem.
Lastly, I could just store the title (and other per-version data that I need in this query) in the main table, but this is duplication of data and thus a bad practice. Nonetheless, it seems like the least evil that I can figure out at present; hence, the question: How can I query the data that I need, to produce a list of wiki pages including the latest per-version data in a performant manner and without duplicating data?
1: My project isn't a wiki, but as the details of it are private for now, I need to give a slightly contrived example.
You are describing a top-1-per-group problem. Without seeing actual structures this is rather theoritical, but the logic could be implemented with distinct on in Postgres. That would look something like this:
select distinct on (p.page_id) p.*, pv.title
from pages p
inner join page_versions pv on pv.page_id = p.page_id
order by p.page_id, pv.timestamp desc
Or you could use a lateral join:
select p.*, pv.title
from pages p
cross join lateral (
select pv.*
from page_versions pv
where pv.page_id = p.page_id
order by pv.timestamp desc limit 1
) pv

Performance issue using Row Level Security using lookup table

I have implemented Row Level Security using on SQL Server 2016. I think I have a failry complex setup, but our security requirement is complex.
This is in the context of a data warehouse. I have basic fact and dimension tables. I applied row level security to one of my dimension table with the following setup:
Table 1 : dimDataSources (standard physical table)
Table 2 : dimDataSources_Secured (Memory Optimized table)
I created a Security Policy on the dimDataSources_Secured (In-Memory) that uses a Natively Compiled function. That function read another Memory Optimized table that contains lookup values and Active Directory Groups that can read the record.
The function use the is_member() function to return 1 for all records that are allowed for my groups.
So the context seems a bit complex but so far it works.
But... now I get to use this in jonctions with fact table and we get performance hit. Here, I am not applying row level security directly on the fact table... only on the dimension table.
So my problem is if I run this:
SELECT SUM(Sales) FROM factSales
It returns quickly, let's say 2 seconds.
If I run the same query but with a join on the secured table (or view), it will take 5-6 times longer:
SELECT SUM(Sales) FROM factSales f
INNER JOIN dimDataSources_Secured d ON f.DataSourceKey = d.DataSourceKey
This retrieves only the source I have access to based on my AD groups.
When the execution plan changes, it seems like it retrieves the fact table data quickly, but then will do a nested loop lookup on the In-Memory table to get the allowed records.
Is that behavior caused by the usage of the Filter Predicate functions?
Anyone had good or bad experiences using Row Level Security?
Is it mature enough to put in production?
Is it a good candidate for data warehousing (i.e. processing big volumes of data)?
It is hard to put more details on my actual function and queries without writing a novel. I'm mostly looking for guidelines or alternatives.
Is that behavior caused by the usage of the Filter Predicate
functions? Anyone had good or bad experiences using Row Level
Security? is it mature enough to put in production? Is it a good
candidate for datawarhousing (processing of big volume of Data)?
Yea, you'll take a performance hit when using RLS. Aaron Bertrand wrote a good piece in March of 2017 on it. Ben Snaidero wrote a good one in 2016. Microsoft has also provided guidance on patterns to limit performance impact.
I've never seen RLS implemented for a OLAP schema so I can't comment on that. Without seeing your filter predicates, it's tough to say, but that's usually where the devil is.

How to write riak query in riakc

I am a noob at riak, and have been trying to test the query aspect of riak using riakc in erlang. but I can not find any example of how to query the database that match the old SQL way, Only on how to get a single value out of a single field. I think I am missing something but all I really want is just a standard SQL query with a matching riakc code.
SELECT * FROM bucket;
SELECT * FROM bucket LIMIT 10, 100;
SELECT id, name FROM bucket;
SELECT * FROM bucket WHERE name="john" AND surname LIKE "Ste%";
SELECT * FROM bucket LEFT JOIN bucket2 ON bucket.id = bucket2.id2;
I assume that there is no direct correlation on how you write these, but was hoping there is a standard way, and is there somewhere that has a a simple to understand way of explaining this querys in riakc (or even just riak).
I have looked a mapreduce but found that it was confusing for just simple queries
Riak is a NoSQL database, more specifically a key-value database, and there is no query language like SQL available. When working with Riak you need to model and query your data in a completely different way compared to how you use a relational database in order to get the most from it. Trying to model and query your data in a relational manner, e.g. by extensive use of secondary indexes or by trying to use map/reduce as a real-time query language, generally results in very poor performance and scalability. A good and useful discussion about Riak development anti-patterns that can be found here.

linked server with cross join

I have two servers. One is mine and the other is of the other company. In the second server I can´t create any database or add any functions or store procedures, but I need to return information to do cross join with my database.
for example,
select fieldA, fieldB from localTBL l
left join linkedserver.remoteDB.dboremoteTBL r on l.ID = r.ID
or
select fieldA, fieldB from linkedserver.remoteDB.dboremoteTBL r
where r.ID in (select l.ID from localTBL l)
I did this but the performance was very horrible.
Is it possible to do this with better performance?
For better performance with linked servers, use openquery. Otherwise, you bring back all the data from the remote server first and apply the where clause afterwards.
In your situation, run the subquery first and return the list of values to a variable. Then use that variable in your openquery.
A CTE can be used to bring only the information you require across the wire and then perform the join against the calling server. Something like:
DECLARE #Id As int;
SELECT #Id = 45;
with cte (ID, fieldB)
AS
(
SELECT ID, fieldB
FROM linkedserver.remoteDB.dboremoteTBL
WHERE ID = #Id
)
SELECT lt.fieldA, cte.fieldB
FROM localTbl lt
INNER JOIN cte ON lt.ID = cte.ID
ORDER BY lt.ID;
Yep. Performance will be horrible. It's down to the network between you and the other company, and any authentications and authorisations that have to be done on the way.
This is why Linked Servers aren't used very much, even within a single company: Performance is usually bad. (I've never seen a Linked Server in a separate company and can only sympathise!)
Unless you can upgrade the network link between you there's not much you can do querying from a linked server.
This setup sounds like a short-term solution to a problem which needed a fast fix, and which has lasted longer than expected. If you can get a business case for spending money on it two alternatives are:
Cheapest alternative: is to cache the data locally: have a background Service running which drags the latest version of the data out of the Linked Server tables into a set on table in the local database and then run your queries against the local tables. This does depend on how changeable the remote data is, and how up-to-date your queries have to be. Forex, if you're doing things like getting yesterday's sales data, you might be able to do an overnight pull. If you need more up-to-date data, maybe an hourly pull. You can get quite picky sometimes, and if the data structures support it only pull out data which has changed since that last pull: that makes each pull much smaller and allows more frequent ones, maybe..
More expensive involving work by your and the other company: is to re-architect it so that the other company pushed changes to you as they happen, via a WCF service (or something) you expose. This can then update your local copy as the data comes in.