About the way of saving BigQuery data capacity.(BigQuery/Data Portal/Data Studio/Google) - google-bigquery

I want to know about the way of saving BigQuery data capacity with changing setting of Data Portal(Google BI tool/old name:Data Studio).
The reason is I can't execute SQL or defray the much cost , if I don't save my BigQuery data capacity .
I want to know the way is not used Changing BigQuery Setting(contain of change SQL code) , but Data Protal setting.
Because , the dashboard in data portal continue to use BigQuery data capacity , I can't solve my problem ,even if I change the SQL code.
My situations is below:
My situations:
1.I made a "view" in my BigQuery Enviroment.
I tried to make the query not to use a lot of BigQuery data capacity.
For example , I didn't use "SELECT * FROM ...".
I set the view to "data sorce" in the data portal.
And I made the dashboard using the "data sorce".
If someone open the dashboard , the view I made is executed.
And , BigQuery data capacity is used every time that someone open the dashboard.

If I'm understanding correctly, you're wanting to reduce the amount of data processed in BigQuery from your Data Studio (or in Japan, Data Portal) reports.
There are a few ways to do this:
Make sure that the "Enable Cache" option is checked in the report settings.
Avoid using BigQuery views as a query source, as these aren't cached at the BigQuery level (the view query is run every time, and likely many times per report for various charts). Instead, use a Custom Query connection or pull the table data directly to allow caching. Another option (which we use heavily) is to run a scheduled query that saves the output of a view as a table and replaces it regularly (or is triggered when the underlying data is refreshed). This way your queries can be cached, but the business logic can still exist within the view.
Create a BI Engine reservation in BigQuery. This adds another level of caching to Data Studio reports, and may give you better results for things that can't be query-cached or cached in Data Studio. (While there will be a cost to the service in the future based on the size of instance you reserve, it's free during their beta period.)
Don't base your queries on tables with a streaming buffer attached (even if it hasn't received rows recently), uses wildcard tables in the query, or is based on an external dataset (e.g. file in Cloud Storage or BigTable). See Caching Exceptions for details.
Pull as little data as possible by using the new Data Source Parameters. This means you can pass the values of your date range or other filters directly to BigQuery and filter the data before it reaches your report. This is especially helpful if you have a date-partitioned table, as you can only scan the needed partitions (which greatly reduces processing and the amount of data returned)
Also, sometimes it seems like you're moving a lot of data but that doesn't always relate to a high cost. Check your cost breakdowns or look at the logging filtered to the user your data source authenticates as, then see how much cost that's incurred. Certain operations fall under a free tier, and others don't result in cost for non-egress use cases like Data Studio. All that to say that you may want to make sure there's a cost problem at the BigQuery level in the first place before killing yourself trying to optimize the usage.

Related

How to increase performance of Azure Data Factory Pipeline with Integration Runtime

I would like to increated the performance of our pipelines.
The pipelines currently run from an integration runtime.
I am running a single copy activity on tables held on our Source which is a SQL Database. Tables contain just under a million rows, with about 15 columns.
Currently the time it takes to copy a table from Source to Sink(ADLS) is approximately 20mins.
Is there a way to increase the DIU to increase performance?
My current copy settings are as follows:
I'm thinking that if I made some changes to Settings, see below, I would improve performance, but I have never played around to settings before, any suggestions most welcomed.
The activity details for a pipeline run is as follows:
My link service is an Azure Synapse Link service, see below:
From the output window, we can see that almost all the wait time was "Time to first byte", which means your SQL server is slow to reply. It takes ~22 minutes for less than 90K rows. So changes on the ADF side will not help.
If your query is a simple "select * from table", then maybe your SQL server is low on resources. You can check that in your database portal in Azure. Try to add more resources and see if copy times improve.
If this is a query from a view or other complicated query, maybe it needs some improvement (indexes, improve code). You can test that by writing the query result to a table in your SQL database, use that table as the data factory source, and see if this improves copy time.
Quick check , is the Azure SQL and storage account in the same region ? Also I see that your copy activity is set as parraleism as 1 , you can play with number and see if that helps .
How to setyp parallelism please read here : https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#parallel-copy
Please see the snaphot below

Usecase for BIgQuery as a database backend for website thoughts

members,
Currently we synchronise salesdata into BigQuery, and it allows us to make fast, detailed, practically realtime reports of all kinds of stats that we otherwise would not have available. We want to have a website that is able to use these reports and present this information to website-users.
Some specs:
Users are using the data as 'readonly'
We want to do the analysis 'on request', so as soon as a user opens the page, we would query BigQuery and the user would see their stats depending on the query
The stats could change for external sources but often the result will be equal, I take into my mind that BigQuery would cache the query
The average query processes about 100Mb of data, it takes >2 seconds for the whole backend to respond (so user request, query, return resultset) so performance is what we want
Why I doubt:
BigQuery would not be adviced
Could it run 'out of hand'
Dataset will grow bigger, but we will need to keep using all historical data in any case
I would be an option to get aggregated data into another database for doing the main calls, but that would give me not a 'realtime' experience.
I would love to hear your thoughts.
As per your requirement, you can consider Bigquery as an option since Bigquery is fully managed and supports analytics over petabyte-scale data, it will be able to handle large amounts of data. Bigquery is specially designed for performing OLAP transactions so analysis can be performed on requests. Bigquery uses cached query results through which you can cache the query and fetch results quickly.
If your dataset is very large and grows then you can create partitioned tables to store and manage your data and easily query the tables. Since your data can go out of hand, Bigquery being a fully managed service will automatically handle that load. Historical data can be stored and accessed but for that you can set the expiration time of the table and also check the optimized storage according to your requirement.

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.
In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery
I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

Allowing many users to view stale BigQuery data query results concurrently

If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.
Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.

Tableau Data Limits

I've been hearing conflicting statements on how much records / data size, tableau can handle.
In the last week two people have told me they have dashes which are, 100m and 600m records. They do incremental refreshes.
If I have a dash with xxx million records. Do clients only receive the data that is in their aggregated view.
So, if I have a source with 200million records. In the dash it shows the aggregated total per week per product. Let's say this is 400 cells(underneath it's millions of records). Is the client only receiving 400 data points.
If I then add filters to sub product or user level data, would that mean all of these data is imported due to the filters? If this is the case, how does this affect speed?
Ultimately, Tableau can handle as much data as your datasource can handle. If you are set up so Tableau connects to a datasource directly, only the results of a query are transmitted to the user. I've got billion row datasources in BigQuery that return reasonably fast aggregated numbers to Tableau.
If your datasource is not fast then this won't give good results in Tableau.
If you are using extracts, where, in effect, Tableau pulls all the data locally, things will usually be faster, but you will have local drive and memory limits on the size of the dataset. And each user will need an extract. Unless you are using Tableau server in which case the extract can be on the server.
Dashboards built on big datasources sometimes get slow when there are a lot of filters because populating each filter requires a datasource query (which may be triggered every time you use a filter). There are strategies to speed up dashboards with this problem by using partial extracts that generate all the values used for filtering (you can sometimes use parameters for a similar speed gain). Or even just designing the filters intelligently. But speed is usually the limiting factor not the size of the source table.
The only real limit on how much Tableau can handle is how many points are displayed. And that depends on RAM. In my experience a 4GB machine will choke on a chart will a couple of million points (e.g. a map plotting every postcode in the UK). But on a 16GB RAM machine I have never found a limit other than how fast the points are drawn.