Storing Data for Google Data Studio - sql

I have an application that makes reports available by http in either CSV or JSON format. I want this data to be accessible to Google Data Studio. I was considering building a connector to access the data, but the number of rows that can be accessed at any given time is quite small and there is a daily data limit. So I want to build a system to download the reports daily and store them to be accessed by Data Studio. I created a script to load the reports into a Google Cloud SQL but this is quite expensive because of the base cost of running a Google Cloud SQL machine. Any ideas how else to deal with a situation like this?

You can use firebase Realtime Database.
I used it before for storing 1G Data and 20k rows.
I have code samples for that.

Related

what is the difference between BigQuery and Storage on GCP?

Hi guys I am using GCP for the first time and while I walking through the a project's cloud function example with the mock data, I got confused about similarities/differences of each one and I would like more clarity of what makes them different because to me they seem so similar.
BigQuery is a data warehouse and a SQL Engine. You can use it to store tabular data in datasets and tables. In the tables you may as well store more complex structures like arrays and JSONs but not files for example.
Cloud Storage is a blob storage, with functionality similar to what you know in your linux/windows machine (saving files, folders, deleting, copying). Of course that in the backend it's nothing like your local file system.
BigQuery is a fully managed and serverless data warehouse. It's like Snowflake or Redshift.
Google Cloud Storage(GCS) is like Amazon S3 or Azure Storage. Storages are for storing data as the name suggests.
You usually use BigQuery to analyze & query data in order to draw some insights. BigQuery is an analytical engine.
You can store images, videos, logs, files, and etc in GCS(Google Cloud Storage), but BigQuery can't.
Google BigQuery belongs to "Big Data as a Service" category of the tech stack, while Google Cloud Storage can be primarily classified under "Cloud Storage".
Some of the features offered by Google BigQuery are:
• All behind the scenes- Your queries can execute asynchronously in the
background, and can be polled for status.
• Import data with ease- Bulk load your data using Google Cloud Storage or stream it in bursts of up to 1,000 rows per second.
• Affordable big data- The first Terabyte of data processed each month is free.
On the other hand, Google Cloud Storage provides the following key features:
• High Capacity and Scalability
• Strong Data Consistency
• Google Developers Console Projects
"High Performance" is the primary reason why developers consider Google BigQuery over the competitors, whereas "Scalable" was stated as the key factor in picking Google Cloud Storage.

Building OLTP DB from Datalake?

I'm confused and having trouble finding examples and reference architecture where someone wants to extract data from an existing data lake (S3/Lakeformation in my case) and build a OLTP datastore that serves as an applications backend. Everything I come across is an OLAP data warehousing pattern (i.e. ETL -> S3 -> Redshift -> BI Tools) where data is always coming IN to the datalake and warehouse rather than being pulled OUT. I don't necessarily have a need for 'business analytics' but I do have a need for displaying graphs in web dashboards with large amounts of time series data points underneath for my websites users.
What if I want to automate pulling extracts of a large dataset in the datalake and build a relational database with some useful data extracts from the various datasets that need to be queried by the hand full instead of performing large analytical queries against a DW?
What if I just want an extract of say, stock prices over 10 years, and just get the list of unique ticker symbols for populating a drop down on a web app? I don't want to query an OLAP data warehouse every time to get this, so I want to have my own OLTP store for more performant queries on smaller datasets that will have much higher TPS?
What if I want to build dashboards for my web app's customers that display graphs of large amounts of time series data currently sitting in the datalake/warehouse. Does my web app connect directly to the DW to display this data? Or do I pull that data out of the datalake or warehouse and into my application DB on some schedule?
My views on your 3 questions:
Why not just use the same ETL solution that is being used to load the datalake?
Presumably your DW has a Ticker dimension that has unique records for each Ticker symbol? What's the issue with querying this as it would be very fast to get the unique Ticker symbols from it?
It depends entirely on your environment/infrastructure and what you are doing with the data - so there is no generic answer anyone could provide you with. If your webapp is showing aggregations of a large volume of data then your DW is probably better at doing those aggregations and passing the aggregated data to your webapp; if the webapp is showing unaggregated data (and only a small subset of what is held in your DW, such as just the last week's data) then loading it into your application DB might make more sense
The pros/cons of any solution would also be heavily influenced by your infrastructure e.g. what's the network performance like between your DW and application?

Handling pictures, documents, etc. (Microsoft Azure)

I am currently in the process of building a SQL database in Microsoft Azure for handling pictures, documents, etc. What is the most efficient/best way of storing data? Uploading the files directly to the DB, or by sourcing the files from something like Azure BLOB? I have read numerous posts about people uploading it directly to the DB, but I am concerned about its efficiency.
Thank you in advance for any replies.
You can store in something like Azure SQL DB for example but I would not recommend it, you should definitely store in Azure Storage (BLOB) and then for reference store in a DB. Azure has multiple relational and NoSQL data stores which are offered as platform services.
I would do two things, use a NoSQL platform data store like Cosmos DB using SQL Core API to store the metadata for the images, here you can use the filename as the partition ID to do a point read (this is very fast read and it would be a very cheap option with blazing fast performance) and secondly I would use Azure CDN to make sure images are accessed via CDN so that they are faster.
Azure CDN has three options; Akamai, Verizon and Microsoft. You can test which CDN is faster from where you are from here: https://cloudharmony.com/speedtest-for-azure
Using the above URL you can also use to test which Azure region is closer to you so to use that region, or test for your end-users and choose the region closer tot them.
I would say storing in Azure BLOBs is a better idea. Imagine you have 100 GB files stored in DB.
It will slow down your query if your table is not designed properly.
Backup & Restore DB will be very slow.
Azure DB is more expensive than Azure BLOB for the same size.
If your total file size is small enough, it doesn't make much difference.

Best approach loading a text file (.txt) to bigquery table

Anyone got any pratical idea with regard to what is the best possible approach to upload a text file to a bigquery table? I have a few zipped text files I need to download from a remote SFTP server and load it into a bigquery table. Should I download it to a google cloud storage and upload it from there to bigquery for faster speed? The text files are about 5GB each and will grow further.
Thank you.
First thing to consider if you are loading files from your local data source is that there are limitations for that, according to the documentation.
Loading data from a local data source is subject to the following limitations:
Wildcards and comma separated lists are not supported when you load
files from a local data source. Files must be loaded individually.
When using the classic BigQuery web UI, files loaded from a local data
source must be 10 MB or less and must contain fewer than 16,000 rows.
Besides that, with this provided above link , there are instructions how to upload your data with Console or CLI.
Nevertheless, using the cloud storage you can take advantage of long term storage, which means that you are not charged by loading data into bigquery instead for storing the data in Cloud Storage. You can read more about it here.
Finally, I would like you to consider two points External and Natives tables in bigquery.
Native tables: tables backed by native BigQuery storage.
External tables: tables backed by storage external to BigQuery. For more
information, see Querying External Data Sources.
In other words, using Native tables you import the full data inside BigQuery. Thus, it tends to me faster when executing data analysis. Meanwhile, external tables do not store data in BigQuery, instead references the data from an external source.
The cost of storing in BigQuery is higher than in Cloud storage. Although, querying external tables is slower than querying against native tables, mainly if the files are significantly large. Lastly, since external tables are pointers to files, you do not have to wait for the data to load.

Tableau data extract refresh from Google BigQuery takes very long

We are very pleased with the combination BigQuery <-> Tableau Server with live connection. However, we now want to work with a data extract (500MB) on Tableau Server (since this datasource is not too big and is used very frequently). This takes too much time to refresh (1.5h+). We noticed that only 0.1% is query time and the rest is data export. Since the Tableau Server is on the same platform and location, latency should not be a problem.
This is similar to the slow export of a BigQuery table to a single file, which can be solved by using "daisy chain" option (wildcards). Unfortunately we can't use similar logic with a Google BigQuery data extract refresh in Tableau...
We have identified some approaches, but are not pleased with our current ideas:
Working with incremental refresh: our existing BigQuery table rows can change: these changes can only be applied in Tableau if you do a full refresh
Exporting the BigQuery table to GCS using the daisy chain option and making a Tableau data extract using the Tableau SDK: this would result in quite some overhead...
Writing a Dataflow job using a custom sink for Tableau Server (data extracts).
Experimenting with a Tableau web connector that communicates directly with the BigQuery API: I don't think this will be faster? I didn't see anything about parallelizing calls with the Tableau web connecector, but I didn't try this approach yet.
We would prefer a non-technical option, to limit maintenance... Is there a way to modify the Tableau connector to make use of the "daisy chain" option for BigQuery?
You've uploaded the data in BigQuery. Can't you just use the input for that load job (a CSV perhaps) as input for Tableau?
When we use Tableau and BigQuery we also notice that extracts are slow but we generally don't do that because you lose BigQuery's power. We start with a live data connection at first, and then (if needed) convert this into a custom query that aggregates that data into a much smaller datasets which extracts in just a few seconds.
Another way to achieve higher performance with BigQuery and Tableau is aggregating or joining tables on beforehand. JOINs on huge tables can be slow, so if you use a lot of those you might consider generating a denormalised dataset which does all of the JOIN-ing first. You will get a dataset with a lot of duplicates and a lot of columns. But if you select only what you need in Tableau (hide unused fields!) then these columns won't count in your query cost.
One recommendation I have seen is similar to your point 2 where you export the BQ table to Google Cloud Storage and then use the Tableau Extract API to create a .tde from the flat files in GCS.
This was from an article on the Google Cloud site so I'd assume it would be best practice:
https://cloud.google.com/blog/products/gcp/the-switch-to-self-service-marketing-analytics-at-zulily-best-practices-for-using-tableau-with-bigquery
There is an article here which provides a step by step guide to achieving the above.
https://community.tableau.com/docs/DOC-23161
It would be nice if Tableau optimised the BQ connector for extract refresh using the BigQuery Storage API. We too have our Tableau Server environment in the same GCP zone as our BQ datasets and experience slow refresh times.