What are data volumes?
How would you define them?
How would you calculate the data volumes for a website.
In SAP, data volumes are the spaces defined in SAP to store data or log information.
Otherwise, the English word volume means amount. A data volume is simply the amount of data in a file or database.
You would calculate the amount of data storage for a website by figuring out how much data comes in per month, and multiply that times the number of months you expect your web site to grow.
Most web sites just add disk storage as needed, rather than attempt to predict how much will be needed in the future. If you're Google or Facebook, you just plan to add disk storage space constantly.
Maybe this article may help you, but still I would migrate this question to Webmasters part of the stackexchange.com if you're asking for a data volumes calculation for a website.
Related
For compliance requirements, we would like to move all our Bigquery data and GCS data from US region to EU region.
From my understanding, multi-region is either within US or within EU. There is no cross-region option as such.
Question 1: In order to move the data from US to EU or vice versa, our understanding is that we need explicitly move the data using a storage transfer service. And assuming a cost associated with this movement even though it is within Google cloud?
Question 2: We also think if we can maintain copies at both locations. In this case, Is there a provision for cross-region replication? If so, what would be the associated cost for the same?
Question1:
You are moving data from one part of the world to another one. So, yes you will pay the egress cost of the source location.
Sadly, today (November 28th 2023), I can't 100% commit on that cost. Indeed, I reached Google Cloud about a very similar question and my Google Cloud contact told me that the cost page was out of date. The Cloud Storage egress cost should apply (instead of the Compute Engine Networking egress cost as today in the documentation).
Question2:
You copy the data, so, you have, at the end, the volume of data duplicated in 2 dataset and you have your storage cost duplicated.
Every time that you want to sync the data, you perform a copy. It's only a copy, and not a smart delta update. So, be careful if you update directly the data in the target dataset: a new copy will override the data!
Instead, use the target dataset as a base to query, and duplicate (again) the data in an independent dataset, where you can add your region specific data
According to the docs, once the dataset is created, the location cannot be changed, but you can copy the dataset to a different location, or manually move (recreate) the dataset in a different location.
The easier approach is copy, you can learn more about the requirements, quotas and limitations here: https://cloud.google.com/bigquery/docs/copying-datasets
So:
There is no need for the transfer service, you can copy datasets to a different location.
There is no mechanism for automatic replication across regions. Even a disaster recovery policy will require cross-region datasets copies.
BigQuery does not automatically provide a backup or replica of your data in another geographic region. You can create cross-region dataset copies to enhance your disaster recovery strategy.
https://cloud.google.com/bigquery/docs/availability#:%7E:text=cross%2Dregion%20dataset%20copies
So in both cases you need to work with dataset copies and deal with data freshness in the second scenario.
If I pull the data from BiqQuery, will Google charge me or not for sending the data to Data Studio?
That depends. BigQuery is a consumption based model, unless you purchased slots. What that means is, any time you query you're utilizing resources and then getting charged at their defined rate, $5 per TB of data scanned.
There are a few caveats to that however one being that the first TB of data scanned per month is free, and not every query issued will scanned data as it may use cache. If you are concerned about the associated cost one option would be to utilize the BigQuery sandbox. It has limited functionality but will not charge you, however there are limitations.
https://cloud.google.com/bigquery/docs/quickstarts/quickstart-cloud-console
BigQuery runs queries that you pay.
DataStudio runs queries on BigQuery that you pay.
There is no cost of transfer between the two systems.
I am trying to download as much information from Bloomberg for as many securities as I can. This is for a machine learning project, and I would like to have the data reside locally, rather than querying for it each time I need it. I know how to download information for a few fields for a specified security.
Unfortunately, I am pretty new to Bloomberg. I've taken a look at the excel add-in, and it doesn't allow me to specify that I want ALL securities and ALL their data fields.
Is there a way to blanket download data from Bloomberg via excel? Or do I have to do this programmatically. Appreciate any help on how to do this.
Such a request in unreasonable. Bloomberg has dozens of thousands of fields for each security. From fundamental fields like sales, through technical analysis like Bollinger bands and even whether CEO is a woman and if the company abides by Islamic law. I doubt all of these interest you.
Also, some fields come in "flavors". Bloomberg allows you to set arguments when requesting a field, these are called "overrides". For example, when asking for an analyst recommendation, you could specify whether you're interested in yearly or quarterly recommendation, you could also specify how do you want the recommendation consensus calculated? Are you interested in GAAP or IFRS reporting? What type of insider buys do you want to consider? I hope I'm making it clear, the possibilities are endless.
My recommendation is, when approaching a project like you're describing: think in advance what aspects of the security do you want to focus on? Are you looking for value? growth? technical analysis? news? Then "sit down" with a Bloomberg rep and ask what fields apply to this aspect. Then download those fields.
Also, try to reduce your universe of securities. Bloomberg has data for hundreds of thousands of equities. The total number of securities (including non equities) is probably many millions. You should reduce that universe to securities that interest you (only EU? only US? only above certain market capitalization?). This could make you research more applicable to real life. What I mean is that if you find out that certain behavior indicates a stock is going to go up - but you can't buy that stock - then that's not that interesting.
I hope this helps, even if it doesn't really answer the question.
They have specific "Data Licence" products available if you or your company can fork out the (likely high) sums of money for bulk data dumps. Otherwise, as has been mentioned, there are daily and monthly restrictions on how much data (and what type of data) is downloaded via their API. These limits are not very high at all and so, by the sounds of your request, this will take a long and frustrating time. I think the daily limits are something like 500,000 hits, where one hit is one item of data, e.g. a price for one stock. So if you wanted to download only share price data for the 2500 or so US stocks, you'd only managed 200 days for each stock before hitting the limit. And they also monitor your usage, so if you were consistently hitting 500,000 each day - you'd get a phone call.
One tedious way around this is to manually retrieve data via the clipboard. You can load a chart of something (GP), right click and copy data to clipboard. This stores all data points that are on display, which you can dump in excel. This is obviously an extremely inefficient method but, crucially, has no impact on your data limits.
Unfortunately you will find no answer to your (somewhat unreasonable) request, without getting your wallet out. Data ain't cheap. Especially not "all securities and all data".
You say you want to download "ALL securities and ALL their data fields." You can't.
You should go to WAPI on your terminal and look at the terms of service.
From the "extended rules:"
There is a daily limit to the number of hits you can make to our data servers via the Bloomberg API. A "hit" is defined as one request for a singled security/field pairing. Therefore, if you request static data for 5 fields and 10 securities, that will translate into a total of 50 hits.
There is a limit to the number of unique securities you can monitor at any one time, where the number of fields is unlimited via the Bloomberg API.
There is a monthly limit that is based on the volume of unique securities being requested per category (i.e. historical, derived, intraday, pricing, descriptive) from our data servers via the Bloomberg API.
These may be few basic questions.
When i load data into BQ tables, where exactly data stored? (If billing is already enabled). if it is data center, what would be data center capacity? Does our data co-exist with other users data?
When we fire queries, How our queries processed? What is the default compute engine used for this?
How can we increase query processing capacity?
Thanks
CP
BigQuery datacenter capacity is practically unlimited. If you plan to upload petabytes in a very short time frame you might need to contact support first just to make sure, but for normal big loads everything should be fine.
BigQuery doesn't use compute engine, but a series of very large clusters where all queries run. That's the secret to a low cost per query, without ongoing costs per hour like other alternatives.
BigQuery increases the number of CPUs involved in your query elastically as the query needs. You don't need to manage storage nor processing capacity.
We have a MIS where stores all the information about Customers, Accounts, Transactions and etc. We are building a data warehouse with BigQuery.
I am pretty new on this topic, Should we
1. everyday extract ALL the customer's latest information and append them to a BigQuery table with timestamp,
2. or we only extract the updated customer's information on that day?
First solution uses a lot of storage and takes time to upload data, and got lots of duplicates. But it's very clear for me to run query. For 2nd solution, given a specific date how can I get the latest record for that day?
Similar for Account data, an example of simplified Account table, only 4 fields here.
AccountId, CustomerId, AccountBalance, Date
If I need to build a report or graphic of a group of customers' AccountBalance everyday, I need to know the balance of each account on every specific date. So should I extract each account record everyday, even it's the same as last day, or I can only extract the account when the balance changed?
What is the best solution or your suggestion? I prefer the 2nd one because there are no duplicates, but how can I construct the query in BigQuery, will performance be an issue?
What else should I consider? Any recommendation for me to read?
When designing DWH you need to start from business questions, translate them to KPIs, measures, dimensions, etc.
When you have those in place...
you chose technology based on some of the following questions (and many more):
who are your users? in what frequency and what resolutions they consume the data? what are your data sources? are they structured? what are the data volumes? what is your data quality? how often your data structure changes? etc.
when choosing technology you need to think of the following: ETL, DB, Scheduling, Backup, UI, Permissions management, etc.
after you have all those defined... data schema design is pretty straight forward and is derived from "The purpose of the DWH" and your technology limits.
You have pointed out some of the points to consider, but the answer is based of your needs... and is not related to specific DB technology.
I am afraid your question is too general to be answered without deep understanding of your needs.
Referring to your comment bellow:
How reliable is your source data? Are you interested in the analyzing trends or just snapshots? Does your source system allow "Select all" operations? what are the data volumes? What resources does your source allow for extraction (locks, bandwidth, etc.)?
If you just need a daily snapshot of the current balance, and there are no limits by your source system,
it would be much simpler to run a daily snapshot.
this way you don't need to manage "increments", handle data integrity issues and systems discrepancies etc. however, this approach might have undesired impact on your source system, and your network costs...
If you do have resources limits, and you chose the incremental ETL approach, you can either
create a "Changes log" table and query it, you can use row_number()
in order to find latest record per account.
or yo can construct a copy of the source accounts table, merging
changes everyday to an existing table.
each approach has its own aspect of simplicity, costs, and resource consumption...
Hope this helps