Why Azure has given 3 different storage accounts. Is there any major difference between this account.
General-purpose v2 accounts
General-purpose v1 accounts
Blob Storage accounts
As 1 already can do everything in Azure storage like blobs, files, tables, and queue.
Any suggestions appreciate.
There are several differences, most are in the limits/performances, additional supported features on the newer versions ex:
There are also replication cases which are not offered for specific storage account types, ex:
and finally a full overview which includes the above and adds on the Supported services, performance tiers and access tiers is below:
Related
We are in the process for buying SaaS solution for busy sales operations. We want to ensure that we have ability to access our data and ingest it into our analytics data lake (some real-time). I am looking for advice for what requirements should we have/prefer for vendors and their solutions?
APIs - most vendors mention that they provide APIs for data access, however, what features APIs need to have to be suitable for data ingestion into Analytics data lake?. For example Salesforce has Bulk API, does this mean that if vendor only offers "lean APIs", they won't work for DL use case?
Direct SQL Access - shall we prefer SaaS solutions that offer single tenant DBs so that we could obtain direct SQL access?
DB replica - shall we expect that vendor provides a DB replica (if it's single tenant) and we use it as a data store for reporting. Obviously, that extra costs for us.
Direct SQL Access via ODBC - I also read that if SaaS app has multi-tenants, ODBC/JDBC drivers could be built to access DB data via SQL but with proper authorization to ensure data security? Would this be a valid request/approach?
Staged tables - shall we request the vendor to stage their DB tables (as files) and load to our (or theirs) data lake environment. This then would be a raw data source analytics and data archive. My concern is incremental updates.
Any other options we should consider/ look for in vendor solutions or request?
Thank you!
You need to provide your requirements (data lake architecture, data latency, etc.) to the vendors and get them to provide the solution that will work with their product.
I have an azure synapse workspace that contains a number of pipelines & external tables in the serverless sql pool. all associated with one particular project.
There are another 2-3 completely separate projects on the way that will require a synapse toolset.
Should i create a new workspace, or allow them all to share this one?
What is the best criteria to use to decide?
This is probably a bit of an opinion question which don't tend to do that well on StackOverflow, but that said, I tend to think of Synapse Workspaces as similar to an instance of SQL Server, so historically, why would you have used the same SQL instance?
Generally this was where projects have things have in common, eg same data, similar permissions (AAD) groups, similar HADR requirements etc, so ask yourself those questions.
Bear in mind you can have multiple databases (dedicated and serverless) within a workspace but cross database queries for tables in a dedicated sql pool are only possible via Spark Pools1. This could work in your favour if you require separation. Also bear in mind you can connect multiple storage accounts to the workspace. There is no cost overhead to having multiple workspaces, but there is an admin overhead and there would be a cost implication to duplicating any of your data across multiple lakes, storage accounts and databases.
One example - we're using workspaces for environments for example where there aren't separate dev, test, uat Azure subscriptions.
So a few things to consider.
1 import the two tables as dataframes then join them in a Synapse notebook as per this example
Hi guys I am using GCP for the first time and while I walking through the a project's cloud function example with the mock data, I got confused about similarities/differences of each one and I would like more clarity of what makes them different because to me they seem so similar.
BigQuery is a data warehouse and a SQL Engine. You can use it to store tabular data in datasets and tables. In the tables you may as well store more complex structures like arrays and JSONs but not files for example.
Cloud Storage is a blob storage, with functionality similar to what you know in your linux/windows machine (saving files, folders, deleting, copying). Of course that in the backend it's nothing like your local file system.
BigQuery is a fully managed and serverless data warehouse. It's like Snowflake or Redshift.
Google Cloud Storage(GCS) is like Amazon S3 or Azure Storage. Storages are for storing data as the name suggests.
You usually use BigQuery to analyze & query data in order to draw some insights. BigQuery is an analytical engine.
You can store images, videos, logs, files, and etc in GCS(Google Cloud Storage), but BigQuery can't.
Google BigQuery belongs to "Big Data as a Service" category of the tech stack, while Google Cloud Storage can be primarily classified under "Cloud Storage".
Some of the features offered by Google BigQuery are:
• All behind the scenes- Your queries can execute asynchronously in the
background, and can be polled for status.
• Import data with ease- Bulk load your data using Google Cloud Storage or stream it in bursts of up to 1,000 rows per second.
• Affordable big data- The first Terabyte of data processed each month is free.
On the other hand, Google Cloud Storage provides the following key features:
• High Capacity and Scalability
• Strong Data Consistency
• Google Developers Console Projects
"High Performance" is the primary reason why developers consider Google BigQuery over the competitors, whereas "Scalable" was stated as the key factor in picking Google Cloud Storage.
I would really like to use BigQuery for data analytics and developing business intelligence. The only concern is that some of our clients are not comfortable with cloud storage, so we have in-house servers storing their data for all our other processes. So far as I can tell, BigQuery offers no flexibility on storage of datasets aside from specifying which location in the cloud (US or EU) should be used. Is there any way to specify that BigQuery datasets are to be stored in local clusters?
It is not possible to point BigQuery storage to servers outside of Cloud. BigQuery supports federated query from outside of its internal storage, but it still needs to be in Google Cloud Storage or on Google Drive (and in the future perhaps on other Cloud storage systems).
An alternative answer to the technically correct one provided above; while you cannot specify a storage location outside of Google's infrastructure for BigQuery to access, it is worth noting that BigQuery is simply a fully managed (and highly optimized) version of the open source tool, Drill. Drill is essentially the query execution engine of BigQuery entirely uncoupled from the storage layer that Google uses (Colossus).
We leverage both BigQuery and Drill heavily at my company, and are very happy with both, albeit for different uses.
I have the following set up:
Azure service
Azure SQL database
Azure Table Storage
Azure Blob Storage
I am trying to develop a backup strategy for this service.
The thing is, that SQL, Tables and BLOBs should be synced. In the backup all three of those have to be of the same version. (backups taken at the same moment). And the main problem is - I can only afford several minutes downtime, not more than that.
What should I do? May be there is existing solution?
Windows Azure Storage supports geo-replication for Blobs, Tables and Queues. Data in the storage account is made durable by replicating transactions across different storage nodes in the same region (LRS) or a secondary region (GRS). GRS is the default redundancy option when creating a storage account. Refer to http://blogs.msdn.com/b/windowsazurestorage/archive/2013/12/11/introducing-read-access-geo-replicated-storage-ra-grs-for-windows-azure-storage.aspx for more details.
If you want to build a custom backup solution then you could use the techniques suggested in the below 2 blogs
1) http://blogs.msdn.com/b/windowsazurestorage/archive/2010/04/30/protecting-your-blobs-against-application-errors.aspx
2) http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/03/protecting-your-tables-against-application-errors.aspx
I am not sure of the exact use case of why you need to backup azure table and blob. You can backup All the above services without downtime; might be there would be slight glitch or bottleneck performance with SQL database durning back.
The single shot answer is to write a custom script which would read the data from azure table ( or SQL database, or the required service ) make a archive (packaging) and store it back.
The important thing to note here is where would storage backups, broadly speaking generally store the archives in blob. In this case you have thing where you would be storing, if you are storing on-premises you need calculate upon the storage locally, out bandwidth cost and latency of the data transfer from azure.
PS : cloud storage by itself has good leave of availability and durability, you further improve these factors by enabling geo-replication