Import data from BigQuery to Cloud Storage in different project - google-bigquery

I have two projects under the same account:
projectA with BQ and projectB with cloud storage
projectA has BQ with dataset and table - testDataset.testTable
prjectB has cloud storage and bucket - testBucket
I use python, google cloud rest api
account key credentials for every project, with different permissions: projectA key has permissions only for BQ; projectB has permissions only for cloud storage
What I need:
import data from projectA testDataset.testTable to projectB testBucket
Problems
of course, I'm running into error Permission denied while I'm trying to do it, because apparently, projectA key does not have permissions for projectB storage and etc
another strange issue as I have testBucket in projetB I can't create a bucket with the same name in projectA and getting
This bucket name is already in use. Bucket names must be globally
unique. Try another name.
So, looks like all accounts are connected I guess it means should be possible to import data from one account to another one via API
What can I do in this case?

You put this wrong. You need to provide access to the user account on both projects to have accessible across projects. So there needs to be a user authorized to do the BQ thing and also the GCP thing on the different project.
Also Bucket names must be globally unique it means I can't create the name as well, it's global (for the entire planet you reserved that name, not just for project)

Related

Attempting to Read parcquet files on linked storage in Azure Synapse

I am attempting to give access to parquet files on a Gen2 Data Lake container. I have owner RBAC on the container but would prefer to limit access in the container for other users.
My Query is very simple:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://aztsworddataaipocacldl.dfs.core.windows.net/pocacl/Top/Sub/part-00006-c62926ba-c530-4ad8-87d1-cf38c67a2da3-c000.snappy.parquet',
FORMAT='PARQUET'
) AS [result]
When I run this I have no problems connecting. I have attempted to add ACL rights onto the files (and of course the containing folders 'Top' and 'Sub').
I've give RWX on the 'Top' folder using Storage Explorer and default so that it cascades to the 'Sub' folder and parquet files as I add them
When my colleague attempts to run the SQL script the get the error message. Failed to execute query. Error: File 'https://aztsworddataaipocacldl.dfs.core.windows.net/pocacl/Top/Sub/part-00006-c62926ba-c530-4ad8-87d1-cf38c67a2da3-c000.snappy.parquet' cannot be opened because it does not exist or it is used by another process.
NB similar results are also experienced in Spark but with a 403 instead
SQL on-demand provides a link to the following help file after the error, it suggests:
If your query fails with the error saying 'File cannot be opened because it does not exist or it is used by another process' and you're sure both file exist and it's not used by another process it means SQL on-demand can't access the file. This problem usually happens because your Azure Active Directory identity doesn't have rights to access the file. By default, SQL on-demand is trying to access the file using your Azure Active Directory identity. To resolve this issue, you need to have proper rights to access the file. Easiest way is to grant yourself 'Storage Blob Data Contributor' role on the storage account you're trying to query.
I don't wish to grant Storage Blob Data Contributor or Storage Blob Data Reader as this gives access to every file on the container and not just those I want end users to be able to query. We have found the same experience occurs for SSMS connecting to parquet external tables.
So then in parts:
Is this the correct pattern using ACL to grant access, or should I use another method?
Are there settings on the Storage Account or within my query/notebook that I should be enabling to support ACL?*
Has ACL been implemented on Synapse Workspace to date given that we're still in preview?
*I have resisted pasting my entire settings as I really have no idea what is relevant and what entirely irrelevant to this issue but of course can supply.
It would appear that the ACL feature was not working correctly in Preview for Azure Synapse Analytics.
I have now managed to get it to work. At present I see that once Read|Execute is provided to a folder it allows access to the files contained within that folder and sub folders. Access is available even when no specific ACL access is provided on a file in a sub folder. This is not quite what I expected however it provides enough for me to proceed: only giving access to the Gold folder allows for separation of access to the files I want to let users query and the working files that I want to keep hidden.
When you assign ACL to a folder it's not propagated recursively to all files inside the folder. Only new files inherit from the folder.
You can see this here
Go to azure storage explorer change ACL permissions in the route Folder and right click on your storage and click on "propogate access control lists"

Best practice to make S3 file accessible for Redshift through COPY operation for anyone

I want to publish a tutorial where a data from sample tsv file S3 is used by Redshift. Ideally I want it to be simple copy paste operation required to follow the exercises step by step, similar to what's in Load Sample Data from Amazon S3. The problem is with the first data import task using COPY command as it only supports S3, or EMR based load.
This seems like a simple requirement but no hassle-free way to really do it with Redshift COPY (I can make the file available for browser download without any problem but COPY requires CREDENTIALS parameter…)
Variety of options for Redshift COPY Authorization parameters is quite rich:
Should I ask user to Create an IAM Role for Amazon Redshift
himself?
Should I create it myself and publish the IAM role ARN? Sounds most hassle
free (copy paste) but security wise doesn't sound well…? Do I need to restrict S3 permissions to limit the access to only that particular file for that role?
Should I try temporary access instead?
You are correct:
Data can be imported into Amazon Redshift from Amazon S3 via the COPY command
The COPY command requires permission to access the data stored in Amazon S3. This can be granted either via:
Credentials (Access Key + Secret Key) associated with an IAM User, or
An IAM Role
You cannot create a Role for people and let them use it, because their Amazon Redshift cluster will be running in a different AWS Account than your IAM Role. You could possibly grant trust access so that other accounts can use the role, but this is not necessarily a wise thing to do.
As for credentials, they could either use their own or ones the you supply. They can access their own Access Key + Secret Key in the IAM console.
If you wish to supply credentials for them to use, you could create an IAM User that has permission only to access the Amazon S3 files they need. It is normally unwise to publish your AWS credentials because they might expose a security hole, so you should think carefully before doing this.
At the end of the day, it's probably best to show them the correct process so they understand how to obtain their own credentials. Security is very important in the cloud, so you would also be teaching them good security practice, in additional to Amazon Redshift itself.

Terraform Shared State

Terraform 0.9.5.
I am in the process of putting together a group of modules that our infrastructure team and automation team will use to create resources in a standard fashion and in turn create stacks to provision different envs. All working well.
Like all teams using terraform shared state becomes a concern. I have configured terraform to use a s3 backend, that is versioned and encrypted, added a lock via a dynamo db table. Perfect. All works with local accounts... Okay the problem...
We have multiple aws accounts, 1 for IAM, 1 for billing, 1 for production, 1 for non-production, 1 for shared services etc... you get where I am going. My problem is as follows.
I authenticate as user in our IAM account and assume the required role. This has been working like a dream until i introduced terraform backend configuration to utilise s3 for shared state. It looks like the backend config within terraform requires default credentials to be set within ~/.aws/credentials. It also looks like these have to be a user that is local to the account where the s3 bucket was created.
Is there a way to get the backend configuration setup in such a way that it will use the creds and role configured within the provider? Is there a better way to configured shared state and locking? Any suggestions welcome :)
Update:Got this working. I created a new user within the account where the s3 bucket is created. Created a policy to just allow that new user s3:DeleteObject,GetObject,PutObject,ListBucket and dynamodb:* on the specific s3 bucket and dynamodb table. Created a custom credentials file and added default profile with access and secret keys assigned to that new user. Used the backend config similar to
terraform {
required_version = ">= 0.9.5"
backend "s3" {
bucket = "remote_state"
key = "/NAME_OF_STACK/terraform.tfstate"
region = "us-east-1"
encrypt = "true"
shared_credentials_file = "PATH_TO_CUSTOM_CREDENTAILS_FILE"
lock_table = "MY_LOCK_TABLE"
}
}
It works but there is an initial configuration that needs to happen within your profile to get it working. If anybody knows of a better setup or can identify problems with my backend config please let me know.
Terraform expects backend configuration to be static, and does not allow it to include interpolated variables as might be true elsewhere in the config due to the need for the backend to be initialized before any other work can be done.
Due to this, applying the same config multiple times using different AWS accounts can be tricky, but is possible in one of two ways.
The lowest-friction way is to create a single S3 bucket and DynamoDB table dedicated to state storage across all environments, and use S3 permissions and/or IAM policies to impose granular access controls.
Organizations adopting this strategy will sometimes create the S3 bucket in a separate "adminstrative" AWS account, and then grant restrictive access to the individual state objects in the bucket to the specific roles that will run Terraform in each of the other accounts.
This solution has the advantage that once it has been set up correctly in S3 Terraform can be used routinely without any unusual workflow: configure the single S3 bucket in the backend, and provide appropriate credentials via environment variables to allow them to vary. Once the backend is initialized, use workspaces (known as "state environments" prior to Terraform 0.10) to create a separate state for each of the target environments of a single configuration.
The disadvantage is the need to manage a more-complicated access configuration around S3, rather than simply relying on coarse access control with whole AWS accounts. It is also more challenging with DynamoDB in the mix, since the access controls on DynamoDB are not as flexible.
There is a more complete description of this option in the Terraform s3 provider documentation, Multi-account AWS Architecture.
If a complex S3 configuration is undesirable, the complexity can instead be shifted into the Terraform workflow by using partial configuration. In this mode, only a subset of the backend settings are provided in config and additional settings are provided on the command line when running terraform init.
This allows options to vary between runs, but since it requires extra arguments to be provided most organizations adopting this approach will use a wrapper script to configure Terraform appropriately based on local conventions. This can be just a simple shell script that runs terraform init with suitable arguments.
This then allows to vary, for example, the custom credentials file by providing it on the command line. In this case, state environments are not used, and instead switching between environments requires re-initializing the working directory against a new backend configuration.
The advantage of this solution is that it does not impose any particular restrictions on the use of S3 and DynamoDB, as long as the differences can be represented as CLI options.
The disadvantage is the need for unusual workflow or wrapper scripts to configure Terraform.

Sharing data between several Google projects

A question about Google Storage:
Is it possible to give r/o access to a (not world-accessible) storage bucket to a user from another Google project?
If yes, how?
I want it to backup data to another Google project, for the case if somebody may incidentally delete all storage buckets from our project.
Yes. Access to Google Cloud Storage buckets and objects are controlled by ACLs that allow you to specify individual users, service accounts, groups, or project role.
You can add users to any existing object through the UI, the gsutil command-line utility, or via any of the APIs.
If you want to grant one specific user the ability to write objects into project X, you need only specify the user's email:
$> gsutil acl ch -u bob.smith#gmail.com:W gs://bucket-in-project-x
If you want to say that every member of the project my-project is permitted to write into some bucket in a different project, you can do that as well:
$> gsutil acl ch -p members-my-project:W gs://bucket-in-project-x
The "-u" means user, "-p" means 'project'. User names are just email addresses. Project names are the strings "owners-", "viewers-", or "editors-" and then the project's ID. The ":W" bit at the end means "WRITE" permission. You could also use O or R or OWNER or READ or WRITE instead.
You can find out more by reading the help page: $> gsutil help acl ch

Permissions to create Entities in Google Datastore via Cloud Console

I'm managing a project running in the google cloud and have a team working on it. The team-members are organized in a google group, that has the permission to edit the project. Each team-member can start instances, create container-engine cluster, etc. but it's not possible to create datastore entities.
When I add the team-member directly as editor to the project (not via the google group), he is able to create datastore entities. But I like managing members via the google group, because I can give selected team-members the permission to add team-members without giving them the owner-role of the project.
Is there anything I missed? Or is it just not possible to give project editors added via google group the permission to create entities in the datastore?