I am working on providing the accesses for huge views(containing sometimes +10 datasets) without the need to providing access to raw data. I have tested authorized views( resource 'google_bigquery_dataset_access') and it works brilliant when it comes to a small view based on one datasets.
How do achieve that for bigger ones? I tried to add multiple 'google_bigquery_dataset_access' resources with different datasets that are within the view, but it is overwriting the previous ones. Tried also multiple dataset_id arguments which are not allowed, as well as a list within dataset_id argument. Also it works only when I give access to the whole dataset and not particular view within this dataset.
Env: BQ with Terraform Cloud
Related
I'm currently working on a project in Azure Data Factory, which involves collecting data from a Dataset, using this data to make API calls, and thereafter taking the output of the calls, and posting them to another dataset.
In this way I wish to end up with a dataset containing various different data, that the API call returns to me.
My current difficulty with this is, that do not know how to make the "Web activity" (which I use to make the API Call) save its output to my dataset.
I have tried numerous different solutions found online, however none of them seem to work. I am not sure if the official documentation is outdated or if I'm misunderstanding parts of it. Below I've listed links to the solutions I've tried and failed:
Copy data from a REST source
Copy data from an HTTP source
(among others, including similar posts to mine.)
The current flow in my pipeline is, that a "Lookup" collects a list of variables named "User_ID". These user ID's are put in to a ForEach loop, which makes an API call with the "Web" activity, using each of the USER_ID's. And this is where in the pipeline I wish to implement an activity or other, that can post each of these Web activity outputs into my new dataset.
I've tried to use the "Copy data" activity, but all it seems to do, is copying data straight from one dataset to another, and not being able to manipulate the data (which i wish to do with my api call).
Does anyone have a solution to how this is done?
Thanks a lot in advance.
Not sure why you could not achieve this following Copy data from a REST endpoint. I tested the below which works fine. I used schema mapping feature of 'Copy data' activity.
For example, I used a sample API http://dummy.restapiexample.com/api/v1/employees as source and for my testing, I used CosmosDB as sink. Of course you can choose any other dataset as per your requirement.
Create 'Linked Service' for the REST API. For simplicity I do not have authentication for this API. Of course, you have that option if required.
Create 'Linked Service' for the target data store. In my case, it is CosmosDB.
Create Dataset for the REST API and link to the linked service created in #1.
Create Dataset for the Data store (in my case CosmosDB) and link to the linked service created in #2.
In the pipeline, add a 'Copy data' activity like below with source as the REST dataset created in #3 and sink as the dataset created in #4. Also, in my case I had to add schema mapping to select the employees array from the API output and map to each field in my datastore.
And voila, that's it. When I run the pipeline, it calls the REST API and saves the output in my DB with my desired mapping.
I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:
my_big_dataset.hdf5:
type: kedro.extras.datasets.pandas.HDFDataSet
filepath: "s3://my_bucket/data/04_feature/my_big_dataset.hdf5"
I want to refer to these objects in the catalog by their S3 URI so that my team can use them. HOWEVER, I want to avoid re-downloading the datasets, model weights, etc. every time I run a pipeline by keeping a local copy in addition to the S3 copy. How do I mirror files with Kedro?
This is a good question, Kedro has CachedDataSet for caching datasets within the same run, which handles caching the dataset in memory when it's used/loaded multiple times in the same run. There isn't really the same thing that persists across runs, in general Kedro doesn't do much persistent stuff.
That said, off the top of my head, I can think of two options that (mostly) replicates or gives this functionality:
Use the same catalog in the same config environment but with the TemplatedConfigLoader where your catalog datasets have their filepaths looking something like:
my_dataset:
filepath: ${base_data}/01_raw/blah.csv
and you set base_data to s3://bucket/blah when running in "production" mode and with local_filepath/data locally. You can decide how exactly you do this in your overriden context method (whether it's using local/globals.yml (see the linked documentation above) or environment variables or what not.
Use separate environments, likely local (it's kind of what it was made for!) where you keep a separate copy of your catalog where the filepaths are replaced with local ones.
Otherwise, your next best bet is to write a PersistentCachedDataSet similar to CachedDataSet which intercepts the loading/saving for the wrapped dataset and makes a local copy when loading for the first time in a deterministic location that you look up on subsequent loads.
I accidentally overwrote a saved project query in BQ with a completely unrelated query. I can't find any documentation about retrieving overwritten queries or about any sort of version control. Has anyone done this as well and recovered their query?
Unfortunately, "Saved Query" is UI internal feature (see How to access “Saved Queries” programmatically? and there is respective feature request REST API for Saved Queries), so we really have no way to manage / control this cases
Meantime you can use query history (either in UI or via respective API or in Stackdriver) to locate use of that query and recreate/re-save it again
My team at work is currently looking for a replacement for a rather expensive ETL tool that, at this point, we are using as a glorified scheduler. Any of the integrations offered by the ETL tool we have improved using our own python code, so I really just need its scheduling ability. One option we are looking at is Data Pipeline, which I am currently piloting.
My problem is thus: imagine we have two datasets to load - products and sales. Each of these datasets requires a number of steps to load (get source data, call a python script to transform, load to Redshift). However, product needs to be loaded before sales runs, as we need product cost, etc to calculate margin. Is it possible to have a "master" pipeline in Data Pipeline that calls products first, waits for its successful completion, and then calls sales? If so, how? I'm open to other product suggestions as well if Data Pipeline is not well-suited to this type of workflow. Appreciate the help
I think I can relate to this use case. Any how, Data Pipeline does not do this kind of dependency management on its own. It however can be simulated using file preconditions.
In this example, your child pipelines may depend on a file being present (as a precondition) before starting. A Master pipeline would create trigger files based on some logic executed in its activities. A child pipeline may create other trigger files that will start a subsequent pipeline downstream.
Another solution is to use Simple Workflow product . That has the features you are looking for - but would need custom coding using the Flow SDK.
This is a basic use case of datapipeline and should definitely be possible. You can use their graphical pipeline editor for creating this pipeline. Breaking down the problem:
There are are two datasets:
Product
Sales
Steps to load these datasets:
Get source data: Say from S3. For this, use S3DataNode
Call a python script to transform: Use ShellCommandActivity with staging. Data Pipeline does data staging implicitly for S3DataNodes attached to ShellCommandActivity. You can use them using special env variables provided: Details
Load output to Redshift: Use RedshiftDatabase
You will need to do add above components for each of the dataset you need to work with (product and sales in this case). For easy management, you can run these on an EC2 Instance.
Condition: 'product' needs to be loaded before 'sales' runs
Add dependsOn relationship. Add this field on ShellCommandActivity of Sales that refers to ShellCommandActivity of Product. See dependsOn field in documentation. It says: 'One or more references to other Activities that must reach the FINISHED state before this activity will start'.
Tip: In most cases, you would not want your next day execution to start while previous day execution is still active aka RUNNING. To avoid such a scenario, use 'maxActiveInstances' field and set it to '1'.
I'm pretty new to Core Data, and I'm trying to wrap my head around it.
You might have cases where you want different types of data stored in different places or with different behaviors. For example you might have one read only sqlite store shipped as part of your app containing some default data, an additional store for updates to that data set you have downloaded from a server, and a third for user data. Alternately you might have a case where you want some objects to be persisted while others can live in an in-memory store and do not need to be saved between uses of the app.