I'll get a list of coupons by mail. That needs to be stored somewhere somehow (bigquery?) where I can request and send it to the user. The user should only be able to get 1 unique code, that was not used beforehand.
I need the ability to get a code and write, that it was used, so the next request gets the next code...
I know it is a completely vague question but I'm not sure how to implement that, anyone has any ideas?
thanks in advance
Thr can be multiples solution for same requirement, one of them is given below :-
Step 1. Try to get coupons over a file (CSV, JSON, and etc) as per your preference/requirement.
Step 2. Load Source file to GCS (storage).
Step 3. Write a Dataflow code which read data from GCS (file) an load data to a different Bigquery table (tentative name: New_data). Sample code.
Step 4. Create a Dataflow code to read data from Bigquery table New_data and compare it with History_data and identify new coupons and write data to a file on GCS or Bigquery table. Sample code.
Step 5. Schedule entire process over an orchestrator/Cloud scheduler/Cron tab job.
Step 6. Once you have data you can send it to consumers through any communication channel.
Related
I wonder if there is a way to count the actual number of accesses to a certain GCP service by analysing audit log stored in BigQ. In other words, I have audit tables sink to BigQ (no actual access to Stackdriver). I can see a number of rows were generated per single access, i.e. it was one physical access to the GCS, but about 10 rows generated due to different function calls. I'd like to be able to say how many attempts/accesses were made by the user account by looking at X number of rows. This is a data example.
Thank you
Assuming that the sink works well and the logs are being exported to BQ, then you would need to check the format on how audit logs are written to BQ and the fields that are exported/written to BQ [1].
Then filter the logs by resource_type and member, these documents [2][3] can help you with that.
[1] https://cloud.google.com/logging/docs/audit/understanding-audit-logs#sample
[2] https://cloud.google.com/logging/docs/audit/understanding-audit-logs#sample
[3] https://cloud.google.com/logging/docs/audit/understanding-audit-logs#interpreting_the_sample_audit_log_entry
So I was playing around seeing what could be achieved using Data Migration Service Chance Data Capture, to take data from MSSQL to S3 & also Redshift.
The redshift testing was fine, if I delete a record in my source DB, a second or two later the record disappears from Redshift. Same with Insert/update etc ..
But S3 ...
You get the original record from the first full load.
Then if you update a record in source, S3 receives a new copy of the record, marked with an 'I'.
If I delete a record, I get another copy of the record marked with a 'D'.
So my question is - what do I do with all this ?
How would I query my S3 bucket to see the 'current' state of my data set as reflecting the source database ?
Do I have to script some code myself to pick up all these files and process them, performing the inserts/updates and deletes until I finally resolve back to a 'normal' data set ?
Any insight welcomed !
The records containing 'I', 'D' or 'U' are actually CDC data (change data capture). This is called sometimes called "history" or "historical data". This type of data has some applications in data warehousing and also this can also be used in many Machine learning uses cases.
Now coming to the next point, in order to get 'current' state of data set you have to script/code yourself. You can use AWS Glue to perform the task. For-example, This post explains something similar.
If you do not want to maintain the glue code, then a shortcut is not to use s3 target with DMS directly, but use Redshift target and once all CDC is applied offload the final copy to S3 using Redshift unload command.
As explained here about what 'I', 'U' and 'D' means.
What we do to get the current state of the db? An alternate is to first of all add this additional column to the fullload files as well, i.e. the initial loaded files before CDC should also have this additional column. How?
Now query the data in athena in such a way where we exclude the records where Op not in ("D", "U") or AR_H_OPERATION NOT IN ("DELETE", "UPDATE"). Thus you get the correct count (ONLY COUNT as 'U' would only come if there is already an I for that entry).
SELECT count(*) FROM "database"."table_name"
WHERE Op NOT IN ('D','U')
Also to get all the records you may try something in athena with a complex sql, where Op not in ('D') and records when Op IN = 'I' and count 1 or else if count 2, pick the latest one or Op = 'U'.
Is there an option to Log details of Copy Activity to a Database Table.
I want to log the FileName & PAth that was generate, PipelineID that Generated it, How long it took to copy the File, Rows it copied, size of File Created plus few more.
How can all of these be achieved ?
You could reference this two link.
1. https://learn.microsoft.com/en-us/azure/data-factory/monitor-visually
2. https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically
The first one is about what information you can get. The second is to tell you how to get these information with different way.
But currently, I don’t think there is a way to get file name and path directly. But you could leverage user properties. Please reference this post. https://social.msdn.microsoft.com/Forums/azure/en-US/8692cd00-307b-4204-a547-bed2030cb762/adfv2-user-property-setting?forum=AzureDataFactory
How do I read and process all the rows that read from an excel, in Javascript step.
I have the following transformation
The first step is to read all the rows from a spreadsheet and then I will post it to a REST service. I am waiting for the REST operation is to complete before entering the JS script. In the JS step, I want to process all the records at a time and save it as one single record to XML.
Just to give some more clarity on my requirement, I will give one scenario. I have an input file with two columns. Material no and quantity. After the JavaScript step, I need to post the data read from the spreadsheet to another service. This service will return me the free goods associated with the input. But to get free good, I need a combination of materials. For eg: if the input is TV and DVD Player, I get something free. But I won't get anything free, if I am passing only TV or DVD player as the input. So in this case my data is:
**Material Qty**
TV 1
DVD Player 1
My REST service has the following structure.
{
"items": [
{
"material":"TV",
"quantity":"1",
},
{
"material":"DVD Player",
"quantity":"1",
}
]
}
Any input on how to achieve this would be really valuable. Thank you.
One way of addressing this in a transformation is to use a group by. Try using unique rows by hashset or even a group by utility.
This step will replace the blocking step and wait until all rows are processed and grouped.
Below is the sample (3 ways)
As an example, the Unique Rows (Hashset) can read all the fields from the JavaScript which will make it distinct after reading the entire dataset.
I am trying to write a custom outputter for U-SQL that writes rows to individual files based on the data in one column.
For example, if the column has a date "2016-01-01", it writes that row a file with that name, and a the next row to a file with the value in the same column.
I am aiming to do this by using the Data Lake Store SDK within the outputter, which creates a client and uses the SDK functions to write to individual files.
Is this a viable and possible solution?
I have seen that the function to be overriden for outputters is
public override void Output (IRow row, IUnstructuredWriter output)
In which the IUnstructuredWriter is casted to a StreamWriter(I saw one such example), so I assume this IUnstructuredWriter is passed to this function by the U-SQL script. So that doesn't leave for me any control over this what is passed here, also it will remain constant for all rows and can't change.
This is currently not possible but we are working on this functionality in reply to this frequent customer request. For now, please add your vote to the request here: https://feedback.azure.com/forums/327234-data-lake/suggestions/10550388-support-dynamic-output-file-names-in-adla
UPDATE (Spring 2018): This feature is now in private preview. Please contact us via email (usql at microsoft dot com) if you want to try it out.