Store the output of pig job into a directory structure derived from data - apache-pig

I would like to achieve the following:
My input data looks as follows
{"metadata":
{
"producerName":"capture_api",
"producerVersion":"3.0.13"
},
"payload":
{
--some payload
}
}
I would like to bucket this data using a pig script as follows
/finalOutputDir/producerName/producerVersion/File.txt
Is there a way I can do this. I have tried using the MultiStorage Function but that class supports only one field. I can override the functionality inside multistage but just wanted to check if there is a easier option.

The piggybank MultiStorage could separate the data into multiple folders by a (only one?) field.
STORE data INTO '$out/$producerName' USING org.apache.pig.piggybank.storage.MultiStorage('$out/$producerName', '0', 'none', ',');

Related

How can I generate a JSON file from a flat buffer schema

For example if I have the following flat buffer schema:
table table_1 {
field1:uint32_t;
field2:uint32_t;
}
table table_2 {
field3:string;
field4:table_1;
}
root_type table_2;
Is there a way to automatically generate the Json file:
{
"field3": "",
"field4": {
"field1":"",
"field2":""
}
}
So it will be easier to fill the Json file and generate a bin file.
I just need to implement a reader and not a builder.
Thanks
There is no built-in way, but you could use the reflection API to do it yourself. Generate the binary schema using flatc, read in that binary schema into your own application and iterate over the tables and fields and export it as JSON.
I recently wrote a Lua generator that follows this scheme, and you could something very similar.

How to achieve generic Audit.NET json data processing?

I am using Audit.Net library to log EntityFramework actions into a database (currently everything into one AuditEventLogs table, where the JsonData column stores the data in the following Json format:
{
"EventType":"MyDbContext:test_database",
"StartDate":"2021-06-24T12:11:59.4578873Z",
"EndDate":"2021-06-24T12:11:59.4862278Z",
"Duration":28,
"EntityFrameworkEvent":{
"Database":"test_database",
"Entries":[
{
"Table":"Offices",
"Name":"Office",
"Action":"Update",
"PrimaryKey":{
"Id":"40b5egc7-46ca-429b-86cb-3b0781d360c8"
},
"Changes":[
{
"ColumnName":"Address",
"OriginalValue":"test_address",
"NewValue":"test_address"
},
{
"ColumnName":"Contact",
"OriginalValue":"test_contact",
"NewValue":"test_contact"
},
{
"ColumnName":"Email",
"OriginalValue":"test_email",
"NewValue":"test_email2"
},
{
"ColumnName":"Name",
"OriginalValue":"test_name",
"NewValue":"test_name"
},
{
"ColumnName":"OfficeSector",
"OriginalValue":1,
"NewValue":1
},
{
"ColumnName":"PhoneNumber",
"OriginalValue":"test_phoneNumber",
"NewValue":"test_phoneNumber"
}
],
"ColumnValues":{
"Id":"40b5egc7-46ca-429b-86cb-3b0781d360c8",
"Address":"test_address",
"Contact":"test_contact",
"Email":"test_email2",
"Name":"test_name",
"OfficeSector":1,
"PhoneNumber":"test_phoneNumber"
},
"Valid":true
}
],
"Result":1,
"Success":true
}
}
Me and my team has a main aspect to achieve:
Being able to create a search page where administrators are able to tell
who changed
what did they change
when did the change happen
They can give a time period, to reduce the number of audit records, and the interesting part comes here:
There should be an input text field which should let them search in the values of the "ColumnValues" section.
The problems I encountered:
Even if I map the Json structure into relational rows, I am unable to search in every column, with keeping the genericity.
If I don't map, I could search in the Json string with LIKE mssql function but on the order of a few 100,000 records it takes an eternity for the query to finish so it is probably not the way.
Keeping the genericity would be important, so we don't need to modify the audit search page every time when we create or modify a new entity.
I only know MSSQL, but is it possible that storing the audit logs in a document oriented database like cosmosDB (or anything else, it was just an example) would solve my problem? Or can I reach the desired behaviour using relational database like MSSQL?
Looks like you're asking for an opinion, in that case I would strongly recommend a document oriented DB.
CosmosDB could be a great option since it supports SQL queries.
There is an extension to log to CosmosDB from Audit.NET: Audit.AzureCosmos
A sample query:
SELECT c.EventType, e.Table, e.Action, ch.ColumnName, ch.OriginalValue, ch.NewValue
FROM c
JOIN e IN c.EntityFrameworkEvent.Entries
JOIN ch IN e.Changes
WHERE ch.ColumnName = "Address" AND ch.OriginalValue = "test_address"
Here is a nice post with lot of examples of complex SQL queries on CosmosDB

Saving Sequelize Query Text In Database

I have a sequelize query that I would like to save in a database and then execute on demand. In order to do this I am first testing how this would work in a variable as a string (because in a database it will be stored as a string):
queryToRun = models.user.findAll({
attributes: [
['name', 'name'],
[Sequelize.literal("COUNT(DISTINCT(user.id))"), "user_count"]
],
group: Sequelize.col("user.name")
})
With this query I would like to use it like so:
Promise.all(queryToRun);
I am successfully able to save the object (object that goes inside findAll with the attributes etc) as a string and then execute, but I can't figure out how to save every part of the query. I want to save the actual "model.user.findAll()"string and evaluate it at a later time.
This is important because I want to define the model that I need to run findAll on and save it in the database.
This is actually fairly simple
Don't use the multiline string it trips up the function call for some reason
Use eval to evaluate the string
Example:
let queryToRun = 'models.project.findAll({})';
Promise.all([eval(queryToRun)]);

Load data from ES and store as avro in HDFS using pig

I have some data on ElasticSearch that I need to send on HDFS. I'm trying to use pig (this is the first time I'm using it), but I have some problem to define a correct schema for my data.
First of all, I tried loading a JSON using the option 'es.output.json=true' with org.elasticsearch.hadoop.pig.EsStorage, and I can load/dump data correctly, and also save them as a JSON to HDFS using STORE A INTO 'hdfs://path/to/store';. Later, defining an external table on HIVE, I can query this data. This is the full example that is working fine (I removed all SSL attributes from the code):
REGISTER /path/to/commons-httpclient-3.1.jar;
REGISTER /path/to/elasticsearch-hadoop-5.3.0.jar;
A = LOAD 'my-index/log' USING org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes=https://addr1:port,https://addr2:port2,https://addr3:port3',
'es.query=?q=*',
'es.output.json=true');
STORE A INTO 'hdfs://path/to/store';
How can I store my data as AVRO to HDFS? I suppose I need to use AvroStorage, but I should also define a schema loading the data, or the JSON is enough? I tried to define a schema with LOAD...USING...AS command and setting es.mapping.date.rich=false instead of es.output.json=true (my data is quite complex, with map of maps and things like that), but it doesn't work. I'm not sure if the problem is on the syntax, or in the approach itself. Would be nice to have an hint on the correct direction to follow.
UPDATE
This is an example of what I tried with es.mapping.date.rich=false. My problem is that if a field is null, all fields will be in a wrong order.
A = LOAD 'my-index/log' USING org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes=https://addr1:port,https://addr2:port2,https://addr3:port3',
'es.query=?q=*',
'es.mapping.date.rich=false')
AS(
field1:chararray,
field2:chararray,
field3:map[chararray,fieldMap:map[],chararray],
field4:chararray,
field5:map[]
);
B = FOREACH A GENERATE field1, field2;
STORE B INTO 'hdfs://path/to/store' USING AvroStorage('
{
"type" : "foo1",
"name" : "foo2",
"namespace" : "foo3",
"fields" : [ {
"name" : "field1",
"type" : ["null","string"],
"default" : null
}, {
"name" : "field2",
"type" : ["null","string"],
"default" : null
} ]
}
');
For future readers, I decided to use spark instead as it is much faster than pig. To save avro files on hdfs, I'm using the databrick library.

Wsapi data store query

I am looking to get all projects under a selected project (i.e the entire child project branch ) using Wsapi data store query in Rally SDK 2.0rc1. Is it possible using a query to recursively get all child project names? or will I have to write a separate recursive function to get that information? If a separate recursive function is required, how should I populate that data into for example, a combo box? Do I need to create a separate data store and push the data from my recursive function in it and then link the Combobox's store to it?
Also, how to get the "current workspace name" (workspace that I am working in, inside Rally), in Rally SDK 2.0rc1 ?
Use the 'context' config option to specify which project level to start at and add 'projectScopeDown' to make sure child projects are returned. That would look something like this:
Ext.create('Rally.data.WsapiDataStore', {
limit : Infinity,
model : 'Project',
fetch : ['Name','ObjectID'],
context : {
project : '/project/' + PROJECT_OID,
projectScopeDown : true
}
}).load({
callback: function(store) {
//Use project store data here
}
});
To get your current context data, use: this.getContext().
var workspace = this.getContext().getWorkspace();
var project = this.getContext().getProject();
If you try exposing with console.log the this.getContext().getWorkspace() and this.getContext().getProject() you may understand better what is returned and what is required. In one of my cases I had to use this.getContext().getProject().project.
Using console debug statement is best way to figure what you need based on its usage.