How to create a view against a table that has record fields? - google-bigquery

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.

Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])

We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.

Related

How to sort connection type into only 2 rows in Qlik sense

I have a column named Con_TYPE in which there are multiple types of connections such as fiberoptic, satellite, 3g etc.
And I want to sort them only into 2 rows:
fiberoptic
5
others
115
Can anybody help me?
Thanks in advance
You can use Calculated dimension or Mapping load
Lets imagine that the data, in its raw form, looks like this:
dimension: Con_TYPE
measure: Sum(value)
Calculated dimension
You can add expressions inside the dimension. If we have a simple if statement as an expression then the result is:
dimension: =if(Con_TYPE = 'fiberoptic', Con_TYPE, 'other')
measure: Sum(Value)
Mapping load
Mapping load is a script function so we'll have to change the script a bit:
// Define the mapping. In our case we want to map only one value:
// fiberoptic -> fiberoptic
// we just want "fiberoptic" to be shown the same "fiberoptic"
TypeMapping:
Mapping
Load * inline [
Old, New
fiberoptic, fiberoptic
];
RawData:
Load
Con_TYPE,
value,
// --> this is where the mapping is applied
// using the TypeMapping, defined above we are mapping the values
// in Con_TYPE field. The third parameter specifies what value
// should be given if the field value is not found in the
// mapping table. In our case we'll chose "Other"
ApplyMap('TypeMapping', Con_TYPE, 'Other') as Con_TYPE_Mapped
;
Load * inline [
Con_TYPE , value
fiberoptic, 10
satellite , 1
3g , 7
];
// No need to drop "TypeMapping" table since its defined with the
// "Mapping" prefix and Qlik will auto drop it at the end of the script
And we can use the new field Con_TYPE_Mapped in the ui. And the result is:
dimension: Con_TYPE_Mapped
measure: Sum(Value)
Pros/Cons
calculated dimension
+ easy to use
+ only UI change
- leads to performance issues on mid/large datasets
- have to be defined (copy/paste) per table/chart. Which might lead to complications if have to be changed across the whole app (it have to be changed in each object where defined)
mapping load
+ no performance issues (just another field)
+ the mapping table can be defined inline or loaded from an external source (excel, csv, db etc)
+ the new field can be used across the whole app and changing the values in the script will not require table/chart change
- requires reload if the mapping is changed
P.S. In both cases selecting Other in the tables will correctly filter the values and will show data only for 3g and satellite

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

Alias not recognized in main script

In the following code I've loaded data from two excel documents that have the exact same column names, and have therefore given one of the tables aliases.
My problem occurs when I try to put in a not match() condition at the end of the script.
// New table
NewTable:
LOAD
[namn] as namnNy
FROM
[pglistaNy.xlsx]
(ooxml, embedded labels);
// Old table
OldTable:
LOAD
[namn]
FROM
[pglistaOld.xlsx]
(ooxml, embedded labels)
Where not match(namn, namnNy);
I get an error telling me that it does not recognize the namnNy alias, why is that and what's a better solution / method?
match function will not work in your case. You are trying to match values from field names from different tables. You should use the exists function (full documentation on Qlik's help page)
So your script will be:
// New table
NewTable:
LOAD
[namn] as namnNy
FROM
[pglistaNy.xlsx]
(ooxml, embedded labels);
// Old table
OldTable:
LOAD
[namn]
FROM
[pglistaOld.xlsx]
(ooxml, embedded labels)
Where
not Exists(namnNy, namn);
Example qvw file here

Peoplesoft CreateRowset with related display record

According to the Peoplebook here, CreateRowset function has the parameters {FIELD.fieldname, RECORD.recname} which is used to specify the related display record.
I had tried to use it like the following (just for example):
&rs1 = CreateRowset(Record.User, Field.UserId, Record.UserName);
&rs1.Fill();
For &k = 1 To &rs1.ActiveRowCount
MessageBox(0, "", 999999, 99999, &rs1(&k).UserName.Name.Value);
End-for;
(Record.User contains only UserId(key), Password.
Record.UserName contains UserId(key), Name.)
I cannot get the Value of UserName.Name, do I misunderstand the usage of this parameter?
Fill is the problem. From the doco:
Note: Fill reads only the primary database record. It does not read
any related records, nor any subordinate rowset records.
Having said that, it is the only way I know to bulk-populate a standalone rowset from the database, so I can't easily see a use for the field in the rowset.
Simplest solution is just to create a view, but that gets old very soon if you have to do it a lot. Alternative is to just loop through the rowset yourself loading the related fields. Something like:
For &k = 1 To &rs1.ActiveRowCount
&rs1(&k).UserName.UserId.value = &rs1(&k).User.UserId.value;
&rs1(&k).UserName.SelectByKey();
End-for;

Django ORM Cross Product

I have three models:
class Customer(models.Model):
pass
class IssueType(models.Model):
pass
class IssueTypeConfigPerCustomer(models.Model):
customer=models.ForeignKey(Customer)
issue_type=models.ForeignKey(IssueType)
class Meta:
unique_together=[('customer', 'issue_type')]
How can I find all tuples of (custmer, issue_type) where there is no IssueTypeConfigPerCustomer object?
I want to avoid a loop in Python. A solution which solves this in the DB would be preferred.
Background: for every customer and for every issue-type, there should be a config in the DB.
If you can afford to make one database trip for each issue type, try something like this untested snippet:
def lacking_configs():
for issue_type in IssueType.objects.all():
for customer in Customer.objects.filter(
issuetypeconfigpercustomer__issue_type=None
):
yield customer, issue_type
missing = list(lacking_configs())
This is probably OK unless you have a lot of issue types or if you are doing this several times per second, but you may also consider having a sensible default instead of making a config object mandatory for each combination of issue type and customer (IMHO it is a bit of a design-smell).
[update]
I updated the question: I want to avoid a loop in Python. A solution which solves this in the DB would be preferred.
In Django, every Queryset is either a list of Model instances or a dict (values querysets), so it is impossible to return the format you want (a list of tuples of Model) without some Python (and possibly multiple trips to the database).
The closest thing to a cross product would be using the "extra" method without a where parameter, but it involves raw SQL and knowing the underlying table name for the other model:
missing = Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype']
)
As a result, each Customer object will have an extra attribute, "issue_type_id", containing the id of one IssueType. You can use the where parameter to filter based on NOT EXISTS (SELECT 1 FROM appname_issuetypeconfigpercustomer WHERE issuetype_id=appname_issuetype.id AND customer_id=appname_customer.id). Using the values method you can have something close to what you want - this is probably enough information to verify the rule and create the missing records. If you need other fields from IssueType just include them in the select argument.
In order to assemble a list of (Customer, IssueType) you need something like:
cross_product = [
(customer, IssueType.objects.get(pk=customer.issue_type_id))
for customer in
Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype'],
where=["""
NOT EXISTS (
SELECT 1
FROM appname_issuetypeconfigpercustomer
WHERE issuetype_id=appname_issuetype.id
AND customer_id=appname_customer.id
)
"""]
)
]
Not only this requires the same number of trips to the database as the "generator" based version but IMHO it is also less portable, less readable and violates DRY. I guess you can lower the number of database queries to a couple using something like this:
missing = Customer.objects.extra(
select={"issue_type_id": 'appname_issuetype.id'},
tables=['appname_issuetype'],
where=["""
NOT EXISTS (
SELECT 1
FROM appname_issuetypeconfigpercustomer
WHERE issuetype_id=appname_issuetype.id
AND customer_id=appname_customer.id
)
"""]
)
issue_list = dict(
(issue.id, issue)
for issue in
IssueType.objects.filter(
pk__in=set(m.issue_type_id for m in missing)
)
)
cross_product = [(c, issue_list[c.issue_type_id]) for c in missing]
Bottom line: in the best case you make two queries at the cost of legibility and portability. Having sensible defaults is probably a better design compared to mandatory config for each combination of Customer and IssueType.
This is all untested, sorry if some homework was left for you.