DB2 sql: How to generate unique ids of a certain length - sql

I'm trying to use python to generate a list of unique ids that can be used as indexes in a table on our DB2 database. My starting input is a list of ids come from a seperate table. I need to take this list of ids and generate a list of other ids (place inside the formlist variable here) These other ids must be unique and must not already exist on the target database table (table name is below shown as FORM_RPT
So far what I have tried is the following:
import ibm_db_dbi
import ibm_db
import numpy as np
import pandas as pd
class Gen_IDs():
def __init__(self, mycon, opt_ids):
"""Create an ID Generator object, requires an opt_id list as argument"""
self.mycon = mycon
self.opt_ids = opt_ids
def gen_form(self):
"""generates unique form ids based off an option list"""
sql = """SELECT *
FROM FORM_RPT"""
df = pd.read_sql(sql, self.mycon)
formlist = list(df["FORM_RPT_ID"])
stack = 0
opt_list = []
while(stack < len(self.opt_ids)):
f = np.random.randint(1000, 9999)
#if f in df['FORM_RPT_ID'].values:
if formlist.count(f) > 0:
pass
if f in opt_list:
pass
else:
opt_list.append(f)
stack += 1
return opt_list
This code is generating just fine, but to my confusion, a small portion of the generated ids are still showing as existing in the target database. The generated ids need to be 4 digits ints.
Here is an example of how it would work:
optionList = [1001, 1002, 1003, 1004, 1005]
formlist = [2001, 2002, 2003, 2004, 2005]
gm = Gen_Ids(optionList)
new_form_list = gm.gen_form()
Currently I'm getting a returned list, but the new list sometimes will have ids that exist in my formList variable.

you generate id by using row_number()
SELECT *,row_number() over( order by (select null)) as id
FROM FORM_RPT

Generating unique IDs is something databases provide. There is no need to use extra coding for that.
In Db2 you can use identity columns if is only for a single table or a database sequence id you want to have it as a stand alone database object.
Why does it need to be certain length?

Related

LINQ Select Distinct objects from a List(Of T) on multiple properties does not work

My goal is to select the objects based on three properties: CreatedBy, GrantedTo, PatientID.
The following is the data I'm having in my List(Of Access)
So from the above example I should be getting two objects only:
1) CreatedBy:1, GrantedTo: 65, PatientID: 48
2) CreatedBy:1, GrantedTo: 66, PatientID: 48
but I'm getting eight.
Here is my buggy code:
Dim distinctList As List(Of Access) = t.GroupBy(Function(p) New With {p.GrantedTo, p.PatientID, p.CreatedBy}).[Select](Function(g) g.First()).ToList()
I believe the code should select the first from each unique group of objects.
Any ideas?
You have a big flaw in your logic:
So from the above example I should be getting two objects only:
1) CreatedBy:1, GrantedTo: 65, PatientID: 48
2) CreatedBy:1, GrantedTo: 66, PatientID: 48
[...]
I believe the code should select the first from each unique group of objects.
Yes, you can get your data grouped by CreatedBy, GrantedTo and PatientID.
But what you get back cannot be (reasonable) objects of type Access and it will not be the first from each group.
Why? Because when you want to select all data from your object - and thus AccessID, PermissionID etc. - what values should those attributes have?
In your example 1):
Should AccesID be 238 or 240?
Should PermissionID be 15 or 18?
...
I guess you got the point.
So what you actually should do is to select the grouped data only, not as Access objects but either as an anonymous type or an defined type.
Anonymous version:
Dim distinct = From acc In access
Group By g = New With {Key .CreatedBy = acc.CreatedBy,
Key .GrantedTo = acc.GrantedTo,
Key .PatientID = acc.PatientID} Into Group
(The Key keyword is mandatory here!)
Now if you want to pass these distinct values to another object/function you can pass them as single paramaeters:
For Each value In distinct
foo.bar(value.g.CreatedBy, value.g.GrantedTo, value.g.PatientID)
Next
Or you create an own small class which contains only the three properties.
(I leave this point out since I´m running out of time but it should be straight forward).
EDIT
Try the following (typed blindly and untested):
Dim distinct = (From acc In access
Group By g = New With {Key .CreatedBy = acc.CreatedBy,
Key .GrantedTo = acc.GrantedTo,
Key .PatientID = acc.PatientID} Into Group
Select New Access With {.CreatedBy = g.CreatedBy,
.GrantedTo = g.GrantedTo,
.PateintID = g.PatientID}).ToList()
This should give you a List(Of Access) with new Access objects with only the three properties set.

Comparing Elements across collections

I have the following models:
class Collection(models.Model):
...
class Record(models.Model):
collection = models.ForeignKey(Collection, related_name='records')
filename = models.CharField(max_length=256)
checksum = models.CharField(max_length=64)
class Meta:
unique_together = (('filename', 'collection'),)
I want to perform the following query:
For each filename of Record I want to know the Collections that:
Do not provide a Record with that filename
or that provide such a Record but has a differing checksum
I have in mind an output like that:
| C1 C2 C3 <- collections
-----------+------------
file-1.txt | x
file-2.txt | x
file-3.txt | ! ! !
file-4.txt | x ! !
file-5.txt | ! ! x
x = missing
! = different checksum
What I've com up so far is that I create a query for each Collection, excluding all filenames that are within this collection but exist in others.
for collection in collections:
other_collections = [c for c in collections if c is not collection]
results[collection] = qs.filter(collection__in=other_collections).exclude(
filename__in=qs.filter(
collection=collection
).values_list('filename', flat=True)
).order_by('filename').values_list('filename', flat=True)
This somewhat solves the first part of my question, but is rather quirky and requires post-processing to get to the format I desire. And, more importantly, it does not address the checksum comparison.
Is it possible to perform the two queries in one combined step to get the results in the format I described above?
The solution would not necessarily have to use the QuerySet APIs, a fallback to raw SQL is fine by me too.
It is not possible to write a SQL query that returns a variable number of columns, although you can achieve that effect if you wrap everything in an array or JSON object.
If you know the collections, you could write SQL like this:
SELECT r.filename,
(SELECT r.checksum = r2.checksum FROM records r2 WHERE r.filename = r2.filename AND r2.collection_id = 1) AS c1,
(SELECT r.checksum = r2.checksum FROM records r2 WHERE r.filename = r2.filename AND r2.collection_id = 2) AS c2,
...
FROM records r
WHERE r.collection_id = 1
GROUP BY r.filename, r.checksum
For each filename/collection pair, you will get NULL if the collection doesn't have the record, true if the collection has it with the right checksum, or false if the collection has it with a different checksum.
I include WHERE r.collection_id = 1 because otherwise for the checksum comparison, you have to answer "different from what?"

How to combine multiple rows in a relation into a tuple to perform calculations in PIG Latin

I have the following code:
pitcher_res = UNION pitcher_total_salary,pitcher_total_appearances;
dump pitcher_res;
The output is:
(8965000.0)
(22.0)
However, I want to calculate 8965000.0/22.0, so I need something like:
res = FOREACH some_relation GENERATE $0/$1;
Therefore I need to have some_relation = (8965000.0,22.0). How can I perform such a conversion?
You can do a CROSS.
Computes the cross product of two or more relations.
https://pig.apache.org/docs/r0.11.1/basic.html#cross
Ideally you would have a unique identifier for each entry in your source relations. Then you can perform a join based on this identifier which results in the kind of relation you want to have.
Salary relation
salaries: pitcher_id, pitcher_total_salary
Total appearances relation
appearances: pitcher_id, pitcher_total_appearances
Join
pitcher_relation = join salaries by pitcher_id, appearances by pitcher_id;
Calculation
res = FOREACH pitcher_relation GENERATE pitcher_total_salary/pitcher_total_apperances;
The below pig latin scripts will surely come to your rescue:
load the salary file
salary = load '/home/abhishek/Work/pigInput/pitcher_total_salary' as (salary:long);
load the appearances file
appearances = load '/home/abhishek/Work/pigInput/pitcher_total_appearances' as (appearances:long);
Now, use the CROSS command
C = cross salary, appearances
Then, the final output
res = foreach C generate salary/appearances;
Output
dump res
407500
Hope this helps

pandas groupby by different key and merge

I have a transaction data main containing three variables: user_id/item_id/type, one user_id have more than one item_id and type_id ,the type_id is in (1,2,3,4)
data=DataFrame({'user_id':['a','a','a','b','b','c'],'item_id':['1','3','3','2','4','1'],'type_id':['1','2','2','3','4','4']})
ui=data.groupby(['user_id','item_id','type']).size()
u=data.groupby(['user_id','type']).size()
What I want to get in the end is get every user_id's amount of distinct type_id
and also the every user_id,item's amount of distinct type_id,and merge then by the user_id
Your question is difficult to answer but here is one solution:
import pandas as pd
data= pd.DataFrame({'user_id':['a','a','a','b','b','c'],'item_id':['1','3','3','2','4','1'],'type_id':['1','2','2','3','4','4']})
ui = data.copy()
ui.drop('item_id',axis=1,inplace=True)
ui = data.groupby('user_id').type_id.nunique().reset_index()
u = data.groupby(['user_id','item_id']).type_id.nunique().reset_index()
final = ui.merge(u,on='user_id',how='inner').set_index('user_id')
final.columns = ['distinct_type_id','item_id','distinct_type_id']
print final

Using pig, how do I parse a mixed format line into tuples and a bag of tuples?

I'm new to pig, and I'm having an issue parsing my input and getting it into a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:
FF1|FF2|FF3|FF4|KVP1|KVP2|...|KVPn
My goal here is to count the number of unique fixed field combinations for each of the KV Pairs. So considering the following input lines:
1|2|3|4|key1=value1|key2=value2
2|3|4|5|key1=value7|key2=value2|key3=value3
When I'm done, I'd like to be able to generate the following results (the output format doesn't really matter at this point, I'm just showing you what I'd like the results to be):
key1=value1 : 1
key1=value7 : 1
key2=value2 : 2
key3=value3 : 1
It seems like I should be able to do this by grouping the fixed fields and flattening a bag of the KV Pairs to generate the cross product
I've tried reading this in with something like:
data = load 'myfile' using PigStorage('|');
A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()};
B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); };
C = filter B by ff3 matches 'somevalue';
D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair;
E = group D by (ff1, ff4, kvpair);
F = foreach E generate group, COUNT(E);
This generates records with a schema as follows:
A: {date: long,hms: long,id: long,ff1: chararray,ff2: long,ff3: chararray,ff4: chararray,kvpairs: {kvpair: (NULL)}}
While this gets me the schema that I want, there are several problems that I can't seem to solve:
By using the TOBAG with .., no schema can be applied to my kvpairs, so I can't ever filter on kvpair, and I don't seem to be able to cast this at any point, so it's an all or nothing query.
The filter in statement 'C' seems to return no data regardless of what value I use, even if I use something like '.*' or '.+'. I don't know if this is because there is no schema, or if this is actually a bug in pig. If I dump some data from statement B, I definitely see data there that would match those expressions.
So I've tried approaching the problem differently, by loading the data using:
data = load 'myfile' using PigStorage('\n') as (line:chararray);
init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray);
A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)};
The issue here is that the TOBAG( STRSPLIT( ... ) ) results in a bag of a single tuple, with each of the kvpairs being a field in that tuple. I really need the bag to contain, each of the individual kvpairs as a tuple of one field so that when I flatten the bag, I get the cross product of the bag and the group that I'm interested in.
I'm open to other ways of attacking this problem as well, but I can seem to find good way to transform my tuple of multiple fields into a bag of tuples, with each tuple having one field each.
I'm using Apache Pig version 0.11.1.1.3.0.0-107
Thanks in advance.
Your second approach is on the right track. Unfortunately, you'll need a UDF to convert a tuple to a bag, and as far as I know there is no builtin to do this. It's a simple matter to write one, however.
You won't want to group on the fixed fields, but rather on the key-value pairs themselves. So you only need to keep the tuple of key-value pairs; you can completely ignore the fixed fields.
The UDF is pretty simple. In Java, you can just do something like this in your exec method:
DataBag b = new DefaultDataBag();
Tuple t = (Tuple) input.get(0);
for (int i = 0; i < t.size(); i++) {
Object o = t.get(i);
Tuple e = TupleFactory.getInstance().createTuple(o);
b.add(e);
}
return b;
Once you have that, turn the tuple from STRSPLIT into a bag, flatten it, and then do the grouping and counting.