pandas groupby by different key and merge - pandas

I have a transaction data main containing three variables: user_id/item_id/type, one user_id have more than one item_id and type_id ,the type_id is in (1,2,3,4)
data=DataFrame({'user_id':['a','a','a','b','b','c'],'item_id':['1','3','3','2','4','1'],'type_id':['1','2','2','3','4','4']})
ui=data.groupby(['user_id','item_id','type']).size()
u=data.groupby(['user_id','type']).size()
What I want to get in the end is get every user_id's amount of distinct type_id
and also the every user_id,item's amount of distinct type_id,and merge then by the user_id

Your question is difficult to answer but here is one solution:
import pandas as pd
data= pd.DataFrame({'user_id':['a','a','a','b','b','c'],'item_id':['1','3','3','2','4','1'],'type_id':['1','2','2','3','4','4']})
ui = data.copy()
ui.drop('item_id',axis=1,inplace=True)
ui = data.groupby('user_id').type_id.nunique().reset_index()
u = data.groupby(['user_id','item_id']).type_id.nunique().reset_index()
final = ui.merge(u,on='user_id',how='inner').set_index('user_id')
final.columns = ['distinct_type_id','item_id','distinct_type_id']
print final

Related

DB2 sql: How to generate unique ids of a certain length

I'm trying to use python to generate a list of unique ids that can be used as indexes in a table on our DB2 database. My starting input is a list of ids come from a seperate table. I need to take this list of ids and generate a list of other ids (place inside the formlist variable here) These other ids must be unique and must not already exist on the target database table (table name is below shown as FORM_RPT
So far what I have tried is the following:
import ibm_db_dbi
import ibm_db
import numpy as np
import pandas as pd
class Gen_IDs():
def __init__(self, mycon, opt_ids):
"""Create an ID Generator object, requires an opt_id list as argument"""
self.mycon = mycon
self.opt_ids = opt_ids
def gen_form(self):
"""generates unique form ids based off an option list"""
sql = """SELECT *
FROM FORM_RPT"""
df = pd.read_sql(sql, self.mycon)
formlist = list(df["FORM_RPT_ID"])
stack = 0
opt_list = []
while(stack < len(self.opt_ids)):
f = np.random.randint(1000, 9999)
#if f in df['FORM_RPT_ID'].values:
if formlist.count(f) > 0:
pass
if f in opt_list:
pass
else:
opt_list.append(f)
stack += 1
return opt_list
This code is generating just fine, but to my confusion, a small portion of the generated ids are still showing as existing in the target database. The generated ids need to be 4 digits ints.
Here is an example of how it would work:
optionList = [1001, 1002, 1003, 1004, 1005]
formlist = [2001, 2002, 2003, 2004, 2005]
gm = Gen_Ids(optionList)
new_form_list = gm.gen_form()
Currently I'm getting a returned list, but the new list sometimes will have ids that exist in my formList variable.
you generate id by using row_number()
SELECT *,row_number() over( order by (select null)) as id
FROM FORM_RPT
Generating unique IDs is something databases provide. There is no need to use extra coding for that.
In Db2 you can use identity columns if is only for a single table or a database sequence id you want to have it as a stand alone database object.
Why does it need to be certain length?

Adjust R Data Frame with Element Wise Function

I am using RODBC to pull down server data into a data frame using this statement:
df <- data.frame(
sqlQuery(
channel = ODBC_channel_string,
query = SQLquery_string
)
)
The resultant data set has the following grouping attributes of interest:
Scenario
Other Group By 1
Other Group By 2
...
Other Group By K
With key variables:
[Time] = Future Year(i)
[Spot] = Projected Effective Discount Rate For Year(i-1) to Year(i)
Abbreviated Table Snip
What I would like to do is transform the [Spot] column into a discount factor that is dependent on consistent preceding values:
Scenario
Other Group By 1
Other Group By 2
...
Other Group By K
With key variables:
[Time] = Future Year(i)
[Spot] = Projected Effective Discount Rate For Year(i-1) to Year(i)
[Disc_0] = prod([Value]), for [All Grouping] = [This Grouping] and [Time] <= [This Time]
Excel Version of Abbreviated Goal Snip
I could code the solution using a for loop, but I suspect that will be very inefficient in R if there are significant row counts in the original data frame.
What I am hoping is to use some creative implementation of dplyr's mutate:
df %>% mutate(Disc_0 = objective_function{?})
I think that R should be able to do this kind of data wrangle quickly, but I am not sure if that is the case. I am more familiar with SQL and may attempt to produce the necessary variable there.

how to store grouped data into json in pyspark

I am new to pyspark
I have a dataset which looks like (just a snapshot of few columns)
I want to group my data by key. My key is
CONCAT(a.div_nbr,a.cust_nbr)
My ultimate goal is to convert the data into JSON, formated like this
k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],....
e.g
248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } ,
{ PRECIMA_ID:SCP 00248 0000138339 , PROD_NBR:6659079 , PROD_DESC:Beef Chuck Short Rib Slices, PROD_BRND:Stockyards , PACK_SIZE:12 LBA , QTY_UOM:CA} ,{...,...,} ],
1384611034793[{},{},{}],....
I have created a dataframe (I am joining two tables basically to get some more fields)
joinstmt = sqlContext.sql(
"SELECT a.precima_id , CONCAT(a.div_nbr,a.cust_nbr) as
key,a.prod_nbr , a.prod_desc,a.prod_brnd , a.pack_size , a.qty_uom , a.sales_opp , a.prc_guidance , a.pim_mrch_ctgry_desc , a.pim_mrch_ctgry_id , b.start_date,b.end_date
FROM scoop_dtl a join scoop_hdr b on (a.precima_id =b.precima_id)")
Now, in order to get the above result I need to group by the result based on key, I did the following
groupbydf = joinstmt.groupBy("key")
This resulted intp a grouped data and after reading I got to know that I cannot use it directly and I need to convert it back into dataframes to store it.
I am new to it, need some help inorder to convert it back into dataframes or I would appreciate if there are any other ways as well.
If your joined dataframe looks like this:
gender age
M 5
F 50
M 10
M 10
F 10
You can then use below code to get desired output
joinedDF.groupBy("gender") \
.agg(collect_list("age").alias("ages")) \
.write.json("jsonOutput.txt")
Output would look like below:
{"gender":"F","ages":[50,10]}
{"gender":"M","ages":[5,10,10]}
In case you have multiple columns like name, salary. You can add columns like below:
df.groupBy("gender")
.agg(collect_list("age").alias("ages"),collect_list("name").alias("names"))
Your output would look like:
{"gender":"F","ages":[50,10],"names":["ankit","abhay"]}
{"gender":"M","ages":[5,10,10],"names":["snchit","mohit","rohit"]}
You cannot use GroupedData directly. It has to be aggregated first. It could be partially covered by aggregation with built-in functions like collect_list but it is simply not possible to achieve desired output, with values used to represent keys, using DataFrameWriter.
In can try something like this instead:
from pyspark.sql import Row
import json
def make_json(kvs):
k, vs = kvs
return json.dumps({k[0]: list(vs)})
(df.select(struct(*keys), values)
.rdd
.mapValues(Row.asDict)
.groupByKey()
.map(make_json))
and saveAsTextFile.

Pig script to get top 3 data in a single record

I have the sample data as
user_id, date, accessed url, session time
the data refers to the top 3 interests of the user depending on the session time.
Got the data using the code:
top3 = FOREACH DataSet{
sorted = ORDER DataSet BY sessiontime DESC;
lim = LIMIT sorted 3;
GENERATE flatten(group), flatten(lim);
};
Output:
(1,20,url1,2484)
(1,20,url2,1863)
(1,20,url3,1242)
(2,22,url4,484)
(2,22,url5,63)
(2,22,url6,42)
(3,25,url7,500)
(3,25,url8,350)
(3,25,url9,242)
But I want my output to be like this:
(1,20,url1,url2,url3)
(2,22,url4,url5,url6)
(3,25,url7,url8,url9)
Please help.
You are close. The problem is that you FLATTEN the bag of URLs when you really want to keep them all in one record. So do this instead:
top3 = FOREACH DataSet{
sorted = ORDER DataSet BY sessiontime DESC;
lim = LIMIT sorted 3;
GENERATE flatten(group), lim.url;
};
Based on the output you got, you will now get
(1,20,{(url1),(url2),(url3)})
(2,22,{(url4),(url5),(url6)})
(3,25,{(url7),(url8),(url9)})
Note that the URLs are contained inside a bag. If you want to have them as three top-level fields, you will need to use a UDF to convert a bag into a tuple, and then FLATTEN that.

Remove duplicates in a Django query

Is there a simple way to remove duplicates in the following basic query:
email_list = Emails.objects.order_by('email')
I tried using duplicate() but it was not working. What is the exact syntax for doing this query without duplicates?
This query will not give you duplicates - ie, it will give you all the rows in the database, ordered by email.
However, I presume what you mean is that you have duplicate data within your database. Adding distinct() here won't help, because even if you have only one field, you also have an automatic id field - so the combination of id+email is not unique.
Assuming you only need one field, email_address, de-duplicated, you can do this:
email_list = Email.objects.values_list('email', flat=True).distinct()
However, you should really fix the root problem, and remove the duplicate data from your database.
Example, deleting duplicate Emails by email field:
for email in Email.objects.values_list('email', flat=True).distinct():
Email.objects.filter(pk__in=Email.objects.filter(email=email).values_list('id', flat=True)[1:]).delete()
Or books by name:
for name in Book.objects.values_list('name', flat=True).distinct():
Book.objects.filter(pk__in=Artwork.objects.filter(name=name).values_list('id', flat=True)[3:]).delete()
For checking duplicate you can do a GROUP_BY and HAVING in Django as below. We are using Django annotations here.
from django.db.models import Count
from app.models import Email
duplicate_emails = Email.objects.values('email').annotate(email_count=Count('email')).filter(email_count__gt=1)
Now looping through the above data and deleting all other emails except the first one (depends on requirement or whatever).
for data in duplicates_emails:
email = data['email']
Email.objects.filter(email=email).order_by('pk')[1:].delete()
You can chain .distinct() on the end of your queryset to filter duplicates. Check out: http://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct
You may be able to use the distinct() function, depending on your model. If you only want to retrieve a single field form the model, you could do something like:
email_list = Emails.objects.values_list('email').order_by('email').distinct()
which should give you an ordered list of emails.
You can also use set()
email_list = set(Emails.objects.values_list('email', flat=True))
Use, self queryset.annotate()!
from django.db.models import Subquery, OuterRef
email_list = Emails.objects.filter(
pk__in = Emails.objects.values('emails').distinct().annotate(
pk = Subquery(
Emails.objects.filter(
emails= OuterRef("emails")
)
.order_by("pk")
.values("pk")[:1])
)
.values_list("pk", flat=True)
)
This queryset goes to make this query.
SELECT `email`.`id`,
`email`.`title`,
`email`.`body`,
...
...
FROM `email`
WHERE `email`.`id` IN (
SELECT DISTINCT (
SELECT U0.`id`
FROM `email` U0
WHERE U0.`email` = V0.`approval_status`
ORDER BY U0.`id` ASC
LIMIT 1
) AS `pk`
FROM `agent` V0
)
cheet-sheet
from django.db.models import Subquery, OuterRef
group_by_duplicate_col_queryset = Models.objects.filter(
pk__in = Models.objects.values('duplicate_col').distinct().annotate(
pk = Subquery(
Models.objects.filter(
duplicate_col= OuterRef('duplicate_col')
)
.order_by("pk")
.values("pk")[:1])
)
.values_list("pk", flat=True)
)
I used the following to actually remove the duplicate entries from from the database, hopefully this helps someone else.
adds = Address.objects.all()
d = adds.distinct('latitude', 'longitude')
for address in adds:
if i not in d:
address.delete()
you can use this raw query : your_model.objects.raw("select * from appname_Your_model group by column_name")