Pandas - split columns and count occurrences - pandas

It's info for some purchases made by clients on phone accessories, my real data would look something like this:
Abstract Model 1 ~Samsung S6 | Sold: 4
I've got a dataset that looks something like this:
item sold
Design1 ~Model1 1
Design2 ~Model1 2
Design1 ~Model2 3
Design2 ~Model2 1
I want to break the item column into 2 columns , design and model, and count each time a design has been sold, and a model has been sold, individually, based on the selling data of design+model combinations in the input.
My expected output, based on the first dataset, would look something like this:
expected output:
design design_sold model model_sold
Design1 4 Model1 3
Design2 3 Model2 4

try this,
df[['Design','Model']]=df['item'].str.split(' ~',expand=True)
print pd.concat([df.groupby('Design',as_index=False)['sold'].sum().rename(columns={'sold':'Desgin Sold'}),df.groupby('Model',as_index=False)['sold'].sum().rename(columns={'sold':'Model Sold'})],axis=1)
Output:
Design Desgin Sold Model Model Sold
0 Design1 4 Model1 3
1 Design2 3 Model2 4
Explanation:'
1. .str.split() used to split your series into frame.
groupby model and design and perform sum on grouped object.
rename the column and concat your dataframe.

Related

Balancing a multilabel dataset using Julia

I have a dataframe like this:
id text feat_1 feat_2 feat_3 feat_n
1 random coments 0 0 1 0
2 random coments2 1 0 1 0
1 random coments3 1 1 1 1
Feat columns goes from 1 to 100 and they are labels of a multilabel dataset. The type of data as is 1 and 0 (boolean)
The dataset has over 50k records the labels are unbalance. I am looking for a way to balance it and I was working on this approach:
Sum the values in each feat column and then use the lowest value of this sum as a threshold to filter the dataset.
I need to keep all features columns so I can exclude comments to achieve.
The main idea boild down to: i need to get a balanced dataset to use in a multilabel classification problem, i mean, I need the same amount of feat_columns data as they are my labels.

pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

SSRS: Summing Lookupset not working

I'm currently working on a report where I'm given 3 different datasets. The report essentially calculates the input, output and losses of a given food production process.
In dataset "Spices", contains the quantity of spices used under a field named "Qty_Spice". In dataset "Meat", contains the quantity of meat used under a field named "Qty_Meat". In dataset "Finished", contains the quantity of finished product used under a field "Qty_Finished".
I'm currently trying to create a table where the amount of input (spice+meat) is compared against output (finished product), such that the table looks like this:
Sum of Inputs (kg) | Finished Product (kg) | Losses (kg)
10 8 2
8 5 3
Total:
18 13 5
What I'm currently doing is using lookupset to get all inputs of both spices and meats (using lookupset instead of lookup because there are many different types of meats and spices used), then using a custom code named "Sumlookup" to sum the quantities lookupset returned.
The problem I'm having is that when I want to get the total sum of all inputs and all finished products (bottom of the table) using "Sumlookup" the table is only returning the first weight it finds. In the example above, it would return, 10, 8 and 2 as inputs, finished products and losses respectively.
Does anyone know how I should approach solving this?
Really appreciate any help
Here is the custom code I used for SumLookUp:
Public Function SumLookup(ByVal items As Object()) As Decimal
Dim suma As Decimal = 0
For Each item As Decimal In items
suma += item
Next
Return suma
End Function

how can i find correlation between very few items in dataframe pandas

Hi i am new to dataframe, please help me resolve this.
My dataframe1 looks like this (It has itemID and Item name), i only have 7 items
itemID ItemName
1 abc
2 fds
3 btbtr
4 gerhet
5 dfhkwjfn
6 adaf
7 jdkj
My Dataframe2 looks like this:
which has userID, and itemID, here i have 20k users and each user has a itemid in front of it(can be multiple)
userId itemID
23213 2
31267 3
52144 1
52144 2
87467 6
how can i find item- item correlation between the items?
I want that item1 is highly correlated with item3 and item6
i tried corrwith() but all i get is NaN.
please help me find this, Thanks in advance
Here is the approach I can think of. Might be crude, but here we go.
Remove all users which have only 1 item in front of them
Now you only have users with multiple items.
Make a note of the count of co-occurrence of items. i.e. make a data frame of sort
item-item : count
1-2 : 50
3-5 : 35
and so on. Now after getting all one on one correlations normalize the count values between 0-1 and you have your correlation between all items.
Hope it helps!

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation