pandas/python Merge/concatenate related data of duplicate rows and add a new column to existing data frame - pandas

I am new to Pandas, and wanted your help with data slicing.
I have a dump of 10 million rows with duplicates. Please refer to this image for a sample of the rows with the steps I am looking to perform.
As you see in the image, the column for criteria "ABC" from Source 'UK' has 2 duplicate entries in the Trg column. I need help with:
Adding a concatenated new column "All Targets" as shown in image
Removing duplicates from above table so that only unique values without duplicates appear, as shown in step 2 in the image
Any help with this regard will be highly appreciated.

I would do like this:
PART 1:
First define a function that does what you want, than use apply method:
def my_func(grouped):
all_target = grouped["Trg"].unique()
grouped["target"] = ", ".join(all_target)
return grouped
df1 = df.groupby("Criteria").apply(my_func)
#output:example with first 4 rows
Criteria Trg target
0 ABC DE DE, FR
1 ABC FR DE, FR
2 DEF UK UK, FR
3 DEF FR UK, FR
PART 2:
df2 = df1.drop_duplicates(subset=["Criteria"])
I tried it only on first 4 rows so let me know if it works.

Related

how to sum rows in my dataframe Pandas with specific condition?

Could anyone help me ?
I want to sum the values with the format:
print (...+....+)
for example:
a b
France 2
Italie 15
Croatie 7
I want to make the sum of France and Croatie.
Thank you for your help !
One of possible solutions:
set column a as the index,
using loc select rows for the "wanted" values,
take column b,
sum the values found.
So the code can be:
result = df.set_index('a').loc[['France', 'Croatie']].b.sum()
Note double square brackets. The outer pair is the "container" of index values
passed to loc.
The inner part, and what is inside, is a list of values.
To subtract two sums (one for some set of countries and the second for another set),
you can run e.g.:
wrk = df.set_index('a').b
result = wrk.loc[['Italie', 'USA']].sum() - wrk.loc[['France', 'Croatie']].sum()

Pandas - data per row instead of all in one cell

I have problems getting the data in separate rows. At the moment all my data per column is in one cell. I really would appreciate your support!
the column header is "Dealer" and it is showing one value below like this:
|Dealer|
|:---- |
|['Automobiles', 'Garage Benz', 'Cencini SA']|
I would like to get three rows out of this:
Row
Dealer
1
'Automobiles'
2
'Garage Benz'
3
'Cencini SA'
4
....
5
....
...
...
what would be the easiest way to achieve this?
Thanks for your support, as I am totally new to pandas!
The easiest way is to convert your data into a dict like data:
x = {'Dealer':['Automobiles', 'Garage Benz', 'Cencini SA']}
Then
x = pd.DataFrame(x)

groupby 2 columns and count into separate columns based on one columns cases

I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation

How to facet multiple columns in Google Refine

I have a data set with 30 columns and multiple rows (some cells have no data). I would like to be able to facet the columns in groups.
1 2 3 4...
Row1 A B C D
Row2 E A D F
Row3 Q A B H
Given the above data I would like the facet to retun the number of instances in a group of columns. For the first three columns I need the facet to return:
A - 3
B - 2
C - 1
D - 1
E - 1
Q - 1
I have tried to combine columns when I loaded the data but the individual data was grouped as well. This is not the desired outcome. For example:
ABC - 1
EAD - 1
QAB - 1
Thanks in advance.
I can't think of a more efficient way to do this off the top of my head, but you can do a custom facet with something like:
[ cells.["1"].value, cells.["2"].value, cells.["3"].value ]
where "1", "2", and "3" are the names of your columns. If your column names are single words, like "V1", "V2", "V3", and so on, you can also change the custom facet to something like:
[ cells.V1.value, cells.V2.value, cells.V3.value ]
With a lot of columns, this solution might be somewhat tedious though...
Did you tried to transpose all your column in one and facet on this 'master column'?
When transposing add the column name so you know from where the data comes from. The you can split your master column into 'source column' and 'data'.
You can find here the JSON code to transpose a large amount of column: http://googlerefine.blogspot.ca/2011/09/json-code-to-transpose-important-number.html
it should work for your project with a limited amount of edits.
Hope it help!