I have a dataframe which collects readings from device. Sometimes there are multiple readings for same sample, and thats stored as seperate ID in my dataframe. Is there a way for me to detect the duplicated ID's by using the columns that have same value?
Sample dataframe:
test_df = {'ID': [1,2,3,4,5,6], 'Age': [18,18,19,19,20,21], 'Sex':['Male','Male','Female','Female','Male','Female'], 'Values':[1200,200, 300, 400, 500,600]}
I want the result to return ID's 1,2,3,4 since they are duplicated when we compare Age and Sex column values.
Expected Output:
ID Age Sex Values
1 18 Male 1200
2 18 Male 200
3 19 Female 300
4 19 Female 400
Related
Trying to figure out how to subtract a constant from a column based on the presence of a value in another DataFrame. For example, if I have the below DataFrame a that contains a column called person name and count:
a = pd.DataFrame({
"person":["Bob", "Kate", "Joe", "Mark"],
"count":[1, 2, 3, 4],
})
person count
0 Bob 3
1 Kate 4
2 Joe 5
2 Mark 4
And a second DataFrame that contains Person and whatever other arbitrary columns:
b = pd.DataFrame({
"person":["Bob", "Joe"],
"foo":['a', 'b'],
})
person foo
0 Bob a
1 Joe c
My hope is that I can change the first DataFrame to look like the below. Specifically decreasing count by one for any instance of person within DataFrame b. It is safe to assume that DataFrame b will always be a subset of DataFrame a and person will be unique.
Person Count
0 Bob 2
1 Kate 4
2 Joe 3
2 Mark 4
Many thanks in advance!
a["count"] -= a.person.isin(b.person)
With isin we get a boolean array of True and Falses per each person if it is in the other one or not. Then treating it as integers, we can subtract it from count column,
to get
>>> a
person count
0 Bob 2
1 Kate 4
2 Joe 4
3 Mark 4
This answer assumes that df2 can have multiple instances of a name. If it is just one instance, you can subtract by just iterating through and checking whether the person is named in the second data frame. In df2:
df2_counts = df2['Person'].value_counts()
In df1, join this data over and then subtract the counts:
df1['subtracts'] = df1.set_index('Person').join(df2_counts)
df1['count_new'] = df1['count'] - df1['subtracts']
Create a list of person names from Dataset B:
listDFB=DFB['person']
loop through dfa to fill new col accordingly
for i, rw in dfa.iterrows():
if rw['person'] in listDFB:
rw['count']=rw['count']-1
I currently have the following dataframe:
SN Gender Purchase
Name 1 Female 1.14
Name 2 Female 2.50
Name 3 Male 7.77
Name 1 Female 2.74
Name 3 Male 4.58
Name 3 Male 9.99
Name 1 Female 5.55
Name 2 Female 1.20
I am trying to figure out how to just get a count, not a Dataframe, from a table like this. The count must be based on gender (so, how many males are there?), but must be unique by name (SN). So, in this instance, I would have 1 male and 2 females. I have tried multiple ways...valuecounts from the data frame, unique from the dataframe, etc. but I keep getting syntax errors.
There are a few ways you can achieve this.
The simplest one would be to use pd.crosstab to get a cross tabulation (count) of the values:
pd.crosstab(df["SN"], df["Gender"])
Gender Female Male
SN
Name 1 3 0
Name 2 2 0
Name 3 0 3
Another way is to use DataFrame.value_counts() which cameabout in pandas version >= 1.1.0. Instead of a cross tabulation, this returns a Series whose values are the counts of data per unique index combination. The index is a MultiIndex referring to unique combinations of "SN" and "Gender"
df.value_counts(["SN", "Gender"])
SN Gender
Name 3 Male 3
Name 1 Female 3
Name 2 Female 2
dtype: int64
If you're operating with a pandas version older than 1.1.0 you can use a combination of groupby and value_counts. This performs a functionally equivalent operation as DataFrame.value_counts so we get the same output:
df.groupby("SN")["Gender"].value_counts()
SN Gender
Name 1 Female 3
Name 2 Female 2
Name 3 Male 3
Name: Gender, dtype: int64
Edit: If you want to only count the number of unique "SN" for each gender, you can use nunique() instead of value_counts:
unique_genders = df.groupby(["Gender"])["SN"].nunique()
print(unique_genders)
Gender
Female 2
Male 1
Name: SN, dtype: int64
Then you can extract each:
>>> unique_genders["Female"]
2
>>> unique_geners["Male"]
1
i am working on Heart Disease Prediction data and i want to know the unqiue values for each column
first i took total unique feature in my data is
framinghamDF.nunique()
output-
male 2
age 39
education 4
currentSmoker 2
cigsPerDay 33
BPMeds 2
prevalentStroke 2
prevalentHyp 2
diabetes 2
totChol 248
sysBP 234
diaBP 146
BMI 1364
heartRate 73
glucose 143
TenYearCHD 2
dtype: int64
now i took out individual features unique values
print(framinghamDF["education"].unique().tolist())
output
[4.0, 2.0, 1.0, 3.0, nan]
but i want to get all unique values of features which has less than 4 unique values
Filter index values of Series in boolean indexing:
s = framinghamDF.nunique()
out = s.index[s < 4].tolist()
#alternative
out = s[s < 4].index.tolist()
Last for all unique values use:
d = {x: framinghamDF[x].unique() for x in out}
I have a dataframe which has various entries of customers. These customers, which has different customer numbers, belong to certain customer groups (contract, wholesaler, tender, etc.). I have to sum some of these values of the dataframe into a Series for each customer group (e.g., total sales of contract customers would be a single entry in the Series.)
I've tried using .isin() but I had an attribute error (float object has no attribute 'isin'). It is working if I work with or operator but then I will have to manually enter all customer numbers for all customer groups. I'm sure there must be a much simple way and efficient of doing it. Many thanks in advance.
for i in range(len(grouped_sales)):
if df.iloc[i,1]==value1 or df.iloc[i,1]==value2 or df.iloc[i,1]==...:
series[1]=series[1]+df.iloc[i,3]
elif df.iloc[1,i]==valueN or df.iloc[i,1]==value(N+1)...:
series[2]=series[2]+df.iloc[1,3]
elif:
...
If you want to sum the sales for every group you may want to look into panda's
df.groupby() maybe
I'm trying to reproduce what you want it would look like this
>>> df = pd.DataFrame()
>>> df['cust_numb']=[1,2,3,4,5]
>>> df['group']=['group1','group2','group3','group3','group1']
>>> df['sales']=[50,30,50,40,20]
>>> df
cust_numb group sales
0 1 group1 50
1 2 group2 30
2 3 group3 50
3 4 group3 40
4 5 group1 20
>>> df.groupby('group').sum()['sales']
group
group1 70
group2 30
group3 90
Name: sales, dtype: int64
You'll have a series with groups as index and the sum of the sales as values
EDIT: Based on your comment you have the group data in a separate dictionary, the implementation would like this
>>> sales_data = {'CustomerName': ['cust1', 'cust2', 'cust3', 'cust4'],'CustomerCode': [1,2,3,4], 'Sales': [10,10,15,25], 'Risk':[55,55,45,79]}
>>> sdf = pd.DataFrame.from_dict(sales_Data)
>>> group_data ={'group1': [1,3], 'group2': [2,4]}
You want to map your customer number to the groups so you need an inverted dictionary:
>>> dc = {v:k for k in group_data.keys() for v in group_data[k]}
{1: 'group1', 3: 'group1', 2: 'group2', 4: 'group2'}
You replace your customer number column by the group mapping in a new column and reproduce what I did above
>>> sdf['groups'] = sdf.replace({'CustomerCode': dc})['CustomerCode']
>>> sdf
CustomerName CustomerCode Sales Risk groups
0 cust1 1 10 55 group1
1 cust2 2 10 55 group2
2 cust3 3 15 45 group1
3 cust4 4 25 79 group2
>>> sdf.groupby('groups').sum()['Sales']
groups
group1 25
group2 35
Name: Sales, dtype: int64
I have a large data frame that I would like to develop a summation table from. In other words, column 1 would be the columns of the first data frame, column 2 would be each unique value of each column and column three thru ... would be a summation of different variables I choose. Like the below:
Variable Level Summed_Column
Here is some sample code:
data = {"name": ['bob', 'john', 'mary', 'timmy']
, "age": [32, 32, 29, 28]
, "location": ['philly', 'philly', 'philly', 'ny']
, "amt": [100, 2000, 300, 40]}
df = pd.DataFrame(data)
df.head()
So the output in the above example would be as follows:
Variable Level Summed_Column
Name Bob 100
Name john 2000
Name Mary 300
Name timmy 40
age 32 2100
age 29 300
age 29 40
location philly 2400
location ny 40
I'm not even sure where to start. The actual dataframe has 32 columns in which 4 will be summed and 28 put into the variable and Level format.
You don't need a loop for this and concatenation, you can do this in one go by combining melt with groupby and using the agg method:
final = df.melt(value_vars=['name', 'age', 'location'], id_vars='amt')\
.groupby(['variable', 'value']).agg({'amt':'sum'})\
.reset_index()
Which yields:
print(final)
variable value amt
0 age 28 40
1 age 29 300
2 age 32 2100
3 location ny 40
4 location philly 2400
5 name bob 100
6 name john 2000
7 name mary 300
8 name timmy 40
ok #Datanovice. I figured out how to do this using a for loop w/ pd.melt.
id = ['name', 'age', 'location']
final = pd.DataFrame(columns = ['variable', 'value', 'amt'])
for i in id:
table = df.groupby(i).agg({'amt':'sum'}).reset_index()
table2 = pd.melt(table, value_vars = i, id_vars = ['amt'])
final = pd.concat([final, table2])
print(final)