merging to find only unique values - pandas

I have a simple requirement to insert data into the mongo db only if the data is not present based on a primary key(My primary key is a combination of 3 attributes item_id, process_name and actor) using lambda script. Explanation on what I am doing currently:
1). df is the dataframe loaded from the input csv file, it will have only those headers for the process. This df data has to be inserted into my db.
df = wr.s3.read_csv(path=part_file_url, low_memory=False)
df = df[header]
NOTE: header will have the primary key attributes along with additional default attributes for insertion.
2). I query my collection basis process name and id. Further, i use project_keys to bring in only the primary key of all records from db as follows:
project_keys ={'_id': 0, 'item_id': 1, 'process_name': 1, 'actor': 1}
"queryString": [ "item_id", "process_name","actor"]
curr_data_df = pd.DataFrame(list(audit_collection.find({"process_name":process_name, "processId":process_id}, project_keys)), columns=queryString)
3). Now to find the unique values, I do the following:
result = df.merge(curr_data_df,how='left', indicator=True) #step 3.a
result = result[result['_merge']=='left_only'] # step 3.b
df = result.drop(columns='_merge') # step 3.c
Once i fetch the unique values in df, I change every item in df and do insert_many into my collection.
Queries:
Is my above solution of finding unique values in input df with left merging basis primary key right?
In my merge function, I dont specify any column names as from my database i bring in only the primary key columns and my df will already have those 3 keys. So by default will it take the common merge columns as "item_id", "process_name","actor" or i have to specifically specify it again. During my testing with small data in python shell, above had worked. But wanted this confirmed,
I saw one weird case today and unable to figure out:
I uploaded a same file second time to check if this data duplication check works. See the logs:
Number of records in original input df: 148 (df from step 1)
Original MONGODB source data-> Number of records: 684, Number of columns: 3 (curr_data_df from step 2)
Merge input*mongodb -> Number of records: 4166, Number of columns: 53 (step 3.a)
Unique new records -> Number of records after removing duplicates: 0, Number of columns: 53 (step 3.b)
Input df Unique records -> Number of records in final df: 0, Number of columns: 53 (step 3.c)
So though the df correwctly identified that unique new records is 0 in last step but why in step 3.a merge records was 4166 wherein the input had only 148 records and database data has only 684 records. Any insights here?

Related

Selecting Rows Based On Specific Condition In Python Pandas Dataframe

So I am new to using Python Pandas dataframes.
I have a dataframe with one column representing customer ids and the other holding flavors and satisfaction scores that looks something like this.
Although each customer should have 6 rows dedicated to them, Customer 1 only has 5. How do I create a new dataframe that will only print out customers who have 6 rows?
I tried doing: df['Customer No'].value_counts() == 6 but it is not working.
Here is one way to do it
if you post data as a code (preferably) or text, i would be able to share the result
# create a temporary column 'c' by grouping on Customer No
# and assigning count to it using transform
# finally, using loc to select rows that has a count eq 6
(df.loc[df.assign(
c=df.groupby(['Customer No'])['Customer No']
.transform('count'))['c'].eq(6]
)

Update Column Value for previous entries of row value, based on new entry

enter image description here
I have a dataframe that is updated monthly as such, with a new row for each employee.
If an employee decides to change their gender (for example here, employee 20215 changed from M to F in April 2022, I want all previous entries for that employee number 20215 to be switched to F as well.
This is for a database with roughly 15 million entries, and multiple such changes every month, so I was hoping for a scalable solution (I cannot simply put df['Gender'] = 'F' for example)
Since we didn’t receive a df from you or any code, I neede to generate something myself in order to test it. Please provide enugh code a give us a sample next time as well.
Here the generated df, in case someone comes with a better answer:
import pandas as pd, numpy as np
length=100
df = pd.DataFrame({'ID': np.random.randint(1001,1020,length),
'Ticket': np.random.randint(length),
'salary_grade' : np.random.randint(0,10,size=length),
'date': np.arange(length),
'genre': 'M' })
df['date']=pd.to_numeric(df['date'])
df['date']=pd.to_datetime(df['date'],dayfirst=True,unit='D',origin='15.04.2022')
That is the base DF, now I needed to estimulate some gender changes:
test_id=df.groupby(['ID'])['genre'].count().idxmax() # gives me the employee with most entries.
test_id
df[df['ID']==test_id].loc[:,'genre'] # getting all indexes from test_id, for a testchange/later for checking
df[df['ID']==test_id] # getting indexes of test_ID for gender change
id_lst=[]
for idx in df[df['ID']==test_id].index:
if idx>28: # <-- change this value for you generated df, middle of list
id_lst.append(idx) # returns a list of indexes where gender chage will happen
df.loc[id_lst,'genre']='F' # applying a gender change
Answer:
Finally to your answer:
finder=df.groupby(['ID']).agg({'genre' : lambda x: len(list(pd.unique(x)))>1 , 'date' : 'min'}) # Will return True for every ID with more then 2 genres
finder[finder['genre']] # will return IDs from above condition.
Next steps...
Now with the ID you just need to discover if its M-->F or F-->M new_genreand assign the new genre for the ID_found (int or list).
df.loc[ID_found,'genre']=new_genre

Sum pandas columns, excluding some rows based on other column values

I'm attempting to determine the number of widget failures from a test population.
Each widget can fail in 0, 1, or multiple ways. I'd like to calculate the number of failures of for each failure method, but once a widget is known to have failed, it should be excluded from future sums. In other words, the failure modes are known and ordered. If a widget fails via mode 1 and mode 3, I don't care about mode 3: I just want to count mode 1.
I have a dataframe with one row per item, and one column per failure mode. If the widget fails in that mode, the column value is 1, else it is 0.
d = {"item_1":
{"failure_1":0, "failure_2":0},
"item_2":
{"failure_1":1, "failure_2":0},
"item_3":
{"failure_1":0, "failure_2":1},
"item_4":
{"failure_1":1, "failure_2":1}}
df = pd.DataFrame(d).T
display(df)
Output:
failure_1 failure_2
item_1 0 0
item_2 1 0
item_3 0 1
item_4 1 1
If I just want to sum the columns, that's easy: df.sum(). And if I want to calculate percentage failures, easy too: df.sum()/len(df). But this counts widgets that fail in multiple ways, multiple times. For the problem stated, the best I can come up with is this:
# create empty df to store results
df2 = pd.DataFrame(columns=["total_failures"])
for col in df.columns:
# create a row, named after the column, and assign it the value of the sum
df2.loc[col] = df[col].sum()
# drop rows in the df column that are equal to 1
df = df.loc[df[col] != 1]
display(df2)
Output:
total_failures
failure_1 2
failure_2 1
This requires creating another dataframe (that's fine), but also requires iterating over the existing dataframe columns and deleting it a couple of rows at a time. If the dataframe takes a while to generate, or is needed for future calculations, this is not workable. I can deal with iterating over the columns.
Is there a way to do this without deleting the original df, or making a temporary copy? (Not workable with large data sets.)
You can do a cumsum on axis=1 and wherever the value is greater than 1 , mask it as 0 and then take sum:
out = df.mask(df.cumsum(axis=1).gt(1), 0).sum().to_frame('total_failures')
print(out)
total_failures
failure_1 2
failure_2 1
This way the original df is retained too.

Replace a subset of pandas data frame with another data frame

I have a data frame(DF1) with 100 columns.( one of the column is ID)
I have one more data frame(DF2) with 30 columns.( one column is ID)
I have to update the first 30 columns of the data frame(DF1) with the values in second data frame (DF2) keeping the rest of the values in the remaining columns of first data frame (DF1) intact.
update the first 30 column value in DF1 out of the 100 columns when the ID in second data frame (DF2) is present in first data frame (DF1).
I tested this on Python 3.7 but I see no reason for it not to work on 2.7:
joined = df1.reset_index() \
[['index', 'ID']] \
.merge(df2, on='ID')
df1.loc[joined['index'], df1.columns[:30]] = joined.drop(columns=['index', 'ID'])
This assumes that df2 doesn't have a column called index or the merge will fail saying duplicate key with suffix.
Here a slow-motion of its inner workings:
df1.reset_index() returns a dataframe same as df1 but with an additional column: index
[['index', 'ID']] extracts a dataframe containing just these 2 columns from the dataframe in #1
.merge(...) merges with df2 , matching on ID . The result (joined) is a dataframe with 32 columns: index, ID and the original 30 columns of df2.
df1.loc[<row_indexes>, <column_names>] = <another_dataframe> mean you want to replace at those particular cells with data from another_dataframe. Since joined has 32 columns, we need to drop the extra 2 (index and ID)

How to preallocate memory for a large pandas dataframe?

I need to create a large dataframe to save my data. It has 30001 columns, 1000 rows. For the data type, 30000 columns are int64, and the last columns is a hash values.
So I first create an empty dataframe:
df = pd.DataFrame(columns=columnNames, data=np.empty(shape=(1000, 30001)))
And then I create a Series based on dataframe's columns:
record = pd.Series(index=df.columns)
Then in a loop I'll populate the record and assign them to dataframe:
loop:
record[0:30000] = values #fill record with values
record['hash']= hash_value
df.loc[index] = record <==== this is slow
index += 1
When I debug on my code, I found the above step which assign record to a row is horribly slow.
My guess is that if I could create a dataframe with exact the size preallocated, then assigning the record to each row will be much faster.
So can I create the dataframe with full size preallocated?
(note: my original dataframe does not have the 'hash' column, it runs without any performance issue. Recently I found I need this additional hash column, which is a string value. And this performance issue occurred right after this new column added)