Removing duplicates Pandas without drop_duplicates - pandas

Please be informed that I already looped through various posts before turning to you.
In fact, I tried to implement the solution provided in : dropping rows from dataframe based on a "not in" condition
My problem is the following. Let's assume I have I huge dataframe of which I want to remove duplicates. I'm well aware I could use drop_duplicates since it is fastest an simplest approach. However, our teacher wants us to create a list containing the IDs of the duplicates and then remove them based on if the values are contained within the aforesaid list.
#My list
list1 = ['s1' , 's2']
print(len(list1))
#My dataframe
data1 = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6']})
print(len(data1))
#Remove all the rows that hold a value contained in list1 matched against the 'id' column
data2 = data1[~data1.id.isin(list1)]
print(len(data2))
Now, let's see the output:
Len list1 = 135
Len data1 = 8942
Len data2 = 8672
So, I came to the conclusion that my code is somehow doubling the rows to be removed and removing them.
However, when I follow the drop_duplicates approach, my code works just fine and removes the 135 rows.
Could any of you help me understand why is that happening? I tried to simplify the issue as far as possible.
Thanks a lot!

This is an extraordinarily painful way to do what you're asking. Maybe someone will see this and make a less painful way. I specifically stayed away from groupby('id').first() as means to remove duplicates because you mentioned needing to first create a list of duplicates. But that would be my next best recommendation.
Anyway, I added duplicates of s1 and s2 to your example
df = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6', 's1' , 's2', 's2']})
Finding IDs with more than 1 entry (assuming duplicate). Here I do use groupby to get counts and keep those >1 and send unique values to the a list
dup_list = df[df.groupby('id')['id'].transform('count') > 1]['id'].unique().tolist()
print(dup_list)
['s1', 's2']
Then iterate over the list finding indices that are duplicated and removing all but the first
for id in dup_list:
# print(df[df['id']==id].index[1:].to_list())
drp = df[df['id']==id].index[1:].to_list()
df.drop(drp, inplace=True)
df
id
0 s1
1 s2
2 s3
3 s4
4 s5
5 s6
Indices 6 and 7 were dropped

Related

Creating batches based on city in pandas

I have two different dataframes that I want to fuzzy match against each other to find and remove duplicates. To make the process faster/more accurate I want to only fuzzy match records from both dataframes in the same cities. So that makes it necessary to create batches based on cities in the one dataframe then running the fuzzy matcher between each batch and a subset of the other dataframe with like cities. I can't find another post that does this and I am stuck. Here is what I have so far. Thanks!
df = pd.DataFrame({'A':[1,1,2,2,2,2,3,3],'B':['Q','Q','R','R','R','P','L','L'],'origin':['file1','file2','file3','file4','file5','file6','file7','file8']})
cols = ['B']
df1 = df[df.duplicated(subset=cols,keep=False)].copy()
df1 = df1.sort_values(cols)
df1['group'] = 'g' + (df1.groupby(cols).ngroup() + 1).astype(str)
df1['duplicate_count'] = df1.groupby(cols)['origin'].transform('size')
df1_g1 = df1.loc[df1['group'] == 'g1']
print(df1_g1)
which will not factor in anything that isn't duplicated so if a value only appears once then it will be skipped as is the case with 'P' in column B. It also requires me to go in and hard-code the group in each time which is not ideal. I haven't been able to figure out a for loop or any other method to solve this. Thanks!
You can pass to locals
variables = locals()
for i,j in df1.groupby('group'):
variables["df1_{0}".format(i)] = j
df1_g1
Out[314]:
A B origin group duplicate_count
6 3 L file7 g1 2
7 3 L file8 g1 2

Advanced condition lookup in pandas(numpy)

given:
a list of elements 'ls' and a big df 'df', all the elements of 'ls' is in the 'df'.
ls = ['a0','a1','a2','b0','b2','c0',...,'c_k']
df = [['a0','b0','c0'],
['a0','b0','c1'],
['a0','b0','c2'],
...
['a_i','b_j','c_k']]
goal:
I want to collect the rows set of the 'df' that contains the most elements of 'ls', such as ['a0','b0','c0'] is the best one. But at most a row just contain only 2 elements
tried:
I tried enumerating 3 or 2 elements in 'ls', but it was too expensive and probably return None since there exist only 2 elements in some row.
I tried to use a dictionary to count, but it didn't work either.
I've been puzzling over this problem all day, any help will be greatly appreciated.
I would go like this:
row_id = df.apply(lambda x: x.isin(ls).sum(), axis=1)
This will give you the row index with max entries in the list.
The desired row can be obtained so:
df.iloc[row_id, :]

How to split a column into multiple columns and then count the null values in the new column in SQL or Pandas?

I have a relatively large table with thousands of rows and few tens of columns. Some columns are meta data and others are numerical values. The problem I have is, some meta data columns are incomplete or partial that is, it missed the string after a ":". I want to get a count of how many of these are with the missing part after the colon mark.
If you look at the miniature example below, what I should get is a small table telling me that in group A, MetaData is complete for 2 entries and incomplete (missing after ":") in other 2 entries. Ideally I also want to get some statistics on SomeValue (Count, max, min etc.).
How do I do it in an SQL query or in Python Pandas?
Might turn out to be simple to use some build in function however, I am not getting it right.
Data:
Group MetaData SomeValue
A AB:xxx 20
A AB: 5
A PQ:yyy 30
A PQ: 2
Expected Output result:
Group MetaDataComplete Count
A Yes 2
A No 2
No reason to use split functions (unless the value can contain a colon character.) I'm just going to assume that the "null" values (not technically the right word) end with :.
select
"Group",
case when MetaData like '%:' then 'Yes' else 'No' end as MetaDataComplete,
count(*) as "Count"
from T
group by "Group", case when MetaData like '%:' then 'Yes' else 'No' end
You could also use right(MetaData, 1) = ':'.
Or supposing that values can contain their own colons, try charindex(':', MetaData) = len(MetaData) if you just want to ask whether the first colon is in the last position.
Here is an example:
## 1- Create Dataframe
In [1]:
import pandas as pd
import numpy as np
cols = ['Group', 'MetaData', 'SomeValue']
data = [['A', 'AB:xxx', 20],
['A', 'AB:', 5],
['A', 'PQ:yyy', 30],
['A', 'PQ:', 2]
]
df = pd.DataFrame(columns=cols, data=data)
# 2- New data frame with split value columns
new = df["MetaData"].str.split(":", n = 1, expand = True)
df["MetaData_1"]= new[0]
df["MetaData_2"]= new[1]
# 3- Dropping old MetaData columns
df.drop(columns =["MetaData"], inplace = True)
## 4- Replacing empty string by nan and count them
df.replace('',np.NaN, inplace=True)
df.isnull().sum()
Out [1]:
Group 0
SomeValue 0
MetaData_1 0
MetaData_2 2
dtype: int64
From a SQL perspective, performing a split is painful, not mention using the split results in having to perform the query first then querying the results:
SELECT
Results.[Group],
Results.MetaData,
Results.MetaValue,
COUNT(Results.MetaValue)
FROM (SELECT
[Group]
MetaData,
SUBSTRING(MetaData, CHARINDEX(':', MetaData) + 1, LEN(MetaData)) AS MetaValue
FROM VeryLargeTable) AS Results
GROUP BY Results.[Group],
Results.MetaData,
Results.MetaValue
If your just after a count, you could also try the algorithmic approach. Just loop over the data and use regular expressions with negative lookahead.
import pandas as pd
import re
pattern = '.*:(?!.)' # detects the strings of the missing data form
missing = 0
not_missing = 0
for i in data['MetaData'].tolist():
match = re.findall(pattern, i)
if match:
missing += 1
else:
not_missing += 1

How to read a very messy .txt file using pd.read.csv() with multiple conditions

I have a very messy .txt file that I'm attempting to read in using pd.read_csv(). The file has multiple challenges to overcome 1) The first 12 lines are not needed and therefore need to be skipped, the next 50 rows are needed, the next 14 rows need to be Skipped, next 50 rows needed, next 14 to be skipped , and so on. 2) Each normal row of data actually exists across 2 rows of data in this report, meaning that we need to lift the 2nd row of data up to the 1st row of data and place it to the right in new columns. (This action would halve the number of total rows and double the number of columns of the desired dataframe) 3) The last challenge is that the first row of data has 8 spaces of seperation between values while the 2 row of data has anywhere from 8 through to 17 spaces of sep between values.
I thought the best way to approach this would be to first remove the rows that I don't need. I would then find way to merge row 1 with row2 / row 3 with row 4/ row 5 with row 6 until all rows are correctly consolidated. I would then use the 'sep' function to separate values of each row for anything that has 8 spaces and over. This would hopefully get to my desired Output - has anyone ever had a similar challenge that they have overcome?
First picture is an image of the raw data
Second picture is my ideal output
Ok, so the error_bad_lines=False combined with sep = '\s+|\^+' worked a treat.
I then solved the problem of bad lines by removing them one by one.
I then solved the '1 row over 2 rows' problem by splitting the dataframe into two dfs (df8,df9) and recombined them on axis=1. Looks perfect now.
import pandas as pd #importing Pandas Package to wrangle data
boltcogs = 'ABAPlist.txt'
df = pd.read_csv(boltcogs,skiprows=12,error_bad_lines=False,header = None ,sep = '\s+|\^+')
df1 = df[df.iloc[:,0] != 'Production' ] ## removing verbose lines
df2 = df1[df1.iloc[:,0] != '----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------' ]
df3 = df2[df2.iloc[:,0] != 'Kuala' ] ## removing bad rows
df4 = df3[df3.iloc[:,0] != 'Operating' ] ## removing bad rows
df5 = df4[df4.iloc[:,0] != 'Plant:' ] ## removing bad rows
df6 = df5[df5.iloc[:,0] != 'Costing' ] ## removing bad rows
df7 = df6[df6.iloc[:,0] != 'Currency:' ] ## removing bad rows
df8 = df7.iloc[0::2, :].reset_index() # Selecting every second row to get second half of row
df9 = df7.iloc[1::2, :].reset_index()# Selecting remainder to to get first half of row
df10 = pd.concat([df8, df9], axis=1, ignore_index=True) # joining them together

isin pandas doesn't show all values in dataframe

I am using the Amazon database for my research where I want to select the 100 most rated items. So first I have counted the values of the itemID's (asin)
data = amazon_data_parse('data/reviews_Movies_and_TV_5.json.gz')
unique, counts = np.unique(data['asin'], return_counts=True)
test = np.asarray((unique, counts)).T
test.sort(axis=1)
which gives:
array([[5, '0005019281'],
[5, '0005119367'],
[5, '0307141985'],
...,
[1974, 'B00LG7VVPO'],
[2110, 'B00LH9ROKM'],
[2213, 'B00LT1JHLW']], dtype=object)
It is clearly to see that there must be at least 6.000 rows selected. But if I run:
a= test[49952:50054,1]
a = a.tolist()
test2 = data[data.asin.isin(a)]
It only selects 2000 rows from the dataset. I already have tried multiple thing, like only filter on one asin but it doesn't just seem to work. Can someone please help? If there is a better option to get a dataframe with the rows of the 100 most frequent values in asin column I would be glad too.
I found the solution, had to change the sorting line to:
test = test[test[:,1].argsort()]