isin pandas doesn't show all values in dataframe

isin pandas doesn't show all values in dataframe - pandas

I am using the Amazon database for my research where I want to select the 100 most rated items. So first I have counted the values of the itemID's (asin)
data = amazon_data_parse('data/reviews_Movies_and_TV_5.json.gz')
unique, counts = np.unique(data['asin'], return_counts=True)
test = np.asarray((unique, counts)).T
test.sort(axis=1)
which gives:
array([[5, '0005019281'],
[5, '0005119367'],
[5, '0307141985'],
...,
[1974, 'B00LG7VVPO'],
[2110, 'B00LH9ROKM'],
[2213, 'B00LT1JHLW']], dtype=object)
It is clearly to see that there must be at least 6.000 rows selected. But if I run:
a= test[49952:50054,1]
a = a.tolist()
test2 = data[data.asin.isin(a)]
It only selects 2000 rows from the dataset. I already have tried multiple thing, like only filter on one asin but it doesn't just seem to work. Can someone please help? If there is a better option to get a dataframe with the rows of the 100 most frequent values in asin column I would be glad too.

I found the solution, had to change the sorting line to:
test = test[test[:,1].argsort()]

Related

Removing duplicates Pandas without drop_duplicates

Please be informed that I already looped through various posts before turning to you.
In fact, I tried to implement the solution provided in : dropping rows from dataframe based on a "not in" condition
My problem is the following. Let's assume I have I huge dataframe of which I want to remove duplicates. I'm well aware I could use drop_duplicates since it is fastest an simplest approach. However, our teacher wants us to create a list containing the IDs of the duplicates and then remove them based on if the values are contained within the aforesaid list.
#My list
list1 = ['s1' , 's2']
print(len(list1))
#My dataframe
data1 = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6']})
print(len(data1))
#Remove all the rows that hold a value contained in list1 matched against the 'id' column
data2 = data1[~data1.id.isin(list1)]
print(len(data2))
Now, let's see the output:
Len list1 = 135
Len data1 = 8942
Len data2 = 8672
So, I came to the conclusion that my code is somehow doubling the rows to be removed and removing them.
However, when I follow the drop_duplicates approach, my code works just fine and removes the 135 rows.
Could any of you help me understand why is that happening? I tried to simplify the issue as far as possible.
Thanks a lot!

This is an extraordinarily painful way to do what you're asking. Maybe someone will see this and make a less painful way. I specifically stayed away from groupby('id').first() as means to remove duplicates because you mentioned needing to first create a list of duplicates. But that would be my next best recommendation.
Anyway, I added duplicates of s1 and s2 to your example
df = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6', 's1' , 's2', 's2']})
Finding IDs with more than 1 entry (assuming duplicate). Here I do use groupby to get counts and keep those >1 and send unique values to the a list
dup_list = df[df.groupby('id')['id'].transform('count') > 1]['id'].unique().tolist()
print(dup_list)
['s1', 's2']
Then iterate over the list finding indices that are duplicated and removing all but the first
for id in dup_list:
# print(df[df['id']==id].index[1:].to_list())
drp = df[df['id']==id].index[1:].to_list()
df.drop(drp, inplace=True)
df
id
0 s1
1 s2
2 s3
3 s4
4 s5
5 s6
Indices 6 and 7 were dropped

Sort column names in specific order

Imagine I have the following column names for a pyspark dataframe:
Naturally pyspark is ordering them by 0, 1, 2, etc. However, I wanted the following: 0_0; 0_1; 1_0; 1_1; 2_0; 2_1 OR INSTEAD 0_0; 1_0; 2_0; 3_0; 4_0; (...); 0_1; 1_1; 2_1; 3_1; 4_1 (both solutions would be fine by me).
Can anyone help me with this?

You can sort the column names according to the number before and after the underscore:
df2 = df.select(
'id',
*sorted(
df.columns[1:], key=lambda c: (int(c.split('_')[0]), int(c.split('_')[1]))
)
)
To get the other desired output, just swap 0 with 1 in the code above.

Advanced condition lookup in pandas(numpy)

given:
a list of elements 'ls' and a big df 'df', all the elements of 'ls' is in the 'df'.
ls = ['a0','a1','a2','b0','b2','c0',...,'c_k']
df = [['a0','b0','c0'],
['a0','b0','c1'],
['a0','b0','c2'],
...
['a_i','b_j','c_k']]
goal:
I want to collect the rows set of the 'df' that contains the most elements of 'ls', such as ['a0','b0','c0'] is the best one. But at most a row just contain only 2 elements
tried:
I tried enumerating 3 or 2 elements in 'ls', but it was too expensive and probably return None since there exist only 2 elements in some row.
I tried to use a dictionary to count, but it didn't work either.
I've been puzzling over this problem all day, any help will be greatly appreciated.

I would go like this:
row_id = df.apply(lambda x: x.isin(ls).sum(), axis=1)
This will give you the row index with max entries in the list.
The desired row can be obtained so:
df.iloc[row_id, :]

Put rows in between existed rows, and interpolate the values in the data frame

Hello.
I would like you to ask how to put new rows ( 60 number of new rows ) in between every existed rows.
What I am thinking of is as shown as in the picture.
I want to put new rows in between every existed rows, and interpolate the values.
Can you guide me how to do this?
I am using Pandas, and Numpy libraries
Thank you.

You can multiple index values by 3, then DataFrame.reindex for add missing values and last use DataFrame.interpolate:
#solution working with default index
df = df.reset_index(drop=True)
df = df.rename(lambda x: x * 3).reindex(np.arange(df.index.max() * 3 + 1)).interpolate()

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?

We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)

nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.

Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

isin pandas doesn't show all values in dataframe - pandas

I found the solution, had to change the sorting line to: test = test[test[:,1].argsort()]

Related

Removing duplicates Pandas without drop_duplicates

Sort column names in specific order

Advanced condition lookup in pandas(numpy)

Put rows in between existed rows, and interpolate the values in the data frame

Pandas groupby year filtering the dataframe by n largest values

Categories

Resources