isin pandas doesn't show all values in dataframe - pandas

I am using the Amazon database for my research where I want to select the 100 most rated items. So first I have counted the values of the itemID's (asin)
data = amazon_data_parse('data/reviews_Movies_and_TV_5.json.gz')
unique, counts = np.unique(data['asin'], return_counts=True)
test = np.asarray((unique, counts)).T
test.sort(axis=1)
which gives:
array([[5, '0005019281'],
[5, '0005119367'],
[5, '0307141985'],
...,
[1974, 'B00LG7VVPO'],
[2110, 'B00LH9ROKM'],
[2213, 'B00LT1JHLW']], dtype=object)
It is clearly to see that there must be at least 6.000 rows selected. But if I run:
a= test[49952:50054,1]
a = a.tolist()
test2 = data[data.asin.isin(a)]
It only selects 2000 rows from the dataset. I already have tried multiple thing, like only filter on one asin but it doesn't just seem to work. Can someone please help? If there is a better option to get a dataframe with the rows of the 100 most frequent values in asin column I would be glad too.

I found the solution, had to change the sorting line to:
test = test[test[:,1].argsort()]

Related

Removing duplicates Pandas without drop_duplicates

Please be informed that I already looped through various posts before turning to you.
In fact, I tried to implement the solution provided in : dropping rows from dataframe based on a "not in" condition
My problem is the following. Let's assume I have I huge dataframe of which I want to remove duplicates. I'm well aware I could use drop_duplicates since it is fastest an simplest approach. However, our teacher wants us to create a list containing the IDs of the duplicates and then remove them based on if the values are contained within the aforesaid list.
#My list
list1 = ['s1' , 's2']
print(len(list1))
#My dataframe
data1 = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6']})
print(len(data1))
#Remove all the rows that hold a value contained in list1 matched against the 'id' column
data2 = data1[~data1.id.isin(list1)]
print(len(data2))
Now, let's see the output:
Len list1 = 135
Len data1 = 8942
Len data2 = 8672
So, I came to the conclusion that my code is somehow doubling the rows to be removed and removing them.
However, when I follow the drop_duplicates approach, my code works just fine and removes the 135 rows.
Could any of you help me understand why is that happening? I tried to simplify the issue as far as possible.
Thanks a lot!
This is an extraordinarily painful way to do what you're asking. Maybe someone will see this and make a less painful way. I specifically stayed away from groupby('id').first() as means to remove duplicates because you mentioned needing to first create a list of duplicates. But that would be my next best recommendation.
Anyway, I added duplicates of s1 and s2 to your example
df = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6', 's1' , 's2', 's2']})
Finding IDs with more than 1 entry (assuming duplicate). Here I do use groupby to get counts and keep those >1 and send unique values to the a list
dup_list = df[df.groupby('id')['id'].transform('count') > 1]['id'].unique().tolist()
print(dup_list)
['s1', 's2']
Then iterate over the list finding indices that are duplicated and removing all but the first
for id in dup_list:
# print(df[df['id']==id].index[1:].to_list())
drp = df[df['id']==id].index[1:].to_list()
df.drop(drp, inplace=True)
df
id
0 s1
1 s2
2 s3
3 s4
4 s5
5 s6
Indices 6 and 7 were dropped

Sort column names in specific order

Imagine I have the following column names for a pyspark dataframe:
Naturally pyspark is ordering them by 0, 1, 2, etc. However, I wanted the following: 0_0; 0_1; 1_0; 1_1; 2_0; 2_1 OR INSTEAD 0_0; 1_0; 2_0; 3_0; 4_0; (...); 0_1; 1_1; 2_1; 3_1; 4_1 (both solutions would be fine by me).
Can anyone help me with this?
You can sort the column names according to the number before and after the underscore:
df2 = df.select(
'id',
*sorted(
df.columns[1:], key=lambda c: (int(c.split('_')[0]), int(c.split('_')[1]))
)
)
To get the other desired output, just swap 0 with 1 in the code above.

Advanced condition lookup in pandas(numpy)

given:
a list of elements 'ls' and a big df 'df', all the elements of 'ls' is in the 'df'.
ls = ['a0','a1','a2','b0','b2','c0',...,'c_k']
df = [['a0','b0','c0'],
['a0','b0','c1'],
['a0','b0','c2'],
...
['a_i','b_j','c_k']]
goal:
I want to collect the rows set of the 'df' that contains the most elements of 'ls', such as ['a0','b0','c0'] is the best one. But at most a row just contain only 2 elements
tried:
I tried enumerating 3 or 2 elements in 'ls', but it was too expensive and probably return None since there exist only 2 elements in some row.
I tried to use a dictionary to count, but it didn't work either.
I've been puzzling over this problem all day, any help will be greatly appreciated.
I would go like this:
row_id = df.apply(lambda x: x.isin(ls).sum(), axis=1)
This will give you the row index with max entries in the list.
The desired row can be obtained so:
df.iloc[row_id, :]

Put rows in between existed rows, and interpolate the values in the data frame

Hello.
I would like you to ask how to put new rows ( 60 number of new rows ) in between every existed rows.
What I am thinking of is as shown as in the picture.
I want to put new rows in between every existed rows, and interpolate the values.
Can you guide me how to do this?
I am using Pandas, and Numpy libraries
Thank you.
You can multiple index values by 3, then DataFrame.reindex for add missing values and last use DataFrame.interpolate:
#solution working with default index
df = df.reset_index(drop=True)
df = df.rename(lambda x: x * 3).reindex(np.arange(df.index.max() * 3 + 1)).interpolate()

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)