intersect(A,B) returns the data with no repetitions - intersect

I was using "intersect" in my Matlab code where I want the following:
A = [ 4 1 1 2 3];
[B] = sort(A, 'ascend'); % so that B is sorting A in ascending order, so I got B = [1 1 2 3 4]
[same,a] = intersect(B,A);
I want same = [1 1 2 3 4] but the simulation gives me same = [1 2 3 4] by omitting the repeated '1'.
I understand by using intersect it will return data with no repetition
C = intersect(A,B) returns the data common to both A and B with no repetitions.
I want it to show the complete data including those repetition, what are the alternatives I can use rather than the function "intersect"?
For example:
A = [ 4 1 1 2 3];
[B] = sort(A, 'ascend'); % so that B is sorting A in ascending order, so I got B = [1 1 2 3 4]
[same,a] = intersect(B,A);
So now I want it to be like this same =[1 1 2 3 4] and a=[2 3 4 5 1].
I need to access ‘a’ where ‘a’ shows the original index prior to sorting so I can use it for further processing.
Thank you very much.

Why do you need intersect of A and B knowing that B contains the same values than A ?
From what you said, I think you have all the needed results in B.

Related

String to Columns

I have a string column in my df.
col
a: 1, b: 2, c: 3
b: 1, c: 3, a: 4
c: 2, b: 4, a: 3
I wish to convert this into multiple columns as:
a b c
1 2 3
4 1 3
3 4 2
Need help regarding this.
I am trying to convert this into a dict and then sort the dict. Post that, I want to maybe do a pivot table. Not exactly sure if it'll do but any help or better method will be appreciated.
Use nested list comprehension with double split by , and : for list of dictionaries and pass to DataFrame constructor:
df = pd.DataFrame([dict(y.split(': ') for y in x.split(', ')) for x in df['col']],
index=df.index)
print (df)
a b c
0 1 2 3
1 4 1 3
2 3 4 2
You can use str.extractall and unstack:
(df['col'].str.extractall('(\w+):\s*([^,]+)')
.set_index(0, append=True).droplevel('match')[1]
.unstack(0)
)
Output:
a b c
0 1 2 3
1 4 1 3
2 3 4 2

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Check if list cell contains value

Having a dataframe like this:
month transactions_ids
0 1 [0, 5, 1]
1 2 [7, 4]
2 3 [8, 10, 9, 11]
3 6 [2]
4 9 [3]
For a given transaction_id, I would like to get the month when it took place. Notice that a transaction_id can only be related to one single month.
So for example, given transaction_id = 4, the month would be 2.
I know this can be done in a loop by looking month by month if the transactions_ids related contain the given transaction_id, but I'm wondering if there is any way more efficient than that.
Cheers
The best way in my opinion is to explode your data frame and avoid having python lists in your cells.
df = df.explode('transaction_ids')
which outputs
month transactions_ids
0 1 0
0 1 5
0 1 1
1 2 7
1 2 4
2 3 8
2 3 10
2 3 9
2 3 11
3 6 2
4 9 3
Then, simply
id_to_find = 1 # example
df.loc[df.transactions_ids == id_to_find, 'month']
P.S: be aware of the duplicated indexes that explode outputs. In general, it is better to do explode(...).reset_index(drop=True) for most cases to avoid unwanted behavior.
You can use pandas string methods to find the id in the "list" (it's really just a string as far as pandas is concerned when read in using StringIO):
import pandas as pd
from io import StringIO
data = StringIO("""
month transactions_ids
1 [0,5,1]
2 [7,4]
3 [8,10,9,11]
6 [2]
9 [3]
""")
df = pd.read_csv(data, delim_whitespace=True)
df.loc[df['transactions_ids'].str.contains('4'), 'month']
In case your transactions_ids are real lists, then you can use map to check for membership:
df['transactions_ids'].map(lambda x: 3 in x)

Finding values in postgres array (some must be in, some must not be in at the same query)

I have a ids integer[]
And, I want to find rows which contain 1 but must not contains 2, 3, 4
but [1] OR [1, 5] OR [1, 6, 7] <- this data is OK. [2,3,4] is not.
So I tried this way
SELECT *
FROM table_test
WHERE 1 = ANY(ids) AND 2 <> ANY(ids) AND 3 <> ANY(ids) AND 4 <> ANY(ids)
but it returns 1 = ANY(ids) part
[1 2 3]
[1 3 4]
[1]
[1 5]
[1 6 7]
I want this data
[1]
[1 5]
[1 6 7]
How can I solve this problem?
Thanks a lot!
You should use ALL together with <>.
The expression 2 <> ANY(ids) is true if at least one element is not equal to 2 - which is always the case because you require at least one element to be 1 (which is not 2) with the first condition.
SELECT *
FROM table_test
WHERE 1 = ANY(ids)
AND 2 <> ALL(ids)
AND 3 <> ALL(ids)
AND 4 <> ALL(ids)
another option is to use the overlaps operator && ("have elements in common") and negate it:
SELECT *
FROM table_test
WHERE 1 = ANY(ids)
AND NOT ids && array[3,4,5]
Your query is very close, but what is actually does is:
check if any array element contains 1 (this is ok)
check if any array element does not contain 2, 3 and 4 (this means [1,3,4] is valid beacuse 1 is not 2,3 or 4, so the condition is fulfilled)
What you really have to check with case #2 is that ALL elements are not 2, 3, 4.
Your updated query is now:
SELECT * FROM table_test WHERE 1 = ANY(ids) AND 2 <> ALL(ids) AND 3 <> ALL(ids) AND 4 <> ALL(ids);

Generate list of values summing to 1 - within groupby?

In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?
For example, I can loop through the unique values of the letter column and apply one at a time:
df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]
for letter in df['letter'].unique():
sz=len(df[df['letter'] == letter])
df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)
results in:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.015493 0.293481
1 a 3 12 0.114027 0.043973
2 a 2 12 0.309150 0.160818
3 a 6 12 0.033999 0.501729
4 b 7 16 0.365276 0.617484
5 b 5 16 0.144502 0.318075
6 b 4 16 0.017552 0.064442
but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.
I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.
Thoughts? Suggestions? Solutions?
IIUC, do a transform():
def direchlet(x, size=1):
return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])
df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)
Output:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.102780 0.127119
1 a 3 12 0.079201 0.219648
2 a 2 12 0.341158 0.020776
3 a 6 12 0.096956 0.632456
4 b 7 16 0.193970 0.269094
5 b 5 16 0.012905 0.516035
6 b 4 16 0.173031 0.214871