Finding values in postgres array (some must be in, some must not be in at the same query) - sql

I have a ids integer[]
And, I want to find rows which contain 1 but must not contains 2, 3, 4
but [1] OR [1, 5] OR [1, 6, 7] <- this data is OK. [2,3,4] is not.
So I tried this way
SELECT *
FROM table_test
WHERE 1 = ANY(ids) AND 2 <> ANY(ids) AND 3 <> ANY(ids) AND 4 <> ANY(ids)
but it returns 1 = ANY(ids) part
[1 2 3]
[1 3 4]
[1]
[1 5]
[1 6 7]
I want this data
[1]
[1 5]
[1 6 7]
How can I solve this problem?
Thanks a lot!

You should use ALL together with <>.
The expression 2 <> ANY(ids) is true if at least one element is not equal to 2 - which is always the case because you require at least one element to be 1 (which is not 2) with the first condition.
SELECT *
FROM table_test
WHERE 1 = ANY(ids)
AND 2 <> ALL(ids)
AND 3 <> ALL(ids)
AND 4 <> ALL(ids)
another option is to use the overlaps operator && ("have elements in common") and negate it:
SELECT *
FROM table_test
WHERE 1 = ANY(ids)
AND NOT ids && array[3,4,5]

Your query is very close, but what is actually does is:
check if any array element contains 1 (this is ok)
check if any array element does not contain 2, 3 and 4 (this means [1,3,4] is valid beacuse 1 is not 2,3 or 4, so the condition is fulfilled)
What you really have to check with case #2 is that ALL elements are not 2, 3, 4.
Your updated query is now:
SELECT * FROM table_test WHERE 1 = ANY(ids) AND 2 <> ALL(ids) AND 3 <> ALL(ids) AND 4 <> ALL(ids);

Related

ValueError , couldn't convert string object to float in this dataset [duplicate]

I have a dataframe with columns that has list of numbers as strings:
C1 C2 l
1 3 ['5','9','1']
7 1 ['7','1','6']
What is the best way to convert it to list of ints?
C1 C2 l
1 3 [5,9,1]
7 1 [7,1,6]
Thanks
You can try
df['l'] = df['l'].apply(lambda lst: list(map(int, lst)))
print(df)
C1 C2 l
0 1 7 [5, 9, 1]
1 3 1 [7, 1, 6]
Pandas' dataframes are not designed to work with nested structures such as lists. Thus, there is no vectorial method for this task.
You need to loop. The most efficient is to use a list comprehension (apply would also work but with much less efficiency).
df['l'] = [[int(x) for x in l] for l in df['l']]
NB. There is no check. If you have anything that cannot be converted to integers, this will trigger an error!
Output:
C1 C2 l
0 1 3 [5, 9, 1]
1 7 1 [7, 1, 6]

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Pandas Dataframe: split column into multiple columns

I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.
As you can see below the column has 15 category codes seen in the column header.
Original DataFrame
I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].
This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].
'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''
Shape of desired DatFrame
If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.
You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.
You need to use a function and some loops to go through the columns.
First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)
1) Mail\n2) Email \n3) At PAC/TPAC 1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0 2 5
1 1 1
2 3 4
3 2 4
4 3 2
5 1 5
6 3 1
7 2 4
8 3 3
9 1 2
We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.
First, let's grab the first column.
s_col = df_full.iloc[:, 0]
print(s_col)
0 2
1 1
2 3
3 2
4 3
5 1
6 3
7 2
8 3
9 1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64
Split the header into individual pieces.
col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']
Clean up any leading or trailing white space.
col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']
Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
1) Mail 2) Email 3) At PAC/TPAC
0 2 2 2
1 1 1 1
2 3 3 3
3 2 2 2
4 3 3 3
5 1 1 1
6 3 3 3
7 2 2 2
8 3 3 3
9 1 1 1
Create a copy to make changes to the values.
df_res = df.copy()
Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
print(df_res)
1) Mail 2) Email 3) At PAC/TPAC
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 1 0
8 0 0 1
9 1 0 0
Now we have split a column into its components and assigned a bool value.
Let's step back and make the above a function so we can use it for each column in the original dataframe.
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
1) Mail 2) Email 3) At PAC/TPAC 1) ACC 2) IM 3) PT 4) Smoking, 5) Cessation
0 0 1 0 0 0 0 0 1
1 1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0
4 0 0 1 0 1 0 0 0
5 1 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0
7 0 1 0 0 0 0 1 0
8 0 0 1 0 0 1 0 0
9 1 0 0 0 1 0 0 0
Here is the full code...
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)

Numpy indexing in 3 dimensions

In [93]: a = np.arange(24).reshape(2, 3, 4)
In [94]: a[0, 1, ::2]
Out[94]: array([4, 6])
Can someone explain what '::2' means here?
Thanks!
::2 means : in this dimension, get all the "layers" having a pair index (starting from 0, counting by 2).
it means: get the element at a[0, 1, 0] and a[0, 1, 2] and put it into the same array.
each index position (you have 3 in this sample) is indexable and "sliceable". perhaps you saw slices like [this:slice] before in normal arrays. well... slices can also have a third value which is the "step" value.
so: [a:b:c] means [startPosition:endPosition:step] where endPosition is not included.
so having ::2 means start=0, end=the end of the ... dimension, step=2.
you have at most 4 in that dimension (see your reshape line), so the index it will count are 0 and 2 (1 and 3 are skipped, and 3 is the last element).
0 0 0 => 0
0 0 1 => 1
0 0 2 => 2
0 0 3 => 3
0 1 0 => 4 -> (0, 1, 0) is rescued via the slice
0 1 1 => 5
0 1 2 => 6 -> (0, 1, 2) is rescued via the slice

intersect(A,B) returns the data with no repetitions

I was using "intersect" in my Matlab code where I want the following:
A = [ 4 1 1 2 3];
[B] = sort(A, 'ascend'); % so that B is sorting A in ascending order, so I got B = [1 1 2 3 4]
[same,a] = intersect(B,A);
I want same = [1 1 2 3 4] but the simulation gives me same = [1 2 3 4] by omitting the repeated '1'.
I understand by using intersect it will return data with no repetition
C = intersect(A,B) returns the data common to both A and B with no repetitions.
I want it to show the complete data including those repetition, what are the alternatives I can use rather than the function "intersect"?
For example:
A = [ 4 1 1 2 3];
[B] = sort(A, 'ascend'); % so that B is sorting A in ascending order, so I got B = [1 1 2 3 4]
[same,a] = intersect(B,A);
So now I want it to be like this same =[1 1 2 3 4] and a=[2 3 4 5 1].
I need to access ‘a’ where ‘a’ shows the original index prior to sorting so I can use it for further processing.
Thank you very much.
Why do you need intersect of A and B knowing that B contains the same values than A ?
From what you said, I think you have all the needed results in B.