How do you create a "count" matrix from a series? - pandas

I have a Pandas series of lists of categorical variables. For example:
df = pd.Series([["A", "A", "B"], ["A", "C"]])
Note that in my case the series is pretty long (50K elements) and the number of possible distinct elements in the list is also big (20K elements).
I would like to obtain a matrix having a column for each distinct feature and its count as value. For the previous example, this means:
[[2, 0, 0], [1, 0, 1]]
This is the same output as the one obtained with OneHot encoding, except that it contains the count instead of just 1.
What is the best way to achieve this?

Let's try explode:
df.explode().groupby(level=0).value_counts().unstack(fill_value=0)
Output:
A B C
0 2 1 0
1 1 0 1
To get the list of list, chain the above with .values:
array([[2, 1, 0],
[1, 0, 1]])
Note that you will end up with a 50K x 20K array.

Related

Identifying indices where any one of a plurality of columns has a certain value?

I need to determine which indices in a dataframe have any one of a set of columns having a specified value. The dataframe has several hundred columns, and a few dozen I need to use for the filtering, so it's impractical to write them all out. My strategy is as follows, to determine indices where any column having 'temp' in its name is equal to 1:
columns = [col for col in df.columns if 'temp' in col]
indices = list(np.where(df[columns]==1)[0])
However, this is returning an unexpected result - it seems return a value for every single index in the df. Any clues where this is going wrong?
You could try this:
import pandas as pd
# Toy dataframe: two columns have "temp" in their name
# and rows 0 and 3 have a value of 1
df = pd.DataFrame(
{"SJDRtemp": [0, 0, 0, 1], "TR": [0, 0, 2, 1], "LDtemp": [1, 3, 0, 0]}
)
# Select columns which name contains "temp"
columns = [col for col in df.columns if "temp" in col]
# Get indices of rows where "temp columns" have a value of 1
indices = list(df[df[columns] == 1].dropna(how="all").index)
print(indices)
# Outputs
[0, 3]

Numpy, how to retrieve sub-array of array (specific indices)?

I have an array:
>>> arr1 = np.array([[1,2,3], [4,5,6], [7,8,9]])
array([[1 2 3]
[4 5 6]
[7 8 9]])
I want to retrieve a list (or 1d-array) of elements of this array by giving a list of their indices, like so:
indices = [[0,0], [0,2], [2,0]]
print(arr1[indices])
# result
[1,6,7]
But it does not work, I have been looking for a solution about it for a while, but I only found ways to select per row and/or per column (not per specific indices)
Someone has any idea ?
Cheers
Aymeric
First make indices an array instead of a nested list:
indices = np.array([[0,0], [0,2], [2,0]])
Then, index the first dimension of arr1 using the first values of indices, likewise the second:
arr1[indices[:,0], indices[:,1]]
It gives array([1, 3, 7]) (which is correct, your [1, 6, 7] example output is probably a typo).

How can I select the rows which contains some specific value in a dataframe using python?

I am quite new to python and coding, so sorry in advance if I may not be so clear.
I have a dataframe where the rows correspond to IDs (f.ied) and the columns to several values (ICD10 codes). I want to select the rows which contain specific ICD10 codes.
However, I could not find the right way to do so...I tried with loc and set but no luck...any help, please?
The dataframe is like that:
each rows corresponds to f.ied (IDs). I want to know which f.ied have specific codes: I20, I21, I22, I23, I24, I25.
df = pd.DataFrame({'feid': [2, 4, 8, 0],
'f42002': [2, 0, 0, 0],
'f42003': [10, 'I21', 1, 'J10']})
df = df.set_index('feid')
df
DataFrame
f42002 f42003
feid
2 2 10
4 0 I21
8 0 1
0 0 J10
Desired items
mylist = ['I21', 'J10']
for i in mylist:
print(df[(df['f42002']==i) | (df['f42003']==i)].index.values)
Result:
[4]
[0]

How to compute how many elements in three arrays in python are equal to some value in the same positon betweel the arrays?

I have three numpy arrays
a = [0, 1, 2, 3, 4]
b = [5, 1, 7, 3, 9]
c = [10, 1, 3, 3, 1]
and i wanna to compute how many elements in a, b, c are equal to 3 in the same position, so for that example would be 3.
An elegant solution is to use Numpy functions, like:
np.count_nonzero(np.vstack([a, b, c])==3, axis=0).max()
Details:
np.vstack([a, b, c]) - generate an array with 3 rows, composed of your
3 source arrays.
np.count_nonzero(...==3, axis=0) - count how many values of 3 occurs
in each column. For your data the result is array([0, 0, 1, 3, 0], dtype=int64).
max() - take the greatest value, in your case 3.

Numpy remove rows with same column values

How do I remove rows from ndarray arrays which have the same nth column value?
For eg,
a = np.ndarray([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
And I want to have rows unique by third column.
I want to have just the [1, 3, 5] row left.
numpy.unique does not do it. It will check for uniqueness in every column; I can't specify the
column by which to check uniqueness.
How can I do this efficiently for thousand + rows?
Thank you.
You could try a combination of bincount, nonzero and in1d
import numpy as np
a = np.array([[1, 3, 4],
[1, 3, 4],
[1, 3, 5]])
#A tuple containing the values which are unique in column 3
unique_in_column = (np.bincount(a[:,2]) == 1).nonzero()
a[:,2] == unique_in_column[0]
unique_index = np.in1d(a[:,2], unique_in_column[0])
unique_a = a[unique_index]
This should do the trick. However, I'm not sure how this method scales with 1000+ rows.
I had done this finally:
repeatdict = {}
todel = []
for i, row in enumerate(kplist):
if repeatdict.get(row[2], 0):
todel.append(i)
else:
repeatdict[row[2]] = 1
kplist = np.delete(kplist, todel, axis=0)
Basically, I iterated over the list store the values of the third column, and if in the next iteration the same value is already found in the repeatdict dict, that row is marked for deletion, by storing its index in todel list.
Then we can get rid of the unwanted rows by calling np.delete with the list of all row indexes which we want to delete.
Also, I'm not picking my answer as the picked answer, because I know there's probably a better way to do this with just numpy magic.
I'll wait.