Extract rows from numpy and add rows to specific indexes - numpy

I have a really big 3D array (1103546X2504X3). These are genotype data, imported from VCF file. First I want to filter it, with my down data. After it, I would like to extract the needed rows, and add the missing one-s, but sorted. Right now my code is:
chr_pos is the position from my "reference" file, pos is the index position from the big array, and needed_index the rows I need.
needed_index = []
for i in range(len(chr_pos)):
for k in range(len(pos)):
if chr_pos[i] == k:
needed_index.append(k)
After the extraction I check is there any missing row from the reference:
list_difference = [item for item in chr_pos if item not in needed_pos]
needed_pos was made with the same code, but with .append(pos[k]).
My questions would be:
How to extract specific array rows, according to the list needed_index or needed pos?
How to add the missing items to the array? It would be in the format indexesX2504X[0,0], where indexes is from list_difference, 2504 is the columns (sample numbers], and [0,0] is the value for every position i want to add.
Edit1: So basically I want to find the rows in the array, what i need (from a reference file), and if some of the positions are not in the main array, add them to the specific position with the 2504 column and [0,0] value as the third dimension

Related

Error when filtering pandas dataframe by column value

I am having a problem with filtering a pandas dataframe. I am trying to filter a dataframe based on column values being equal to a specific list but I am getting a length error.
I tried every possible way of filtering a dataframe but got nowhere. Any help would be appreciated, thanks in advance.
Here is my code :
for ind in df_hourly.index:
timeslot = df_hourly['date_parsed'][ind][0:4] # List value to filter
filtered_df = df.loc[df['timeslot'] == timeslot]
Error : ValueError: ('Lengths must match to compare', (5696,), (4,))
Above Image : df , Below Image : df_hourly
In the above image, the dataframe I want to filter is shown. Specifically, I want to filter according to the "timeslot" column.
And the below image shows the the dataframe which includes the value I want to filter by. I specifically want to filter by "date_parsed" column. In the first line of my code, I iterate through every row in this dataframe and assign the first 4 elements of the list value in df_hourly["date_parsed"] to a variable and later in the code, I try to filter the above dataframe by that variable.
When comparing columns using ==, pandas try to compare value by value - aka does the first item equals to first item, second item to the second and so on. This is why you receive this error - pandas expects to have two columns of the same shape.
If you want to compare if value is inside a list, you can use the .isin (documentation):
df.loc[df['timeslot'].isin(timeslot)]
Depends on what timeslot is exactly, you might to take timeslot.values or something like that (hard to understand exactly without giving an example for your dataframe)

Encoding feature array column from training set and applying to test set later

I have input columns that contain arrays of features. Feature is listed if present, absent if not. Order not guaranteed. eg:
features = pd.DataFrame({"cat_features":[['cuddly','short haired'],['short haired','bitey'],['short haired','orange','fat']]})
This works:
feature_table = pd.get_dummies(features['cat_features'].explode()).add_prefix("cat_features_").groupby(level=0).sum()
Problem:
It's not trivial to ensure the same output columns on my test set
when features are missing.
My real dataset has multiple such array
columns, but I can't explode them all at once because ValueError: columns must have matching element counts requiring looping over each array column.
One option, make a dtype and save it for later ("skinny" added as an example of something not in our input set):
from pandas.api.types import CategoricalDtype
cat_feature_type = CategoricalDtype([x.replace("cat_features_","") for x in feature_table.columns.to_list()]+ ["skinny"])
pd.get_dummies(features["cat_features"].explode().astype(cat_feature_type)).add_prefix("cat_features_").groupby(level=0).sum()
Is there a smarter way of doing this?

is there a way to subset an AnnData object after reading it in?

I read in the excel file like so:
data = sc.read_excel('/Users/user/Desktop/CSVB.xlsx',sheet= 'Sheet1', dtype= object)
There are 3 columns in this data set that I need to work with as .obs but it looks like everything is in the .X data matrix.
Anyone successfully subset after reading in the file or is there something I need to do beforehand?
Okay, so assuming sc stands for scanpy package, the read_excel just takes the first row as .var and the first column as .obs of the AnnData object.
The data returned by read_excel can be tweaked a bit to get what you want.
Let's say the index of the three columns you want in the .obs are stored in idx variable.
idx = [1,2,4]
Now, .obs is just a Pandas DataFrame, and data.X is just a Numpy matrix (see here). Thus, the job is simple.
# assign some names to the new columns
new_col_names = ['C1', 'C2', 'C3']
# add the columns to data.obs
data.obs[new_col_names] = data.X[:,idx]
If you may wish to remove the idx columns from data.X, I suggest making a new AnnData object for this.

pandas concatenate part of index in row attending condition

I have a table listing in rows invoices and in rows kind of reparation and also power of the engine
The power of the engine contains always the symbol '/'
so filtering the index I have:
What I desire is a new row containing for every invoice a list of the different powers
For instance for 'inv123' the new row should contain ['400/HP','500kw/h']
So far I have the following code:
from itertools import compress
boolean_filter = DF.index.str.contains('/') & DF['inv123']
indexlist =list(DF.index)
mylist = list(compress(indexlist, boolean_filter))
# you can generate it in one liner
mylist = list(compress(DF.index,DF.index.str.contains('/') & DF['inv123']))
print(mylist)
Result
['400/HP', '500/kwh']
This is the value I have to add in row='concatenate" column='inv123'
I encounter a number of problems
a) I am not able to that in a pythonic way (no loops)
b) when adding an empty row with:
DF.append(pd.Series(name='concatenate'))
the dtype of the 0s,1s (integers) changes to float, which makes the code not reusable (not being boolean anymore)
Some idea how to approach the problem?
But still I would have to loop over every column
I came up with this solution
from itertools import compress
lc=[list(compress(DF.index,DF.index.str.contains('/') & DF.iloc[:,i])) for i in range(len(DF.columns))]
the first thing there is compress a the list of the index with the boolean of every column (DF.iloc[:,i])
As a result I obtain a list in wich every element is a list of the wanted values.
The solution is not at all elegant.
It took me a few hours.

Neighbours list extracted out of polygon regions

I've got a SQL database which contains some coded polygon structures. Those can be extracted as follows
poly <- data.frame(sqldf("SELECT ST_astext(geometry) FROM table"))
The data.frame 'poly' contains strings that now can be converted to real 'SpatialPolygons' objects as follows (for the first string)
realWKT(poly[1,1])
I can do the previous for each string, and save it in a vector
list <- c()
for (i in 1:100){
list <- c(list, readWKT(poly[i,1])
}
The last thing I want to do, is to create a neighbourhood list, based on all the SpatialPolygons by making use of the following function
poly2nb(list)
But sadly, this command results in the following error
Error: extends(class(pl), "SpatialPolygons") is not TRUE
I know that the problem has something to do with the classtype of the list, but I really don't see a way out.. Any help will be appreciated!
Edit
As suggested, some parts of the output. Keep in mind that the rows of 'poly' are really long strings of coordinates
> poly[1,1]
[1] "POLYGON((4.155976 50.78233,...,4.153225 50.76121,4.152384 50.761191,4.151878 50.761194,4.151319 50.761163,4.150872 50.761126))"
> poly[2,1]
[1] "POLYGON((5.139526 50.914059,...,5.140994 50.913612,5.156976 50.895945))"
This seems to work:
list <- lapply(1:2,function(i)readWKT(poly[i,1],id=i))
sp <- SpatialPolygons(lapply(list,function(sp)sp#polygons[[1]]))
library(spdep)
poly2nb(sp)
The internal structure of SpatialPolygons is rather complex. A SpatialPolygons object is a collection (list) of Polygons objects (which represent geographies), and each of these is a list of Polygon objects, which represent geometric shapes. So for example, a SpatialPolygons object that represents US states, has 50 or so Polygons objects (one for each state), and each of those can have multiple Polygon objects (if the state is not contiguous, e.g. has islands, etc.).
It looks like poly2nb(...) takes a single SpatialPolygons object and calculates neighborhood structure based on the contained list of Polygons objects. You were passing a list of SpatialPolygons objects.
So the challenge is to convert the result of your SQL query to a single SpatialPolygons object. readWKT(...) converts each row to a SpatialPolygons object, each of which contains exactly one Polygons object. So you have to extract those and re-assemble them into a single SpatialPolygons object. The line:
sp <- SpatialPolygons(lapply(list,function(sp)sp#polygons[[1]]))
does that. The line before:
list <- lapply(1:2,function(i)readWKT(poly[i,1],id=i))
replaces your for (...) loop and also adds a polygon id to each polygon, which is necessary for the call to SpatialPolygons(...).