I need to pick out a few rows in my sframe by index. Is there an equivalent graphlab command to pandas df.irow()?
There is no direct equivalent in graphlab to DataFrame.iloc (previously irow). One way to achieve the same thing is to add a column of row numbers and use the filter_by method. Suppose I want to get only the 1st and 3rd rows:
import graphlab
sf = graphlab.SFrame({'x': ['a', 'b', 'a', 'c']})
sf = sf.add_row_number('row_id')
new_sf = sf.filter_by(values=[0, 2], column_name='row_id')
Related
Id like to ask for help in fixing the missing values in pandas dataframe (python)
here is the dataset
In this dataset I found a missing value in ['Item_Weight'] column.
I don't want to drop the missing values because I found out by sorting them. the missing value is "miss type" by someone who encoded it.
here is the sorted dataset
Now I created a lookup dataset so I can merge them to fill na missing values.
How can I merge them or join them only to fill the missing values (Nan) using the lookup table I made? Or is there any other way without using a lookup table?
Looking at this you will probably want to use something along the lines of map instead of join/merge this is an example of how to use map with your data.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Column1' : ['A', 'B', 'C'],
'Column2' : [1, np.nan, 3]
})
df
df_map = pd.DataFrame({
'Column1' : ['A', 'B', 'C'],
'Column2' : [1, 2, 3]
})
df_map
#Looks to find where the column you specify is null, then using your map df will map the value from column1 to column2
df['Column2'] = np.where(df['Column2'].isna(), df['Column1'].map(df_map.set_index('Column1')['Column2']), df['Column2'])
I had to create my own dataframes since you used screenshots. In the future, the use of screenshots is not considered best to help developers with assistance.
This will probably work:
df = df.sort_values(['Item_Identifier', 'Item_Weight']).ffill()
But I can't test it since you didn't give us anything to work with.
I have the following Pandas Dataframe
Which can also be generated using this list of dictionaries:
list_of_dictionaries = [
{'Project': 'A', 'Hours': 2, 'people_ids': [16986725, 17612732]},
{'Project': 'B', 'Hours': 2, 'people_ids': [17254707, 17567393, 17571668, 17613773]},
{'Project': 'C', 'Hours': 3, 'people_ids': [17097009, 17530240, 17530242, 17543865, 17584457, 17595079]},
{'Project': 'D', 'Hours': 2, 'people_ids': [17097009, 17584457, 17702185]}]
I have implemented kind of what I need, but adding columns vertically:
df['people_id1']=[x[0] for x in df['people_ids'].tolist()]
df['people_id2']=[x[1] for x in df['people_ids'].tolist()]
And then I get a different column of every single people_id, just until the second element, because when I add the extraction 3rd element on a third column, it crashes because , there is no 3rd element to extract from the first row.
Even though, what I am trying to do is to extract every people_id from people_ids column, and then each one of those will have their associated value from the Project and Hours columns, so I get a dataset like this one:
Any idea on how could I get this output?
I think what you are looking for is explode on 'people_ids' column.
df = df.explode('people_ids', ignore_index=True)
Suppose I have a dataframe df, and it has columns with names 'a', 'b', 'c', 'd', 'e'. Then I make all combinations of length three (order doesn't matter) from this list to generate the following list of lists:
Combinations_of_3 = [ [['a','b','c'], ['a','b','d'],...., ['c','d','e']]]
Now I wish to create a for loop to populate a second data frame, and to do this I want to loop over Combinations_of_3 and use the current entry to select the corresponding columns of df.
For example, if I wanted to select only the 'a', 'b' and 'e' columns of df, I would normally write df[['a','b','e']]; but now I would like to do this in a for loop using Combinations_of_3. I'm writing this code using pandas / python. Thank you.
Just do as you described, using a variable:
Combinations_of_3 = [['a','b','c'], ['a','b','d'], ['c','d','e']]
for cols in Combinations_of_3:
#do something
print(df[cols])
NB. To create Combinations_of_3 you could use:
from itertools import combinations
Combinations_of_3 = list(combinations(df.columns, r=3))
#or using a generator
Combinations_of_3 = combinations(df.columns, r=3)
I am following this program: https://scikit-learn.org/dev/auto_examples/inspection/plot_permutation_importance_multicollinear.html
since I have a problem with highly correlated features in my model (different from that one shown in the example). In this step
selected_features = [v[0] for v in cluster_id_to_feature_ids.values()]
I can get information on the features that I will need to remove from my classifier. They are given as numbers ([0, 3, 5, 6, 8, 9, 10, 17]). How can I get names of these features?
Ok, there are two different elements to this problem I think.
First, you need to get a list of the column names. In the example code you linked, it looks like the list of feature names is stored like this:
data.feature_names
Once you have the feature names, you'd need a way to loop through them and grab only the ones you want. Something like this should work:
columns = ['a', 'b', 'c', 'd']
keep_index = [0, 3]
new_columns = [columns[i] for i in keep_index]
new_columns
['a', 'b']
I have a dataframe with a very large number of columns of different types. I want to encode the categorical variables in my dataframe using get_dummies(). The question is: is there a way to get the column headers of the encoded categorical columns created by get_dummies()?
The hard way to do this would be to extract a list of all categorical variables in the dataframe, then append the different text labels associated to each categorical variable to the corresponding column headers. I wonder if there is an easier way to achieve the same end.
I think the way that should work with all the different uses of get_dummies would be:
#example data
import pandas as pd
df = pd.DataFrame({'P': ['p', 'q', 'p'], 'Q': ['q', 'p', 'r'],
'R': [2, 3, 4]})
dummies = pd.get_dummies(df)
#get column names that were not in the original dataframe
new_cols = dummies.columns[~dummies.columns.isin(df.columns)]
new_cols gives:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')
I think the first column is the only column preserved when using get_dummies, so you could also just take the column names after the first column:
dummies.columns[1:]
which on this test data gives the same result:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')