Selecting columns from a dataframe - dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.

You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()

As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

Related

Check multiple columns for multiple values and return a dataframe

I have a list of strings and my dataframe has several columns that i need to search (each of type object).
I need to return all rows where any of the selected columns have any of the string items within them, or is part of the string.
How do i check if 4 columns in my dataframe has any one of the items in the list of strings? The string inside the column may have part of the string provided in the list object, but probably wont have it all.
Ive tried 'list' both as a tuple and as a python list:
list = ("25110", "25910", "25990", "30110", "33110", "43999")
new_df = df.loc[(df['column1'].isin(list))
| (df['column2'].isin(list))
| (df['column3'].isin(list))
| (df['column4'].isin(list))]
When i run new_df.shape, i get (0, 12).
Im new to pandas, got a mountain of analysis to do for an intense uni project, and cant get this to work. Do i need to convert each column to be a string datatype first? (ive actually already tried THAT as well, but each datatype is still stubbornly an 'object').
IIUC:
try:
lst = ["25110", "25910", "25990", "30110", "33110", "43999"]
cols=['column1','column2','column3','column4']
Finally:
m=df[cols].astype(str).agg(lambda x:x.str.contains('|'.join(lst)),1).any(1)
#you can also use apply() in place of agg()
df[m]
#OR
df.loc[m]

Is there a pandas function for get variables names in a column?

I'm just thinking in a hypothetical dataframe (df) with around 50 columns and 30000 rows, and one hypothetical column like e.g: Toy = ['Ball','Doll','Horse',...,'Sheriff',etc].
Now I only have the name of the column (Toy) and I want to know what are the variables inside the column without duplicated values.
I'm thinking an output like the .describe() function
df['Toy'].describe()
but with more info, because now I'm getting only this output
count 30904
unique 7
top "Doll"
freq 16562
Name: Toy, dtype: object
In other words, how do I get the 7 values in this column. I was thinking in something like copy the column and delete duplicated values, but I'm pretty sure that there is a shorter way. Do you know the right code or if I should use another library?
Thank you so much!
You can use unique() function to list out all the unique values in your columns. In your case, to list out the unique values in the column name toys in the dataframe df the syntax would look like
df["toys"].unique()
You can also use .drop_duplicates(), which returns a pandas Series:
df['toys'].drop_duplicates()

How do you split All columns in a large pandas data frame?

I have a very large data frame that I want to split ALL of the columns except first two based on a comma delimiter. So I need to logically reference column names in a loop or some other way to split all the columns in one swoop.
In my testing of the split method:
I have been able to explicitly refer to ( i.e. HARD CODE) a single column name (rs145629793) as one of the required parameters and the result was 2 new columns as I wanted.
See python code below
HARDCODED COLUMN NAME --
df[['rs1','rs2']] = df.rs145629793.str.split(",", expand = True)
The problem:
It is not feasible to refer to the actual column names and repeat code.
I then replaced the actual column name rs145629793 with columns[2] in the split method parameter list.
It results in an ERROR
'str has ni str attribute'
You can index columns by position rather than name using iloc. For example, to get the third column:
df.iloc[:, 2]
Thus you can easily loop over the columns you need.
I know what you are asking, but it's still helpful to provide some input data and expected output data. I have included random input data in my code below, so you can just copy and paste this to run, and try to apply it to your dataframe:
import pandas as pd
your_dataframe=pd.DataFrame({'a':['1,2,3', '9,8,7'],
'b':['4,5,6', '6,5,4'],
'c':['7,8,9', '3,2,1']})
import copy
def split_cols(df):
dict_of_df = {}
cols=df.columns.to_list()
for col in cols:
key_name = 'df'+str(col)
dict_of_df[key_name] = copy.deepcopy(df)
var=df[col].str.split(',', expand=True).add_prefix(col)
df=pd.merge(df, var, how='left', left_index=True, right_index=True).drop(col, axis=1)
return df
split_cols(your_dataframe)
Essentially, in this solution you create a list of the columns that you want to loop through. Then you loop through that list and create new dataframes for each column where you run the split() function. Then you merge everything back together on the index. I also:
included a prefix of the column name, so the column names did not have duplicate names and could be more easily identifiable
dropped the old column that we did the split on.
Just import copy and use the split_cols() function that I have created and pass the name of your dataframe.

Group by multiple columns creating new column in pandas dataframe

I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !

Working with dataframe / matrix to create an input for sklearn & Tensorflow

I am working with pandas / python /numpy / datalab/bigQuery to generate an input table for machine learning processing. The data is genomic - and right now, I am working with small subset of
174 rows
12430 columns
The column names are extracted from bigQuery (df_pik3ca_features = bq.Query(std_sql_features).to_dataframe(dialect='standard',use_cache=True))
at the same way, the row names are extracted: samples_rows = bq.Query('SELECT sample_id FROMspeedy-emissary-167213.pgp_orielresearch.pgp_PIK3CA_all_features_values_step_3GROUP BY sample_id')
what would be the easiest way to create a dataframe / matrix with named rows and columns that were extracted.
I explored the dataframes in pandas and could not find the way to pass the names as parameter.
for empty array, I was able to find the following (numpy) with no names:
a = np.full([num_of_rows, num_of_columns], np.nan)
a.columns
I know R very well (if there is no other way - I hope that I can use it with datalab)
any idea?
Many thanks!
If you have your column names and row names stored in lists then you can just use .loc to select the exact rows and columns you desire. Just make sure that the row names are in the index. You might need to do df.set_index('sample_id') to put the correct row name in the index.
Assuming the rows and columns are in variables row_names and col_names, do this.
df.loc[row_names, col_names]