Dataframe, reverse order of rows in one single column according to groups of multiple factor columns - dataframe

I am trying to reverse the rows of one column of a dataframe by group, where grouping is driven by multiple factor columns.
I would like to go from here
I would like to reverse numeric values of Col5 according to selected groups, so as to obtain this:
All the other columns are factor
I tried with this code, but it failed
reord<-a %>% dplyr::group_by(Col1,Col2,Col4)%>% dplyr::mutate(reve=rev(Col5))
Do you have any suggestion? Any help is highly appreciated
Thanks in advance

Made it.
Here my solution (also used some code from forums, so thanks anyway).
split<-a %>% ##a is the original dataframe
group_split(Col1, Col2, Col4) ## the dataframe is subset to tibble
split1<-lapply(split, function(x) dplyr::mutate(x, reve=rev(Col5))) ## rev() is applied to all columns of interest of the list
reord1<-bind_rows(split1, .id = 'id') ## the list objects are rbound to obtain the dataframe with either original values in Col5 and reverse values in the new column "reve"
It worked well

Related

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

Group by multiple columns creating new column in pandas dataframe

I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !

Finding duplicate records and subset for a clean dataset

I have a dataset in which all the values in a particular row have duplicate rows wherein the 2nd row have missing values in it.
How can I write a code in python to find the duplicate records in a dataset?
Original Dataset
Required Output
first sort_values include the column which contains Null values
use drop_duplicates and provide column FileNo
df.sort_values(by=['FileNo','Coverage'],ascending=[True,True],inplace=True,na_position='last')
df.drop_duplicates(subset=['FileNo'],inplace=True)
Pandas drop_duplicates() method helps in removing duplicates from the data frame.
Syntax:
DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
Refer for example:
https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/
And, Pandas dropna() method allows the user to analyze and drop Rows/Columns with Null values n different ways.
Syntax:
DataFrameName.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Refer for example:
https://www.geeksforgeeks.org/python-pandas-dataframe-dropna/
df.drop_duplicates(subset='FileNo')

How to group by and sum several columns?

I have a big dataframe with several columns which contains strings, numbers, etc. I am trying to group by SCENARIO and then sum only the columns between 2020 and 2050. The only thing I have got so far is sum one column as displayed as follows, but I need to change this '2050' by the columns between 2020 and 2050, for instance.
df1 = df.groupby(["SCENARIO"])['2050'].sum().sum(axis=0)
You are creating a subset of the df with only that single column. I can't tell how your dataset looks like from the information provided, but try:
df.groupby(["SCENARIO"]).sum()
This should some up all the rows which are in the column.
Alternatively select the columns which you want to perform the summation on.
df.groupby(["SCENARIO"])[["column1","column2"]].sum()

Working with dataframe / matrix to create an input for sklearn & Tensorflow

I am working with pandas / python /numpy / datalab/bigQuery to generate an input table for machine learning processing. The data is genomic - and right now, I am working with small subset of
174 rows
12430 columns
The column names are extracted from bigQuery (df_pik3ca_features = bq.Query(std_sql_features).to_dataframe(dialect='standard',use_cache=True))
at the same way, the row names are extracted: samples_rows = bq.Query('SELECT sample_id FROMspeedy-emissary-167213.pgp_orielresearch.pgp_PIK3CA_all_features_values_step_3GROUP BY sample_id')
what would be the easiest way to create a dataframe / matrix with named rows and columns that were extracted.
I explored the dataframes in pandas and could not find the way to pass the names as parameter.
for empty array, I was able to find the following (numpy) with no names:
a = np.full([num_of_rows, num_of_columns], np.nan)
a.columns
I know R very well (if there is no other way - I hope that I can use it with datalab)
any idea?
Many thanks!
If you have your column names and row names stored in lists then you can just use .loc to select the exact rows and columns you desire. Just make sure that the row names are in the index. You might need to do df.set_index('sample_id') to put the correct row name in the index.
Assuming the rows and columns are in variables row_names and col_names, do this.
df.loc[row_names, col_names]