Pandas: Extracting data from sorted dataframe - pandas

Consider I have a dataframe with 2 columns: the first column is 'Name' in the form of a string and the second is 'score' in type int. There are many duplicate Names and they are sorted such that the all 'Name1's will be in consecutive rows, followed by 'Name2', and so on. Each row may contain a different score.The number of duplicate names may also be different for each unique string.'
I wish to extract data afrom this dataframe and put it in a new dataframe such that There are no duplicate names in the name column, and each name's corresponding score is the average of his scores in the original dataframe.
I've provided a picture for a better visualization:

Firstly make use of groupby() method as mentioned by #QuangHong:
result=df.groupby('Name', as_index=False)['Score'].mean()
Finally make use of rename() method:
result=result.rename(columns={'Score':'Avg Score'})

Related

How to handle columns of the same name in pandas

I have a dataframe which happens to have some columns with the same column name.
df_raw[column_name] # [141 rows x 2 columns]
I have a code that extracts the unique values but it does not work if it has more than one dimension.
ipdb> dt_raw[column_name].unique()
*** AttributeError: 'DataFrame' object has no attribute 'unique'
I wish to not "update" with the df_raw.columns to make all columns unique before processing. Is there a good way to handle this?
I have tried the code below with error:
ipdb> dt_raw[column_name][0]
*** KeyError: 0
Questions:
How to know how many columns have the same name. In the example above, I am expecting 2.
How to individually refer to a column (for example, updating purposes).
To get the number of columns with column_name, you can do df_raw[column_name].shape[1]. You can access a dataframe by actual location, rather than name, with the iloc syntax: df_raw.iloc[:,n] will return the nth column of the dataframe, and df_raw[column_name].iloc[:,n] will return the nth column named "column_name" (keep in mind that it's zero-indexed).
Also, if you want the unique column names, you can do set(df_raw.columns).
I got the answer. Thank you for viewing.
How to know how many columns have the same name. In the example above, I am expecting 2.
len(df_raw[column_name].columns)
How to individually refer to a column (for example, updating purposes).
df_raw[column_name].ix[:,0] #first column
df_raw[column_name].ix[:,1] #2nd column, etc

Is there a pandas function for get variables names in a column?

I'm just thinking in a hypothetical dataframe (df) with around 50 columns and 30000 rows, and one hypothetical column like e.g: Toy = ['Ball','Doll','Horse',...,'Sheriff',etc].
Now I only have the name of the column (Toy) and I want to know what are the variables inside the column without duplicated values.
I'm thinking an output like the .describe() function
df['Toy'].describe()
but with more info, because now I'm getting only this output
count 30904
unique 7
top "Doll"
freq 16562
Name: Toy, dtype: object
In other words, how do I get the 7 values in this column. I was thinking in something like copy the column and delete duplicated values, but I'm pretty sure that there is a shorter way. Do you know the right code or if I should use another library?
Thank you so much!
You can use unique() function to list out all the unique values in your columns. In your case, to list out the unique values in the column name toys in the dataframe df the syntax would look like
df["toys"].unique()
You can also use .drop_duplicates(), which returns a pandas Series:
df['toys'].drop_duplicates()

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

Can we sort multiple data frames comparing values of each element in column

I have two csv files having some data and I would like to combine and sort data based on one common column:
Here is data1.csv and data2.csv file:
The data3.csv is the output file where you I need data to be combined and sorted as below:
How can I achieve this?
Here's what I think you want to do here:
I created two dataframes with simple types, assume the first column is like your timestamp:
df1 = pd.DataFrame([[1,1],[2,2], [7,10], [8,15]], columns=['timestamp', 'A'])
df2 = pd.DataFrame([[1,5],[4,7],[6,9], [7,11]], columns=['timestamp', 'B'])
c = df1.merge(df2, how='outer', on='timestamp')
print(c)
The outer merge causes each contributing DataFrame to be fully present in the output even if not matched to the other DataFrame.
The result is that you end up with a DataFrame with a timestamp column and the dependent data from each of the source DataFrames.
Caveats:
You have repeating timestamps in your second sample, which I assume may be due to the fact you do not show enough resolution. You would not want true duplicate records for this merge solution, as we assume timestamps are unique.
I have not repeated the timestamp column here a second time, but it is easy to add in another timestamp column based on whether column A or B is notnull() if you really need to have two timestamp columns. Pandas merge() has an indicator option which would show you the source of the timestamp if you did not want to rely on columns A and B.
In the post you have two output columns named "timestamp". Generally you would not output two columns with same name since they are only distinguished by position (or color) which are not properties you should rely upon.

Group by multiple columns creating new column in pandas dataframe

I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !