How can I query a column of a dataframe on a specific value and get the values of two other columns corresponding to that value - pandas

I have a data frame where the first column contains various countries' ISO codes, while the other 2 columns contain dataset numbers and Linkedin profile links.
Please refer to the image.
I need to query the data frame's first "FBC" column on the "IND" value and get the corresponding values of the "no" and "Linkedin" columns.
Can somebody please suggest a solution?

Using query():
If you want just the no and Linkedin values.
df = df.query("FBC.eq('IND')")[["no", "Linkedin"]]
If you want all 3:
df = df.query("FBC.eq('IND')")

Related

How can I create a pandas column based on another pandas column that has for values a list?

I am working with a dataframe and one of the columns has for values a list of strings in each row. The list contains a number of links (each list can have a different number of links). I want to create a new column that will be based on this column of lists but keep only the links that have the keyword "uploads".
To my example, the first entry of the column is like that:
['https://seekingalpha.com/instablog/5006891-hfir/4960045-natural-gas-daily',
'https://seekingalpha.com/article/4116929-weekly-natural-gas-storage-report',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png',
'https://seekingalpha.com/account/research/subscribe?slug=hfir-energy&sasource=upsell']
And I want to keep only
['https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png']
And put the clean version in a new column of the same dataframe.
Can you please suggest a way to do it?
I just found a way where I create a function that looks within a list for a specific pattern (in my case the keyword "uploads")
def clean_alt_list(list_):
list_ = [s for s in list_ if "uploads" in s]
return list_
And then I apply this function into the column I am interested in
df['clean_links'] = df['links'].apply(clean_alt_list)
IIUC, this should work for you:
df = pd.DataFrame({'url': [['https://seekingalpha.com/instablog/5006891-hfir/4960045-natural-gas-daily', 'https://seekingalpha.com/article/4116929-weekly-natural-gas-storage-report', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png', 'https://seekingalpha.com/account/research/subscribe?slug=hfir-energy&sasource=upsell']]})
df = df.explode('url').reset_index(drop=True)
df[df['url'].str.contains('uploads')]
Result:
url
2 https://static.seekingalpha.com/uploads/2017/1...
3 https://static.seekingalpha.com/uploads/2017/1...
4 https://static.seekingalpha.com/uploads/2017/1...

Pandas: Extracting data from sorted dataframe

Consider I have a dataframe with 2 columns: the first column is 'Name' in the form of a string and the second is 'score' in type int. There are many duplicate Names and they are sorted such that the all 'Name1's will be in consecutive rows, followed by 'Name2', and so on. Each row may contain a different score.The number of duplicate names may also be different for each unique string.'
I wish to extract data afrom this dataframe and put it in a new dataframe such that There are no duplicate names in the name column, and each name's corresponding score is the average of his scores in the original dataframe.
I've provided a picture for a better visualization:
Firstly make use of groupby() method as mentioned by #QuangHong:
result=df.groupby('Name', as_index=False)['Score'].mean()
Finally make use of rename() method:
result=result.rename(columns={'Score':'Avg Score'})

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

Group by multiple columns creating new column in pandas dataframe

I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !

Pandas Data Frame Select column last value and replace with blank

I have a data frame with lots of columns, I wanted to replace the last value of few of columns with blank, what is the best way to do that.
the thing is data frame can be dynamic with values and length.
Here is the data frame:
enter image description here
How can I remove values from column F and G to blank?
please except typos as I'm new here.
df.loc[df.index[-1], ['G', 'F']] = np.nan