Is there a pandas function for get variables names in a column? - pandas

I'm just thinking in a hypothetical dataframe (df) with around 50 columns and 30000 rows, and one hypothetical column like e.g: Toy = ['Ball','Doll','Horse',...,'Sheriff',etc].
Now I only have the name of the column (Toy) and I want to know what are the variables inside the column without duplicated values.
I'm thinking an output like the .describe() function
df['Toy'].describe()
but with more info, because now I'm getting only this output
count 30904
unique 7
top "Doll"
freq 16562
Name: Toy, dtype: object
In other words, how do I get the 7 values in this column. I was thinking in something like copy the column and delete duplicated values, but I'm pretty sure that there is a shorter way. Do you know the right code or if I should use another library?
Thank you so much!

You can use unique() function to list out all the unique values in your columns. In your case, to list out the unique values in the column name toys in the dataframe df the syntax would look like
df["toys"].unique()

You can also use .drop_duplicates(), which returns a pandas Series:
df['toys'].drop_duplicates()

Related

SQL fill null values with another column

My problem is that I have a dataframe which has null values, but these null values are filled with another column of the same data frame, then I would like to know how to take that column and put the information of the other column to fill the missing data. I'm using deepnote
link:
https://deepnote.com
For example:
Column A
Column B
Cell 1
Cell 2
NULL
Cell 4
My desired output:
Column A
Cell 1
Cell 4
I think it should be with sub queries and using some WHERE, any ideas?
thanks for the question and welcome to StackOverflow.
It is not 100% clear which direction you need your solution to go, so I am offering two alternatives which I think should get you going.
Pandas way
You seem to be working with Pandas dataframes. The usual way to work with Pandas dataframes is to use Pandas builtin functions. In this case, there is literally a function for filling null values, it's called fillna. We can use it to fill values from another column like this:
df_raw = pd.DataFrame(data={'Column A': ['Cell 1', None], 'Column B': ['Cell 2', 'Cell 4']})
# copying the original dataframe to a clean one
df_clean = df_raw.copy()
# applying the fillna to fill null values from another column
df_clean['Column A'] = df_clean['Column A'].fillna(df_clean['Column B'])
This will make your df_clean look like you need
Column A
Cell 1
Cell 4
Dataframe SQL way
You mentioned "queries" and "where" in your question which seems you might be playing with some combination of Python and SQL world. Enter DuckDB world which supports exactly this, in Deepnote we call these Dataframe SQLs.
You can query e.g. CSV files directly from these Dataframe SQL blocks, but you can also use a previously defined Dataframe.
select * from df_raw
In order to fill the null values like you are requesting, we can use standard SQL querying and a function called coalesce as Paul correctly pointed out.
select coalesce("Column A", "Column B") as "Column A" from df_raw
This will also create what you need in SQL world. In Deepnote, specifically, this will also give you a Dataframe.
Column A
Cell 1
Cell 4
Feel free to check out my project in Deepnote with these examples, and go ahead and duplicate it if you want to iterate on the code a bit. There is also plenty more alternatives, if you're in a real SQL database and want to update existing columns, you would use update statement. And if you are in a pure Python, this is of course also possible in a loop or using lambda functions.

How to handle columns of the same name in pandas

I have a dataframe which happens to have some columns with the same column name.
df_raw[column_name] # [141 rows x 2 columns]
I have a code that extracts the unique values but it does not work if it has more than one dimension.
ipdb> dt_raw[column_name].unique()
*** AttributeError: 'DataFrame' object has no attribute 'unique'
I wish to not "update" with the df_raw.columns to make all columns unique before processing. Is there a good way to handle this?
I have tried the code below with error:
ipdb> dt_raw[column_name][0]
*** KeyError: 0
Questions:
How to know how many columns have the same name. In the example above, I am expecting 2.
How to individually refer to a column (for example, updating purposes).
To get the number of columns with column_name, you can do df_raw[column_name].shape[1]. You can access a dataframe by actual location, rather than name, with the iloc syntax: df_raw.iloc[:,n] will return the nth column of the dataframe, and df_raw[column_name].iloc[:,n] will return the nth column named "column_name" (keep in mind that it's zero-indexed).
Also, if you want the unique column names, you can do set(df_raw.columns).
I got the answer. Thank you for viewing.
How to know how many columns have the same name. In the example above, I am expecting 2.
len(df_raw[column_name].columns)
How to individually refer to a column (for example, updating purposes).
df_raw[column_name].ix[:,0] #first column
df_raw[column_name].ix[:,1] #2nd column, etc

Pandas: Extracting data from sorted dataframe

Consider I have a dataframe with 2 columns: the first column is 'Name' in the form of a string and the second is 'score' in type int. There are many duplicate Names and they are sorted such that the all 'Name1's will be in consecutive rows, followed by 'Name2', and so on. Each row may contain a different score.The number of duplicate names may also be different for each unique string.'
I wish to extract data afrom this dataframe and put it in a new dataframe such that There are no duplicate names in the name column, and each name's corresponding score is the average of his scores in the original dataframe.
I've provided a picture for a better visualization:
Firstly make use of groupby() method as mentioned by #QuangHong:
result=df.groupby('Name', as_index=False)['Score'].mean()
Finally make use of rename() method:
result=result.rename(columns={'Score':'Avg Score'})

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

Group by multiple columns creating new column in pandas dataframe

I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !