pandas dataframe prevent inserting duplicate records - pandas

I'm trying to accomplish something quite plain indeed, but my previous searches were unsuccessful.
I have an existing dataframe generated from a multi-sheet spreadsheet.
From each sheet I loaded a number of records, every sheet corresponds to a 'Match' column in the df.
Now i need to insert multiple records OUTSIDE of the spreadsheet, so it's easy to prepare a dict like this :
for m in range(from_match, to_match):
d = {'Minuto': 0, 'Azione': event, 'Giocatore': player, 'Match': m, 'Extra': 1}
The 'Extra' field indicates that these records are NOT loaded from the spreadsheet.
The problem is that a df.append(d, ignore_index=True), always insert the record no matter if it already exists.
That is not the expected result.
The ideal (for me) solution, would be something like this :
for m in range(from_match, to_match):
d = {'Minuto': 0, 'Azione': event, 'Giocatore': player, 'Match': m, 'Extra': 1}
# insert some check here
if record d do not exists:
df = df.append(d, ignore_index=True)
I've played with df.isin, I've seen solution based on merging dataframes, but it seems to me that something like this shouldn'be complicated at all.
Any suggestions ?
Thanks

Related

SQL fill null values with another column

My problem is that I have a dataframe which has null values, but these null values are filled with another column of the same data frame, then I would like to know how to take that column and put the information of the other column to fill the missing data. I'm using deepnote
link:
https://deepnote.com
For example:
Column A
Column B
Cell 1
Cell 2
NULL
Cell 4
My desired output:
Column A
Cell 1
Cell 4
I think it should be with sub queries and using some WHERE, any ideas?
thanks for the question and welcome to StackOverflow.
It is not 100% clear which direction you need your solution to go, so I am offering two alternatives which I think should get you going.
Pandas way
You seem to be working with Pandas dataframes. The usual way to work with Pandas dataframes is to use Pandas builtin functions. In this case, there is literally a function for filling null values, it's called fillna. We can use it to fill values from another column like this:
df_raw = pd.DataFrame(data={'Column A': ['Cell 1', None], 'Column B': ['Cell 2', 'Cell 4']})
# copying the original dataframe to a clean one
df_clean = df_raw.copy()
# applying the fillna to fill null values from another column
df_clean['Column A'] = df_clean['Column A'].fillna(df_clean['Column B'])
This will make your df_clean look like you need
Column A
Cell 1
Cell 4
Dataframe SQL way
You mentioned "queries" and "where" in your question which seems you might be playing with some combination of Python and SQL world. Enter DuckDB world which supports exactly this, in Deepnote we call these Dataframe SQLs.
You can query e.g. CSV files directly from these Dataframe SQL blocks, but you can also use a previously defined Dataframe.
select * from df_raw
In order to fill the null values like you are requesting, we can use standard SQL querying and a function called coalesce as Paul correctly pointed out.
select coalesce("Column A", "Column B") as "Column A" from df_raw
This will also create what you need in SQL world. In Deepnote, specifically, this will also give you a Dataframe.
Column A
Cell 1
Cell 4
Feel free to check out my project in Deepnote with these examples, and go ahead and duplicate it if you want to iterate on the code a bit. There is also plenty more alternatives, if you're in a real SQL database and want to update existing columns, you would use update statement. And if you are in a pure Python, this is of course also possible in a loop or using lambda functions.

Adding columns from every other row in a pandas dataframe

In the picture, 1, you can see the start of my data frame. I would like to make two new columns that consist of the values (confirmed_cases and deaths) and to get rid of the 'Type' column. Essentially I want there to be one row of data for each county and to have a confirmed_cases and death column added using the values from the data already. I tried the code below but obviously the length of values does not match the length of index.
Any suggestions?
apidata['Confirmed_Cases'] = apidata['values'].iloc[::2].values
apidata['Deaths'] = apidata['values'].iloc[1::2].values
(Sorry about the link to the photo, I am too new to the site to be able to just include the photo in the post)
Maybe if there's a way to double how many times each value is posted in the new column? So first five deaths would be [5, 5, 26, 26, 0] and then I can just delete every other row?
I ended up figuring it out, by creating a second dataframe that deleted half of the rows (every other one) from the first dataframe and then adding the values from the original dataframe into the new one.
apidata2 = apidata.iloc[::2]
apidata2['Confirmed_Cases'] = apidata['values'].iloc[::2].values
apidata2['Deaths'] = apidata['values'].iloc[1::2].values
apidata2.head()
Finished output

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

How to index a column with two values pandas

I have two dataframes:
Dataframe #1
Reads the values--Will only be interested in NodeID AND GSE
sta = pd.read_csv(filename)
Dataframe #2
Reads the file, use pivot and get the following result
sim = pd.read_csv(headout,index_col=0)
sim['Layer'] = sim.groupby('date').cumcount() + 1
sim['Layer'] = 'L' + sim['Layer'].astype(str)
sim = sim.pivot(index = None , columns = 'Layer').T
This gives me the index column to be with two values. (The header is blank for the first one, and Layers for the second) i.e 1,L1.
What I need help on is:
I can not find a way to rename that first blank in the index to 'NodeID'.
I want to name it that so that I can do the lookup function and use NodeID in both dataframes so that I can bring in the 'GSE' values from the first dataframe to the second.
I have been googling way to rename that first column in the second dataframe and I can not seem to find an solution. Any ideas help at this point. I think my pivot function might be wrong...
This is a picture of dataframe #2 before pivot. The number 1-4 are the Node ID.
when I export it to csv to see what the dataframe looks like I get this..
Try
df.rename(columns={"Index": "your preferred name"})
if it is your index then do -
df = df.reset_index()
df.rename(columns={"index": "your preferred name"})

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?