How do I change the index of a pandas dataframe object so as not to get null values in the dataframe entries? - pandas

I do not seem to know what the issue is when I combined three dataframes into one and tried changing the index of the combined dataframe. The following is what I have done:
1) I first combined (or Concatenated) three dataframes into a 'combo' dataframe. Below is an excerpt ('TSP_JuMP_Obtained_Solu') of one of the three. The index goes from 0-9 for all the three datafames as well as the combined.
2) I then used the following line of code to combine them:
f_solu_tsp = pd.concat([list_TSP,list_Scenario1,list_Scenario2], axis=1,
sort=True)
3) I subsequently used the followine line of code to change the index of the combined dataframe (df_solu_tsp):
df_solu_tsp = df_solu_tsp.reindex(proTy_uniq_list)
NB: 'proTy_uniq_list' is a list with membership as shown below:
[u'lau15', u'gr17', u'fri26', u'bays29', u'dantzig42', u'KATRINA_38',
u'HARVEY_50', u'HARVEY_100', u'HARVEY_200', u'HARVEY_415']
Below is the result of the combined dataframe (df_solu_tsp ):
Thank you in advance for the help.

Without having example DataFrame I will try to answer as good as possible:
Solution 1
As Peter Leimbigler mentioned in the comments:
df_solu_tsp = df_solu_tsp.set_index(proTy_uniq_list)
Which replaces your original index with the new index which is in this case an equal length list.
Solution 2
As mentioned in the pandas docs
df_solu_tsp.set_index([pd.Index(proTy_uniq_list), 'proTy'])
Solution 3
I see that you're creating a dataframe from three lists, so we can go a step back and create your data in one go:
f_solu_tsp = pd.DataFrame({'TSP_JuMP_Obtained_Solu': list_TSP,
'Scenario1': list_Scenario1,
'Scenario2': list_Scenario2}, index=proTy_uniq_list)
Example solution 3
data1 = ['hi', 'goodbye']
data2 = ['hello', 'bye']
idx = ['arriving', 'leaving']
df = pd.DataFrame({'column1': data1,
'column2': data2}, index=idx)
print(df)
column1 column2
arriving hi hello
leaving goodbye bye

Related

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How to index a column with two values pandas

I have two dataframes:
Dataframe #1
Reads the values--Will only be interested in NodeID AND GSE
sta = pd.read_csv(filename)
Dataframe #2
Reads the file, use pivot and get the following result
sim = pd.read_csv(headout,index_col=0)
sim['Layer'] = sim.groupby('date').cumcount() + 1
sim['Layer'] = 'L' + sim['Layer'].astype(str)
sim = sim.pivot(index = None , columns = 'Layer').T
This gives me the index column to be with two values. (The header is blank for the first one, and Layers for the second) i.e 1,L1.
What I need help on is:
I can not find a way to rename that first blank in the index to 'NodeID'.
I want to name it that so that I can do the lookup function and use NodeID in both dataframes so that I can bring in the 'GSE' values from the first dataframe to the second.
I have been googling way to rename that first column in the second dataframe and I can not seem to find an solution. Any ideas help at this point. I think my pivot function might be wrong...
This is a picture of dataframe #2 before pivot. The number 1-4 are the Node ID.
when I export it to csv to see what the dataframe looks like I get this..
Try
df.rename(columns={"Index": "your preferred name"})
if it is your index then do -
df = df.reset_index()
df.rename(columns={"index": "your preferred name"})

Seach and delete item in list of dataframes

Lets say I creat a list of dataframes by:
import pandas as pd
lDfs = []
for i in range(0, 3):
lDfs.append(pd.read_csv('SomeTable.csv')
then I have a list of 3 dataframes:
lDfs[0]
lDfs[1]
lDfs[2]
Lets say each dataframe has the following structure:
Date,Open,High,Low,Close,Volume
0 2020-03-02,3355.330078,3406.399902,3257.989990,3338.830078,90017600
1 2020-03-03,3355.520020,3448.239990,3354.300049,3371.969971,79445600
Now I want to search each dataframe in that list for a string pattern:
search = 'null'
and drop that row which includes that specific dataframe. How can I do that?
Thank you!
It turned out that 'null' was interpretet from pandas as NaN. So DataFrame.dropna does the trick pretty easy:
for i in range(0, len(lDfs)):
lDfs[i].dropna(inplace=True)

how to name colums?

I have a pandas Data Frame where some of the id's are repeated a few times. I've written this code:
df = df["id"].value_counts()
and got this output
What should I do to get something like in the following image?
Thanks
As Quang Hoang answered, value_counts set the column you count as the index. Therefore in order to get the id and the count as columns, you need to do 2 things:
Make the counts as column - to_frame(name='B')
Reset the index to make the ids another column which we'll rename to the desired name: .reset_index().rename(columns={'index': 'A'})
So in one line it'll be:
df = df["id"].value_counts().to_frame(name='B').reset_index().rename(columns={'index': 'A'})
Another possible way is:
col = list(["A", "B")]
df.columns = col

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.