Subtract a specific row from a csv using phyton - pandas

I have two csv files: one containing data, the other one containing a single row with the same columns as the first file. I am trying to subtract the one row from the second file from all the rows from the first file using pandas.
I have tried the following, but to no avail.
df = df.subtract(row, axis=1)

You're looking for the "drop" method. From pandas docs:
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
drop by index:
df.drop([0, 1])
A B C D
2 8 9 10 11
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

Related

Merge and inverleave rows of two dataframes [duplicate]

This question already has answers here:
Pandas - Interleave / Zip two DataFrames by row
(5 answers)
Closed 20 days ago.
This post was edited and submitted for review 20 days ago.
Suppose we have:
>>> df1
A B
0 1 a
1 2 a
2 3 a
3 4 a
>>> df2
A B
0 1 b
1 2 b
2 3 b
3 5 b
I would like to merge them on "A" and then list them by interleaving rows like:
A B
0 1 a
0 1 b
1 2 a
1 2 b
2 3 a
2 3 b
I tried merge but it list them column by column. For example if I have 3 or more data frames, merge can merge them on some columns, but my problem would be then to interleave them
If need match by A filter rows by Series.isin in boolean indexing, pass to concat with DataFrame.sort_index:
df = pd.concat([df1[df1.A.isin(df2.A)],
df2[df2.A.isin(df1.A)]]).sort_index(kind='stable')
print (df)
A B
0 1 a
0 1 b
1 2 a
1 2 b
2 3 a
2 3 b
EDIT:
For general data is possible sorting by A and create default index for correct interleaving:
df = (pd.concat([df1[df1.A.isin(df2.A)].sort_values('A', kind='stable').reset_index(drop=True),
df2[df2.A.isin(df1.A)].sort_values('A', kind='stable').reset_index(drop=True)])
.sort_index(kind='stable'))

renaming multiple cells below a specific cell with pandas

I am trying to merge two Excel tables, but the rows don't line up because in one column information is split over several rows whereas in the other table it is contained in a single cell.
Is there a way with pandas to rename the cells in Table A so that they line up with the rows in Table B?
df_jobs = pd.read_excel(r"jobs.xlsx", usecols="Jobs")
df_positions = pd.read_excel(r"orders.xlsx", usecols="Orders")
Sample files:
https://drive.google.com/file/d/1PEG3nZc0183Gh-8A2xbIs9kEZIWLzLSa/view?usp=sharing
https://drive.google.com/file/d/1HfQ4q7pjba0TKNJAHBqcGeoqdY3Yr3DB/view?usp=sharing
I suppose your input data looks like:
>>> df1
A i j
0 O-20-003049 NaN NaN
1 1 0.643284 0.834937
2 2 0.056463 0.394168
3 3 0.773379 0.057465
4 4 0.081585 0.178991
5 5 0.667667 0.004370
6 6 0.672313 0.587615
7 O-20-003104 NaN NaN
8 1 0.916426 0.739700
9 O-20-003117 NaN NaN
10 1 0.800776 0.614192
11 2 0.925186 0.980913
12 3 0.503419 0.775606
>>> df2
A x y
0 O-20-003049.01 0.593312 0.666600
1 O-20-003049.02 0.554129 0.435650
2 O-20-003049.03 0.900707 0.623963
3 O-20-003049.04 0.023075 0.445153
4 O-20-003049.05 0.307908 0.503038
5 O-20-003049.06 0.844624 0.710027
6 O-20-003104.01 0.026914 0.091458
7 O-20-003117.01 0.275906 0.398993
8 O-20-003117.02 0.101117 0.691897
9 O-20-003117.03 0.739183 0.213401
We start by renaming the rows in column A:
# create a boolean mask
mask = df1["A"].str.startswith("O-")
# rename all rows
df1["A"] = df1.loc[mask, "A"].reindex(df1.index).ffill() \
+ "." + df1["A"].str.pad(2, fillchar="0")
# remove unwanted rows (where mask==True)
df1 = df1[~mask].reset_index(drop=True)
>>> df1
A i j
1 O-20-003049.01 0.000908 0.078590
2 O-20-003049.02 0.896207 0.406293
3 O-20-003049.03 0.120693 0.722355
4 O-20-003049.04 0.412412 0.447349
5 O-20-003049.05 0.369486 0.872241
6 O-20-003049.06 0.614941 0.907893
8 O-20-003104.01 0.519443 0.800131
10 O-20-003117.01 0.583067 0.760002
11 O-20-003117.02 0.133029 0.389461
12 O-20-003117.03 0.969289 0.397733
Now, we are able to merge data on column A:
>>> pd.merge(df1, df2, on="A")
A i j x y
0 O-20-003049.01 0.643284 0.834937 0.593312 0.666600
1 O-20-003049.02 0.056463 0.394168 0.554129 0.435650
2 O-20-003049.03 0.773379 0.057465 0.900707 0.623963
3 O-20-003049.04 0.081585 0.178991 0.023075 0.445153
4 O-20-003049.05 0.667667 0.004370 0.307908 0.503038
5 O-20-003049.06 0.672313 0.587615 0.844624 0.710027
6 O-20-003104.01 0.916426 0.739700 0.026914 0.091458
7 O-20-003117.01 0.800776 0.614192 0.275906 0.398993
8 O-20-003117.02 0.925186 0.980913 0.101117 0.691897
9 O-20-003117.03 0.503419 0.775606 0.739183 0.213401

the column in csv that comes from the index of DataFrame does not have a header name

here is a pandas DataFrame
>>> print(df)
A B C
0 0 1 2
1 3 4 5
2 6 7 8
with df.to_csv('df.csv') I got this file
the column in csv that comes from the index of DataFrame does not have a header name. Is it possible to specify a column name with pandas?
Try with rename_axis
df.rename_axis('index').to_csv('df.csv')

pandas read csv is returning extra unknown column

I am creating a csv file from pandas dataframe by combining two lists using:
df= pd.DataFrame(list(zip(patients_full, labels)),
columns=['id','cancer'])
df.to_csv("labels.csv")
but when I read the csv back there is an unknown column unnamed that shows up ? how do I remove that ?
Unnamed: 0 id cancer
0 0 HF0953.npy 1
1 1 HF1058.npy 3
2 2 HF1071.npy 3
3 3 HF1122.npy 3
4 4 HF1235.npy 1
5 5 HF1280.npy 2
6 6 HF1344.npy 1
7 7 HF1463.npy 1
8 8 HF1489.npy 1
9 9 HF1490.npy 2
10 10 HF1587.npy 2
11 11 HF1613.npy 2
This is happening because of the index column that is saved by default when you do to_csv("labels.csv"). As the index column in the data frame that you were saving didn't have a name, when you read your read_csv("labels.csv") it is treated as all other columns but with 'Blank' column named that is becoming Unnamed: 0. To avoid this you have 2 options:
Option 1 - not read the index:
read_csv("labels.csv", index_col=False)
Option 2 - not save the index:
to_csv("labels.csv", index=False)
What that column is in your output is the index of the dataframe. To not include it in the output: df.to_csv('labels.csv', index=False). More information is available on that method here in the pandas docs

Pandas dropping columns by index drops all columns with same name

Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( )
>>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)})
>>> df.rename(columns={"b":"a"},inplace=True)
df
a a
0 10 5
1 11 6
2 12 7
3 13 8
4 14 9
>>> df.columns
Index(['a', 'a'], dtype='object')
I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case.
>>> df.drop(df.columns[-1],1)
0
1
2
3
4
Is there a way to get rid of columns with duplicated column names?
EDIT: I choose missleading values for the first column, fixed now
EDIT2: the expected outcome is
a
0 10
1 11
2 12
3 13
4 14
Actually just do this:
In [183]:
df.ix[:,~df.columns.duplicated()]
Out[183]:
a
0 0
1 1
2 2
3 3
4 4
So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~
The output from duplicated:
In [184]:
df.columns.duplicated()
Out[184]:
array([False, True], dtype=bool)
UPDATE
As .ix is deprecated (since v0.20.1) you should do any of the following:
df.iloc[:,~df.columns.duplicated()]
or
df.loc[:,~df.columns.duplicated()]
Thanks to #DavideFiocco for alerting me