I have a large dataframe, df_vol.
It has about 20 columns and 500k rows.
In the column named "FTID" three of the values are "###". Other than those three instances every other value in the "FTID" column is unique.
I want to search for and change each instance of "###" to be unique.
Either of these two options would be acceptable:
"###1", "###2", "###3", or
"###" + str(row_index) for each i.e. concatenate "###" with the row index
The code I have tried is:
df_vol["FTID"] = df_vol["FTID"].apply(lambda x: "###" if x == "###" else None)
I know the above code doesn't actually change anything, but I don't know how to make it pull only the row index or use an incremental number. I have tried so many different things but I'm a noob, and I'm stabbing in the dark.
It seems to me it should look like:
df_vol["FTID"] = df_vol["FTID"].apply(lambda x: "###" + df_vol.index.astype(str) if x == "###" else None)
but what little success I have had just returns something like this for the new values:
Int64Index([ 423, 424, 425, 426, 427, 428, 429, 430,
Going to go collect up all my hair now and see if I can glue it back to my head ;)
You can access the index with x.name. I think you need something like:
df_vol["FTID"] = df_vol["FTID"].apply(lambda x: f"###{x.name}" if x == "###" else x)
(I didn't get why you would otherwise set the value to None since other values are unique... I think it should be unchanged when not equal to ###)
Edit: apply works slightly differently when use on Series and Dataframes.
In your case it would be best to create a function and apply it on your whole dataframe:
def myfunc(row):
if row['FTID']=="###":
row['FTID'] = f"###{row.name}"
return row
df_vol = df_vol.apply(myfunc, axis=1)
Related
I've scraped a PDF table and it came with an annoying formatting feature.
The table has two columns. In some cases, one row stayed with what should be the column A value and the next stayed with what should be the column B value. Like this:
df = pd.DataFrame()
df['names'] = ['John','Mary',np.nan,'George']
df['numbers'] = ['1',np.nan,'2','3']
I want to reformat that database so wherever there is an empty cell on df['numbers'] it fills it with the value of the next line. Then I apply .dropna() to eliminate the still-wrong cells.
I thied this:
for i in range(len(df)):
if df['numbers'][i] == np.nan:
df['numbers'][i] = df['numbers'][i+1]
No change on the dataframe, though. No error message, too.
What am I missing?
While I don't think this solves all your problems, the reason why you are not updating the dataframe is the line
if df['numbers'][i] == np.nan: , since this always evaluates to False.
To implement a vlaid test for nan in this case you must use
if pd.isnull(df['numbres'][i]): this will evaluate to True or False depending on the cell contents.
This is the solution I found:
df[['numbers']] = df[['numbers']].fillna(method='bfill')
df = df[~df['names'].isna()]
It's probably not the most elegant, but it worked.
I have a problem which ought to be trivial but seems to have been massively over-complicated by the column-based nature of FITS BinTableHDU.
The script I'm writing should be trivial: iterate through a FITS file and write a subset of rows to an identically formatted FITS file, reducing the row count from c700k/3.6GB to about 350 rows. I have processed the input file and have each row that I want to save in a python array of FITS records:
outarray = []
self.indata=Table.read(self.infile, hdu=1)
for r in self._indata:
RecPassesFilter = FilterProc(r, self)
#
# Add to output array only if passes all filters...
#
if RecPassesFilter:
outarray.append(r)
Now, I've created an empty BintableHDU with exactly the same columns and formats and I want to add the filtered data:
[...much omitted code later...}
mycols = []
for inputcol in self._coldefs:
mycols.append(fits.Column(name=inputcol.name, format=inputcol.format))
# Next line should produce an empty BinTableHDU in the identical format to the output data
SaveData = fits.BinTableHDU.from_columns(mycols)
for s in self._outdata:
SaveData.data.append(s)
Now that last line not only fails, but every variant of it (SaveData.append() or .add_row() or whatever) also fails with a "no such method" error. There seems to be a singular lack of documentation on how to do the trivial task of adding a record. Clearly I am missing something, but two days later I'm still drawing a blank.
Can anyone point me in the right direction here?
OK, I managed to resolve this with some brute force and nested iterations essentially to create column data arrays on the fly. It's not much in terms of code and I don't care that it's inefficient as I won't need to run it too often. Example code here:
with fits.open(self._infile) as HDUSet:
tableHDU=HDUSet[1]
self._coldefs = tableHDU.columns
FITScols = []
for inputcol in self._coldefs:
NewColData = []
for r in self._outdata:
NewColData.append(r[inputcol.name])
FITScols.append(fits.Column(name=inputcol.name, format=inputcol.format, array=NewColData))
SaveData = fits.BinTableHDU.from_columns(FITScols)
SaveData.writeto(fname)
This solves my problem for a 350 row subset. I haven't yet dared try it for the 250K row subset that I need for the next part of my project!
I just recalled that BinTableHDU.from_columns takes an nrows argument. If you pass that along with the columns of an existing table HDU, it will copy the column structure but initialize subsequent rows with empty data:
>>> hdul = fits.open('astropy/io/fits/tests/data/table.fits')
>>> table = hdul[1]
>>> table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '>f4')]))
>>> new_table = fits.BinTableHDU.from_columns(table.columns, nrows=5)
>>> new_table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> new_table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2),
('', 0. ), ('', 0. )],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
As you can see, this still copies the data from the original columns. I think the idea behind this originally was for adding new rows to an existing table. However, you can also initialize a completely empty new table by passing fill=True:
>>> new_table_zeroed = fits.BinTableHDU.from_columns(table.columns, nrows=5, fill=True)
>>> new_table_zeroed.data
FITS_rec([('', 0.), ('', 0.), ('', 0.), ('', 0.), ('', 0.)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link.
The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has:
[newsSource name]
Trying the below throws the error
File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x[0]['newsSource']))
TypeError: string indices must be integers
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x['source']))
But I've used x[colName] before? The below line works fine, it simply creates a column of the source's name:
df['newsSource'] = df['source'].apply(lambda x: x['name'])
Why suddenly ("suddenly" to me) is it saying I can't access the indices?
pd.Series.apply has access only to a single series, i.e. the series on which you are calling the method. In other words, the function you supply, irrespective of whether it is named or an anonymous lambda, will only have access to df['source'].
To access multiple series by row, you need pd.DataFrame.apply along axis=1:
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
Note there is an overhead associated with passing an entire series in this way; pd.DataFrame.apply is just a thinly veiled, inefficient loop.
You may find a list comprehension more efficient:
df['sourceURL'] = ['{1}'.format(i, j) \
for i, j in zip(df['url'], df['source'])]
Here's a working demo:
df = pd.DataFrame([['BBC', 'http://www.bbc.o.uk']],
columns=['source', 'url'])
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
print(df)
source url sourceURL
0 BBC http://www.bbc.o.uk BBC
With zip and string old school string format
df['sourceURL'] = ['%s.' % (x,y) for x , y in zip (df['url'], df['source'])]
This is f-string
[f'{y}' for x , y in zip ((df['url'], df['source'])]
Normally, a relatively long dataframe like
df = pd.DataFrame(np.random.randint(0,10,(100,2)))
df
will display a truncated form in jupyter notebook like
With head, tail, ellipsis in between and row column count in the end.
However, after style.apply
def highlight_max(x):
return ['background-color: yellow' if v == x.max() else '' for v in x]
df.style.apply(highlight_max)
we got all rows displayed
Is it possible to still display the truncated form of dataframe after style.apply?
Something simple like this?
def display_df(dataframe, function):
display(dataframe.head().style.apply(function))
display(dataframe.tail().style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
**** EDIT ****
def display_df(dataframe, function):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns),
dataframe.iloc[-5:,:]]).style.apply(function))
print(f'{dataframe.shape[0]} rows x {dataframe.shape[1]} columns')
display_df(df, highlight_max)
Output:
The jupyter preview is basically something like this:
def display_df(dataframe):
display(pd.concat([dataframe.iloc[:5,:],
pd.DataFrame(index=['...'], columns=dataframe.columns, data={0: '...', 1: '...'}),
dataframe.iloc[-5:,:]]))
but if you try to apply style you are getting an error (TypeError: '>=' not supported between instances of 'int' and 'str') because it's trying to compare and highlight the string values '...'
You can capture the output in a variable and then use head or tail on it. This gives you more control on what you display every time.
output = df.style.apply(highlight_max)
output.head(10) # 10 -> number of rows to display
If you want to see more variate data you can also use sample, which will get random rows:
output.sample(10)
I am indexing a subset of cells from a DataFrame column and attempting to assign a boolean True to said subset:
df['mycolumn'][df['myothercolumn'] == val][idx: idx + 25] = True
However, when I slice df['mycolumn'][df['myothercolumn'] == val][idx: idx + 25], my the initial values are still found. In other words the changes were not applied!
I'm about to rip my hair out. What am I doing wrong?
Try this:
df.loc[df['myothercolumn']==val, some_column_name] = True
some_column_name should be the name of the column you want to add or change.