Adding a row to a FITS table with astropy - astropy

I have a problem which ought to be trivial but seems to have been massively over-complicated by the column-based nature of FITS BinTableHDU.
The script I'm writing should be trivial: iterate through a FITS file and write a subset of rows to an identically formatted FITS file, reducing the row count from c700k/3.6GB to about 350 rows. I have processed the input file and have each row that I want to save in a python array of FITS records:
outarray = []
self.indata=Table.read(self.infile, hdu=1)
for r in self._indata:
RecPassesFilter = FilterProc(r, self)
#
# Add to output array only if passes all filters...
#
if RecPassesFilter:
outarray.append(r)
Now, I've created an empty BintableHDU with exactly the same columns and formats and I want to add the filtered data:
[...much omitted code later...}
mycols = []
for inputcol in self._coldefs:
mycols.append(fits.Column(name=inputcol.name, format=inputcol.format))
# Next line should produce an empty BinTableHDU in the identical format to the output data
SaveData = fits.BinTableHDU.from_columns(mycols)
for s in self._outdata:
SaveData.data.append(s)
Now that last line not only fails, but every variant of it (SaveData.append() or .add_row() or whatever) also fails with a "no such method" error. There seems to be a singular lack of documentation on how to do the trivial task of adding a record. Clearly I am missing something, but two days later I'm still drawing a blank.
Can anyone point me in the right direction here?

OK, I managed to resolve this with some brute force and nested iterations essentially to create column data arrays on the fly. It's not much in terms of code and I don't care that it's inefficient as I won't need to run it too often. Example code here:
with fits.open(self._infile) as HDUSet:
tableHDU=HDUSet[1]
self._coldefs = tableHDU.columns
FITScols = []
for inputcol in self._coldefs:
NewColData = []
for r in self._outdata:
NewColData.append(r[inputcol.name])
FITScols.append(fits.Column(name=inputcol.name, format=inputcol.format, array=NewColData))
SaveData = fits.BinTableHDU.from_columns(FITScols)
SaveData.writeto(fname)
This solves my problem for a 350 row subset. I haven't yet dared try it for the 250K row subset that I need for the next part of my project!

I just recalled that BinTableHDU.from_columns takes an nrows argument. If you pass that along with the columns of an existing table HDU, it will copy the column structure but initialize subsequent rows with empty data:
>>> hdul = fits.open('astropy/io/fits/tests/data/table.fits')
>>> table = hdul[1]
>>> table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '>f4')]))
>>> new_table = fits.BinTableHDU.from_columns(table.columns, nrows=5)
>>> new_table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> new_table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2),
('', 0. ), ('', 0. )],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
As you can see, this still copies the data from the original columns. I think the idea behind this originally was for adding new rows to an existing table. However, you can also initialize a completely empty new table by passing fill=True:
>>> new_table_zeroed = fits.BinTableHDU.from_columns(table.columns, nrows=5, fill=True)
>>> new_table_zeroed.data
FITS_rec([('', 0.), ('', 0.), ('', 0.), ('', 0.), ('', 0.)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))

Related

Alternatives to iloc for searching dataframes

I have a simple piece of code that iterates through a list of id's, and if an id is in a particular data frame column(in this case, the column is called uniqueid), it uses iloc to get the value from another column on the matching row and then adds it to as a value in a dictionary with the id as the key:
union_cols = ['uniqueid', 'FLD_ZONE', 'FLD_ZONE_1', 'ST_FIPS', 'CO_FIPS', 'CID']
union_df = gpd.GeoDataFrame.from_features(records(union_gdb, union_cols))
pop_df = pd.read_csv(pop_csv, low_memory=False) # Example dataframe
uniqueid_inin = ['', 'FL1234', 'F54323', ....] # Just an example
isin_dict = dict()
for id in uniqueid_inin:
if (id is not '') & (id in pop_df.uniqueid.values):
v = pop_df.loc[pop_df['uniqueid'] == id, 'Pop_By_Area'].iloc[0]
inin_dict.setdefault(id, v)
This works, but it is very slow. Is there a quicker way to do this?
To resolve this issue (and make the process more efficient) I had to think about the process in a different way that took advantage of Pandas and didn't rely on a generic Python solution. I first had to get a list of only the uniqueids from my union_df that were absolutely in pop_df. If they were not, applying the .isin() method would throw an indexing error.
#Get list of uniqueids in pop_df
pop_uniqueids = pop_df['uniqueid'].unique()
#Get only the union_df rows where the uniqueid matches pop_uniqueid
union_df = union_df.loc[(union_df['uniqueid'].isin(pop_uniqueids))]
union_df = union_df.reset_index()
union_df = union_df.drop(columns='index')
When the uniqueid_inin list is created from union_df (by just getting the uniqueid's from rows where my zone_status column is equal to 'in-in'), it will only contain a subset of items that are definitely in pop_df and empty values are no longer an issue. Then, I simply create a subset dataframe using the list and zip the desired column values together in a dictionary:
inin_subset =pop_df[ pop_df['uniqueid'].isin(uniqueid_inin)]
inin_pop_dict = dict(zip(inin_subset.uniqueid, inin_subset.Pop_By_Area))
I hope this technique helps.

Creating a line plot after every 48 rows in Dataframe

So I am given thousands of lines of data of which I inserted into a data frame using pandas. I would like to create plots that includes only 48 rows of data and after every 48 rows creating a new plot that has the next 48 rows and so on. I'm confused as to how to do that. I would also like to know how to graph only certain rows in my data frame in my line plot. P.S. this is my first question so I apologize for any formatting errors.
I isolated a certain column of my code "HP" and assigned into the variable hp by doing hp = df.HP. I also made a basic plot for the whole data already by doing hp.plot(x = '#', y = None, kind = 'line'). I've looked up my issue and tried using
hpnew = hp[seq(1, nrow(hp), 48), ]
hpnew.plot(x = '#', y = None, kind = 'line')
Where hp new would be every 48th row. It didn't work and I was left with the error message
NameError: name 'seq' is not defined
Initially I told to use
for i to range(hp):
hp(i)
But I was left with a syntax error and was confused what to from there.
You can use the answer by Roman Pekar here to bin your dataframe into groups of 48:
df.groupby(df.index / 48)
Then if you have some plotting function you can apply it to the grouped data:
def plot_function(df):
df.plot( ... )
df.groupby(df.index / 48)['hp'].apply(plot_function)

How to insert a column in a julia DataFrame at specific position (without referring to existing column names)

I have a DataFrame in Julia with hundreds of columns, and I would like to insert a column after the first one.
For example in this DataFrame:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
I would like to insert a column area after colour, but without referring specifically to shape and border (that in my real case are hundreds of different columns).
df[:area] = [1,2]
In this example I can use (but referring specifically to shape and border):
df = df[[:colour, :area, :shape, :border]] # with specific reference to shape and border names
Update: This function has changed. See #DiegoJavierZea ’s comment.
Well, congratulate you found a workaround your self, but there is a built-in function that is semantically more clear and possibly a little bit faster:
using DataFrames
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
insert!(df, 3, [1,2], :area)
Where 3 is the expected index for the new column after the insertion, [1,2] is its content, and :area is the name. You can find a more detailed document by typing ?insert! in REPL after loading the DataFrames package.
It is worth noting that the ! is a part of the function name. It's a Julia convention to indicate that the function will mutate its argument.
rows = size(df)[1] # tuple gives you (rows,columns) of the DataFrame
insertcols!(df, # DataFrame to be changed
1, # insert as column 1
:Day => 1:rows, # populate as "Day" with 1,2,3,..
makeunique=true) # if the name of the column exist, make is Day_1
While making the question I also found a solution (as often happens).
I still post the question here for keep it in memory (for myself) and for the others..
It is enough to save the column names before "adding" the new column:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
dfnames = names(df)
df[:area] = [1,2]
df = df[vcat(dfnames[1:1],:area,dfnames[2:end])]

Reading csv file with pandas

enter image description hereI have a csv file with 2 columns which is text and boolean(y/n) where I am trying to put all the positive value in 1 file in 1 file and the negative one in the others. Here is what I tried:
df = pd.read_csv('text_trait_with_binary_EXT.csv','rb',delimiter=',',quotechar='"')
#print(df)
df.columns = ["STATUS", "xEXT"]
positive = []
negative = []
for line in df:
text = line[0].strip()
if line[1].strip() == "y":
positive.append(text)
elif line[1].strip() == "n":
negative.append(text)
print(positive)
print(negative)
And when I run this it just give an empty list!
I am new in using pandas so if any of you can help that would be great.
As others have commented, there is almost always a better approach than using iteration in Pandas. It has a lot of built-in functions to help avoid loops.
If I understand your intentions correctly that you want to take the values from column 1 (named 'STATUS'), split them according to whether the corresponding value in column 2 (named 'xEXT') is 'y' or 'n', and generate two lists containing the column 1 values, the following should work (to be used after your first two lines of code you posted):
positive = df.loc[df['xEXT'].str.strip() == 'y', 'STATUS'].tolist()
negative = df.loc[df['xEXT'].str.strip() == 'n', 'STATUS'].tolist()
Here is a link to the documentation on loc, which is useful for problems like this.
The above solution assumes that your data has been read in correctly. If it does not work for you, please do as others have commented and add a small sample of your data so that we are able to try out our proposed solutions.

Data slicing a pandas frame - I'm having problems with unique

I am facing issues trying to select a subset of columns and running unique on it.
Source Data:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df_raw.shape()
Produces:
(10000, 86)
Process Data:
df = df_raw[['A','B','C']]
df.shape()
Produces:
(10000, 3)
Furthermore, doing:
df_raw.head()
df.head()
produces a correct list of rows and columns.
However,
print('RAW:',sorted(df_raw['A'].unique()))
works perfectly
Whilst:
print('PROCESSED:',sorted(df['A'].unique()))
produces:
AttributeError: 'DataFrame' object has no attribute 'unique'
What am I doing wrong? If the shape and head output are exactly what I want, I'm confused why my processed dataset is throwing errors. I did read Pandas 'DataFrame' object has no attribute 'unique' on SO which correctly states that unique needs to be applied to columns which is what I am doing.
This was a case of a duplicate column. Given this is proprietary data, I abstracted it as 'A', 'B', 'C' in this question and therefore masked the problem. (The real data set had 86 columns and I had duplicated one of those columns twice in my subset, and was trying to do a unique on that)
My problem was this:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df = df_raw[['A','B','C', 'A']] # <-- I did not realize I had duplicated A later.
This was causing problems when doing a unique on 'A'
From the entire dataframe to extract a subset a data based on a column ID. This works!!
df = df.drop_duplicates(subset=['Id']) #where 'id' is the column used to filter
print (df)