OrderedDict contain list of DataFrame - pandas

I don't understand why method1 is fine but not the second ...
method 1
import pandas as pd
import collections
d = collections.OrderedDict([('key', []), ('key2', [])])
df = pd.DataFrame({'id': [1], 'test': ['ok']})
d['key'].append(df)
d
OrderedDict([('key', [ id test
0 1 ok]), ('key2', [])])
method 2
l = ['key', 'key2']
dl = collections.OrderedDict(zip(l, [[]]*len(l)))
dl
OrderedDict([('key', []), ('key2', [])])
dl['key'].append(df)
dl
OrderedDict([('key', [ id test
0 1 ok]), ('key2', [ id test
0 1 ok])])
dl == d True

The issue stems from creating empty lists like so: [[]] * len(l) what this is actually doing is copying the reference to the empty list multiple times. So what you end up with is a list of empty lists that all point to the same underlying object. When this happens, any change you make to the underlying list via inplace operations (such as append) will change the values inside of all references to that list.
The same type issue comes about when assigning variables to one another:
a = []
b = a
# `a` and `b` both point to the same underlying object.
b.append(1) # inplace operation changes underlying object
print(a, b)
[1], [1]
To circumvent your issue instead of using [[]] * len(l) you can use a generator expression or list comprehension to ensure a new empty list is created for each element in list l:
collections.OrderedDict(zip(l, ([] for _ in l))
using the generator expression ([] for _ in l) creates a new empty list for every element in l instead of copying the reference to a single empty list. The easiest way to check this is to use the id function to check the underlying ids of the objects. Here we'll compare your original method to the new method:
# The ids come out the same, indicating that the objects are reference to the same underlying list
>>> [id(x) for x in [[]] * len(l)]
[2746221080960, 2746221080960]
# The ids come out different, indicating that they point to different underlying lists
>>> [id(x) for x in ([] for _ in l)]
[2746259049600, 2746259213760]

Related

Adding a row to a FITS table with astropy

I have a problem which ought to be trivial but seems to have been massively over-complicated by the column-based nature of FITS BinTableHDU.
The script I'm writing should be trivial: iterate through a FITS file and write a subset of rows to an identically formatted FITS file, reducing the row count from c700k/3.6GB to about 350 rows. I have processed the input file and have each row that I want to save in a python array of FITS records:
outarray = []
self.indata=Table.read(self.infile, hdu=1)
for r in self._indata:
RecPassesFilter = FilterProc(r, self)
#
# Add to output array only if passes all filters...
#
if RecPassesFilter:
outarray.append(r)
Now, I've created an empty BintableHDU with exactly the same columns and formats and I want to add the filtered data:
[...much omitted code later...}
mycols = []
for inputcol in self._coldefs:
mycols.append(fits.Column(name=inputcol.name, format=inputcol.format))
# Next line should produce an empty BinTableHDU in the identical format to the output data
SaveData = fits.BinTableHDU.from_columns(mycols)
for s in self._outdata:
SaveData.data.append(s)
Now that last line not only fails, but every variant of it (SaveData.append() or .add_row() or whatever) also fails with a "no such method" error. There seems to be a singular lack of documentation on how to do the trivial task of adding a record. Clearly I am missing something, but two days later I'm still drawing a blank.
Can anyone point me in the right direction here?
OK, I managed to resolve this with some brute force and nested iterations essentially to create column data arrays on the fly. It's not much in terms of code and I don't care that it's inefficient as I won't need to run it too often. Example code here:
with fits.open(self._infile) as HDUSet:
tableHDU=HDUSet[1]
self._coldefs = tableHDU.columns
FITScols = []
for inputcol in self._coldefs:
NewColData = []
for r in self._outdata:
NewColData.append(r[inputcol.name])
FITScols.append(fits.Column(name=inputcol.name, format=inputcol.format, array=NewColData))
SaveData = fits.BinTableHDU.from_columns(FITScols)
SaveData.writeto(fname)
This solves my problem for a 350 row subset. I haven't yet dared try it for the 250K row subset that I need for the next part of my project!
I just recalled that BinTableHDU.from_columns takes an nrows argument. If you pass that along with the columns of an existing table HDU, it will copy the column structure but initialize subsequent rows with empty data:
>>> hdul = fits.open('astropy/io/fits/tests/data/table.fits')
>>> table = hdul[1]
>>> table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '>f4')]))
>>> new_table = fits.BinTableHDU.from_columns(table.columns, nrows=5)
>>> new_table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> new_table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2),
('', 0. ), ('', 0. )],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
As you can see, this still copies the data from the original columns. I think the idea behind this originally was for adding new rows to an existing table. However, you can also initialize a completely empty new table by passing fill=True:
>>> new_table_zeroed = fits.BinTableHDU.from_columns(table.columns, nrows=5, fill=True)
>>> new_table_zeroed.data
FITS_rec([('', 0.), ('', 0.), ('', 0.), ('', 0.), ('', 0.)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))

Got TypeError: string indices must be integers with .apply [duplicate]

I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link.
The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has:
[newsSource name]
Trying the below throws the error
File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x[0]['newsSource']))
TypeError: string indices must be integers
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x['source']))
But I've used x[colName] before? The below line works fine, it simply creates a column of the source's name:
df['newsSource'] = df['source'].apply(lambda x: x['name'])
Why suddenly ("suddenly" to me) is it saying I can't access the indices?
pd.Series.apply has access only to a single series, i.e. the series on which you are calling the method. In other words, the function you supply, irrespective of whether it is named or an anonymous lambda, will only have access to df['source'].
To access multiple series by row, you need pd.DataFrame.apply along axis=1:
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
Note there is an overhead associated with passing an entire series in this way; pd.DataFrame.apply is just a thinly veiled, inefficient loop.
You may find a list comprehension more efficient:
df['sourceURL'] = ['{1}'.format(i, j) \
for i, j in zip(df['url'], df['source'])]
Here's a working demo:
df = pd.DataFrame([['BBC', 'http://www.bbc.o.uk']],
columns=['source', 'url'])
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
print(df)
source url sourceURL
0 BBC http://www.bbc.o.uk BBC
With zip and string old school string format
df['sourceURL'] = ['%s.' % (x,y) for x , y in zip (df['url'], df['source'])]
This is f-string
[f'{y}' for x , y in zip ((df['url'], df['source'])]

How to obtain a matrix by adding one more new row vector within an iteration?

I am generating arrays (technically they are row vectors) with a for-loop. a, b, c ... are the outputs.
Can I add the new array to the old ones together to form a matrix?
import numpy as np
# just for example:
a = np.asarray([2,5,8,10])
b = np.asarray([1,2,3,4])
c = np.asarray([2,3,4,5])
... ... ... ... ...
I have tried ab = np.stack((a,b)), and this could work. But my idea is to always add a new row to the old matrix in a new loop, but with abc = np.stack((ab,c)) then there would be an error ValueError: all input arrays must have the same shape.
Can anyone tell me how I could add another vector to an already existing matrix? I couldn´t find a perfect answer in this forum.
np.stack wouldn't work, you can only stack arrays with same dimensions.
One possible solution is to use np.concatenate((original, to_append), axis = 0) each time. Check the docs.
You can also try using np.append.
Thanks to the ideas from everybody, the best solution of this problem is to append nparray or list to a list during the iteration and convert the list to a matrix using np.asarray in the end.
a = np.asarray([2,5,8,10]) # or a = [2,5,8,10]
b = np.asarray([1,2,3,4]) # b = [1,2,3,4]
c = np.asarray([2,3,4,5]) # c = [2,3,4,5]
... ...
l1 = []
l1.append(a)
l1.append(b)
l1.append(c)
... ...
l1don´t have to be empty, however, the elements which l1 already contained should be the same type as the a,b,c
For example, the difference between l1 = [1,1,1,1] and l1 = [[1,1,1,1]] is huge in this case.

Creating a CategoricalDtype from an int column in Dask

dask.__version__ = 2.5.0
I have a table with columns containing many uint16 range 0,...,n & a bunch of lookup tables containing the mappings from these 'codes' to their 'categories'.
My question: Is there a way to make these integer columns 'categorical' without parsing the data or first replacing the codes with the categories.
Ideally I want Dask can keep the values as is and accept them as category codes and and accept the categories I tell Dask belong to these codes?
dfp = pd.DataFrame({'c01': np.random.choice(np.arange(3),size=10), 'v02': np.random.randn(10)})
dfd = dd.from_pandas(dfp, npartitions=2)
mdt = pd.CategoricalDtype(list('abc'), ordered=True)
dfd.c01 = dfd.c01.map_partitions(lambda s: pd.Categorical.from_codes(s, dtype=mdt), meta='category')
dfd.dtypes
The above does not work, the dtype is 'O' (it seem to have replaced the ints with strings)? I can subsequently do the following (which seems to do the trick):
dfd.c01 = dfd.c01.astype('category')
But than seems inefficient for big data sets.
Any pointers are much appreciated.
Some context: I have a big dataset (>500M rows) with many columns containing a limited number of strings. The perfect usecase for dtype categorical. The data gets extracted from a Teradata DW using Parallel Transporter, meaning it produces a delimited UTF-8 file. To make this process faster, I categorize the data on the Teradata side and I just need to create the dtype category from the codes on the dask side of the fence.
As long as you have an upper bound on largest integer, which you call n (equal to 3), then the following will work.
In [33]: dfd.c01.astype('category').cat.set_categories(np.arange(len(mdt.categories))).cat.rename_categories(list(mdt.categories))
Out[33]:
Dask Series Structure:
npartitions=2
0 category[known]
5 ...
9 ...
Name: c01, dtype: category
Dask Name: cat, 10 tasks
Which is the following when computed
Out[34]:
0 b
1 b
2 c
3 c
4 a
5 c
6 a
7 a
8 a
9 a
Name: c01, dtype: category
Categories (3, object): [a, b, c]
The basic idea is to make an intermediate Categorical whose categories are the codes (0, 1, ... n) and then move from those numerical categories to the actual one (a, b, c).
We have an open issue for making this nicer https://github.com/dask/dask/issues/2829

Numpy index array of unknown dimensions?

I need to compare a bunch of numpy arrays with different dimensions, say:
a = np.array([1,2,3])
b = np.array([1,2,3],[4,5,6])
assert(a == b[0])
How can I do this if I do not know either the shape of a and b, besides that
len(shape(a)) == len(shape(b)) - 1
and neither do I know which dimension to skip from b. I'd like to use np.index_exp, but that does not seem to help me ...
def compare_arrays(a,b,skip_row):
u = np.index_exp[ ... ]
assert(a[:] == b[u])
Edit
Or to put it otherwise, I wan't to construct slicing if I know the shape of the array and the dimension I want to miss. How do I dynamically create the np.index_exp, if I know the number of dimensions and positions, where to put ":" and where to put "0".
I was just looking at the code for apply_along_axis and apply_over_axis, studying how they construct indexing objects.
Lets make a 4d array:
In [355]: b=np.ones((2,3,4,3),int)
Make a list of slices (using list * replicate)
In [356]: ind=[slice(None)]*b.ndim
In [357]: b[ind].shape # same as b[:,:,:,:]
Out[357]: (2, 3, 4, 3)
In [358]: ind[2]=2 # replace one slice with index
In [359]: b[ind].shape # a slice, indexing on the third dim
Out[359]: (2, 3, 3)
Or with your example
In [361]: b = np.array([1,2,3],[4,5,6]) # missing []
...
TypeError: data type not understood
In [362]: b = np.array([[1,2,3],[4,5,6]])
In [366]: ind=[slice(None)]*b.ndim
In [367]: ind[0]=0
In [368]: a==b[ind]
Out[368]: array([ True, True, True], dtype=bool)
This indexing is basically the same as np.take, but the same idea can be extended to other cases.
I don't quite follow your questions about the use of :. Note that when building an indexing list I use slice(None). The interpreter translates all indexing : into slice objects: [start:stop:step] => slice(start, stop, step).
Usually you don't need to use a[:]==b[0]; a==b[0] is sufficient. With lists alist[:] makes a copy, with arrays it does nothing (unless used on the RHS, a[:]=...).