How to assign increment values to pandas column names? - pandas

For any columns without column names, I want to arbitrarily assign increment numbers to each column name. Meaning if column name is NaN, assign 1, 2, 3...If column name exists, ignore.
Here, column 28 onwards do not have column names.
My code below did not change the column names.
import pandas as pd
import numpy as np
# Arbitrarily assign the NaN column names with numbers (i.e., column 28 onwards)
df.iloc[:, 27:].columns = range(1, df.iloc[:, 27:].shape[1] + 1)
df.columns
Original column names
df.columns
Index([ 'strand', 'start',
'stop', 'total_probes',
'gene_assignment', 'mrna_assignment',
'swissprot', 'unigene',
'GO_biological_process', 'GO_cellular_component',
'GO_molecular_function', 'pathway',
'protein_domains', 'crosshyb_type',
'category', 'seqname',
'Gene Title', 'Cytoband',
'Entrez Gene', 'Swiss-Prot',
'UniGene', 'GO Biological Process',
'GO Cellular Component', 'GO Molecular Function',
'Pathway', 'Protein Domains',
'Probe ID', nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan,
nan, nan],
dtype='object', name=0)
Expected output:
Index([ 'strand', 'start',
'stop', 'total_probes',
'gene_assignment', 'mrna_assignment',
'swissprot', 'unigene',
'GO_biological_process', 'GO_cellular_component',
'GO_molecular_function', 'pathway',
'protein_domains', 'crosshyb_type',
'category', 'seqname',
'Gene Title', 'Cytoband',
'Entrez Gene', 'Swiss-Prot',
'UniGene', 'GO Biological Process',
'GO Cellular Component', 'GO Molecular Function',
'Pathway', 'Protein Domains',
'Probe ID', 1,
2, 3,
4, 5,
6, 7,
8, 9,
10, 11,
12, 13,
14, 15,
16, 17,
18, 19,
20, 21,
22, 23,
24, 25,
26, 27,
28, 29],
dtype='object', name=0)

This will do it.
temp_columns_name= []
nan_count= 1
for i in df.columns:
if pd.isnull(i):
temp_columns_name.append(nan_count)
nan_count+= 1
else:
temp_columns_name.append(i)
df.columns= temp_columns_name
print(df.columns)
Output:
['strand',
'start',
'stop',
'total_probes',
'gene_assignment',
'mrna_assignment',
'swissprot',
'unigene',
'GO_biological_process',
'GO_cellular_component',
'GO_molecular_function',
'pathway',
'protein_domains',
'crosshyb_type',
'category',
'seqname',
'Gene Title',
'Cytoband',
'Entrez Gene',
'Swiss-Prot',
'UniGene',
'GO Biological Process',
'GO Cellular Component',
'GO Molecular Function',
'Pathway',
'Protein Domains',
'Probe ID',
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29]

Related

numpy - Subtract array between actual value and previous value (only not null)

I have the following situation:
Supposing I have an array, and I want to subtract (absolute value) between the actual not null value and the previous not null values.
[np.nan, np.nan, 10, np.nan, np.nan, 5, np.nan, 3, 6, np.nan, np.nan, 7]
Expected output:
[nan, nan, nan, nan, nan, 5, nan, 2, 3, nan, nan, 1]
What is a good approach to get this result using numpy without for loops?
I only solved it using for loop:
x = [np.nan, np.nan, 10, np.nan, np.nan, 5, np.nan, 3, 6, np.nan, np.nan, 7]
idx = np.where(~np.isnan(x))[0]
output = np.full(len(x), np.nan)
for i, j in enumerate(idx):
if i > 0:
output[j] = abs(x[idx[i]] - x[idx[i - 1]])
You're most of the way there already:
output[idx[1:]] = np.abs(np.diff(x[idx]))

2 different ways to index 3D array in Numpy?

I have a 3-D array with dimension of (14,3,5), which correspond to (event, color, taste).
If I want to select all event's second color option and third taste option.
Could someone tell me which is the correct format?
[:,2,3] vs [:,2][:,3]
Are they the same, or different?
If they are different, how are they different?
Do a test:
In [256]: arr = np.arange(2*3*5).reshape(2,3,5)
In [257]: arr
Out[257]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
One way:
In [258]: arr[:,2,3]
Out[258]: array([13, 28])
The other is evaluated in 2 steps:
In [259]: arr[:,2]
Out[259]:
array([[10, 11, 12, 13, 14],
[25, 26, 27, 28, 29]])
In [260]: arr[:,2][:,3]
Out[260]: array([13, 28])
The [:,3] is applied to the result of the [:,2]. Each [] is translated by the interpreter into a __getitem__() call (or a __setitem__ if followed by a =). [:,2,3] is one just call, __getitem__((slice(None),2,3)).
With scalar indices like this, they are the same.
But what if one (or both) index is a list or array?
In [261]: arr[:,[1,2],3]
Out[261]:
array([[ 8, 13],
[23, 28]])
In [262]: arr[:,[1,2]]
Out[262]:
array([[[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
In [263]: arr[:,[1,2]][:,3]
Traceback (most recent call last):
Input In [263] in <cell line: 1>
arr[:,[1,2]][:,3]
IndexError: index 3 is out of bounds for axis 1 with size 2
In [264]: arr[:,[1,2]][:,:,3]
Out[264]:
array([[ 8, 13],
[23, 28]])
At least you are doing the common novice mistake of attempting:
In [265]: arr[:][2][3]
Traceback (most recent call last):
Input In [265] in <cell line: 1>
arr[:][2][3]
IndexError: index 2 is out of bounds for axis 0 with size 2
In the long run you need to read and understand (most of)
https://numpy.org/doc/stable/user/basics.indexing.html

Numpy array changes shape when accessing with indices

I have a small matrix A with dimensions MxNxO
I have a large matrix B with dimensions KxMxNxP, with P>O
I have a vector ind of indices of dimension Ox1
I want to do:
B[1,:,:,ind] = A
But, the lefthand of my equation
B[1,:,:,ind].shape
is of dimension Ox1xMxN and therefore I can not broadcast A (MxNxO) into it.
Why does accessing B in this way change the dimensions of the left side?
How can I easily achieve my goal?
Thanks
There's a feature, if not a bug, that when slices are mixed in the middle of advanced indexing, the sliced dimensions are put at the end.
Thus for example:
In [204]: B = np.zeros((2,3,4,5),int)
In [205]: ind=[0,1,2,3,4]
In [206]: B[1,:,:,ind].shape
Out[206]: (5, 3, 4)
The 3,4 dimensions have been placed after the ind, 5.
We can get around that by indexing first with 1, and then the rest:
In [207]: B[1][:,:,ind].shape
Out[207]: (3, 4, 5)
In [208]: B[1][:,:,ind] = np.arange(3*4*5).reshape(3,4,5)
In [209]: B[1]
Out[209]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
This only works when that first index is a scalar. If it too were a list (or array), we'd get an intermediate copy, and couldn't set the value like this.
https://docs.scipy.org/doc/numpy-1.15.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
It's come up in other SO questions, though not recently.
weird result when using both slice indexing and boolean indexing on a 3d array

numpy UnicodeDecodeError am I using the right approach with genfromtxt

I am stuck. I want to read a simple csv file into a Numpy array and seem to have dug myself into a hole. I am new to Numpy and I am SURE I have messed this up somehow as usually I can read CSV files easily in Python 3.4. I don't want to use Pandas so I thought I would use Numpy to increase my skillset but I really am not getting this at all. If someone could tell me if I am on the right track using genfromtxt OR is there an easier way and give me a nudge in the right direction I would be grateful.
I want to read in the CSV file manipulate the datetime column to 8/4/2014 then put it in a numpy array together with the remaining columns. Here is what I have so far and the error which I am having trouble coding around. I can get the date part way there but don't see how to add the date.strftime("%Y-%m-%d") to the datefunc. Also I don't see how to format the string for SYM to get round the error. Any help would be appreciated.
the data
2015-08-04 02:14:05.249392, AA, 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99
2015-08-04 02:14:05.325113, AAPL, 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75
2015-08-04 02:14:05.415193, AIG, 0.0080808151, 0.0073296055, 0.0076213535, 12.8278962785, 11.635388035, 12.0985236788, -9.2962105215, 3.980405659, -142.8175077335, 71, 42, 33
2015-08-04 02:14:05.486185, AMZN, 0.0235649449, 0.0305828226, 0.0092703502, 37.4081902773, 48.5487257749, 14.7162247572, 29.7810062852, -69.6877219282, -334.0005615016, 2, 92, 10
the "code" sorry still learning
import numpy as np
from datetime import datetime
from datetime import date,time
datefunc = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d %H:%M:%S.%f')
a = np.genfromtxt('/home/dave/Desktop/development/hvanal2016.csv',delimiter = ',',
converters = {0:datefunc},dtype='object,str,float,float,float,float,float,float,float,float,float,float,float,float',
names = ["date","sym","20sd","10sd","5sd","hv20","hv10","hv5","2010hv","105hv","abshv","2010rank","105rank","absrank"])
print(a["date"])
print(a["sym"])
print(a["20sd"])
print(a["hv20"])
print(a["absrank"])
the error
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "copyright", "credits" or "license()" for more information.
>>>
============================================================================== RESTART: /home/dave/3 9 15 my slope.py ===============================================================================
[datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
datetime.datetime(2015, 8, 4, 2, 14, 5, 325113)
datetime.datetime(2015, 8, 4, 2, 14, 5, 415193) ...,
datetime.datetime(2016, 3, 18, 1, 0, 25, 925754)
datetime.datetime(2016, 3, 18, 1, 0, 26, 26400)
datetime.datetime(2016, 3, 18, 1, 0, 26, 114828)]
Traceback (most recent call last):
File "/home/dave/3 9 15 my slope.py", line 19, in <module>
print(a["sym"])
File "/usr/lib/python3/dist-packages/numpy/core/numeric.py", line 1615, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 454, in array2string
separator, prefix, formatter=formatter)
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 328, in _array2string
_summaryEdgeItems, summary_insert)[:-1]
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 490, in _formatArray
word = format_function(a[i]) + separator
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
So part of your text is
b'2015-08-04 02:14:05.249392 AA 0.0193103612 ...'
(I'm using b because Py3 genfromtxt opens the file a bytestrings).
But you specify a , delimiter. I don't see any commas.
Let's just try a basic load, not fancy business.
In [97]: txt=b"""2015-08-04 02:14:05.249392 AA 0.0193103612 0.0193515212 0.0249713335 30.6542480634 30.7195875454 39.640763021 0.2131498442 29.0406746589 13524.5347810182 89 57 99
2015-08-04 02:14:05.325113 AAPL 0.0170506271 0.0137941891 0.0105915637 27.0670313481 21.8975963326 16.8135861893 -19.0986405157 -23.2172064279 21.5647072302 33 26 75
"""
In [98]: txt=txt.splitlines()
In [99]: data=np.genfromtxt(txt,dtype=None)
In [100]: data
Out[100]:
array([ (b'2015-08-04', b'02:14:05.249392', b'AA', 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99),
(b'2015-08-04', b'02:14:05.325113', b'AAPL', 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75)],
dtype=[('f0', 'S10'), ('f1', 'S15'), ('f2', 'S4'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', '<f8'), ('f10', '<f8'), ('f11', '<f8'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4')])
The datetime information is in 2 fields:
In [101]: data[['f0','f1']]
Out[101]:
array([(b'2015-08-04', b'02:14:05.249392'),
(b'2015-08-04', b'02:14:05.325113')],
dtype=[('f0', 'S10'), ('f1', 'S15')])
Your datefunction does work with a byte substring
In [102]: datefunc(b'2015-08-04 02:14:05.249392')
Out[102]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
But it requires 2 fields (as defined by the ' ' delimiter). So we need to figure out a way of parsing these 2 substrings as one, rather than split into two fields.
Maybe I'll try changing the sample txt to really use , delimiter (but not between date and time) and set what works.
With the , delimited text I get:
In [117]: data=np.genfromtxt(txt,delimiter=',',dtype=None,usecols=[0,1,2,3])
In [118]: data.dtype
Out[118]: dtype([('f0', 'S26'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
In [119]: data['f0']
Out[119]:
array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113',
b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'],
dtype='|S26')
In [120]: [datefunc(d) for d in data['f0']]
Out[120]:
[datetime.datetime(2015, 8, 4, 2, 14, 5, 249392),
datetime.datetime(2015, 8, 4, 2, 14, 5, 325113),
datetime.datetime(2015, 8, 4, 2, 14, 5, 415193),
datetime.datetime(2015, 8, 4, 2, 14, 5, 486185)]
I used usecols because the full text has 14 fields in the 1st line, and 13 in the others.
If I specify the dtype (instead of the easy None), I can replace the strings in the 1st field with these datetime objects:
In [122]: data=np.genfromtxt(txt,delimiter=',',dtype='O,S5,f,f',usecols=[0,1,2,3])
In [123]: data
Out[123]:
array([ (b'2015-08-04 02:14:05.249392', b' AA', 0.01931036077439785, 0.019351521506905556),
(b'2015-08-04 02:14:05.325113', b' AAPL', 0.01705062761902809, 0.01379418931901455),....],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])
In [124]: data['f0']
Out[124]:
array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113',
b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'], dtype=object)
....
In [126]: data['f0']=[datefunc(d) for d in data['f0']]
In [127]: data
Out[127]:
array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.01931036077439785, 0.019351521506905556),
(datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.01705062761902809, 0.01379418931901455),...],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])
and with the converter, your call works (more or less)
In [133]: data=np.genfromtxt(txt,dtype='object,S5,float,float',
converters = {0:datefunc},delimiter=',',usecols=[0,1,2,3])
In [134]: data
Out[134]:
array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212),
(datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...],
dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
the numpy datetime64 works with this string. These types can be used a numpy numbers.
In [154]: datefunc(b'2015-08-04 02:14:05.249392')
Out[154]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
In [155]: np.datetime64(b'2015-08-04 02:14:05.249392')
Out[155]: numpy.datetime64('2015-08-04T02:14:05.249392-0700')
From this Importing csv into Numpy datetime64 I got this to work:
In [175]: data=np.genfromtxt(txt,dtype='M8[us],S5,float,float',
delimiter=',',usecols=[0,1,2,3])
In [176]: data
Out[176]:
array([ (datetime.datetime(2015, 8, 4, 9, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212),
(datetime.datetime(2015, 8, 4, 9, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...],
dtype=[('f0', '<M8[us]'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
See for datetime units: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units

data Frame to dictionary

I can create a new dataframe based on the list of dicts. But how do I get the same list back from dataframe?
mylist=[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
import pandas as pd
df = pd.DataFrame(mylist)
The following will return the dictionary as per column and not row as shown in the example above.
n [18]: df.to_dict()
Out[18]:
{'month': {0: nan, 1: 'february', 2: 'january', 3: 'june'},
'points': {0: 50.0, 1: 25.0, 2: 90.0, 3: nan},
'points_h1': {0: nan, 1: nan, 2: nan, 3: 20.0},
'time': {0: '5:00', 1: '6:00', 2: '9:00', 3: nan},
'year': {0: 2010.0, 1: nan, 2: nan, 3: nan}}
df.to_dict(outtype='records')
Answer is from from: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html