Reading text file with multiple blocks separated by # - pandas

I have a text file with multiple blocks separated by #. The number of rows in each block is different. I would integrate a variable for each block. The text file looks like the following:
# a b c
### grid 1
1 2 3
2 3 4
3 4 5
### grid 2
11 12 13
12 13 14
13 14 15
### grid 3
21 22 23
22 23 24
23 24 25
24 25 26
I wound integrate a*c for each block. Using block one as an example, the result should be 1*3 + 2*4 + 3*5. Any ideas to implement it using numpy or pandas?

Once you have loaded a block into memory, you'll get an array like:
In [115]: arr = np.arange(1,4)+np.arange(0,3)[:,None]
In [116]: arr
Out[116]:
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
the sum of products is then easy:
In [117]: np.dot(arr[:,0], arr[:,2])
Out[117]: 26
In [118]: 1*3+2*4+3*5
Out[118]: 26

I found an answer from #Fred Foo, which reads the file quite good.
from itertools import groupby
def contains_data(ln):
# just an example; there are smarter ways to do this
return ln[0] not in "#\n"
with open("example") as f:
datasets = [[ln.split() for ln in group] \
for has_data, group in groupby(f, contains_data) \
if has_data]
dim1 = len(datasets)
cooling_intgrl = np.zeros(dim1)
for i in range(dim1):
block = np.array(datasets[i]).astype(float)
length = block[:,0]
cooling = block[:,2]
result = np.dot(length, cooling)
cooling_intgrl[i] = result
This works very well for me.

Related

Why this inconsistency between a Dataframe and a column of it?

When debugging a nasty error in my code I come across this that looks that an inconsistency in the way Dataframes work (using pandas = 1.0.3):
import pandas as pd
df = pd.DataFrame([[10*k, 11, 22, 33] for k in range(4)], columns=['d', 'k', 'c1', 'c2'])
y = df.k
X = df[['c1', 'c2']]
Then I tried to add a column to y (forgetting that y is a Series, not a Dataframe):
y['d'] = df['d']
I'm now aware that this adds a weird row to the Series; y is now:
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
But the weird thing is that now:
>>> df.shape, df['k'].shape
((4, 4), (5,))
And df and df['k'] look like:
d k c1 c2
0 0 11 22 33
1 10 11 22 33
2 20 11 22 33
3 30 11 22 33
and
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
There are a few things at work here:
A pandas series can store objects of arbitrary types.
y['d'] = _ add a new object to the series y with name 'd'.
Thus, y['d'] = df['d'] add a new object to the series y with name 'd' and value is the series df['d'].
So you have added a series as the last entry of the series y. You can verify that
(y['d'] == y.iloc[-1]).all() == True and
(y.iloc[-1] == df['d']).all() == True.
To clarify the inconsistency between df and df.k: Note that df.k, df['k'], or df.loc[:, 'k'] returns the series 'view' of column k, thus, adding an entry to the series will directly append it to this view. However, df.k shows the entire series, whereas df only show the series to maximum length df.shape[0]. Hence the inconsistent behavior.
I agree that this behavior is prone to bugs and should be fixed. View vs. copy is a common cause for many issues. In this case, df.iloc[:, 1] behaves correctly and should be used instead.

How to assign multiple values to a key in a dictionary, multiple times using pandas [duplicate]

This question already has answers here:
GroupBy results to dictionary of lists
(2 answers)
Closed 3 years ago.
I have read an excel file with pandas.
I want to loop over 4 values in 4 consecutive rows and assign those 4 values as a list to a certain key in a dictionary. When the first 4 values are assigned to the first key, i want to assign the next 4 values to the second key and so on.
s t values_I_need
AT 1 123
AT 2 21
AT 3 1
AT 4 34
BT 1 34
BT 2 34
BT 3 213
BT 4 12
CE 1 23
CE 2 45
CE 3 234
CE 4 23
#and so on...
The Output I want to see is a dictionary y = { AT : [123,21,1,34], BT : [34,34,213,12], CE : [23,45,234,23]}
I tried the following but it only returns empty lists assigned to the keys, and the dictionary doesnt even contain all the keys from the excelsheet.
df = pd.read_excel("Inputdaten.xlsx", sheetname="dummy")
y = {}
lst = []
t=0
z=4
for row in df.itertuples():
for i in df.iloc[t:z,2]:
lst.append(i)
if len(lst) ==4:
t = t+4
z = z+4
y[row.s]= lst
lst[:] = []
break
What am I missing? Or is there a smarter way to code it without for-loops?
Thanks for any help in advance :)
You could simply do the following:
result = df.groupby("s")["values_I_need"].apply(list).to_dict()
Output:
{'AT': [123, 21, 1, 34], 'BT': [34, 34, 213, 12], 'CE': [23, 45, 234, 23]}

how to convert a pandas column containing list into dataframe

I have a pandas dataframe.
One of its columns contains a list of 60 elements, constant across its rows.
How do I convert each of these lists into a row of a new dataframe?
Just to be clearer: say A is the original dataframe with n rows. One of its columns contains a list of 60 elements.
I need to create a new dataframe nx60.
My tentative:
def expand(x):
return(pd.DataFrame(np.array(x)).reshape(-1,len(x)))
df["col"].apply(lambda x: expand(x))]
it gives funny results....
The weird thing is that if i call the function "expand" on a single raw, it does exactly what I expect from it
expand(df["col"][0])
To ChootsMagoots: Thjis is the result when i try to apply your suggestion. It does not work.
Sample data
df = pd.DataFrame()
df['col'] = np.arange(4*5).reshape(4,5).tolist()
df
Output:
col
0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11, 12, 13, 14]
3 [15, 16, 17, 18, 19]
now exctract DataFrame from col
df.col.apply(pd.Series)
Output:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
Try this:
new_df = pd.DataFrame(df["col"].tolist())
This is a little frankensteinish, but you could also try:
import numpy as np
np.savetxt('outfile.csv', np.array(df['col'].tolist()), delimiter=',')
new_df = pd.read_csv('outfile.csv')
You can try this as well:
newCol = pd.Series(yourList)
df['colD'] = newCol.values
The above code:
1. Creates a pandas series.
2. Maps the series value to columns in original dataframe.

Saving with numpy savetxt. Array elements as columns

I am pretty new to Python and trying to kick my Matlab addiction. I am converting a lot of my lab's machine vision code over to Python but I am just stuck on one aspect of the saving. At each line of the code we save 6 variables in an array. I'd like these to be entered in as one of 6 columns in a txt file with bumpy.savetxt. Each iteration of the tracking loop would then add similar variables for that given frame as the next row in the txt file.
But I keep getting either a single column that just grows with every loop. I've attached a simple code to show my problem. As it loops through, there will be a variable generated that is called output. I would like this to be the three columns of the txt file and each iteration of the loop to be a new row. Is there any easy way to do this?
import numpy as np
dataFile_Path = "dataFile.txt"
dataFile_id = open(dataFile_Path, 'w+')
for x in range(0, 9):
variable = np.array([2,3,4])
output = x*variable+1
output.astype(float)
print(output)
np.savetxt(dataFile_id, output, fmt="%d")
dataFile_id.close()
In [160]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: output.astype(float)
...: print(output)
...:
[1 1 1]
[3 4 5]
[5 7 9]
[ 7 10 13]
[ 9 13 17]
[11 16 21]
[13 19 25]
[15 22 29]
[17 25 33]
So you are writing one row at a time. savetxt normally is used to write a 2d array.
Notice that the print is still integers - astype returns a new array, it does not change things inplace.
But because you are giving it 1d arrays it writes those as columns:
In [177]: f = open('txt','bw+')
In [178]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: np.savetxt(f, output, fmt='%d')
...:
In [179]: f.close()
In [180]: cat txt
1
1
1
3
4
5
5
7
9
if instead I give savetxt a 2d array ((1,3) shape), it writes
In [181]: f = open('txt','bw+')
In [182]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: np.savetxt(f, [output], fmt='%d')
...:
...:
In [183]: f.close()
In [184]: cat txt
1 1 1
3 4 5
5 7 9
7 10 13
9 13 17
11 16 21
13 19 25
15 22 29
17 25 33
But a better approach is to construct the 2d array, and write that with one savetxt call:
In [185]: output = np.array([2,3,4])*np.arange(9)[:,None]+1
In [186]: output
Out[186]:
array([[ 1, 1, 1],
[ 3, 4, 5],
[ 5, 7, 9],
[ 7, 10, 13],
[ 9, 13, 17],
[11, 16, 21],
[13, 19, 25],
[15, 22, 29],
[17, 25, 33]])
In [187]: np.savetxt('txt', output, fmt='%10d')
In [188]: cat txt
1 1 1
3 4 5
5 7 9
7 10 13
9 13 17
11 16 21
13 19 25
15 22 29
17 25 33

sorting within the keys of group by

I have a group by table as follows, I want to sort by index within the keys ['CPUCore', Offline_RetetionAge'] (need to keep the structure of ['CPUCore', Offline_RetetionAge']) how should I do?
I think there is problem dtype of your second level is object, what is obviously string, so if use sort_index it sorts alphanumeric:
df = pd.DataFrame({'CPUCore':[2,2,2,3,3],
'Offline_RetetionAge':['100','1','12','120','15'],
'index':[11,16,5,4,3]}).set_index(['CPUCore','Offline_RetetionAge'])
print (df)
index
CPUCore Offline_RetetionAge
2 100 11
1 16
12 5
3 120 4
15 3
print (df.index.get_level_values('Offline_RetetionAge').dtype)
object
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
100 11
12 5
3 120 4
15 3
#change multiindex - cast level Offline_RetetionAge to int
new_index = list(zip(df.index.get_level_values('CPUCore'),
df.index.get_level_values('Offline_RetetionAge').astype(int)))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
12 5
100 11
3 15 3
120 4
EDIT by comment:
print (df.reset_index()
.sort_values(['CPUCore','index'])
.set_index(['CPUCore','Offline_RetetionAge']))
index
CPUCore Offline_RetetionAge
2 12 5
100 11
1 16
3 15 3
120 4
I think what you mean is this:
import pandas as pd
from pandas import Series, DataFrame
# create what I believe you tried to ask
df = DataFrame( \
[[11,'reproducible'], [16, 'example'], [5, 'a'], [4, 'create'], [9,'!']])
df.columns = ['index', 'bla']
df.index = pd.MultiIndex.from_arrays([[2]*4+[3],[10,100,1000,11,512]], \
names=['CPUCore', 'Offline_RetentionAge'])
# sort by values and afterwards by index where sort_remaining=False preserves
# the order of index
df = df.sort_values('index').sort_index(level=0, sort_remaining=False)
print df
The statement sort_values sorts the values by index and the sort_index restores the grouping by multiindex without changing the order of index for rows with the same CPUCore.
I don't know what a "group by table" is supposed to be. If you have a pd.GroupBy object, you won't be able to use sort_values() like that.
You might have to rethink what you group by or use functools.partial and DataFrame.apply
Output:
index bla
CPUCore Offline_RetentionAge
2 11 4 create
1000 5 a
10 11 reproducible
100 16 example
3 512 9 !