y argument is not supported when using dataset as input - kerastuner Batchdataset tensorflow - tensorflow2.0

is kerastuner and Batchdataset is compatible ?
Hi All I am trying to add batch dataset combine kerastuner bayesianoptimization and tensorflow ?
X is
<BatchDataset shapes: ({Sales: (None,), Quantity: (None,), Discount: (None,), Profit: (None,)}, (None,)), types: ({Sales: tf.float64, Quantity: tf.int64, Discount: tf.float64, Profit: tf.float64}, tf.int32)>
Y is
<BatchDataset shapes: (None,), types: tf.int32> Tensors are batch size is 32 . While passing into tuner.search() I am receiving error
y argument is not supported when using dataset as input.
Value of Y :
tf.Tensor([1 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 2 2 3 3 1 3 3 3 3 0 1 3 1 0 3 1], shape=(32,), dtype=int32)
tf.Tensor([3 1 3 3 3 3 3 1 1 3 3 0 3 1 0 3 3 1 0 2 1 3 3 1 1 0 3 3 3 3 3 1], shape=(32,), dtype=int32)
tf.Tensor([3 3 0 1 3 3 3 1 3 0 3 3 3 1 3 3 3 3 0 3 1 3 3 0 3 3 3 3 1 3 0 3], shape=(32,), dtype=int32)
tf.Tensor([1 3 1 0 3 3 3 3 3 1 3 3 3 1 3 3 1 3 3 3 3 3 3 3 3 3 3 0 1 3 3 3], shape=(32,), dtype=int32)
tf.Tensor([1 3 1 3 3 3 1 3 1 1 3 1 1 1 3 3 1 3 0 3 3 1 1 3 0 1 3 3 1 3 3 3], shape=(32,), dtype=int32)
Actual Function
def df_to_training_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop(COLUMN_LABEL)
dataset = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
dataset = dataset.batch(batch_size)
labels = np.concatenate([y for x,y in dataset], axis=0)
labels = tf.data.Dataset.from_tensor_slices(labels)
labels = labels.batch(batch_size)
return dataset,labels
Calling Function
training_batch_size = training_dataframe.shape[0]
training_dataset,training_labels = df_to_training_dataset(training_dataframe, training_batch_size)
validation_batch_size = validation_dataframe.shape[0]
validation_dataset,validation_labels = df_to_training_dataset(validation_dataframe, validation_batch_size)
Error Function
tuner.search(x=training_dataset, y=training_labels,validation_data=(validation_dataset,validation_labels),epochs=2)
Getting Error
`y` argument is not supported when using dataset as input.
`y` argument is not supported when using dataset as input.

Related

splitting the file and rearrangement

My text file contain data of temp variation
#
1 2
2 4
3 4
#
6 1
3 2
1 7
I want the column values to be splitted at # and generate the new files by appending the splitted files
expected output1
1 6
2 3
3 1
expected output2
2 1
4 2
4 7
A more complex problem than it seems at first glance. I had to carefully look at the example output to fully see what was going on.
Simulated text file:
sim_txt = io.StringIO('''
#
1 2
2 4
3 4
#
6 1
3 2
1 7
''')
df = pd.read_csv(sim_txt, sep='\s+', header=None, names=[0,1])
df_out = df.assign(out=df[0].str.contains('#').cumsum()) \
.pivot(columns=['out']).apply(lambda x: x.shift(-len(x)//2) if x.name[1]==2 else x) \
.dropna().astype(int)
print(df_out)
0 1
out 1 2 1 2
1 1 6 2 1
2 2 3 4 2
3 3 1 4 7
Then save to individual files:
for c in df_out.columns.get_level_values(0).unique():
df_out.loc[1:,c].to_csv(fr'd:\jchtempnew\SO\output{c+1}.txt', index=None, header=None, sep=' ')
output1
1 6
2 3
3 1
output2
2 1
4 2
4 7

Insert a level o in the existing data frame such that 4 columns are grouped as one

I want to do multiindexing for my data frame such that MAE,MSE,RMSE,MPE are grouped together and given a new index level. Similarly the rest of the four should be grouped together in the same level but different name
> mux3 = pd.MultiIndex.from_product([list('ABCD'),list('1234')],
> names=['one','two'])###dummy data
> df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3) #### dummy data frame
> print(df3) #intended output required for the data frame in the picture given below
Assuming column groups are already in the appropriate order we can simply create an np.arange over the length of the columns and floor divide by 4 to get groups and create a simple MultiIndex.from_arrays.
Sample Input and Output:
import numpy as np
import pandas as pd
initial_index = [1, 2, 3, 4] * 3
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, len(initial_index))), columns=initial_index
)
1 2 3 4 1 2 3 4 1 2 3 4 # Column headers are in repeating order
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
# Create New Columns
df3.columns = pd.MultiIndex.from_arrays([
np.arange(len(df3.columns)) // 4, # Group Each set of 4 columns together
df3.columns # Keep level 1 the same as current columns
], names=['one', 'two']) # Set Names (optional)
df3
one 0 1 2
two 1 2 3 4 1 2 3 4 1 2 3 4
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
If columns are in mixed order:
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, 8)), columns=[1, 1, 3, 2, 4, 3, 2, 4]
)
df3
1 1 3 2 4 3 2 4 # Cannot select groups positionally
0 3 6 6 0 9 8 4 7
1 0 0 7 1 5 7 0 1
2 4 6 2 9 9 9 9 1
We can convert Index.to_series then enumerate columns using groupby cumcount then sort_index if needed to get in order:
df3.columns = pd.MultiIndex.from_arrays([
# Enumerate Groups to create new level 0 index
df3.columns.to_series().groupby(df3.columns).cumcount(),
df3.columns
], names=['one', 'two']) # Set Names (optional)
# Sort to Order Correctly
# (Do not sort before setting columns it will break alignment with data)
df3 = df3.sort_index(axis=1)
df3
one 0 1
two 1 2 3 4 1 2 3 4 # Notice Data has moved with headers
0 3 0 6 9 6 4 8 7
1 0 1 7 5 0 0 7 1
2 4 9 2 9 6 9 9 1

Convert multichannel image into pixelwise pandas dataframe

If you have a multiband image of, say, dimensions 1024 * 1024 * 200 (columns * lines * bands) and want to convert that to a pandas dataframe of the form:
Band Value
1 1 0.14
2 1 1.18
3 1 2.56
.
.
.
209715198 200 1.01
209715199 200 1.15
209715200 200 2.00
So basically all pixels in sequential form, with the band number (or wavelength) and the pixel value as columns.
Is there a clever and efficient way of doing this without a lot of loops, appending to arrays and so on?
Answer
You can do it with numpy. I'll try my best to walk you through it below. First you need the input images in a 3D numpy array. I'm just going to use a randomly generated small one for illustration. This is the full code, with an explanation below.
import numpy as np
import pandas as pd
images = np.random.randint(0,9,(2,5,5))
z, y, x = images.shape ## 2, 5, 5 (200, 1024, 1024 for your example)
arr = np.column_stack((np.repeat(np.arange(z),y*x), images.ravel()))
df = pd.DataFrame(arr, columns = ['Bands', 'Value'])
Explanation
The images output array looks like this (basically 2 images at 5x5 pixels):
[[[5 2 3 6 2]
[6 1 6 3 2]
[8 3 2 2 1]
[5 1 2 6 0]
[3 4 7 0 2]]
[[1 7 0 7 3]
[7 4 5 4 3]
[1 5 4 7 4]
[2 0 2 7 2]
[7 0 1 6 7]]]
The next step is to use np.ravel() to flatten it. Which will output your required Value column:
#images.ravel()
[5 2 3 6 2 6 1 6 3 2 8 3 2 2 1 5 1 2 6 0 3 4 7 0 2 1 7 0 7 3 7 4 5 4 3 1 5
4 7 4 2 0 2 7 2 7 0 1 6 7]
To create the band column, you need to repeat the z value for an array, x*y times. You can do this with np.repeat() and np.arange(). Which gives you a 1D array:
#(np.repeat(np.arange(z),y*x))
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1]
This is the required Band column. To combine, them use np.column_stack() and then turn it into a dataframe. All of the above steps combined. Would be:
arr = np.column_stack((np.repeat(np.arange(z),y*x), images.ravel()))
df = pd.DataFrame(arr, columns = ['Bands', 'Value'])
Which will output:
Bands Value
0 0 5
1 0 2
2 0 3
3 0 6
4 0 2
5 0 6
6 0 1
7 0 6
8 0 3
9 0 2
10 0 8
11 0 3
12 0 2
13 0 2
14 0 1
15 0 5
16 0 1
17 0 2
18 0 6
19 0 0
20 0 3
21 0 4
22 0 7
23 0 0
24 0 2
25 1 1
26 1 7
27 1 0
...
As required. I hope this at least gets you moving in the right direction.

How to use the each pair of values in column and transpose them into rows inside groups (groupby)

I have data on which I already applied group by user and sort by time (data.groupby('id').apply(lambda x: x.sort_values('time'))
):
user time point_id
1 00:00 1
1 00:01 3
1 00:02 4
1 00:03 2
2 00:00 1
2 00:05 3
2 00:15 1
3 00:00 1
3 01:00 2
3 02:00 3
And from that I need to inside each group to made links/transpose next 2 values into rows. It should look like this for the example above:
user start_point end_point
1 1 3
1 3 4
1 4 2
2 1 3
2 3 1
3 1 2
3 2 3
My final goal is to get matrix which will show how many links come into each point:
point_id | 1 | 2 | 3 | 4 |
--------------------------------------------
1 0 1 3 0
2 1 0 0 1
3 3 0 0 1
4 0 1 1 0
So this matrix means that from point 2 one link goes to point 1, from point 3 that 3 links go to the point one and etc.
The picture of this looks like this:
First, you can use shift() to group point_id into rows.
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
print(df)
user start_point end_point
0 1 1 3
1 1 3 4
2 1 4 2
4 2 1 3
5 2 3 1
7 3 1 2
8 3 2 3
Then you can use pd.crosstab to count directed link.
u = pd.crosstab(df.start_point, df.end_point)
print(u)
end_point 1 2 3 4
start_point
1 0 1 2 0
2 0 0 1 0
3 1 0 0 1
4 0 1 0 0
According to your results, what you need is undirected graph counting, so all we need to do is transpose and add.
result = u + u.T
print(result)
end_point 1 2 3 4
start_point
1 0 1 3 0
2 1 0 1 1
3 3 1 0 1
4 0 1 1 0
Final code as follow:
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
u = pd.crosstab(df.start_point, df.end_point)
result = u + u.T
I believe this works for your example, taking df = data.groupby('id').apply(lambda x: x.sort_values('time')) (your starting example):
groups = [(k, df.loc[v, 'point_id'].values) for k, v in df.groupby('user').groups.items()]
res = []
for g in groups:
res.append([(g[0], i) for i in (zip(g[1], g[1][1:]))])
df1 = pd.DataFrame([item for sublist in res for item in sublist])
df2 = df1.copy()
df2.iloc[:,-1] = df2.iloc[:,-1].apply(lambda x: (x[1], x[0])) # df2 swaps around the points
df_ = pd.concat([df1, df2]).sort_values(by=0)
df_['1'], df_['2'] = df_.iloc[:,-1].apply(lambda x: x[0]), df_.iloc[:,-1].apply(lambda x: x[1])
df_ = df_.drop(columns=1)
df_.columns = ['user', 'start_point', 'end_point'] # your intermediate table
df_.pivot_table(index='start_point', columns='end_point', aggfunc='count').fillna(0)
Output:
user
end_point 1 2 3 4
start_point
1 0.0 1.0 3.0 0.0
2 1.0 0.0 1.0 1.0
3 3.0 1.0 0.0 1.0
4 0.0 1.0 1.0 0.0

pandas convert lists in multiple columns within DataFrame to separate columns

I am trying to convert a list within multiple columns of a pandas DataFrame into separate columns.
Say, I have a dataframe like this:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [1, 2, 3] [4, 5, 6]
2 [1, 2, 3] [4, 5, 6]
And would like to convert it to something like this:
0 1 2 0 1 2
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I have managed to do this in a loop. However, I would like to do this in fewer lines.
My code snippet so far is as follows:
import pandas as pd
df = pd.DataFrame([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])
output1 = df[0].apply(pd.Series)
output2 = df[1].apply(pd.Series)
output = pd.concat([output1, output2], axis=1)
If you don't care about the column names you could do:
>>> df.apply(np.hstack, axis=1).apply(pd.Series)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
Using sum
pd.DataFrame(df.sum(1).tolist())
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6