reshape numpy 2d array into pandas 1d - pandas

I have a numpy array as follow
a.shape = (100, 500)
would like to tranform into pandas dataframe as follow
df.shape = (100 * 500, 1)
df[500*i+j,0] = a[i, j]
without loop...

I'm sure I'm missing something, but isn't it a simple flattening?
df = pd.DataFrame(a.flatten())
If I misunderstood what you mean by i and j, a transpose should do:
df = pd.DataFrame(a.T.flatten())

Related

How to apply scipy.signal.decimate to a dataframe to prevent alliasing before downsampling the timeseries

I have made a few modifications to your question to make it more appropriate for StackOverflow. Here's the updated version:
Copy code
I have two pandas DataFrames, dfb and dfv, where dfb has a higher sampling rate than dfv. I want to downsample dfb to align it with dfv. However, I am aware that I need to apply a low-pass filter to avoid aliasing. Can you suggest any improvements to the following function and what is the best way to apply a low-pass filter on a DataFrame before downsampling a time series?
import pandas as pd
from scipy.signal import decimate
# Generate example dataframes
dfb = pd.DataFrame({'a': range(0, 100, 2), 'b': range(0, 100, 2)}, index=pd.date_range('2022-01-01', periods=50, freq='2s'))
dfv = pd.DataFrame({'c': range(0, 100, 5), 'd': range(0, 100, 5)}, index=pd.date_range('2022-01-01', periods=20, freq='10s'))
def downsample_dataframe(dfb, dfv, filter_order=4):
freq_dfb = pd.infer_freq(dfb.index)
freq_dfv = pd.infer_freq(dfv.index)
q = int(pd.to_timedelta(freq_dfb).total_seconds()/pd.to_timedelta(freq_dfv).total_seconds())
dfb_downsampled = pd.DataFrame()
for column_name in dfb.columns:
signal = dfb[column_name]
signal = decimate(signal, q, zero_phase=True, axis=0, n=filter_order)
dfb_downsampled[column_name] = signal
dfb_downsampled.index = dfv.index
return dfb_downsampled
Are there any suggestions on how to improve this function? Is there a suggested method for aligning two timeseries by first applying

Slice dataframe and put slices into new colums

i have a big dataframe with 1 million rows of time series data. I want to slice it into smaller chunks of 1000 rows each. So this would give me 1000 chunks, and i need every chunk to be copied into a column of a new dataframe.
i am now doing this, which does the job but might be inefficient. Im still happy if people could help:
for df_split in np.array_split(df, len(df) // chunk_size):
#print(df_split['random_nos'].mean())
i=i+1
df_split= df_split.reset_index()
df_split = df_split.rename({'random_nos': 'String'+str(i)}, axis=1)
df_all = pd.concat([df_all, df_split], axis=1)
Will do. Now that i can slice my dataframe, i run into the next problem. in my timeseries, i want to slice around 16000 events with a duration of 2144.14 samples. If i slice at 2144 or 2145, slices will become displaced more and more. I tried the following, but it didnt work:
def slice_df_into_chunks(df_size, chunk_size):
df_list = []
chunk_size2=chunk_size
for i, df_split in enumerate(np.array_split(df3, chunk_size2)):
df_split = df_split.rename(columns={'ChannelA01':f'Slice{i}'})
df_split.reset_index(drop=True, inplace=True)
df_list.append(df_split)
if i % 6==0:
chunk_size2=chunk_size-1
print(i)
else:
chunk_size2=chunk_size
return pd.concat(df_list, axis=1)
df4=slice_df_into_chunks(len(df3), np.floor(EventsPerLog))
i thought about solving this issue with something like this (it takes ages), so every now and then the chunk_size is smaller. After i defined the groups, i can cast those into dataframe-columns.
for i in range(40):
df.loc[i*SamplesEvent:(i+1)*SamplesEvent2,'Group']=i
if i % 6==0:
SamplesEvent2=SamplesEvent2-1
print(i)
else:
SamplesEvent2=5
You could use numpy.array_split to achieve this:
import pandas as pd
import numpy as np
def create_df_with_slices_in_cols(df, no_of_cols):
# df = pd.DataFrame(np.random.rand(df_size), columns=['random_nos'])
df_list = []
for i, df_split in enumerate(np.array_split(df, no_of_cols)):
df_split = df_split.rename(columns={'random_nos':f'Slice{i}'})
df_split.reset_index(drop=True, inplace=True)
df_list.append(df_split)
return pd.concat(df_list, axis=1)
create_df_with_slices_in_cols(pd.DataFrame(np.random.rand(10**6), columns=['random_nos']),
no_of_cols=10**3)
Note that if len(df) is not exactly divisible by no_of_cols (e.g: 10 & 3) then one column will have extra numbers.
create_df_with_slices_in_cols(10, 3)
Slice0 Slice1 Slice2
0 0.955620 0.543234 0.509360
1 0.755157 0.174576 0.267600
2 0.816509 0.776549 0.455464
3 0.990282 NaN NaN
Update
To minimize the displacement of data when using a column size(say n) that doesn't divide len(df) exactly, you can only consider the first n-1 columns, then create one final column with the remaining rows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(16000), columns=['random_nos'])
df_size = len(df)
slice_size = 2144.14
no_of_cols = int(df_size//slice_size + 1)
def create_df_with_slices_in_cols(df, no_of_cols):
df_list = []
for i, df_split in enumerate(np.array_split(df, no_of_cols)):
df_split = df_split.rename(columns={'random_nos':f'Slice{i}'})
df_split.reset_index(drop=True, inplace=True)
df_list.append(df_split)
return pd.concat(df_list, axis=1)
# fill out the first 7 (n-1) columns first
rows_first_pass = int((no_of_cols-1) * np.floor(slice_size))
df_combined = create_df_with_slices_in_cols(df[:rows_first_pass], no_of_cols-1)
# fill out the 8th (nth) column using the remaining rows
rem_rows = df_size - rows_first_pass
last_df = df[-rem_rows:].rename(columns={'random_nos':f'Slice{no_of_cols-1}'})
last_df.reset_index(drop=True, inplace=True)
df_combined = pd.concat([df_combined, last_df], axis=1)

Wrong result of np.dot

I'm new in python and I'm trying to do the multiplication of a 2d matrix with a 1d one. I use np.dot to do it but it gives me a wrong output. I'm trying to do this:
#X_train.shape = 60000
w = np.zeros([784, 1])
lista = range (0, len(X_train))
for i in lista:
score = np.dot(X_train[i,:], w)
print score.shape
Out-> (1L,)
the output should be (60000,1)
Any idea of how I can resolve the problem?
You should avoid the for loop altogether. Indeed, np.dot is supposed to work on N-dim arrays and does the looping internally. See for example
In [1]: import numpy as np
In [2]: a = np.random.rand(1,2) # a.shape = (1,2)
In [3]: b = np.random.rand(2,3) # b.shape = (2,3)
In [4]: np.dot(a,b)
Out[4]: array([[ 0.33735571, 0.29272468, 0.09361096]])

Why doesn't the shape of my numpy array change?

I have made a numpy array out of data from an image. I want to convert the numpy array into a one-dimensional one.
import numpy as np
import matplotlib.image as img
if __name__ == '__main__':
my_image = img.imread("zebra.jpg")[:,:,0]
width, height = my_image.shape
my_image = np.array(my_image)
img_buffer = my_image.copy()
img_buffer = img_buffer.reshape(width * height)
print str(img_buffer.shape)
The 128x128 image is here.
However, this program prints out (128, 128). I want img_buffer to be a one-dimensional array though. How do I reshape this array? Why won't numpy actually reshape the array into a one-dimensional array?
.reshape returns a new array, rather than reshaping in place.
By the way, you appear to be trying to get a bytestring of the image - you probably want to use my_image.tostring() instead.
reshape doesn't work in place. Your code isn't working because you aren't assigning the value returned by reshape back to img_buffer.
If you want to flatten the array to one dimension, ravel or flatten might be easier options.
>>> img_buffer = img_buffer.ravel()
>>> img_buffer.shape
(16384,)
Otherwise, you'd want to do:
>>> img_buffer = img_buffer.reshape(np.product(img_buffer.shape))
>>> img_buffer.shape
(16384,)
Or, more succinctly:
>>> img_buffer = img_buffer.reshape(-1)
>>> img_buffer.shape
(16384,)

Turn 2D NumPy array into 1D array for plotting a histogram

I'm trying to plot a histogram with matplotlib.
I need to convert my one-line 2D Array
[[1,2,3,4]] # shape is (1,4)
into a 1D Array
[1,2,3,4] # shape is (4,)
How can I do this?
Adding ravel as another alternative for future searchers. From the docs,
It is equivalent to reshape(-1, order=order).
Since the array is 1xN, all of the following are equivalent:
arr1d = np.ravel(arr2d)
arr1d = arr2d.ravel()
arr1d = arr2d.flatten()
arr1d = np.reshape(arr2d, -1)
arr1d = arr2d.reshape(-1)
arr1d = arr2d[0, :]
You can directly index the column:
>>> import numpy as np
>>> x2 = np.array([[1,2,3,4]])
>>> x2.shape
(1, 4)
>>> x1 = x2[0,:]
>>> x1
array([1, 2, 3, 4])
>>> x1.shape
(4,)
Or you can use squeeze:
>>> xs = np.squeeze(x2)
>>> xs
array([1, 2, 3, 4])
>>> xs.shape
(4,)
reshape will do the trick.
There's also a more specific function, flatten, that appears to do exactly what you want.
the answer provided by mtrw does the trick for an array that actually only has one line like this one, however if you have a 2d array, with values in two dimension you can convert it as follows
a = np.array([[1,2,3],[4,5,6]])
From here you can find the shape of the array with np.shape and find the product of that with np.product this now results in the number of elements. If you now use np.reshape() to reshape the array to one length of the total number of element you will have a solution that always works.
np.reshape(a, np.product(a.shape))
>>> array([1, 2, 3, 4, 5, 6])
Use numpy.flat
import numpy as np
import matplotlib.pyplot as plt
a = np.array([[1,0,0,1],
[2,0,1,0]])
plt.hist(a.flat, [0,1,2,3])
The flat property returns a 1D iterator over your 2D array. This method generalizes to any number of rows (or dimensions). For large arrays it can be much more efficient than making a flattened copy.