Pandas dataframe - pandas

I have a dataframe df having two columns 'voltage'(v) and 'current'(I). I want to randomly select 5 values of 'voltage' from the file, save it in 1D array like [v1,v2,v3,v4,v5], and save the corresponding values of currents in another 1D array like [I1,I2,...,I5]. Here is what I tried:
df=pd.read_csv(file,sep=",",header=None,usecols=[0,1],names=['voltage','current'])
#pick 5 random values of voltage and save it in np array
V= np.array( df['voltage'].sample(n=5))
How to do the same with the corresponding values of I at selected values of V?

I think need:
arr = df.sample(n=5).values
a = arr[:, 0]
b = arr[:, 1]

While jezrael's answer does provide the desired output, the answer to your question would be:
V= df['voltage'].sample(n=5)
I = df.loc[V.index,'current']

Related

Reshape a DataFrame based on column value, and pad missing slices with zeros

I have a Pandas DataFrame which looks like:
ID
order
other_column_1
other_column_x
A
0
10
20
A
1
11
21
A
2
12
22
B
0
31
41
B
2
33
43
I want to reshape it to a 3D matrix with shape (#IDs, #order, #other columns). For the example above, it should be of shape (2, 3, 2).
The order column holds the order of the 2nd dimension, so slice ['A', 0, :] should be [10, 20] and ['A', 1, :] [11, 21] etc. The values of order are identical for all ID (0, 1, 2 in this case).
Trouble is, sometimes a slice is missing e.g. for 'B', the slice (order) '1' is missing, which I want to make it a slice pad with all 0's, to keep the shape consistent.
I think of pre-sorting the whole DataFrame by ID and order, loop over each ID , insert missing slices, and stack them together. However, the DataFrame is huge so I try to avoid global sort and loop if possible.
I came up with a way to do it (if you have enough pc memory to allocate) where you dont have to loop the whole dataframe although I coudn't test it with 10M rows because of memory allocation. I tested it with 5M rows by 300 columns and I will show the results at the end of the answer.
The idea is to get all the combinations of the unique values of the first 2 columns as an index to build the first 2 dimensions of the 3D array.
After that you can merge the original dataframe with the dataframe containing index combinations to then fill all the missing values with 0.
Once the data is complete you can pass it to numpy and reshape it in 3 dimensions.
Code without comments:
# df = orginal dataframe
d1 = df.ID.unique()
d2 = df.order.unique()
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left')\
.fillna(0)
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Code with comments:
# df = orginal dataframe
# Get unique id for 1st dimension
d1 = df.ID.unique()
# Get unique order fpr 2nd dimension
d2 = df.order.unique()
# Get complete DF
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\ # Get missing values from 1st and 2nd dimensions as index
.to_frame().reset_index(drop=True)\ # Get Dataframe from multiindex and reset index
.merge(df, on=['ID', 'order'], how='left')\ # Merge the complete dimensions with the original values
.fillna(0) # fill missing values with 0
# get complete data as 2D array and reshape as 3D array
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
Test:
First I tried to test with 10M rows but I could not allocate the memory needed for that.
To test the code I created a a dataframe with 6M rows x 300 columns (random float numbers) and dropped 1M rows to simulate the missing values.
Here is the code I used to test and the results.
Test code:
import random
import time
import pandas as pd
import numpy as np
# 100000 diff. ID and 60 diff. order
df_test = pd.MultiIndex.from_product((range(100000), range(60)), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.drop(random.sample(range(6_000_000), k=1_000_000))\ # Drop 1M rows to simulate missing rows
.reset_index(drop=True)
# 5M rows random data by 298 columns
df_test2 = pd.DataFrame(np.random.random(size=(5_000_000, 298)))
df = df_test.merge(df_test2, left_index=True, right_index=True)
start = time.time()
d1 = df.ID.unique()
print(f'time 1st Dimension: {round(time.time()-start, 3)}')
d2 = df.order.unique()
print(f'time 2nd Dimension: {round(time.time()-start, 3)}')
df3 = pd.MultiIndex.from_product((d1, d2), names=['ID', 'order'])\
.to_frame().reset_index(drop=True)\
.merge(df, on=['ID', 'order'], how='left').fillna(0)
print(f'time merge: {round(time.time()-start, 3)}')
np_3d_array = df3[df3.columns[2:]].to_numpy().reshape(d1.shape[0], d2.shape[0], df.columns[2:].shape[0])
print(f'time ndarray: {round(time.time()-start, 3)}')
print(f'array shape: {np_3d_array.shape}')
print(f'array type: {type(np_3d_array)}')
Test Results:
time 1st Dimension: 0.035
time 2nd Dimension: 0.063
time merge: 47.202
time ndarray: 49.441
array shape: (100000, 60, 298)
array type: <class 'numpy.ndarray'>
ids = df.ID.unique()
orders = df.order.unique()
ar = (df.set_index(['ID','order'])
.reindex(pd.MultiIndex.from_product((ids, orders)))
.fillna(0)
.to_numpy()
.reshape(len(ids), len(orders), len(df.columns[2:])))
print(ar)
print(ar.shape)
Output:
[[[10. 20.]
[11. 21.]
[12. 22.]]
[[31. 41.]
[ 0. 0.]
[33. 43.]]]
(2, 3, 2)

Looping through a dictionary of dataframes and counting a column

I am wondering if anyone can help. I have a number of dataframes stored in a dictionary. I simply want to access each of these dataframes and count the values in a column in the column I have 10 letters. In the first dataframe there are 5bs and 5 as. For example the output from the count I would expect to be is a = 5 and b =5. However for each dataframe this count would be different hence I would like to store the output of these counts either into another dictionary or a separate variable.
The dictionary is called Dict and the column name in all the dataframes is called letters. I have tried to do this by accessing the keys in the dictionary but can not get it to work. A section of what I have tried is shown below.
import pandas as pd
for key in Dict:
Count=pd.value_counts(key['letters'])
Count here would ideally change with each new count output to store into a new variable
A simplified example (the actual dataframe sizes are max 5000,63) of the one of the 14 dataframes in the dictionary would be
`d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
df = pd.DataFrame(data=d)`
The other dataframes are names df2,df3,df4 etc
I hope that makes sense. Any help would be much appreciated.
Thanks
If you want to access both key and values when iterating over a dictionary, you should use the items function.
You could use another dictionary to store the results:
letter_counts = {}
for key, value in Dict.items():
letter_counts[key] = value["letters"].value_counts()
You could also use dictionary comprehension to do this in 1 line:
letter_counts = {key: value["letters"].value_counts() for key, value in Dict.items()}
The easiest thing is probably dictionary comprehension:
d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
d2 = {'col1': [1, 2,3,4,5,6,7,8,9,10,11], 'letters': ['a','a','a','b','b','a','b','a','b','b','a']}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(d2)
df_dict = {'d': df, 'd2': df2}
new_dict = {k: v['letters'].count() for k,v in df_dict.items()}
# out
{'d': 10, 'd2': 11}

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.

Pandas fill cells in a column with NaN values, derive the value from other cells in the row

I have a dataframe:
a b c
0 1 2 3
1 1 1 1
2 3 7 NaN
3 2 3 5
...
I want to fill column "three" inplace (update the values) where the values are NaN using a machine learning algorithm.
I don't know how to do it inplace. Sample code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df=pd.DataFrame([range(3), [1, 5, np.NaN], [2, 2, np.NaN], [4,5,9], [2,5,7]],columns=['a','b','c'])
x=[]
y=[]
for row in df.iterrows():
index,data = row
if(not pd.isnull(data['c'])):
x.append(data[['a','b']].tolist())
y.append(data['c'])
model = LinearRegression()
model.fit(x,y)
#this line does not do it in place.
df[~df.c.notnull()].assign(c = lambda x:model.predict(x[['a','b']]))
But this gives me a copy of the dataframe. Only option I have left is using a for loop however, I don't want to do that. I think there should be more pythonic way of doing it using pandas. Can someone please help? Or is there any other way of doing this?
You'll have to do something like :
df.loc[pd.isnull(df['three']), 'three'] = _result of model_
This modifies directly dataframe df
This way you first filter the dataframe to keep the slice you want to modify (pd.isnull(df['three'])), then from that slice you select the column you want to modify (three).
On the right hand side of the equal, it expects to get an array / list / series with the same number of lines than the filtered dataframe ( in your example, one line)
You may have to adjust depending on what your model returns exactly
EDIT
You probably need to do stg like this
pred = model.predict(df[['a', 'b']])
df['pred'] = model.predict(df[['a', 'b']])
df.loc[pd.isnull(df['c']), 'c'] = df.loc[pd.isnull(df['c']), 'pred']
Note that a significant part of the issue comes from the way you are using scikit learn in your example. You need to pass the whole dataset to the model when you predict.
The simplest way is yo transpose first, then forward fill/backward fill at your convenience.
df.T.ffill().bfill().T