Dask - concatenate two same-column dataframes doesn't work - pandas

I have two dataframes without a header line, both with the same comma-separated columns.
I tried to read them into one dataframe with
dfoutputs = dd.read_csv(['outputsfile.csv', 'outputsfile2.csv'], names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
but this dataframe only contained outputsfile.csv rows.
Similar problem for reading and concat:
colnames=['firstnr', 'secondnr', 'thirdnr', 'fourthnr']
dfoutputs = dd.read_csv('outputsfile.csv', names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
print(dfoutputs.head(10))
dfoutputs2 = dd.read_csv('outputsfile2.csv', names=colnames, header=None, dtype={'firstnr': 'Int64', 'secondnr': 'Int64', 'thirdnr': 'Int64', 'fourthnr': 'Int64'})
print(dfoutputs2.head(10))
dfnew = dd.concat([dfoutputs, dfoutputs2])
print(dfnew.head(10))
Output:
firstnr secondnr thirdnr fourthnr
0 0 0 0 5000000000
1 1 0 0 5000000000
2 2 0 0 5000000000
3 3 0 0 5000000000
4 4 0 0 5000000000
5 5 0 0 5000000000
firstnr secondnr thirdnr fourthnr
0 11 0 0 5000000000
1 12 0 0 5000000000
firstnr secondnr thirdnr fourthnr
0 0 0 0 5000000000
1 1 0 0 5000000000
2 2 0 0 5000000000
3 3 0 0 5000000000
4 4 0 0 5000000000
5 5 0 0 5000000000
How can I combine both csv's to the same Dask dataframe?

As suggested by TennisTechBoy in the comments:
f=open("outputsfile.csv", "a")
f2=open("outputsfile2.csv", "r")
f2content = f2.readlines()
for i in range(len(f2content)):
f.write(f2content[i])
f.close()
f2.close()
A way to do this in Dask might be needed from a memory perspective.

Related

Pandas Dataframe Manipulation logic

Can use please help with below problem:
Given two dataframes df1 and df2, need to get something like result dataframe.
import pandas as pd
import numpy as np
feature_list = [ str(i) for i in range(6)]
df1 = pd.DataFrame( {'value' : [0,3,0,4,2,5]})
df2 = pd.DataFrame(0, index=np.arange(6), columns=feature_list)
Expected Dataframe :
Need to be driven by comparing values from df1 with column names (features) in df2. if they match, we put 1 in resultDf
Here's expected output (or resultsDf):
I think you need:
(pd.get_dummies(df1['value'])
.rename(columns = str)
.reindex(columns = df2.columns,
index = df2.index,
fill_value = 0))
0 1 2 3 4 5
0 1 0 0 0 0 0
1 0 0 0 1 0 0
2 1 0 0 0 0 0
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 1

How to speed up Pandas' "iterrows"

I have a Pandas dataframe which I want to transform in the following way: I have some sensor data from an intelligent floor which is in column "CAPACITANCE" (split by ",") and that data comes from the device indicated in column "DEVICE". Now I want to have one row with a column per sensor - each device has 8 sensors, so I want to have devices x 8 columns and in that column I want the sensor data from exactly that sensor.
But my code seems to be super slow since I have about 90.000 rows in that dataframe! Does anyone have a suggestion how to speed it up?
BEFORE:
CAPACITANCE DEVICE TIMESTAMP \
0 0.00,-1.00,0.00,1.00,1.00,-2.00,13.00,1.00 01,07 2017/11/15 12:24:42
1 0.00,0.00,-1.00,-1.00,-1.00,0.00,-1.00,0.00 01,07 2017/11/15 12:24:42
2 0.00,-1.00,-2.00,0.00,0.00,1.00,0.00,-2.00 01,07 2017/11/15 12:24:43
3 2.00,0.00,-2.00,-1.00,0.00,0.00,1.00,-2.00 01,07 2017/11/15 12:24:43
4 1.00,0.00,-2.00,1.00,1.00,-3.00,5.00,1.00 01,07 2017/11/15 12:24:44
AFTER:
01,01-0 01,01-1 01,01-2 01,01-3 01,01-4 01,01-5 01,01-6 01,01-7 \
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
01,02-0 01,02-1 ... 05,07-1 05,07-2 05,07-3 05,07-4 05,07-5 \
0 0 0 ... 0 0 0 0 0
1 0 0 ... 0 0 0 0 0
2 0 0 ... 0 0 0 0 0
3 0 0 ... 0 0 0 0 0
4 0 0 ... 0 0 0 0 0
05,07-6 05,07-7 TIMESTAMP 01,07-8
0 0 0 2017-11-15 12:24:42 1.00
1 0 0 2017-11-15 12:24:42 0.00
2 0 0 2017-11-15 12:24:43 -2.00
3 0 0 2017-11-15 12:24:43 -2.00
4 0 0 2017-11-15 12:24:44 1.00
# creating new dataframe based on the old one
floor_df_resampled = floor_df.copy()
floor_device = ["01,01", "01,02", "01,03", "01,04", "01,05", "01,06", "01,07", "01,08", "01,09", "01,10",
"02,01", "02,02", "02,03", "02,04", "02,05", "02,06", "02,07", "02,08", "02,09", "02,10",
"03,01", "03,02", "03,03", "03,04", "03,05", "03,06", "03,07", "03,08", "03,09",
"04,01", "04,02", "04,03", "04,04", "04,05", "04,06", "04,07", "04,08", "04,09",
"05,06", "05,07"]
# creating new columns
floor_objects = []
for device in floor_device:
for sensor in range(8):
floor_objects.append(device + "-" + str(sensor))
# merging new columns
floor_df_resampled = pd.concat([floor_df_resampled, pd.DataFrame(columns=floor_objects)], ignore_index=True, sort=True)
# part that takes loads of time
for index, row in floor_df_resampled.iterrows():
obj = row["DEVICE"]
sensor_data = row["CAPACITANCE"].split(',')
for idx, val in enumerate(sensor_data):
col = obj + "-" + str(idx + 1)
floor_df_resampled.loc[index, col] = val
floor_df_resampled.drop(["DEVICE"], axis=1, inplace=True)
floor_df_resampled.drop(["CAPACITANCE"], axis=1, inplace=True)
Like commented, I'm not sure why you want that many columns, but the new columns can be created as follows:
def explode(x):
dev_name = x.DEVICE.iloc[0]
ret_df = x.CAPACITANCE.str.split(',', expand=True).astype(float)
ret_df.columns = [f'{dev_name}-{col}' for col in ret_df.columns]
return ret_df
new_df = df.groupby('DEVICE').apply(explode).fillna(0)
and then you can merge this with the old data frame:
df = df.join(new_df)

Python Pandas Dataframe cell value split

I am lost on how to split the binary values such that each (0,1)value takes up a column of the data frame.
from jupyter
You can use concat with apply list:
df = pd.DataFrame({0:[1,2,3], 1:['1010','1100','0101']})
print (df)
0 1
0 1 1010
1 2 1100
2 3 0101
df = pd.concat([df[0],
df[1].apply(lambda x: pd.Series(list(x))).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
Another solution with DataFrame constructor:
df = pd.concat([df[0],
pd.DataFrame(df[1].apply(list).values.tolist()).astype(int)],
axis=1, ignore_index=True)
print (df)
0 1 2 3 4
0 1 1 0 1 0
1 2 1 1 0 0
2 3 0 1 0 1
EDIT:
df = pd.DataFrame({0:['1010','1100','0101']})
df1 = pd.DataFrame(df[0].apply(list).values.tolist()).astype(int)
print (df1)
0 1 2 3
0 1 0 1 0
1 1 1 0 0
2 0 1 0 1
But if need lists:
df[0] = df[0].apply(lambda x: [int(y) for y in list(x)])
print (df)
0
0 [1, 0, 1, 0]
1 [1, 1, 0, 0]
2 [0, 1, 0, 1]

copy_blanks(df,column) should copy the value in the original column to the last column for all values where the original column is blank

def copy_blanks(df, column):
like this, Please suggest me.
Input:
e-mail,number
n#gmail.com,0
p#gmail.com,1
h#gmail.com,0
s#gmail.com,0
l#gmail.com,1
v#gmail.com,0
,0
But, here we are having default_value option. In that we can use any value. when we have used this option. that value will adding.like below
e-mail,number
n#gmail.com,0
p#gmail.com,1
h#gmail.com,0
s#gmail.com,0
l#gmail.com,1
v#gmail.com,0
NA,0
But, my output is we have to default value and skip_blank options. when we will use skip_blank like true, then should not work default value,when we will keep skip_blank is false, then should work default value.
my output:
e-mail,number,e-mail_clean
n#gmail.com,0,n#gmail.com
p#gmail.com,1,p#gmail.com
h#gmail.com,0,h#gmail.com
s#gmail.com,0,s#gmail.com
l#gmail.com,1,l#gmail.com
v#gmail.com,0,v#gmail.com
,0,
consider your sample df
df = pd.DataFrame([
['n#gmail.com', 0],
['p#gmail.com', 1],
['h#gmail.com', 0],
['s#gmail.com', 0],
['l#gmail.com', 1],
['v#gmail.com', 0],
['', 0]
], columns=['e-mail','number'])
print(df)
e-mail number
0 n#gmail.com 0
1 p#gmail.com 1
2 h#gmail.com 0
3 s#gmail.com 0
4 l#gmail.com 1
5 v#gmail.com 0
6 0
If I understand you correctly:
def copy_blanks(df, column, skip_blanks=False, default_value='NA'):
df = df.copy()
s = df[column]
if not skip_blanks:
s = s.replace('', default_value)
df['{}_clean'.format(column)] = s
return df
copy_blanks(df, 'e-mail', skip_blanks=False)
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0 NA
copy_blanks(df, 'e-mail', skip_blanks=True)
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0
copy_blanks(df, 'e-mail', skip_blanks=False, default_value='new#gmail.com')
e-mail number e-mail_clean
0 n#gmail.com 0 n#gmail.com
1 p#gmail.com 1 p#gmail.com
2 h#gmail.com 0 h#gmail.com
3 s#gmail.com 0 s#gmail.com
4 l#gmail.com 1 l#gmail.com
5 v#gmail.com 0 v#gmail.com
6 0 new#gmail.com

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)