I want to multiply roughly 50,000 columns with one other column in a large dask dataframe (6_500_000 x 50_002). The solution, using a for loop, works but is painfully slow. Below I tried two other appraoches that failed. Any advice is appreciated.
Pandas
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df[['a','b']].multiply(df['c'], axis="index")
Dask
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=1)
# works but very slow for large datasets:
for column in ['a', 'b']:
ddf[column] = ddf[column] * ddf['c']
# don't work:
ddf[['a','b']].multiply(ddf['c'], axis="index")
ddf[['a', 'b']].map_partitions(pd.DataFrame.mul, other=ddf['c'] ).compute()
Use .mul for dask:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
ddf = dd.from_pandas(df, npartitions=1)
ddf[['a','b']] = ddf[['a','b']].mul(ddf['c'], axis='index') # or axis=0
ddf.compute()
Out[1]:
a b c
0 7 28 7
1 16 40 8
2 27 54 9
You basically had it for pandas, just multiply() isn't inplace. I also changed to using .loc for all but one column so you don't type 50,000 column names :)
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df.loc[:, df.columns != 'c']=df.loc[:, df.columns != 'c'].multiply(df['c'], axis="index")
Output:
a b c
0 7 28 7
1 16 40 8
2 27 54 9
NOTE: I am not familiar with Dask, but I imagine that it is the same issue for that attempt.
Related
I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1
Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()
I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})
Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000
You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')
I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?
We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3
This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)
This is a simple code that creates a two level dataframe.
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=4)
df = pd.DataFrame(np.random.randn(6,4), columns=dates)
df1 = pd.DataFrame({'E' : ["foo1","foo2","foo3","foo4","foo5","foo6"]})
df2=pd.concat(dict(L0 = df, L1 = df1),axis=1)
df2 output:
L0 \
2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00
0 0.530496 -1.536075 -0.592824
1 0.614626 0.146761 1.799287
2 -0.398504 -0.863021 -0.208724
3 0.901720 0.717144 1.504012
4 -0.570248 -0.967722 -0.478540
5 2.225644 2.452121 -0.131774
L1
2013-01-04 00:00:00 E
0 1.293738 foo1
1 1.469431 foo2
2 -2.084461 foo3
3 -0.199157 foo4
4 -1.627641 foo5
5 -1.970185 foo6
I have these three questions. Kindly help:
1) How can I reorder the columns such that the dates are in descending order?
2) How can I show only the date (not the time stamp) in the column header?
3) If you write df2 to csv, it creates a blank row. I read some QA and it indicates a bug with multi-level output. Is that been fixed? If not, what's the best way to remove it?
Assuming you can attack the problem during the construction of df2, the
problem can be solved by sorting the columns of df and then turning the column
labels into strings:
df = df.sort_index(ascending=False, axis=1)
df.columns = df.columns.format()
Using the current version of pandas, 0.21.0 (dev),
df2.to_csv('/tmp/test.csv')
creates a CSV with no blank row. If you try it with the latest stable version, 0.20.3, I think you'll get the same result (see below).
For example,
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=4)
df = pd.DataFrame(np.random.randn(6,4), columns=dates)
df1 = pd.DataFrame({'E' : ["foo1","foo2","foo3","foo4","foo5","foo6"]})
df = df.sort_index(ascending=False, axis=1)
df.columns = df.columns.format()
df2 = pd.concat(dict(L0=df, L1=df1),axis=1)
df2.to_csv('/tmp/test.csv')
creates /tmp/test.csv with the content
,L0,L0,L0,L0,L1
,2013-01-04,2013-01-03,2013-01-02,2013-01-01,E
0,0.02140012949846106,0.26277798576234707,0.3417048534674754,-0.2415864990096712,foo1
1,1.5529608360704856,0.04473119120484416,0.2563552549068564,-0.7234609815350183,foo2
2,0.3197702495146119,-0.4796536804964018,-1.0049744963838612,0.039249748655535384,foo3
3,-1.5129389373140296,-0.2528463527601262,-0.22930219559242235,-0.6661663277403631,foo4
4,0.03756426242171489,0.20880577998533037,1.0229358239647364,0.6539470866256701,foo5
5,-1.8477638391042324,-0.8315712350681457,-0.0743680147471108,0.8503850287138673,foo6
By the way, you might also want to consider this format, which seems a bit more compact:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101', periods=4)
df = pd.DataFrame(np.random.randn(6,4), columns=dates)
df1 = pd.DataFrame({'E' : ["foo1","foo2","foo3","foo4","foo5","foo6"]})
df = df.T
df.columns = df1['E']
print(df)
yields
E foo1 foo2 foo3 foo4 foo5 foo6
2013-01-01 0.166074 0.398726 -0.410202 0.397486 -0.811873 0.462652
2013-01-02 0.406810 -0.313234 0.062569 -0.140924 -1.087162 1.600549
2013-01-03 -0.573118 1.331461 -0.115200 -1.934654 -1.427441 -0.889541
2013-01-04 -0.919885 -1.197192 -0.476039 1.186531 1.013803 0.400977