How to call unique() on dask DataFrame - pandas

How do I call unique on a dask DataFrame ?
I get the following error if I try to call it the same way as for a regular pandas dataframe:
In [27]: len(np.unique(ddf[['col1','col2']].values))
AttributeError Traceback (most recent call last)
<ipython-input-27-34c0d3097aab> in <module>()
----> 1 len(np.unique(ddf[['col1','col2']].values))
/dir/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in __getattr__(self, key)
1924 return self._constructor_sliced(merge(self.dask, dsk), name,
1925 meta, self.divisions)
-> 1926 raise AttributeError("'DataFrame' object has no attribute %r" % key)
1927
1928 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute 'values'

For both Pandas and Dask.dataframe you should use the drop_duplicates method
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 1, 2], 'y': [10, 10, 20]})
In [3]: df.drop_duplicates()
Out[3]:
x y
0 1 10
2 2 20
In [4]: import dask.dataframe as dd
In [5]: ddf = dd.from_pandas(df, npartitions=2)
In [6]: ddf.drop_duplicates().compute()
Out[6]:
x y
0 1 10
2 2 20

This works with dask==2022.11.1
ddf.symbol.unique().compute()

I'm not too familiar with Dask, but they appear to have a subset of Pandas functionality, and that subset doesn't seem to include the DataFrame.values attribute.
http://dask.pydata.org/en/latest/dataframe-api.html
You could try this:
sum(ddf[['col1','col2']].apply(pd.Series.nunique, axis=0))
I don't know how it fares performance-wise, but it should provide you with the value (total number of distinct values in col1 and col2 from the ddf DataFrame).

Related

Multiply many columns by one column in dask

I want to multiply roughly 50,000 columns with one other column in a large dask dataframe (6_500_000 x 50_002). The solution, using a for loop, works but is painfully slow. Below I tried two other appraoches that failed. Any advice is appreciated.
Pandas
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df[['a','b']].multiply(df['c'], axis="index")
Dask
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=1)
# works but very slow for large datasets:
for column in ['a', 'b']:
ddf[column] = ddf[column] * ddf['c']
# don't work:
ddf[['a','b']].multiply(ddf['c'], axis="index")
ddf[['a', 'b']].map_partitions(pd.DataFrame.mul, other=ddf['c'] ).compute()
Use .mul for dask:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
ddf = dd.from_pandas(df, npartitions=1)
ddf[['a','b']] = ddf[['a','b']].mul(ddf['c'], axis='index') # or axis=0
ddf.compute()
Out[1]:
a b c
0 7 28 7
1 16 40 8
2 27 54 9
You basically had it for pandas, just multiply() isn't inplace. I also changed to using .loc for all but one column so you don't type 50,000 column names :)
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df.loc[:, df.columns != 'c']=df.loc[:, df.columns != 'c'].multiply(df['c'], axis="index")
Output:
a b c
0 7 28 7
1 16 40 8
2 27 54 9
NOTE: I am not familiar with Dask, but I imagine that it is the same issue for that attempt.

Adding Pandas series values Pandas dataframe values [duplicate]

I have a Python Pandas DataFrame:
df = pd.DataFrame(np.random.rand(5,3),columns=list('ABC'))
print df
A B C
0 0.041761178 0.60439116 0.349372206
1 0.820455992 0.245314299 0.635568504
2 0.517482167 0.7257227 0.982969949
3 0.208934899 0.594973111 0.671030326
4 0.651299752 0.617672419 0.948121305
Question:
I would like to add the first column to the whole dataframe. I would like to get this:
A B C
0 0.083522356 0.646152338 0.391133384
1 1.640911984 1.065770291 1.456024496
2 1.034964334 1.243204867 1.500452116
3 0.417869798 0.80390801 0.879965225
4 1.302599505 1.268972171 1.599421057
For the first row:
A: 0.04176 + 0.04176 = 0.08352
B: 0.04176 + 0.60439 = 0.64615
etc
Requirements:
I cannot refer to the first column using its column name.
eg.: df.A is not acceptable; df.iloc[:,0] is acceptable.
Attempt:
I tried this using:
print df.add(df.iloc[:,0], fill_value=0)
but it is not working. It returns the error message:
Traceback (most recent call last):
File "C:test.py", line 20, in <module>
print df.add(df.iloc[:,0], fill_value=0)
File "C:\python27\lib\site-packages\pandas\core\ops.py", line 771, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2939, in _combine_series
return self._combine_match_columns(other, func, level=level, fill_value=fill_value)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2975, in _combine_match_columns
fill_value)
NotImplementedError: fill_value 0 not supported
Is it possible to take the sum of all columns of a DataFrame with the first column?
That's what you need to do:
df.add(df.A, axis=0)
Example:
>>> df = pd.DataFrame(np.random.rand(5,3),columns=['A','B','C'])
>>> col_0 = df.columns.tolist()[0]
>>> print df
A B C
0 0.502962 0.093555 0.854267
1 0.165805 0.263960 0.353374
2 0.386777 0.143079 0.063389
3 0.639575 0.269359 0.681811
4 0.874487 0.992425 0.660696
>>> df = df.add(df.col_0, axis=0)
>>> print df
A B C
0 1.005925 0.596517 1.357229
1 0.331611 0.429766 0.519179
2 0.773553 0.529855 0.450165
3 1.279151 0.908934 1.321386
4 1.748975 1.866912 1.535183
>>>
I would try something like this:
firstol = df.columns[0]
df2 = df.add(df[firstcol], axis=0)
I used a combination of the above two posts to answer this question.
Since I cannot refer to a specific column by its name, I cannot use df.add(df.A, axis=0). But this is along the correct lines. Since df += df[firstcol] produced a dataframe of NaNs, I could not use this approach, but the way that this solution obtains a list of columns from the dataframe was the trick I needed.
Here is how I did it:
col_0 = df.columns.tolist()[0]
print(df.add(df[col_0], axis=0))
You can use numpy and broadcasting for this:
df = pd.DataFrame(df.values + df['A'].values[:, None],
columns=df.columns)
I expect this to be more efficient than series-based methods.

Drop categories from Dask DataFrame? [duplicate]

I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?
We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3
This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)

ValueError: total size of new array must be unchanged (numpy for reshape)

I want reshape my data vector, but when I running the code
from pandas import read_csv
import numpy as np
#from pandas import Series
#from matplotlib import pyplot
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
A= np.array(series)
B = np.reshape(10,10)
print (B)
I found error
result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: total size of new array must be unchanged
my data
Month xxx
1749-01 58
1749-02 62.6
1749-03 70
1749-04 55.7
1749-05 85
1749-06 83.5
1749-07 94.8
1749-08 66.3
1749-09 75.9
1749-10 75.5
1749-11 158.6
1749-12 85.2
1750-01 73.3
.... ....
.... ....
There seem to be two issues with what you are trying to do. The first relates to how you read the data in pandas:
series = read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
print(series)
>>>>Empty DataFrame
Columns: []
Index: [1749-01 58, 1749-02 62.6, 1749-03 70, 1749-04 55.7, 1749-05 85, 1749-06 83.5, 1749-07 94.8, 1749-08 66.3, 1749-09 75.9, 1749-10 75.5, 1749-11 158.6, 1749-12 85.2, 1750-01 73.3]
This isn't giving you a column of floats in a dataframe with the dates the index, it is putting each line into the index, dates and value. I would think that you want to add delimtier=' ' so that it splits the lines properly:
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, delimiter=' ', squeeze=True)
>>>> Month
1749-01-01 58.0
1749-02-01 62.6
1749-03-01 70.0
1749-04-01 55.7
1749-05-01 85.0
1749-06-01 83.5
1749-07-01 94.8
1749-08-01 66.3
1749-09-01 75.9
1749-10-01 75.5
1749-11-01 158.6
1749-12-01 85.2
1750-01-01 73.3
Name: xxx, dtype: float64
This gives you the dates as the index with the 'xxx' value in the column.
Secondly the reshape. The error is quite descriptive in this case. If you want to use numpy.reshape you can't reshape to a layout that has a different number of elements to the original data. For example:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6]) # size 6 array
a.reshape(2, 3)
>>>> [[1, 2, 3],
[4, 5, 6]]
This is fine because the array starts out length 6, and I'm reshaping to 2 x 3, and 2 x 3 = 6.
However, if I try:
a.reshape(10, 10)
>>>> ValueError: cannot reshape array of size 6 into shape (10,10)
I get the error, because I need 10 x 10 = 100 elements to do this reshape, and I only have 6.
Without the complete dataset it's impossible to know for sure, but I think this is the same problem you are having, although you are converting your whole dataframe to a numpy array.

Error in filtering groupby results in pandas

I am trying to filter groupby results in pandas using the example provided at:
http://pandas.pydata.org/pandas-docs/dev/groupby.html#filtration
but getting the following error (pandas 0.12):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-d0014484ff78> in <module>()
1 grouped = my_df.groupby('userID')
----> 2 grouped.filter(lambda x: len(x) >= 5)
/Users/zz/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in filter(self, func, dropna, *args, **kwargs)
2092 res = path(group)
2093
-> 2094 if res:
2095 indexers.append(self.obj.index.get_indexer(group.index))
2096
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What does it mean and how can it be resolved?
EDIT:
code to replicate the problem in pandas 0.12 stable
dff = pd.DataFrame({'A': list('222'), 'B': list('123'), 'C': list('123') })
dff.groupby('A').filter(lambda x: len(x) > 2)
This was a quasi-bug in 0.12 and will be fixed in 0.13, the res is now protected by a type check:
if isinstance(res,(bool,np.bool_)):
if res:
add_indices()
I'm not quite sure how you got this error however, the docs are actually compiled and run with actual pandas. You should ensure you're reading the docs for the correct version (in this case you were linking to dev rather than stable - although the API is largely unchanged).
The standard workaround is to do this using transform, which in this case would be something like:
In [11]: dff[g.B.transform(lambda x: len(x) > 2)]
Out[11]:
A B C
0 2 1 1
1 2 2 2
2 2 3 3