Dask bag from multiple files into Dask dataframe with columns - dataframe

I am given a list of filenames files which contain comma-delimited data which has to be cleaned as well as further extended by columns containing information based on the filenames. Thus, I implemented a small read_file function, which handles both, the initial cleaning, as well as the computation of additional columns. Using db.from_sequence(files).map(read_file), I am mapping the read function to all of the files, getting a list of dictionaries each.
However, rather than a list of dictionaries, I want my bag to contain each individual line of the input files as an entry. Subsequently, I want to map the keys of the dictionaries to column names in a dask dataframe.
from dask import bag as db
def read_file(filename):
ret = []
with open(filename, 'r') as fp:
... # reading line of file and storing result in dict
ret.append({'a': val_a, 'b': val_b, 'c': val_c})
return ret
from dask import bag as db
files = ['a.txt', 'b.txt', 'c.txt']
my_bag = db.from_sequence(files).map(read_file)
# a,b,c are the keys of the dictionaries returned by read_file
my_df = my_bag.to_dataframe(columns=['a', 'b', 'c'])
Could someone let me know what I have to change to get this code running? Are there different approaches that would be more suitable?
Edit:
I have created three test files a_20160101.txt, a_20160102.txt, a_20160103.txt. All of them contain just a few lines with a single string each.
asdf
sadfsadf
sadf
fsadff
asdf
sadfasd
fa
sf
ads
f
Previously I had a small error in read_file, but now, calling my_bag.take(10) after mapping to the reader works fine:
([{'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'asdf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'sadfsadf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'sadf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'fsadff', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'asdf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'sadfasd', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'fa', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'sf', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'ads', 'c': 'XY'}, {'b': datetime.datetime(2016, 2, 1, 0, 0), 'a': 'f', 'c': 'XY'}],)
However my_df = my_bag.to_dataframe(columns=['a', 'b', 'c']) and subsequently
my_df.head(10) still raises dask.async.AssertionError: 3 columns passed, passed data had 10 columns

You probably need to call flatten
Your bag of filenames looks like this:
['a.txt',
'b.txt',
'c.txt']
After you call map your bag looks like this:
[[{'a': 1, 'b': 2, 'c': 3}, {'a': 10, 'b': 20, 'c': 30}],
[{'a': 1, 'b': 2, 'c': 3}],
[{'a': 1, 'b': 2, 'c': 3}, {'a': 10, 'b': 20, 'c': 30}]]
Each file was turned into a list of dicts. Now your bag is kind of like a list-of-lists-of-dicts.
The .to_dataframe method wants you to have a list-of-dicts. So lets concatenate our bag to be a single flattened collection of dicts
my_bag = db.from_sequence(files).map(read_file).flatten()
[{'a': 1, 'b': 2, 'c': 3}, {'a': 10, 'b': 20, 'c': 30},
{'a': 1, 'b': 2, 'c': 3},
{'a': 1, 'b': 2, 'c': 3}, {'a': 10, 'b': 20, 'c': 30}]

Related

Merging many multiple dataframes within a list into one dataframe

i have several dataframes, with all the same columns, within one list that i would like to have within one dataframe.
For instance, i have these three dataframes here:
df1 = pd.DataFrame(np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]]),
columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[11, 22, 33], [44, 55, 66], [77, 88, 99]]),
columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
within one list:
dfList = [df1,df2,df3]
I know i can use the following which provides me with exactly what I'm looking for:
df_merge = pd.concat([dfList[0],dfList[1],dfList[2]])
However, my in my actual data i have 100s of dataframes within a list, so I'm trying to find a way to loop through and concat:
dfList_all = pd.DataFrame()
for i in range(len(dfList)):
dfList_all = pd.concat(dfList[i])
I tried the following above, but it provides me with the following error:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Any ideas would be wonderful. Thanks

average of dataframe rows based on condition

data = {'a': [1, -1, -1, 1, -1, 1, -1, 1],
'b': [-2, 2, -2,2, -2, 2, 2,-2],
'c': [-1, 1, -1,-1, -1, 1,-1,1],
'Price': [138, 186, 124, 200,4, 6,5, 5]}
df1 = pd.DataFrame(data)
print(df1)
I have a huge dataset of 5000 rows like this. I would like to find the average of rows having (a,b,c), (-a,b,-c), (-a,-b,-c), (a,-b,c). Those are considered equivalent rows and I want the average price of them The output should be in this form:

Merge similar columns and add extracted values to dict

Given this input:
pd.DataFrame({'C1': [6, np.NaN, 16, np.NaN], 'C2': [17, np.NaN, 1, np.NaN],
'D1': [8, np.NaN, np.NaN, 6], 'D2': [15, np.NaN, np.NaN, 12]}, index=[1,1,2,2])
I'd like to combine columns beginning in the same letter (the Cs and Ds), as well as rows with same index (1 and 2), and extract the non-null values to the simplest representation without duplicates, which I think is something like:
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}
Using stack or groupby gets me part of the way there, but I feel like there is a more efficient way to do it.
You can rename columns by lambda function for first letters with aggregate lists after DataFrame.stack and then create nested dictionary in dict comprehension:
s = df.rename(columns=lambda x: x[0]).stack().groupby(level=[0,1]).agg(list)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}

How to return a list into a dataframe based on matching index of other column

I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]

Pandas to dict conversion with condition

My dataframe:
data_part = [{'Part': 'A', 'Engine': True, 'TurboCharger': True, 'Restricted': True},
{'Part': 'B', 'Engine': False, 'TurboCharger': True, 'Restricted': False},]
My expect output is this:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'TurboCharger': 1}}
This is what I am doing:
df_part = pd.DataFrame(data_part).set_index('Part').astype(int).to_dict('index')
This is what it gives:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'Engine': 0, 'TurboCharger': 1, 'Restricted': 0}}
Anything that can be done to reach expected output?
We can fix your output
d=pd.DataFrame(data_part).set_index('Part').astype(int).stack().loc[lambda x : x!=0].reset_index('Part').groupby('Part').agg(dict)[0].to_dict()
Out[192]:
{'A': {'Engine': 1, 'TurboCharger': 1, 'Restricted': 1},
'B': {'TurboCharger': 1}}
You may call agg before to_dict
df_part = (pd.DataFrame(data_part).set_index('Part')
.agg(lambda x: dict(x[x].astype(int)), axis=1)
.to_dict())
Out[60]:
{'A': {'Engine': 1, 'Restricted': 1, 'TurboCharger': 1},
'B': {'TurboCharger': 1}}
Here's a way to convert the list to a dict without pandas:
from pprint import pprint
data_2 = dict()
for dp in data_part:
ts = [(k, v) for k, v in dp.items()]
key = ts[0][1]
values = {k: int(v) for k, v in ts[1:] if v}
data_2[key] = values
pprint(data_2)
{'A': {'Engine': 1, 'Restricted': 1, 'TurboCharger': 1},
'B': {'TurboCharger': 1}}