Get PARTITION_ID in Dask for Data Frame - pandas

Is it possible to get the partition_id in dask after splitting pandas DFs
For example:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(np.random.randn(10,2), columns=["A","B"])
df_parts = dd.from_pandas(df, npartitions=2)
part1 = df_parts.get_partition(0)
In the 2 parts, part1 is the first_partition. So is it possible to do something like the following:
part1.get_partition_id() => which will return 0 or 1
Or is it possible to get the partition ID by iterating through df_parts?

Not sure about built-in functions, but you can achieve what you want with enumerate(df_parts.to_delayed()).
to_delayed will produce a list of delayed objects, one per partition, so you can iterate over them, keeping track of the sequential number with enumerate.

Related

How do you split a pandas multiindex dataframe into train/test sets?

I have a multi-index pandas dataframe consisting of a date element and an index representing store locations. I want to split into training and test sets based on the time index. So, everything before a certain time being my training data set and after being my testing dataset. Below is some code for a sample dataset.
import pandas as pd
import stats
data = stats.poisson(mu=[5,2,1,7,2]).rvs([60, 5]).T.ravel()
dates = pd.date_range('2017-01-01', freq='M', periods=60)
locations = [f'location_{i}' for i in range(5)]
df_train = pd.DataFrame(data, index=pd.MultiIndex.from_product([dates, locations]), columns=['eaches'])
df_train.index.names = ['date', 'location']
I would like df_train to represent everything before 2021-01 and df_test to represent everything after.
I've tried using df[df.loc['dates'] > '2020-12-31'] but that yielded errors.
You have 'date' as an index, that's why your query doesn't work. For index, you can use:
df_train.loc['2020-12-31':]
That will select all rows, where df_train >= '2020-12-31'. So, if you would like to choose only rows where df_train > '2020-12-31', you should use df_train.loc['2021-01-01':]
You can't do df.loc['dates'] > '2020-12-31' because df.loc['dates'] still represents your numerical data, and you can't compare those to a string.
You can use query which works with index:
df.query('date>"2020-12-31"')

Read json files in pandas dataframe

I have large pandas dataframe (17 000 rows) with a filepath in each row associated with a specific json file. For each row I want to read the json file content and extract the content into a new dataframe.
The dataframe looks something like this:
0 /home/user/processed/config1.json
1 /home/user/processed/config2.json
2 /home/user/processed/config3.json
3 /home/user/processed/config4.json
4 /home/user/processed/config5.json
... ...
16995 /home/user/processed/config16995.json
16996 /home/user/processed/config16996.json
16997 /home/user/processed/config16997.json
16998 /home/user/processed/config16998.json
16999 /home/user/processed/config16999.json
What is the most efficient way to do this?
I believe a simple for-loop might be best suited here?
import json
json_content = []
for row in df:
with open(row) as file:
json_content.append(json.load(file))
result = pd.DataFrame(json_content)
Generally, I'd try with iterrows() function (as a first hit to improve efficiency).
Implementation could possibly look like that:
import json
import pandas as pd
json_content = []
for row in df.iterrows():
with open(row) as file:
json_content.append(json.load(file))
result = pd.Series(json_content)
Possible solution is the following:
# pip install pandas
import pandas as pd
#convert column with paths to list, where: : - all rows, 0 - first column
paths = df.iloc[:, 0].tolist()
all_dfs = []
for path in paths:
df = pd.read_json(path, encoding='utf-8')
all_dfs.append(df)
Each df in all_dfs can be accessed individually or in loop by index like all_dfs[0], all_dfs[1] and etc.
If you wish you can merge all_dfs into the single dataframe.
dfs = df.concat(all_dfs, axis=1)

Pandas - transform series of dicionaries into series of dicionary values

After applying pd.Series to one dataframe column, like so:
df_pos = df_matches.col.apply(pd.Series)
I ended up with:
0 {'macro': 'GOL', 'macro_position': 'Goalkeeper'}
1 {'macro': 'DEF', 'macro_position': 'Defender'}
Now I need to turn it into this dataframe:
macro macro_position
0 GOL Goalkeeper
1 LAT Defender
EDIT:
None of the answers below work. If I do:
out = list(df_pos.values)
I get a list of strings of dictionary syntax:
...
array(["{'macro': 'ATA', 'macro_posicao': 'Ataque'}"],dtype=object),
...
Try with
import ast
out = pd.DataFrame(df.col.apply(ast.literal_eval).tolist())
Out[71]:
macro macro_position
0 GOL Goalkeeper
1 DEF Defender
Two approaches:
using .apply
As mentioned by sammywemmy in the comments, by far the easiest approach is just to use .apply:
import pandas as pd
sf = pd.Series([{'macro': 'GOL', 'macro_position': 'Goalkeeper'},
{'macro': 'DEF', 'macro_position': 'Defender'}])
df = sf.apply(pd.Series)
This worked on my Python installation. Try executing the code above verbatim. Note, you do not need to write .col or anything like that. The .apply is a class method for pd.Series.
Using pd.DataFrame and dictionaries
pd.DataFrame can take a dictionary of dictionaries as an argument. So if you turn your Series into a dictionary, then you can just use pd.DataFrame, passing the dictionary as the data argument.
The one complication is that when converting a dict of dicts, it will interpret the inner dictionaries as the rows and the outer dictionaries as the columns. In your case, the rows of the series correspond to the columns, so if you just used .to_dict() naively, you would have the inner dictionaries as the columns, which is the wrong way around. The easiest way to fix this is just to transpose the DataFrame at the end, swapping rows and columns.
The result is as follows:
import pandas as pd
sf = pd.Series([{'macro': 'GOL', 'macro_position': 'Goalkeeper'},
{'macro': 'DEF', 'macro_position': 'Defender'}])
df = pd.DataFrame(sf.to_dict()).transpose()

How to create a DataFrame with index names different from `row` and write data into (`index`, `column`) pairs in Julia?

How can I create a DataFrame with Julia with index names that are different from Row and write values into a (index,column) pair?
I do the following in Python with pandas:
import pandas as pd
df = pd.DataFrame(index = ['Maria', 'John'], columns = ['consumption','age'])
df.loc['Maria']['age'] = 52
I would like to do the same in Julia. How can I do this? The documentation shows a DataFrame similar to the one I would like to construct but I cannot figure out how.

pandas HDFStore select rows with non-null values in the data column

In pandas Dataframe/Series there's a .isnull() method. Is there something similar in the syntax of where= filter of the select method of HDFStore?
WORKAROUND SOLUTION:
The /meta section of a data column inside hdf5 can be used as a hack solution:
import pandas as pd
store = pd.HDFStore('store.h5')
print(store.groups)
non_null = list(store.select("/df/meta/my_data_column/meta"))
df = store.select('df', where='my_data_column == non_null')