Julia best way to reshape multi-dim array - dataframe

I have a multi-dimensional array:
julia> sim1.value[1:5,:,:]
5x3x3 Array{Float64,3}:
[:, :, 1] =
0.201974 0.881742 0.497407
0.0751914 0.921308 0.732588
-0.109084 1.06304 1.15962
-0.0149133 0.896267 1.22897
0.717094 0.72558 0.456043
[:, :, 2] =
1.28742 0.760712 1.61112
2.21436 0.229947 1.87528
-1.66456 1.46374 1.94794
-2.4864 1.84093 2.34668
-2.79278 1.61191 2.22896
[:, :, 3] =
0.649675 0.899028 0.628103
0.718837 0.665043 0.153844
0.914646 0.807048 0.207743
0.612839 0.790611 0.293676
0.759457 0.758115 0.280334
I have names for the 2nd dimension in
julia> sim1.names
3-element Array{String,1}:
"beta[1]"
"beta[2]"
"s2"
Whats best way to reshape this multi-dim array so that I have a data frame like:
beta[1] | beta[2] | s2 | chain
0.201974 | 0.881742 | 0.497407 | 1
0.0751914| 0.921308 | 0.732588 | 1
-0.109084 | 1.06304 | 1.15962 | 1
-0.0149133| 0.896267 | 1.22897 | 1
... | ... | ... | ...
1.28742 | 0.760712 | 1.61112 | 2
2.21436 | 0.229947 | 1.87528 | 2
-1.66456 | 1.46374 | 1.94794 | 2
-2.4864 | 1.84093 | 2.34668 | 2
-2.79278 | 1.61191 | 2.22896 | 2
... | ... | ... | ...

At the moment, I think the best way to do this would be a mixture of loops and calls to reshape:
using DataFrames
A = randn(5, 3, 3)
df = DataFrame()
for j in 1:3
df[j] = reshape(A[:, :, j], 5 * 3)
end
names!(df, [:beta1, :beta2, :s2])

Looking at your data, it seems you wanted to basically stack the three matrices output by sim1.value[1:5,:,:] on top of each other vertically, plus add another column with the index of the matrix. The accepted answer of the brilliant and venerable John Myles White seems to put the entire contents of each of those matrices into it's own column.
The below matches your desired output using vcat for the stacking and hcat and fill to add the extra column. JMW I'm sure will know if there's a better way :)
using DataFrames
A = randn(5, 3, 3)
names = ["beta[1]","beta[2]","s2"]
push!(names, "chain")
newA = vcat([hcat(A[:,:,i],fill(i,size(A,1))) for i in 1:size(A,3)]...)
df = DataFrame(newA, Symbol[names...])
note also you can do this slightly more concisely without the explicit calls to hcat and vcat:
newA = [[[A[:,:,i] fill(i,size(A,1))] for i in 1:size(A,3)]...]

Related

Confusing matching behaviour of pandas extract(all)

I have a strange problem. But first, I want to match a hierarchy-based string onto the value of a column in a pandas data frame and count the occurrence of the current node and all of its children.
| index | hierarchystr |
| ----- | --------------------- |
| 0 | level0level00level000|
| 1 | level0level01 |
| 2 | level0level02level021|
| 3 | level0level02level021|
| 4 | level0level02level020|
| 5 | level0level02level021|
| 6 | level1level02level021|
| 7 | level1level02level021|
| 8 | level1level02level021|
| 9 | level2level02level021|
Assume that there are 300k lines. Each node can have multiple children with again multiple children so on and so forth (here represented by level0-2 strings). Now I have a separate hierarchy where I extract the hierarchy strings from. Now to the problem:
#hstrs = ["level0", "level1", "level0level01", "level0level02", "level0level02level021"]
pat = "|".join(hstrs)
s = df.hierarchystr.str.extract('(' + pat + ')', expand=True)[0]
df1 = df.groupby(s).size().reset_index(name='Count')
df1 = df1[df1 > 200]
size = len(df1)
The size of the found matched substrings with occurrence greater than 200 differ every RUN! "level0" should match every row where the hierarchy str level0 is included and should build a group with all its subchildren and that size needs to be greater than 200.
Edit:// levelX is just an example, i have thousands of nodes, with different names and again thousands of different subchilds. The hstrs strings do not include each other, besides the parent nodes. (E.g. "parent1" is included in "parent1subchild1" and "parent1subchild2")
I traced it back to a different order of the hierarchy strings in the array hstrs. So I changed the code and compare each substring individually:
for hstr in hstrs:
s = df.hierarchystr.str.extract('(' + hstr + ')', expand=True)
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
list.append(hstr)
This is slow as hell, but the result sticks the same, no matter which order hstrs has. But for efficiency is it possible to do the same with only one regex matching group, all at once for all hstrs?
Edit://
expected output would be:
|index| 0 | Count |
|-----|---------------------|-------|
|0 |level0 | 5 |
|1 |level1 | 3 |
|2 |level0level01 | 1 |
|3 |level0level02 | 4 |
|4 |level0level02level021| 3 |
Edit2://
it has something to do with the ordering of hstrs. I think with the match and stop after the first match the behavior of the extract method. If the ordering is different the hierarchy strings in the pat will be matched differently which results in different sizes of each group. A high hierarchy (short str) will be matched first, the lower hierarchy levels in the same pat won't be matched again. But IDK what to do against this behavior.
Edit3://
an alternative would be, but is also slow as hell:
for hstr in hstrs:
s = df[df.hierarchystr.str.contains(fqn)]
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
beforeset.append(fqn)
Edit4://
I think what I am searching for is the opportunity to do a "group_by" with "contains" or "is in" for the hstrs. I am glad for every Idea. :)
Edit5://
Found a simple, but not satisfying alternative (but faster than the previous tries):
containing =[item for hierarchystr in df.hierarchystr for item in hstrs if item in hierarchystr]
containing = Counter(containing)
df1 = pd.DataFrame([containing]).T
nodeNamesWithOver200 = df1[df1 > 200].dropna().index.values

Trying to iterate through a column to populate another column

I am trying to populate the column num_crimes. Since the zipcode repeats in the houses data frame, I just want to add the number of crimes related to that zipcode from the dictionary containing all the crimes per zipcode.
the houses dataframe contains 5000 entries, and the dictionary contains only 67, so I cannot just merge them.
This is the houses dataframe:
sold_price | zipcode | fireplaces | num_crimes
5300000 | 85637 | 6 | NaN
4200000 | 85646 | 5 | NaN
4200000 | 85646 | 5 | NaN
4500000 | 85646 | 6 | NaN
3411450 | 85750 | 4 | NaN
and this is the dictionary:
{85141: 1,85601: 2, 85607: 1, 85614: 4, 85622: 2, 85629: 4, 85634: 1....}
Problem: this is the code I used for that, but it is not changing the values in num_crimes:
def populate(df1):
for row, rows in df1.iterrows():
if rows[1] in my_dict:
rows[3]=my_dict[rows[1]]
else:
rows[3]=0
You can just do something like:
df["num_crimes"] = df["zipcode"].apply(lambda z: my_dict[z])
If you have zipcode in df that are not in my_dict, you need to handle for that as well:
df["num_crimes"] = df["zipcode"].apply(lambda z: my_dict[z] if z in my_dict else -1)
It's a lot easier to answer your questions if you post your data as text rather than images. Anyways, you could make the dict into a dataframe and then join it with the original dataframe. So something like this:
houses.set_index("Zipcode").join(pd.DataFrame.from_dict(my_dict, orient='index', columns = ["Crimes from dict"]))
Would that work?

In Dask, how would I remove data that is not repeated across all values of another column?

I'm trying to find a set of data that exists across multiple instances of a column's value.
As an example, let's say I have a DataFrame with the following values:
+-------------+------------+----------+
| hardware_id | model_name | data_v |
+-------------+------------+----------+
| a | 1 | 0.595150 |
+-------------+------------+----------+
| b | 1 | 0.285757 |
+-------------+------------+----------+
| c | 1 | 0.278061 |
+-------------+------------+----------+
| d | 1 | 0.578061 |
+-------------+------------+----------+
| a | 2 | 0.246565 |
+-------------+------------+----------+
| b | 2 | 0.942299 |
+-------------+------------+----------+
| c | 2 | 0.658126 |
+-------------+------------+----------+
| a | 3 | 0.160283 |
+-------------+------------+----------+
| b | 3 | 0.180021 |
+-------------+------------+----------+
| c | 3 | 0.093628 |
+-------------+------------+----------+
| d | 3 | 0.033813 |
+-------------+------------+----------+
What I'm trying to get would be a DataFrame with all elements except the rows that contain a hardware_id of d, since they do not occur at least once per model_name.
I'm using Dask as my original data size is on the order of 7 GB, but if I need to drop down to Pandas that is also feasable. I'm very happy to hear any suggestions.
I have tried splitting the dataframe into individual dataframes based on the model_name attribute, then running a loop:
models = ['1','1','1','2','2','2','3','3','3','3']
import dask.dataframe as dd
frame_1 = dd.DataFrame( {'hardware_id':['a','b','c','a','b','c','a','b','c','d'], 'model_name':mn,'data_v':np.random.rand(len(mn))} )
model_splits = []
for i in range(1,4):
model_splits.append(frame_1[frame_1['model_name'.eq(str(i))]])
aggregate_list = []
while len(model_splits) > 0:
data = aggregate_list.pop()
for other_models in aggregate_list:
data = data[data.hardware_id.isin(other_models.hardware_id.to__bag())]
aggregate_list.append(data)
final_data = dd.concat(aggregate_list)
However, this is beyond inefficient, and I'm not entirely sure that my logic is sound.
Any suggestions on how to achieve this?
Thanks!
One way to accomplish this is to treat it as a groupby-aggregation problem.
Pandas
First, we set up the data:
import pandas as pd
import numpy as np
np.random.seed(12)
models = ['1','1','1','2','2','2','3','3','3','3']
df = pd.DataFrame(
{'hardware_id':['a','b','c','a','b','c','a','b','c','d'],
'model_name': models,
'data_v': np.random.rand(len(models))
}
)
Then, collect the unique values of your model_name column.
unique_model_names = df.model_name.unique()
unique_model_names
array(['1', '2', '3'], dtype=object)
Next, we'll do several related steps at once. Our goal is to figure out which hardware_ids co-occur wiht the entire unique set of model_names. First we can do a groupby aggregation to get the unique model_names per hardware_id. This returns a list, but we want this as a tuple for efficiency so it works in the next step. At this point, every hardware ID is associated with a tuple of it's unique models. Next, we check to see if that tuple exactly matches our unique model names, using isin. If it doesn't we know the condition should be False (exactly what we get).
agged = df.groupby("hardware_id", as_index=False).agg({"model_name": "unique"})
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
agged
hardware_id model_name all_present_mask
0 a (1, 2, 3) True
1 b (1, 2, 3) True
2 c (1, 2, 3) True
3 d (3,) False
Finally, we can use this to get our list of "valid" hardware IDs, and then filter our initial dataframe.
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = df.loc[
df.hardware_id.isin(relevant_ids)
]
result
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Dask
We can do essentially the same thing, but we need to be a little clever with our calls to compute.
import dask.dataframe as dd
​
ddf = dd.from_pandas(df, 2)
unique_model_names = ddf.model_name.unique()
​
agged = ddf.groupby("hardware_id").model_name.unique().reset_index()
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
​
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = ddf.loc[
ddf.hardware_id.isin(relevant_ids.compute()) # cant pass a dask Series to `ddf.isin`
]
result.compute()
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Note that you would probably want to persist agged_df and relevant_ids if you have the memory available to avoid some redundant calculation.

Pandas - how to get the minimum value for each row from values across several rows

I have a pandas dataframe in the following structure:
|index | a | b | c | d | e |
| ---- | -- | -- | -- | -- | -- |
|0 | -1 | -2| 5 | 3 | 1 |
How can I get the minimum value for each row using only the positive values in columns a-e?
For the example row above, the minimum of (5,3,1) should be 1 and not (-2).
You can use the loop on all rows and apply your condition on the rows.
for example:
df = pd.DataFrame([{"a":-2,"b":2,"c":5},{"a":3,"b":0,"c":-1}])
# a b c
#0 -2 2 5
#1 3 0 -1
def my_condition(li):
li = [i for i in li if i>=0]
return min(li)
min_cel = []
for k,r in df.iterrows():
li = r.to_dict().values()
min_cel.append( my_condition(li) )
df["min"] = min_cel
# a b c min
#0 -2 2 5 2
#1 3 0 -1 0
You can also write the same code on one line:
df['min'] = ddd.apply(lambda row: min([i for i in row.to_dict().values() if i>=0]) , axis=1)

Convert multi-dimensional array to records

Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;
This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs
Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)