Elixir best data structure for testing comparison - testing

I have two array outputs where I need to iterate over each struct, and compare the counts where the source's match. The comparison needs to be less than or equal to. My output sources look like this:
output_1: [%{source: "facebook", count: 3}, %{count: 1, source: "linkedin"}]
output_2: [%{source: "facebook", count: 2}, %{count: 1, source: "linkedin"}]
Whats the best data structure to implement in order to make the Enumerables easiest and most efficient to compare?

If your order isn't guaranteed, my preferred way is to turn the reference list into a map and compare things by source.
iex> output_1 = [%{source: "facebook", count: 3}, %{count: 1, source: "linkedin"}]
[%{count: 3, source: "facebook"}, %{count: 1, source: "linkedin"}]
iex> output_2 = [%{source: "facebook", count: 2}, %{count: 1, source: "linkedin"}]
[%{count: 2, source: "facebook"}, %{count: 1, source: "linkedin"}]
iex> limits = Map.new(output_1, &{&1.source, &1.count})
%{"facebook" => 3, "linkedin" => 1}
iex> Enum.all?(output_2, & &1.count <= limits[&1.source])
true

Your current output format should be very efficient with the following code. You didn't say what you expected your output to be, nor in which direction the comparison should be done: output2 <= output1 or output1 <= output2, so I'm assuming a list of booleans and output1 <= output2:
defmodule A do
def compare([%{count: count1}|maps1], [%{count: count2}|maps2]) do
[count1 <= count2 | compare(maps1, maps2) ]
end
def compare([], []), do: []
end
The following does the same thing and is easier to come up with and understand:
defmodule A do
def compare(list1, list2), do: _compare(list1, list2, [])
defp _compare([%{count: count1}|maps1], [%{count: count2}|maps2], acc) do
_compare(maps1, maps2, [count1 <= count2 | acc])
end
defp _compare([], [], acc) do
Enum.reverse(acc)
end
end
In iex:
~/elixir_programs$ iex a.ex
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Interactive Elixir (1.8.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> out1 = [%{source: "facebook", count: 3}, %{count: 1, source: "linkedin"}]
[
%{count: 3, source: "facebook"},
%{count: 1, source: "linkedin"}
]
iex(2)> out2 = [%{source: "facebook", count: 2}, %{count: 1, source: "linkedin"}]
[
%{count: 2, source: "facebook"},
%{count: 1, source: "linkedin"}
]
iex(3)> A.compare(out1, out2)
[false, true]
If instead, you need the result to be a single boolean, i.e. the facebook count is less than or equal to AND the linkedin count is less than or equal to, you can change the accumulator:
defmodule A do
def compare(list1, list2), do: _compare(list1, list2, true)
defp _compare([%{count: count1}|maps1], [%{count: count2}|maps2], true) do
_compare(maps1, maps2, count1 <= count2)
end
defp _compare(_, _, false), do: false #If you find a false comparison, stop and return false
defp _compare([], [], _), do: true
end
In iex:
iex(22)> c "a.ex"
warning: redefining module A (current version defined in memory)
a.ex:1
[A]
iex(23)> A.compare(out1, out2)
false
This also works:
defmodule A do
def compare(list1, list2) do
List.first(list1)[:count] <= List.first(list2)[:count]
and
List.last(list1)[:count] <= List.last(list2)[:count]
end
end
Whats the best data structure to implement in order to make the Enumerables easiest and most efficient to compare?
Otherwise, I would nominate a keyword list like this:
[facebook: 3, linkedin: 1]
[facebook: 2, linkedin: 1]

The easiest would probably to be use Enum.zip/2 with Enum.all?/2. Something like the following should work
output_1 = Enum.sort(output_1, fn a, b -> a.source <= b.source end)
output_2 = Enum.sort(output_2, fn a, b -> a.source <= b.source end)
output_1
|> Enum.zip(output_2)
|> Enum.all?(fn a, b -> a.count == b.count end)

Related

Highlight distinct cells based on a different cell in the same row in a multiindex pivot table

I have created a pivot table where the column headers have several levels. This is a simplified version:
index = ['Person 1', 'Person 2', 'Person 3']
columns = [
["condition 1", "condition 1", "condition 1", "condition 2", "condition 2", "condition 2"],
["Mean", "SD", "n", "Mean", "SD", "n"],
]
data = [
[100, 10, 3, 200, 12, 5],
[500, 20, 4, 750, 6, 6],
[1000, 30, 5, None, None, None],
]
df = pd.DataFrame(data, columns=columns)
df
Now I would like to highlight the adjacent cells next to SD if SD > 10. This is how it should look like:
I found this answer but couldn't make it work for multiindices.
Thanks for any help.
Use Styler.apply with custom function - for select column use DataFrame.xs and for repeat boolean use DataFrame.reindex:
def hightlight(x):
c1 = 'background-color: red'
mask = x.xs('SD', axis=1, level=1).gt(10)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 column by boolean mask
return df1.mask(mask.reindex(x.columns, level=0, axis=1), c1)
df.style.apply(hightlight, axis=None)

Set index for aggregated dataframe

I did some calculation to a list of dataframes. I'd like the result dataframe uses rangeindex. However, it uses one of the column name as index, even I set index=None
d1 = {'id': [1, 2, 3, 4, 5], 'is_free': [True, False, False, True, True], 'level': ['Top', 'Mid', 'Top', 'Top', 'Low']}
d2 = {'id': [1, 3, 4, 5, 7], 'is_free': [True, True, False, False, False], 'level': ['Top', 'High', 'Top', 'Top', 'Low']}
d1 = pd.DataFrame(data=d1)
d2 = pd.DataFrame(data=d2)
df_list = [d1, d2]
dfs = []
for i, df in enumerate(df_list):
df = df.groupby('is_free')['id'].count()
dfs.append(df)
df = pd.DataFrame(data=dfs, index=None)
It returns
is_free False True
id 2 3
id 3 2
df.index returns
Index(['id', 'id'], dtype='object')
From your code:
df = pd.DataFrame(data=dfs, index=None).reset_index(drop=True)
However, in general, I would avoid append iteratively. Try concat:
pd.concat({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)},
axis=1).T
Or use pd.DataFrame:
pd.DataFrame({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)}).T
Output:
is_free False True
0 2 3
1 3 2

numpy unique over multiple arrays

Numpy.unique expects a 1-D array. If the input is not a 1-D array, it flattens it by default.
Is there a way for it to accept multiple arrays? To keep it simple, let's just say a pair of arrays, and we are unique-ing the pair of elements across the 2 arrays.
For example, say I have 2 numpy array as inputs
a = [1, 2, 3, 3]
b = [10, 20, 30, 31]
I'm unique-ing against both of these arrays, so against these 4 pairs (1,10), (2,20) (3, 30), and (3,31). These 4 are all unique, so I want my result to say
[True, True, True, True]
If instead the inputs are as follows
a = [1, 2, 3, 3]
b = [10, 20, 30, 30]
Then the last 2 elements are not unique. So the output should be
[True, True, True, False]
You could use the unique_indices value returned by numpy.unique():
In [243]: def is_unique(*lsts):
...: arr = np.vstack(lsts)
...: _, ind = np.unique(arr, axis=1, return_index=True)
...: out = np.zeros(shape=arr.shape[1], dtype=bool)
...: out[ind] = True
...: return out
In [244]: a = [1, 2, 2, 3, 3]
In [245]: b = [1, 2, 2, 3, 3]
In [246]: c = [1, 2, 0, 3, 3]
In [247]: is_unique(a, b)
Out[247]: array([ True, True, False, True, False])
In [248]: is_unique(a, b, c)
Out[248]: array([ True, True, True, True, False])
You may also find this thread helpful.

Efficient column MultiIndex ordering

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).
You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])
I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

How to count array elements occurrences in Presto?

I have an array in Presto and I'd like to count how many times each element occurs in it. For example, I have
[a, a, a, b, b]
and I'd like to get something like
{a: 3, b: 2}
We do not have a direct function for this, but you can combine UNNEST with histogram:
presto> SELECT histogram(x)
-> FROM UNNEST(ARRAY[1111, 1111, 22, 22, 1111]) t(x);
_col0
----------------
{22=2, 1111=3}
You may want to file a new issue for a direct function for this.
SELECT
TRANSFORM_VALUES(
MULTIMAP_FROM_ENTRIES(
TRANSFORM(ARRAY['a', 'a', 'a', 'b', 'b'], x -> ROW(x, 1))
),
(k, v) -> ARRAY_SUM(v)
)
Output:
{
"a": 3,
"b": 2
}
You can use REDUCE if there is no support of ARRAY_SUM:
SELECT
TRANSFORM_VALUES(
MULTIMAP_FROM_ENTRIES(
TRANSFORM(ARRAY['a', 'a', 'a', 'b', 'b'], x -> ROW(x, 1))
),
(k, v) -> REDUCE(v, 0, (s, x) -> s + x, s -> s)
)
In Presto 0.279, you now have a direct function for this purpose. You can easily use array_frequency. The input is your ARRAY, and the output is a MAP, where keys are the element of the given array and values are the frequencies. Fro example, if you run this SQL :
SELECT array_frequency(ARRAY[1,4,1,3,5,4,7,3,1])
The result will be
{
"1": 3,
"3": 2,
"4": 2,
"5": 1,
"7": 1
}