Manipulate map using different ways? - map-matching

I am hoping to find an elegant way of sorting a map by value first and then by the key.
For example:
B | 50
A | 50
C | 50
E | 10
D | 100
F | 99
I have the following code:
// Making the map into a list first
List<Map.Entry<String, Integer>> sortingList = new LinkedList<>(processMap.entrySet());
// Create a comparator that would compare the values of the map
Comparator<Map.Entry<String, Integer>> c = Comparator.comparingInt(Map -> Map.getValue());
// Sort the list in descending order
sortingList.sort(c.reversed());
I don't need the result to be map again, so this is sufficient, however, my result is:
D | 100
F | 99
B | 50
A | 50
C | 50
E | 10
I would like to sort not just by value, but also by the key, so the result becomes:
D | 100
F | 99
A | 50
B | 50
C | 50
E | 10
I had researched some possible solutions, but the problem is, my values need to be in descending, but my key has to be ascending...
Hoping if anyone can help me.

Try this:
Comparator<Map.Entry<String, Integer>> c = Comparator.comparing(Map.Entry<String, Integer>::getValue)
.reversed()
.thenComparing(Map.Entry::getKey);

Related

Recursive query to produce edges of a path?

I have a table paths:
CREATE TABLE paths (
id_travel INT,
point INT,
visited INT
);
Sample rows:
id_travel | point | visited
-----------+-------+---------
10 | 35 | 0
10 | 16 | 1
10 | 93 | 2
5 | 15 | 0
5 | 26 | 1
5 | 193 | 2
5 | 31 | 3
And another table distances:
CREATE TABLE distances (
id_port1 INT,
id_port2 INT,
distance INT CHECK (distance > 0),
PRIMARY KEY (id_port1, id_port2)
);
I need to make a view:
id_travel | point1 | point2 | distance
-----------+--------+--------+---------
10 | 35 | 16 | 1568
10 | 16 | 93 | 987
5 | 15 | 26 | 251
5 | 26 | 193 | 87
5 | 193 | 31 | 356
I don't know how to make dist_trips by a recursive request here:
CREATE VIEW dist_view AS
WITH RECURSIVE dist_trips (id_travel, point1, point2) AS
(SELECT ????)
SELECT dt.id_travel, dt.point1, dt.point2, d.distance
FROM dist_trips dt
NATURAL JOIN distances d;
dist_trips is a recursive request witch and should return three columns: id_travel, point1, and point2 from table paths.
You don't need recursion. Can be plain joins:
SELECT id_travel, p1.point AS point1, p2.point AS point2, d.distance
FROM paths p1
JOIN paths p2 USING (id_travel)
LEFT JOIN distances d ON d.id_port1 = p1.point
AND d.id_port2 = p2.point
WHERE p2.visited = p1.visited + 1
ORDER BY id_travel, p1.visited;
db<>fiddle here
Your paths seem to have gapless ascending numbers. Just join each point with the next.
I threw in a LEFT JOIN to keep all edges of each path in the result, even if the distances table should not have a matching entry. Probably prudent.
Your NATURAL JOIN didn't go anywhere. Generally, NATURAL is rarely useful and breaks easily. The manual warns:
USING is reasonably safe from column changes in the joined relations
since only the listed columns are combined. NATURAL is considerably
more risky since any schema changes to either relation that cause a
new matching column name to be present will cause the join to combine
that new column as well.

In Dask, how would I remove data that is not repeated across all values of another column?

I'm trying to find a set of data that exists across multiple instances of a column's value.
As an example, let's say I have a DataFrame with the following values:
+-------------+------------+----------+
| hardware_id | model_name | data_v |
+-------------+------------+----------+
| a | 1 | 0.595150 |
+-------------+------------+----------+
| b | 1 | 0.285757 |
+-------------+------------+----------+
| c | 1 | 0.278061 |
+-------------+------------+----------+
| d | 1 | 0.578061 |
+-------------+------------+----------+
| a | 2 | 0.246565 |
+-------------+------------+----------+
| b | 2 | 0.942299 |
+-------------+------------+----------+
| c | 2 | 0.658126 |
+-------------+------------+----------+
| a | 3 | 0.160283 |
+-------------+------------+----------+
| b | 3 | 0.180021 |
+-------------+------------+----------+
| c | 3 | 0.093628 |
+-------------+------------+----------+
| d | 3 | 0.033813 |
+-------------+------------+----------+
What I'm trying to get would be a DataFrame with all elements except the rows that contain a hardware_id of d, since they do not occur at least once per model_name.
I'm using Dask as my original data size is on the order of 7 GB, but if I need to drop down to Pandas that is also feasable. I'm very happy to hear any suggestions.
I have tried splitting the dataframe into individual dataframes based on the model_name attribute, then running a loop:
models = ['1','1','1','2','2','2','3','3','3','3']
import dask.dataframe as dd
frame_1 = dd.DataFrame( {'hardware_id':['a','b','c','a','b','c','a','b','c','d'], 'model_name':mn,'data_v':np.random.rand(len(mn))} )
model_splits = []
for i in range(1,4):
model_splits.append(frame_1[frame_1['model_name'.eq(str(i))]])
aggregate_list = []
while len(model_splits) > 0:
data = aggregate_list.pop()
for other_models in aggregate_list:
data = data[data.hardware_id.isin(other_models.hardware_id.to__bag())]
aggregate_list.append(data)
final_data = dd.concat(aggregate_list)
However, this is beyond inefficient, and I'm not entirely sure that my logic is sound.
Any suggestions on how to achieve this?
Thanks!
One way to accomplish this is to treat it as a groupby-aggregation problem.
Pandas
First, we set up the data:
import pandas as pd
import numpy as np
np.random.seed(12)
models = ['1','1','1','2','2','2','3','3','3','3']
df = pd.DataFrame(
{'hardware_id':['a','b','c','a','b','c','a','b','c','d'],
'model_name': models,
'data_v': np.random.rand(len(models))
}
)
Then, collect the unique values of your model_name column.
unique_model_names = df.model_name.unique()
unique_model_names
array(['1', '2', '3'], dtype=object)
Next, we'll do several related steps at once. Our goal is to figure out which hardware_ids co-occur wiht the entire unique set of model_names. First we can do a groupby aggregation to get the unique model_names per hardware_id. This returns a list, but we want this as a tuple for efficiency so it works in the next step. At this point, every hardware ID is associated with a tuple of it's unique models. Next, we check to see if that tuple exactly matches our unique model names, using isin. If it doesn't we know the condition should be False (exactly what we get).
agged = df.groupby("hardware_id", as_index=False).agg({"model_name": "unique"})
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
agged
hardware_id model_name all_present_mask
0 a (1, 2, 3) True
1 b (1, 2, 3) True
2 c (1, 2, 3) True
3 d (3,) False
Finally, we can use this to get our list of "valid" hardware IDs, and then filter our initial dataframe.
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = df.loc[
df.hardware_id.isin(relevant_ids)
]
result
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Dask
We can do essentially the same thing, but we need to be a little clever with our calls to compute.
import dask.dataframe as dd
​
ddf = dd.from_pandas(df, 2)
unique_model_names = ddf.model_name.unique()
​
agged = ddf.groupby("hardware_id").model_name.unique().reset_index()
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
​
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = ddf.loc[
ddf.hardware_id.isin(relevant_ids.compute()) # cant pass a dask Series to `ddf.isin`
]
result.compute()
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Note that you would probably want to persist agged_df and relevant_ids if you have the memory available to avoid some redundant calculation.

Find referenced value of multiple columns

I have a table Setpoints which contains 3 columns Base,Effective and Actual which contains an id that refers to the item found in io.
I would like to make a query that will return the io_value found in the io table for the id found in Setpoints.
Currently my query will return multiple id's and then I query the io table to find the io_value for each id.
Ex Query returning the ID's in the row
row # | base | effective | actual
1 | 24 | 30 | 40
2 | 25 | 31 | 41
3 | 26 | 32 | 42
But i want it return the value instead of the id
Ex returning the value for the id's instead
row # | base | effective | actual
1 | 2.3 | 4.5 | 3.44
2 | 4.2 | 7.7 | 4.41
3 | 3.9 | 8.12 | 5.42
Here are the table fields
IO
io_value
io_id
Setpoints
stpt_base
stpt_effective
stpt_actual
Using postgres 9.5
What Im using now
SELECT * from setpoints
For each row
SELECT io_id, io_value
from io
where io_id in
(stpt_effective, stpt_actual, stpt_base);
// these are from previous query
You can solve this by joining the io table three times to the setpoints table, using the three columns in each individual JOIN:
SELECT a.io_value AS base,
b.io_value AS effective,
c.io_value AS actual
FROM setpoints s
JOIN io a ON a.io_id = s.stpt_base
JOIN io b ON b.io_id = s.stpt_effective
JOIN io c ON c.io_id = s.stpt_actual;

Convert multi-dimensional array to records

Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;
This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs
Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)

Dynamically select a column from a generic list

I have a table that is 200 columns wide and need to return the data of a specific row and column but I won't know the column until runtime. I can easily get the row I want into either a list, an individual strongly typed object, or an Array through LINQ but I can't for the life of me figure out how to find the column I need.
So For instance (on a smaller scale) my table looks like this
GrowerKey | day1 | day2 | day3 | day4 |
-----------------------------------------
3 | 1 | 3 | 2 | 2 |
4 | 6 | 1 | 9 | 1 |
5 | 8 | 8 | 2 | 4 |
and I can get the row I want with something simple like this
Dim CleanRecord As List(Of Grower_Clean_Schedule) = (From key In eng.Grower_Clean_Schedules
Where key.Grower_Key = Grower_Key).ToList
how do I then return only the value of a specific column of that row (like say the value stored in "day2") When I won't know which column until runtime?
Something like this (starting with CleanRecord which you defined in your question):
dim matchingRow = CleanRecord.First()
dim props = matchingRow.GetType().GetProperties( _
BindingFlags.Instance or BindingFlags.Public))
dim myReturnVal = (from prop in props _
where prop.Name = "day2" _
select prop.GetValue(matchingRow, Nothing).FirstOrDefault()
return myReturnVal