count null values and group results - sql

Problem
i want to count null values and group the results but it gives me wrong values
String jpql = "select c.commande.user.login, (select count(*)
from Designation c
WHERE c.commande.commandeTms IS NOT EMPTY
AND c.etatComde = 0) AS count1,
(select count(*)
from Designation c WHERE c.commande.commandeTms IS EMPTY ) AS count2
from Designation c GROUP BY c.commande.user.login";
I have got these results :
user1 user2
10 10
0 0
But I should have these ones :
user1 user2
4 2
3 1
Sample data:
table Commande
idComdeComm, commandeTms_idComndeTms, user_idUser
6, 17 2
8, NULL 2
10, 28 2
12, NULL 2
14, NULL 2
16, NULL 2
21, NULL 19
23, NULL 19
25, 26 19
31 NULL 19
table designation
idDesignation, designation, etatComde, commande_idComdeComm
5, 'fef', 0, 6
7, 'ferf', 0, 8
9, 'hrhrh', 0, 10
11, 'ujujuju', 0, 12
13, 'kikolol', 0, 14
15, 'ololo', 0, 16
20, 'gdfgfd', 0, 21
22, 'gdfgfdd', 0, 23
24, 'nhfn', 0, 25
30, 'momo', 0, 31
table user
idUser, login, password, profil
1, 'moez', '***', 'admin'
2, 'user1', '**', 'comm'
3, 'log', '**', 'log'
4, 'mo', '*', 'comm'
19, 'user2', '*', 'comm'
table Commande TMS
idComndeTms, etatOperationNumCMD, numeroComndeTms
17, '', 3131
26, '', 2525
28, '', 3333

Related

Obtain corresponding column based on another column that matches another dataframe

I want to find matching values from two data frames and return a third value.
For example, if cpg_symbol["Gene_Symbol"] corresponds with diff_meth_kirp_symbol.index, I want to assign cpg_symbol.loc["Composite_Element_REF"] as index.
My code returned an empty dataframe.
diff_meth_kirp.index = diff_meth_kirp.merge(cpg_symbol, left_on=diff_meth_kirp.index, right_on="Gene_Symbol")[["Composite_Element_REF"]]
Example:
diff_meth_kirp
Hello
My
name
is
First
0
1
2
3
Second
4
5
6
7
Third
8
9
10
11
Fourth
12
13
14
15
Fifth
16
17
18
19
Sixth
20
21
22
23
cpg_symbol
Composite_Element_REF
Gene_Symbol
cg1
First
cg2
Third
cg3
Fifth
cg4
Seventh
cg5
Ninth
cg6
First
Expected output:
Hello
My
name
is
cg1
0
1
2
3
cg2
8
9
10
11
cg3
16
17
18
19
cg6
0
1
2
3
Your code works well for me but you can try this version:
out = (diff_meth_kirp.merge(cpg_symbol.set_index('Gene_Symbol'),
left_index=True, right_index=True)
.set_index('Composite_Element_REF')
.rename_axis(None).sort_index())
print(out)
# Output
Hello My name is
cg1 0 1 2 3
cg2 8 9 10 11
cg3 16 17 18 19
cg6 0 1 2 3
Input dataframes:
data1 = {'Hello': {'First': 0, 'Second': 4, 'Third': 8, 'Fourth': 12, 'Fifth': 16, 'Sixth': 20},
'My': {'First': 1, 'Second': 5, 'Third': 9, 'Fourth': 13, 'Fifth': 17, 'Sixth': 21},
'name': {'First': 2, 'Second': 6, 'Third': 10, 'Fourth': 14, 'Fifth': 18, 'Sixth': 22},
'is': {'First': 3, 'Second': 7, 'Third': 11, 'Fourth': 15, 'Fifth': 19, 'Sixth': 23}}
diff_meth_kirp = pd.DataFrame(data1)
data2 = {'Composite_Element_REF': {0: 'cg1', 1: 'cg2', 2: 'cg3', 3: 'cg4', 4: 'cg5', 5: 'cg6'},
'Gene_Symbol': {0: 'First', 1: 'Third', 2: 'Fifth', 3: 'Seventh', 4: 'Ninth', 5: 'First'}}
cpg_symbol = pd.DataFrame(data2)

Average with same id

INSERT INTO `iot_athlete_ind_cont_non_cur_chlng_consp`
(`aicicc_id`, `aicid_id`, `aicidl_id`, `aica_id`, `at_id`, `aicicc_type`, `aicicc_tp`, `aicicc_attempt`, `aicicc_lastposition`, `aicicc_status`, `pan_percentile`, `age_percentile`, `created_at`, `updated_at`) VALUES
(270, 3, 14, 17, 7, 'Time', 50, 1, 5, 'Active', NULL, NULL, '2022-11-15 08:34:40', '2022-11-15 08:34:40'),
(271, 3, 14, 20, 7, 'Time', 60, 1, 231, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 08:35:21'),
(272, 3, 14, 21, 7, 'Time', 70, 1, 20, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 08:35:21'),
(273, 3, 14, 17, 7, 'Time', 90, 2, 5, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 12:13:42'),
(274, 3, 14, 20, 7, 'Time', 40, 2, 231, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 08:35:21'),
(275, 3, 14, 21, 7, 'Time', 70, 2, 20, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 08:35:21'),
(276, 3, 10, 17, 3, 'Time', 80, 1, 5, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 12:10:25'),
(277, 3, 10, 20, 3, 'Time', 60, 1, 231, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 12:10:43'),
(278, 3, 10, 21, 3, 'Time', 60, 1, 20, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 12:11:03');
I need 3 rows form this table with average like this
at_id
aicicc_attempt
average
7
1
60
7
2
66.66
3
1
66.66
my query is
SELECT DISTINCT at_id, AVG(aicicc_tp) OVER (PARTITION BY aicicc_attempt) as average
FROM iot_athlete_ind_cont_non_cur_chlng_consp
WHERE aicid_id = '3';
but its not working properly average calculation is wrong here in my query.
Let's first say that's basically just a combination of average and GROUP BY:
SELECT at_id, aicicc_attempt,
AVG(aicicc_tp) AS average
FROM iot_athlete_ind_cont_non_cur_chlng_consp
GROUP BY at_id, aicicc_attempt
ORDER BY at_id DESC;
(I don't know if the ORDER BY clause is necessary, otherwise the last row appears as first row).
The "problem" might be that the "average" column will not exactly be shown as you wanted. For example, let's assume the column "aicicc_tp" has been declared as data type int and you are using a SQLServer DB. In this case, your outcome will also show the average as integer:
at_id
aicicc_attempt
average
7
1
60
7
2
66
3
1
66
You will need to cast the column as float and format the outcome to generate exactly the desired result (of course, the correct average for your sample data is 66.67, not 66.66):
SELECT at_id, aicicc_attempt,
FORMAT(AVG(CAST(aicicc_tp AS float)), '##.##') AS average
FROM iot_athlete_ind_cont_non_cur_chlng_consp
GROUP BY at_id, aicicc_attempt
ORDER BY at_id DESC;
at_id
aicicc_attempt
average
7
1
60
7
2
66.67
3
1
66.67
If you are using another DB type, the concrete query will differ. That's why it's recommended to tag your DB type.
EDIT because you changed the question:
Adding the WHERE clause that is now mentioned in your question does not change the result:
SELECT at_id, aicicc_attempt,
FORMAT(AVG(CAST(aicicc_tp AS float)), '##.##') AS average
FROM iot_athlete_ind_cont_non_cur_chlng_consp
WHERE aicid_id = 3
GROUP BY at_id, aicicc_attempt
ORDER BY at_id DESC;
The outcome is still the same and correct:
at_id
aicicc_attempt
average
7
1
60
7
2
66.67
3
1
66.67
And a last note: If you want to keep your PARTITION BY idea, your query must be like this:
SELECT DISTINCT at_id, aicicc_attempt,
AVG(aicicc_tp)
OVER (PARTITION BY at_id, aicicc_attempt) AS average
FROM iot_athlete_ind_cont_non_cur_chlng_consp
WHERE aicid_id = 3
ORDER BY at_id DESC;
But I don't recommend this because of some disadvantages:
The usage of DISTINCT often slows down the query.
This will not solve the issue the average is possibly built as integer which is not intended according to your description.
Window functions often differ depending on the DB type, GROUP BY works the same on each DB type.
You can verify all these things here: db<>fiddle
For the average, you may use AVG()
Solution below is for sql-server, as you didn't tag your rdbms.
SELECT at_id, aicicc_attempt, AVG(aicicc_tp) FROM iot_athlete_ind_cont_non_cur_chlng_consp
GROUP BY at_id, aicicc_attempt
with cte (aicicc_id, aicid_id,aicidl_id, aica_id, at_id, aicicc_type, aicicc_tp, aicicc_attempt, aicicc_lastposition, aicicc_status,
pan_percentile, age_percentile, created_at, updated_at) as (
values (270, 3, 14, 17, 7, 'Time', 50, 1, 5, 'Active', NULL, NULL, '2022-11-15 08:34:40', '2022-11-15 08:34:40'),
(271, 3, 14, 20, 7, 'Time', 60, 1, 231, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 08:35:21'),
(272, 3, 14, 21, 7, 'Time', 70, 1, 20, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 08:35:21'),
(273, 3, 14, 17, 7, 'Time', 90, 2, 5, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 12:13:42'),
(274, 3, 14, 20, 7, 'Time', 40, 2, 231, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 08:35:21'),
(275, 3, 14, 21, 7, 'Time', 70, 2, 20, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 08:35:21'),
(276, 3, 10, 17, 3, 'Time', 80, 1, 5, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 12:10:25'),
(277, 3, 10, 20, 3, 'Time', 60, 1, 231, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 12:10:43'),
(278, 3, 10, 21, 3, 'Time', 60, 1, 20, 'Active', NULL, NULL, '2022-11-15 08:34:45', '2022-11-15 12:11:03')
)
select at_id, aicicc_attempt, AVG(aicicc_tp) from cte group by 1,2
Result :

Extract information to work with with pandas

I have this dataframe:
Column Non-Null Dtype
0 nombre 74 non-null object
1 fabricante - 74 non-null - object
2 calorias -74 non-null -int64
3 proteina -74 non-null -int64
4 grasa -74 non-null -int64
5 sodio -74 non-null -int64
6 fibra dietaria -74 non-null -float64
7 carbohidratos -74 non-null -float64
8 azúcar -74 non-null -int64
9 potasio -74 non-null -int64
10 vitaminas y minerales -74 non-null -int64
I am trying to extract information like this:
cereal_df.loc[cereal_df['fabricante'] == 'Kelloggs', 'sodio']
The output is (good, that is what I want to extract in this case right?)
2 260
3 140
6 125
16 290
17 90
19 140
21 220
24 125
25 200
26 0
27 240
37 170
38 170
43 150
45 190
46 220
47 170
50 320
55 210
57 0
59 290
63 70
64 230
Name: sodio, dtype: int64
That is what I need so far, but when I try to write a function like this (in order to get the confidence):
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = cereal_df.loc[cereal_df['fabricante'] == fabricante, cereal_df[variable]]
inicio, final = sm.stats.DescrStatsW(subconjunto[variable]).zconfint_mean(alpha = 1 - confianza)
return inicio, final
Then I run the function:
valor_medio_intervalo('Kelloggs', 'azúcar', 0.95)
And the output is:
KeyError Traceback (most recent call last)
<ipython-input-57-11420ac4d15f> in <module>()
1 #TEST_CELL
----> 2 valor_medio_intervalo('Kelloggs', 'azúcar', 0.95)
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1296 if missing == len(indexer):
1297 axis_name = self.obj._get_axis_name(axis)
-> 1298 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1299
1300 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Int64Index([ 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2,\n 12, 13, 7, 0, 3, 10, 5, 13, 11, 7, 12, 12, 15, 9, 5, 3, 4,\n 11, 10, 11, 6, 9, 3, 6, 12, 3, 13, 6, 9, 7, 2, 10, 14, 3,\n 0, 0, 6, -1, 12, 8, 6, 2, 3, 0, 0, 0, 15, 3, 5, 3, 14,\n 3, 3, 12, 3, 3, 8],\n dtype='int64')] are in the [columns]"
I do not understand what is going on.
I appreciate your help or any hint.
Thanks in advance
Just got the answer examining the code:
def valor_medio_intervalo(fabricante, variable, confianza):
subconjunto = cereal_df.loc[cereal_df['fabricante'] == fabricante,cereal_df[variable]]
inicio, final = sm.stats.DescrStatsW(subconjunto[variable]).zconfint_mean(alpha = 1 -
confianza)
return inicio, final
in the line
inicio, final = sm.stats.DescrStatsW(subconjunto[variable]).zconfint_mean(alpha = 1 -
confianza)
the
(subconjunto[variable])
is just
(subconjunto)

SQL help for selecting most recent non-Null value for a unique plant

I have a SQL Server table with data on various factories (plants), with rows identified by a root plant ID, and a sub plant ID. The root ID is the same for the facility for its entire life. And the sub ID is added each time the plant data is changed with the regulatory agency.
Sometimes when the plant data was re-filed with the regulator, only the changed data was submitted, and other fields were left blank (Null).
I'm looking for an elegant way to write a query that will return all of the data from the most recent sub ID record, except that for Capacity, it will pull the most recent sub for which a non-Null Capacity was actually specified.
Assume that these are the fields in the Plant table:
RecordId (primary key)
RootId
SubId
Fuel
Capacity
Here is the SQL for selecting the data for the most recent SubId:
SELECT p1.* FROM Plant as p1
WHERE
p1.SubId = (
SELECT TOP 1 p2.SubId FROM Plant as p2
WHERE p1.RootId = p2.RootId
ORDER BY p2.SubId DESC)
I've been thinking about this for a while, but haven't come up with an approach. Even just a push in the right direction would be appreciated. Here is some SQL code to generate sample data:
CREATE TABLE Plant (
RecordId INTEGER PRIMARY KEY,
RootId VARCHAR(12) not null,
SubID INTEGER not null,
Fuel INTEGER not null,
Capacity DECIMAL(10,4)
);
INSERT INTO Plant
VALUES
(451, 'PLT03-39', 3, 1, 4399.67),
(471, 'PLT03-39', 4, 1, 4399.67),
(1809, 'PLT03-39', 5, 1, 4399.67),
(4888, 'PLT03-39', 6, 1, Null),
(6111, 'PLT03-39', 7, 1, Null),
(450, 'PLT03-40', 3, 1, 15531.67),
(472, 'PLT03-40', 4, 1, Null),
(1810, 'PLT03-40', 5, 1, 14767.61),
(4882, 'PLT03-40', 6, 1, Null),
(6113, 'PLT03-40', 7, 1, Null),
(454, 'PLT03-41', 5, 1, 23726.34),
(455, 'PLT03-41', 6, 1, 23726.34),
(469, 'PLT03-41', 7, 1, 23726.34),
(1807, 'PLT03-41', 8, 1, 22850.96),
(4884, 'PLT03-41', 9, 1, 22850.96),
(6110, 'PLT03-41', 10, 1, 22850.96),
(452, 'PLT03-42', 3, 1, 9120.65),
(470, 'PLT03-42', 4, 1, Null),
(1808, 'PLT03-42', 5, 1, 9120.65),
(4883, 'PLT03-42', 6, 1, 9120.65),
(6109, 'PLT03-42', 7, 1, Null),
(449, 'PLT03-43', 4, 1, 7923.96),
(474, 'PLT03-43', 5, 1, 7923.96),
(1811, 'PLT03-43', 6, 1, 7357.24),
(4881, 'PLT03-43', 7, 1, Null),
(5107, 'PLT03-43', 7, 1, 7711.44),
(5133, 'PLT03-43', 7, 1, Null),
(6112, 'PLT03-43', 8, 1, 7711.44),
(98, 'PLT05-25', 2, 18, 26.565),
(528, 'PLT05-25', 2, 18, 26033.7),
(139, 'PLT05-25', 2, 18, 26565),
(380, 'PLT05-25', 2, 18, Null),
(381, 'PLT05-25', 2, 18, 51854.88),
(7398, 'PLT06-143', 0, 18, 4091.01),
(4112, 'PLT06-143', 1, 18, 4091.01),
(5309, 'PLT06-143', 2, 18, 4091.01),
(73982, 'PLT06-143', 2, 18, 4091.01),
(73981, 'PLT06-143', 3, 18, Null),
(7397, 'PLT06-145', 0, 18, 4091.01),
(73971, 'PLT06-145', 1, 18, 4091.01),
(4109, 'PLT06-145', 1, 18, Null),
(5314, 'PLT06-145', 2, 18, 4091.01),
(73972, 'PLT06-145', 2, 18, Null),
(73973, 'PLT06-145', 3, 18, 4091.01),
(177, 'PLT06-342', 2, 1, 35420),
(1307, 'PLT06-342', 3, 1, 30360),
(5946, 'PLT06-342', 4, 1, 30360),
(6220, 'PLT06-342', 5, 1, Null),
(13264, 'PLT06-342', 6, 1, Null),
(1312, 'PLT06-344', 2, 1, 15180),
(5106, 'PLT06-344', 3, 1, 15180),
(5945, 'PLT06-344', 4, 1, 15180),
(6218, 'PLT06-344', 5, 1, Null),
(10550, 'PLT06-344', 6, 1, 10120),
(13271, 'PLT06-344', 7, 1, 10120),
(2724, 'PLT06-87', 2, 6, 143.451),
(5039, 'PLT06-87', 3, 6, 143.451),
(5886, 'PLT06-87', 4, 6, Null),
(10586, 'PLT06-87', 5, 6, 143.451),
(22759, 'PLT06-87', 6, 6, Null),
(158, 'PLT07-234', 1, 18, 21274.77),
(341, 'PLT07-234', 2, 18, 21274.77),
(7813, 'PLT07-234', 3, 18, 21274.77),
(24562, 'PLT07-234', 4, 18, Null),
(24584, 'PLT07-234', 4, 18, 2488.508),
(5965, 'PLT07-328', 2, 1, 19607.5),
(6073, 'PLT07-328', 2, 1, 19607.5),
(5996, 'PLT07-328', 2, 1, 19607.5),
(6644, 'PLT07-328', 3, 1, 19607.5),
(6701, 'PLT07-328', 3, 1, Null),
(7664, 'PLT07-328', 4, 1, Null),
(227, 'PLT07-39', 2, 18, 50347),
(1269, 'PLT07-39', 3, 18, 50258.45),
(1821, 'PLT07-39', 4, 18, 50258.45),
(1976, 'PLT07-39', 4, 18, 50258.45),
(5282, 'PLT07-39', 5, 18, Null),
(374, 'PLT08-25', 2, 18, 55331.1),
(135, 'PLT08-25', 2, 18, 30.36),
(134, 'PLT08-25', 2, 18, 56.925),
(533, 'PLT08-25', 2, 18, 55.7865),
(93, 'PLT08-25', 2, 18, 56.925),
(4081, 'PLT08-437', 1, 18, 5206.74),
(4241, 'PLT08-437', 2, 18, 5206.74),
(4242, 'PLT08-437', 3, 18, 5206.74),
(4532, 'PLT08-437', 4, 18, 4946.656),
(24344, 'PLT08-437', 5, 18, Null),
(460, 'PLT10-574', 0, 18, 198207.284),
(943, 'PLT10-574', 2, 18, 198207.284),
(1248, 'PLT10-574', 3, 18, 198207.284),
(2371, 'PLT10-574', 4, 18, 198207.284),
(6173, 'PLT10-574', 5, 18, 198207.284),
(17787, 'PLT10-574', 6, 18, 198207.284),
(23533, 'PLT10-574', 7, 18, 198207.284)
;
And here is the expected result of the query I'm seeking:
RecordId RootId SubId Fuel Capacity
6111 PLT03-39 7 1 4399.67
6113 PLT03-40 7 1 14767.61
6110 PLT03-41 10 1 22850.96
6109 PLT03-42 7 1 9120.65
6112 PLT03-43 8 1 7711.44
381 PLT05-25 2 18 51854.88
7398 PLT06-143 3 18 4091.01
7397 PLT06-145 3 18 4091.01
13264 PLT06-342 6 1 30360
13271 PLT06-344 7 1 10120
22759 PLT06-87 6 6 143.451
24584 PLT07-234 4 18 2488.508
7664 PLT07-328 4 1 19607.5
5282 PLT07-39 5 18 50258.45
93 PLT08-25 2 18 56.925
24344 PLT08-437 5 18 4946.656
23533 PLT10-574 7 18 198207.284
Below is one solution to this problem. I used a CTE and MAX aggregate to determine the latest RecordId for each RootId. After joining that back to the Plant table used an OUTER APPLY to retrieve the most recent capacity.
WITH LATEST AS
(
SELECT RootId, MAX(RecordId) AS RecordId
FROM Plant
GROUP BY RootId
)
SELECT
P.RecordId
, P.RootId
, P.SubID
, P.Fuel
, CAP.Capacity
FROM
LATEST AS L
JOIN Plant AS P
ON L.RecordId = P.RecordId
OUTER APPLY
(
SELECT TOP 1 Capacity
FROM Plant
WHERE RootId = P.RootId AND Capacity IS NOT NULL
ORDER BY SubID DESC
) AS CAP
ORDER BY
L.RootId

Deleting Chained Duplicates

Lets say I have a list:
lits = [1, 1, 1, 2, 0, 0, 0, 0, 3, 3, 1, 4, 5, 2, 2, 2, 0, 0, 0]
and i need this to become [1, 1, 2, 0, 0, 3, 3, 1, 4, 5, 2, 2, 0, 0]
(Delete duplicates, but only in a chain of duplicates. Going to do this on a huge HDF5 file, with pandas, numpy. Would rather not use a for loop iterating through all elements.
table = table.drop_duplicates(cols='[SPEED OVER GROUND [kts]]', take_last=True)
Is there a modification I can do to this code?
In pandas you can do a boolean mask, selecting a row only if it is differs from either the preceding or succeeding value:
>>> df=pd.DataFrame({ 'lits':lits })
>>> df[ (df.lits != df.lits.shift(1)) | (df.lits != df.lits.shift(-1)) ]
lits
0 1
2 1
3 2
4 0
7 0
8 3
9 3
10 1
11 4
12 5
13 2
15 2
16 0
18 0