Finding number of NaT - pandas

I'm trying to find a number of Not A Time values in a grouping:
ID Date eVal
ddd 2014-02-12 2
ddd 2014-02-13 2
ddd NaT 2
aaa 2014-02-12 3
aaa 2014-02-13 3
aaa 2014-02-14 3
I basically need to add a new column which indicates the (number of NaT incidences for that ID)
How do I find that number?

Something like this:
df['NaT'] = df['Date'].isnull() * 1
df.groupby('ID')['NaT'].sum()

Related

How to merge two pandas dataframes/ series

I'm trying to merge multiple pandas data frames into one. I have 1 main frame with the locations of measurements. The other data frames contain multiple measurements for one location. Like below:
df 1: Location ID | X | Y | Z
1 |1| 2 |3
2 |3| 2 |1
n
df 2: Location ID | Date | Measurement
1 |January 1 12:30 | 1
1 |January 16 12 :30 | 4
1 ...
df 2: Location ID | Date | Measurement
2 January 1 12:30 3
2 January 16 12 :30 9
2 ...
df n: Location ID | Date | Measurement
n January 1 12:30 4
n January 16 12 :30 6
n January 20 11:30 7 ...
I'm trying to create a data frame like this:
df_final: Location ID | X | Y | Z | january 1 12:00 | January 16 12 :30| January 20 11:30 etc.
1 1 2 3 1 4 NaN
2 3 2 1 3 9 NaN
n 2 5 7 4 6 7
The dates are already datetime objects and the Location ID is the index of both dataframes.
I tried to use the append, the merge and the concat functions both using two frames and converting the frame to a list by List = frame['measurements'] before adding it.
The problem is that either rows are added under the first data frame, while the measured values should be added in new columns on an existing row( the location ID resp.), or the dates end op to be new rows while new columns with location IDs are created.
I'm sorry my question lay-out is not so nice, but I'm new to this forum.
Found it myself.
I used frame. pivot to reshape df2-n and then used concat to ad it to the locations df.

Mariadb Building the best INDEX for a given SELECT - GROUP BY

I do not have much knowledge in the database.
For study, I am reading MariaDB's index documents.
But there are parts that I do not understand.
Document
Algorithm, step 2b (GROUP BY)¶
WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
aaa or bbb knows that ordering of the indexes is important, regardless of the order of the where clauses. Therefore, the indexes of aaa and bbb in the where clause are used, and sort ccc based on the matched aaa and bbb.
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
(no WHERE) means don't use WHERE clause?
What if I use it like this?
WHERE x > 1 GROUP BY x, y
my think:
(1) from table
(2) where x > 1 -> using index
(3) group by x, y -> using index..? because (2) already sorted..? or sort again?
(4) having -> if i did not enter this keyword, is it not used?
(5) select -> print data(?)
(6) order by -> group by already order by(?)
Algorithm, step 2b (GROUP BY)¶
WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
there is table like below:
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
------------------
123 | 2 | 27
------------------
125 | 1 | 11
------------------
125 | 3 | 29
------------------
125 | 3 | 40
------------------
WHERE aaa = 123 AND bbb = 1 clause result is this:
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
check ccc column.
ccc column is sorted by bbb column.
so GROUP BY clause can be grouped quickly because the ccc columns are sorted.
**CAUTION**
think about WHERE aaa >= 123 AND bbb = 1 GROUP BY ccc clause.
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
------------------
125 | 1 | 11
------------------
ccc column doesn't be sorted by bbb column.
The ccc column is meaningful only if the aaa and bbb columns have the same value.
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
this is same thing.
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
should probably say "(if there is no WHERE)". If there is a WHERE, then that index may or may not be useful. You should (usually) build the INDEX based on the WHERE, an only if you get past it, consider the GROUP BY.
WHERE x > 1 GROUP BY x, y
OK, that can use INDEX(x,y), in that order. First, it will filter, and that leaves the rest of the index still in a good order for the grouping. Similarly:
WHERE x > 1 ORDER BY x, y
WHERE x > 1 GROUP BY x, y ORDER BY x, y
No sorting should be necessary.
So, here are the steps I might take:
1. WHERE x > 1 ... --> INDEX(x) (or any index _starting_ with `x`)
2. ... GROUP BY x, y --> INDEX(x,y)
3. recheck that I did not mess up the WHERE.
This has no really good index:
WHERE x > 1 AND y = 4 GROUP BY x,y
1. WHERE x > 1 AND y = 4 ... --> INDEX(y,x) in this order!
2. ... GROUP BY x,y --> can use that index
However, flipping to GROUP BY y,x has the same effect (ignoring the order of display).
(4) having -> if i did not enter this keyword, is it not used?
HAVING, if present, is applied after things for which INDEXes are useful. Having no HAVING does mean there is no HAVING.
(6) order by -> group by already order by(?)
That has become a tricky question. Until very recently (MySQL 8.0; don't know when or if MariaDB changed), GROUP BY implied the equivalent ORDER BY. That was non-standard and potentially interfered with optimization. With 8.0, GROUP BY does not imply any order; you must explicitly request the order (if you care).
(I updated the source document in response to this discussion.)

Remove group rows

How do I remove group rows based on other columns for a particular ID such that:
ID Att Comp Att. Inc. Att
aaa 2 0 2
aaa 2 0 2
bbb 3 1 2
bbb 3 1 2
bbb 3 0 2
becomes:
ID Att Comp Att. Inc. Att
aaa 2 0 2
bbb 3 1 2
I need to discard cases which are not just duplicate, but also infer the same data based on the columns.
Use drop_duplicates -- check out the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
I can't tell for sure from your description what you want to pay attention for for duplicates, but you can tell drop_duplicates which column(s) to look at.

Summing numbers in a file

I have a file which looks like this:
aaa 15
aaa 12
bbb 131
bbb 12
ccc 123
ddddd 1
ddddd 2
ddddd 3
I would like to get a sum for each unique element in the left side like this and also calculate how many of each element are summed up:
aaa 27 - 2
bbb 143 - 2
ccc 123 - 1
ddddd 6 - 3
How would I accomplish this with AWK or something similar?
You can do it in awk by collecting the sums into two arrays, using column 1 as the key to both arrays (then pipe to sort if desired):
awk '{sums[$1] += $2; counts[$1] += 1}
END {for (key in sums) {print key, sums[key], "-", counts[key]}}' file | sort
Output:
aaa 27 - 2
bbb 143 - 2
ccc 123 - 1
ddddd 6 - 3

MDX query to get employees under given supervisor with parent child relationship

I have employee dimension in my cube where each employee has a supervisor which is also an employee. The sample data set is,
Employee ID | Supervisor ID | Name
1 0 ABC
2 1 AAA
3 1 BBB
4 2 CCC
5 2 DDD
6 4 EEE
7 3 FFF
I want to get the all employees under given supervisor. E.g. If the supervisor is 2, then the result should be
CCC
DDD
EEE
using below query i can get all the employees
SELECT {AddCalculatedMembers({[Employee].[EmployeeName].Children})} ON COLUMNS FROM [MY_CUBE]
I am new to MDX and please tell me how to write MDX query for above requirement.
#mmarie
I already have a cube. But not sure whether I implemented it correctly. My schema is as below,
The dimension "dimEmployee" has columns "EmployeeID, EmployeeName, Dept".
Also I have used bridge table "BridgeEmployee" and it has columns "ParentEmployeeID, ChildEmployeeID, Distance"
sample data in bridge are,
ParentEmployeeID | ChildEmployeeID | Distance
1 1 0
2 2 0
1 2 1
3 3 0
1 3 1
4 4 0
2 4 1
1 4 2
I am using SSAS and I have implemented the bridge table as Measure Group.