Pandas: Subset of subset with multiple conditions - pandas

I need to grab a subset of the following using multiple conditions:
Event Type must contain the string 'Outreach'
AND any other field can contain the string 'STEM' - case insensitive.
Data Sample:
Title Event Type Presenter Description Tags
STEM event STEM Gloria Bubbles Craft
Robots Outreach STEM - John EV3 Bots
School STEM Outreach Billy Robots Craft
Code:
cond = df['Event Type'].str.contains('Outreach')
stemA = df[cond]
This gets me all the outreach events.
cond = df['Event Type'].str.contains('Outreach') & (df['Presenter'].str.contains('STEM') | df['Tags'].str.contains('STEM') | df['Description'].str.contains('STEM') | df['Title'].str.contains('STEM'))
stem[cond]
I was hoping for a grep-like solution. The above gets me less than grep does on the command line and I know this result is wrong from looking at the data.

IIUC, this should work for you
cols_to_include = df.columns[df.columns != 'Event Type']
a = df[cols_to_include].astype(str).sum(axis=1)
df[df['Event Type'].str.contains('Outreach') & (a.str.contains('STEM', regex=True))]

Related

How to filter a date-field with a swift vapor-fluent query

To avoid multiple inserts of the same person in a database, I wrote the following function:
func anzahlDoubletten(_ req: Request, nname: String, vname: String, gebTag: Date)
async throws -> Int {
try await
Teilnehmer.query(on: req.db)
.filter(\.$nname == nname)
.filter(\.$vname == vname)
.filter(\.$gebTag == gebTag)
.count()
}
The function always returns 0, even if there are multiple records with the same surname, prename and birthday in the database.
Here is the resulting sql-query:
[ DEBUG ] SELECT COUNT("teilnehmer"."id") AS "aggregate" FROM "teilnehmer" WHERE "teilnehmer"."nname" = $1 AND "teilnehmer"."vname" = $2 AND "teilnehmer"."geburtstag" = $3 ["neumann", "alfred e.", 1999-09-09 00:00:00 +0000] [database-id: psql, request-id: 1AC70C41-EADE-43C2-A12A-99C19462EDE3] (FluentPostgresDriver/FluentPostgresDatabase.swift:29)
[ INFO ] anzahlDoubletten=0 [request-id: 1AC70C41-EADE-43C2-A12A-99C19462EDE3] (App/Controllers/TeilnehmerController.swift:49)
if I query directly I obtain:
lwm=# select nname, vname, geburtstag from teilnehmer;
nname | vname | geburtstag
---------+-----------+------------
neumann | alfred e. | 1999-09-09
neumann | alfred e. | 1999-09-09
neumann | alfred e. | 1999-09-09
neumann | alfred e. | 1999-09-09
so count() should return 4 not 0:
lwm=# select count(*) from teilnehmer where nname = 'neumann' and vname = 'alfred e.' and geburtstag = '1999-09-09';
count
-------
4
My DateFormatter is defined like so:
let dateFormatter = ISO8601DateFormatter()
dateFormatter.formatOptions = [.withFullDate, .withDashSeparatorInDate]
And finally the attribute "birthday" in my model:
...
#Field(key: "geburtstag")
var gebTag: Date
...
I inserted the 4 alfreds in my database using the model and fluent, passing the birthday "1999-09-09" as a String and fluent inserted all records correctly.
But .filter(\.$gebTag == gebTag) seems to return constantly 'false'.
Is it at all possible to use .filter() with data types other than String?
And if so, what am I doing wrong?
Many thanks for your help
Michael
The problem you've hit is that you're storing only dates whereas you're filtering on dates with times. Unfortunately there's no native way to store just a date. However there are a few options.
The easiest way is to change the date field to a String and then use your date formatter (make sure you remove the time part) to convert the query option to a String.
I am guessing slightly here, but I suspect that your table was not created by a Migration? If it had been, your geburtstag field would include a time component as this is the default and you would have spotted the problem quickly.
In any event, the filter is actually filtering on the time component of gebTag as well as the date. This is why it is returning zero.
I suggest converting the geburtstag to a type that includes the time and ensuring that the time component is set to 0:00:00 when you store it. You can reset the time component to 'midnight' using something like this:
extension Date {
var midnight: Date { return Calendar.current.date(bySettingHour: 0, minute: 0, second: 0, of: self)! }
}
Then change your filter to:
.filter(\.$gebTag == gebTag.midnight)
Alternatively, just use the static method in Calendar:
.filter(\.$gebTag == Calendar.startOfDay(for:gebTag))
I think this is the most straightforward way of doing it.

Process fields with nested arrays into strings with strcat_array for output in Kusto

I would like to process Azure AD audit Logs into HTML tables/csv files. The data contains nested sets of arrays that I would like to summarise into a comma separated string.
eg data that looks like this
{
"TargetResources": [{"displayName": "Policy",
"modifiedProperties": [{"displayname": "PolicySetting1"},
{"displayname": "PolicySetting2"}]
}]
}
Would be processed into
TargetResource | Policy
modifedProps | PolicySetting1, PolicySetting2
mv-expand doesn't seem to work because some rows do not have modifiedProperties so those rows get eliminated
The only solution I have been able to find that gets close to what I am trying to do looks like this:
AuditLogs
| extend TargetResource = tostring(TargetResources[0].displayName)
| extend ModifiedProperty0 = tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[0].displayName)
| extend ModifiedProperty1 = tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[1].displayName)
| extend ModifiedProperty2 = tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[2].displayName)
| extend ModifiedProperties = strcat(ModifiedProperty0,", ",ModifiedProperty1,", ",ModifiedProperty2)
This solution is limited in that it cannot work for arbitrary numbers of modifiedProperty values (it only works properly for exactly 3) which is a requirement for my purposes, I would like the solution to work if modifiedProperties does not exist and if there are 0-15 values.
Thank you for any help you can provide
if I understood your description correctly, you could use mv-apply (twice) to achieve that:
datatable(d: dynamic)
[
dynamic({"TargetResources":[{"displayName": "Policy0","someOtherProperty":"hello world"}]}),
dynamic({"TargetResources":[{"displayName": "Policy1","modifiedProperties":[{"displayname":"PolicySetting1"},{"displayname":"PolicySetting2"}]}]}),
dynamic({"TargetResources":[{"displayName": "Policy2","modifiedProperties":[{"displayname":"PolicySetting3"},{"displayname":"PolicySetting4"}]}, {"displayName":"Policy3","modifiedProperties":[{"displayname":"PolicySetting5"},{"displayname":"PolicySetting6"}]}]}),
]
| mv-apply tr = d.TargetResources on (
extend TargetResource = tr.displayName
| mv-apply mp = tr.modifiedProperties on (
extend propertyName = mp.displayname
| summarize modifiedProps = strcat_array(make_set(propertyName), ", ")
)
)
| project TargetResource, modifiedProps
TargetResource
modifiedProps
Policy0
Policy1
PolicySetting1, PolicySetting2
Policy2
PolicySetting3, PolicySetting4
Policy3
PolicySetting5, PolicySetting6

How to track SLA of VM availability set (or availability zone) through heartbeats with Log Analytics (KQL)

I want to track the SLAs of our VMs in a Monitor Workbook using a Log Analytics query.
For this, I use the 'Heartbeat' table, which gives the heartbeats of each VM.
However, some of our VMs are in an availability set/zone and as such, the SLA is only broken,
if in an interval of 1 minute, both heartbeats are missing.
As such I need to be able to group the heartbeats by availability set/zone in the query, but there doesn't seem to be such a property on the heartbeat.
I can use a separate Azure Resource Graph query to search for which VMs are in an availability set/zone, but when I merge this query with my Log Analytics query, I can't do any further Kusto Query Language processing on the query (I can only merge the tables).
For information, these are my Log Analytics Heartbeat query and my Resource Graph SLA query:
let timeRangeStart = {TimeRange:start};
let timeRangeEnd = {TimeRange:end};
Heartbeat
| where ResourceType == "virtualMachines"
| extend ResourceGroup = case(ResourceGroup <> "", ResourceGroup, "On-Prem")
| where TimeGenerated > timeRangeStart and TimeGenerated < timeRangeEnd and Computer in ({Servers})
| extend Resource=tolower(iff(isempty(_ResourceId), Resource, _ResourceId))
| summarize heartbeat_tot = count() by Resource,ResourceGroup, SubscriptionId
| extend total_number_of_buckets=round((timeRangeEnd-timeRangeStart)/1m)
| extend round(availability_rate=heartbeat_tot*100/total_number_of_buckets,2)
| extend availability_rate = min_of(availability_rate, 100)
| order by availability_rate asc
Resources // VMs
| where type == 'microsoft.compute/virtualmachines'
| extend AvSet = properties.availabilitySet.id
| extend AvZone = properties.availabilityZone.id
| extend VMname_SLA = iff(isnotempty(AvZone), AvZone, iff(isnotempty(AvSet), AvSet, id))
| extend SLA_VM = iff(isnotnull(AvZone), '99.99%', iff(isnotnull(AvSet), '99.95%', ''))
| extend managedBy = tolower(id)
| join kind = leftouter (
Resources // Disks
| where type == 'microsoft.compute/disks'
| where isnotempty(managedBy)
| extend managedBy = tolower(managedBy)
// What do Standard HDD disks have as SKU tag??? I used StandardHDD for the time being
| extend Tier_disk = sku.tier
| extend SLA_disk = iff(Tier_disk == 'StandardHDD', '95%', iff(Tier_disk == 'Standard', '99.5%', '99.9%'))
) on managedBy
| extend SLA_tot = iff(isnotempty(SLA_VM), SLA_VM, SLA_disk)
| project managedBy, VMname_SLA, SLA_tot
| order by managedBy asc
How many resources is it?
If it is not a large number of resources, a workaround would be:
run your ARG query in text parameter, and format the results of the query to effectively generate a json array of objects, with id, location, etc that you need. then mark this parameter as hidden
in your Logs query, reference that parameter json text before the query, and use KQL operators to turn that JSON structure into a table. then you can join/filter on that table in the query
it isn't optimal, and won't work well if there are large numbers of resources since every time you run your query you're effectively "uploading" a json blob and then immediately parsing it apart again.

Searching on pubmed using biopython

I am trying to input over 200 entries into pubmed in order to record the number of articles published by an author and to refine the search by including his/her mentor and institution. I have tried to do this using biopython and xlrd (the code is below), but I am consistently getting 0 results for all three formats of inquiries (1. by name, 2. by name and institution name, and 3. by name and mentor's name). Are there steps of troubleshooting that I can do, or should I use a different format when using the keywords indicated below to search on pubmed?
Example output of the input queries;search_term is a linked list with lists of the input queries.
print(*search_term[8:15], sep='\n')
[text:'Andrew Bland', 'Weill Cornell Medical College', text:'David Cutler MD']
[text:'Andy Price', 'University of Alabama at Birmingham School of Medicine', text:'Jason Warem, PhD']
[text:'Bah Chamin', 'University of Texas Southwestern Medical School', text:'Dr. Timothy Hillar']
[text:'Eduo Cera', 'University of Colorado School of Medicine', text:'Dr. Tim']
Code used to generate the input queries above and to search on Pubmed:
Entrez.email = "mollyzhaoe#college.harvard.edu"
for search_term in search_terms[8:55]:
handle = Entrez.egquery(term="{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
handle_1 = Entrez.egquery(term = "{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[2]))
handle_2 = Entrez.egquery(term = "{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[1]))
record = Entrez.read(handle)
record_1 = Entrez.read(handle_1)
record_2 = Entrez.read(handle_2)
pubmed_count = ['','','']
for row in record["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[0] = row["Count"]
for row in record_1["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[1] = row["Count"]
for row in record_2["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[2] = row["Count"]
Check your indentation, it is difficult to know which part belongs to which loop.
If you want to troubleshoot, try printing your egquery, e.g.
print("{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
and paste the output to pubmed and see what you get. Perhaps modify it a bit and see which search term causes the problems.
Your input format is a little bit hard to guess. Print the query and make sure you are getting the right search values.
For the author names, try to get rid of the academic titles, PubMed might confused them with the initials, e.g. House MD, might be Mark David House.

How to store smart-list rules in a relational database

The system I'm building has smart groups. By smart groups, I mean groups that update automatically based on these rules:
Include all people that are associated with a given client.
Include all people that are associated with a given client and have these occupations.
Include a specific person (i.e., by ID)
Each smart groups can combine any number of these rules. So, for example, a specific smart list might have these specific rules:
Include all people that are associated with client 1
Include all people that are associated with client 5
Include person 6
Include all people associated with client 10, and who have occupations 2, 6, and 9
These rules are OR'ed together to form the group. I'm trying to think about how to best store this in the database given that, in addition to supporting these rules, I'd like to be able to add other rules in the future without too much pain.
The solution I have in mind is to have a separate model for each rule type. The model would have a method on it that returns a queryset that can be combined with other rules' querysets to, ultimately, come up with a list of people. The one downside of this that I can see is that each rule would have its own database table. Should I be concerned about this? Is there, perhaps, a better way to store this information?
Why not use Q objects?
rule1 = Q(client = 1)
rule2 = Q(client = 5)
rule3 = Q(id = 6)
rule4 = Q(client = 10) & (Q(occupation = 2) | Q(occupation = 6) | Q(occupation = 9))
people = Person.objects.filter(rule1 | rule2 | rule3 | rule4)
and then store their pickled strings into the database.
rule = rule1 | rule2 | rule3 | rule4
pickled_rule_string = pickle.dumps(rule)
Rule.objects.create(pickled_rule_string=pickled_rule_string)
Here are the models we implemented to deal with this scenario.
class ConsortiumRule(OrganizationModel):
BY_EMPLOYEE = 1
BY_CLIENT = 2
BY_OCCUPATION = 3
BY_CLASSIFICATION = 4
TYPES = (
(BY_EMPLOYEE, 'Include a specific employee'),
(BY_CLIENT, 'Include all employees of a specific client'),
(BY_OCCUPATION, 'Include all employees of a speciified client ' + \
'that have the specified occupation'),
(BY_CLASSIFICATION, 'Include all employees of a specified client ' + \
'that have the specified classifications'))
consortium = models.ForeignKey(Consortium, related_name='rules')
type = models.PositiveIntegerField(choices=TYPES, default=BY_CLIENT)
negate_rule = models.BooleanField(default=False,
help_text='Exclude people who match this rule')
class ConsortiumRuleParameter(OrganizationModel):
""" example usage: two of these objects one with "occupation=5" one
with "occupation=6" - both FK linked to a single Rule
"""
rule = models.ForeignKey(ConsortiumRule, related_name='parameters')
key = models.CharField(max_length=100, blank=False)
value = models.CharField(max_length=100, blank=False)
At first I was resistant to this solution as I didn't like the idea of storing references to other objects in a CharField (CharField was selected, because it is the most versatile. Later on, we might have a rule that matches any person whose first name starts with 'Jo'). However, I think this is the best solution for storing this kind of mapping in a relational database. One reason this is a good approach is that it's relatively easy to clean hanging references. For example, if a company is deleted, we only have to do:
ConsortiumRuleParameter.objects.filter(key='company', value=str(pk)).delete()
If the parameters were stored as serialized objects (e.g., Q objects as suggested in a comment), this would be a lot more difficult and time consuming.