I have a dataset as such and I need to get its conference calls using pandas.
This would have been very easy in SQL by creating 2 tables.
Created 2 tables,
conference1= conference
My approach,
conference.loc[( (conference['Call date'] < conference1['Call date']) & conference['Cell Id']==conference1['Cell Id'])
&(conference['Called (B) Party Number'] != conference1['Called (B) Party Number']) ]
So I tried somewhat same query approach here too using python and it gives me no rows.
Now, to make it clear a conference call would be-
1. conference['Call date'] < conference1['Call date'] where user is same(i.e. cell ids will be same)
2. conference['Cell Id'] != conference1['Cell Id']
3. Also, the persons called by same person should be different, therefore,
conference['Called (B) Party Number'] != conference1['Called (B) Party Number']
The output should look like this file
in this file,
The call date in 2nd row is greater(that this user must have been added a little later in the conference)
The called party B are different(that users are different)
The end date at the end, might end the call at the same time or leave at any point so that doesn't count much for analysis.
Can somebody help me out with this? A reference link or an idea would also work
Related
I want to create a new DataFrame from another for rows that meet a condition such as:
uk_cities_df['location'] = cities_df['El Tarter'].where(cities_df['AD'] == 'GB')
uk_cities_df[:5]
but the resulting uk_cities_df is returning NaN.
The csv file that I am needing to extract from has no headers so it used the first row values for such. I need to only include rows in uk_cities_df include the ISO code "GB" so "El Tarter" denotes the values for location and "AD" for iso code.
Could you please provide a visual of what uk_cities_df and cities_df look like ?
From what I can gather, I think you might be looking for the .loc operator,
you could try for example :
uk_cities_df['location'] = cities_df.loc[cities_df['AD'] == 'GB']['location']
Also, I did not really get what role 'El Tarter' plays here, maybe you could give more details ?
I have an algorithm that converts the writeback of a frontend app into a cleaned dataset.
In the frontend the user can either add a new record or modify/delete an existing one. The modification and deletion are performed by tracking the key of the original row and creating a new one with the new status.
Here is an example of the writeback of the frontend app
key
date
status
source_key
10277_left_1605483676378
1605483676378
created
null
10277_left_1605559717253
1605559717253
modified
10277_left_1605483676378
10277_left_1627550679123
1627550679123
deleted
10277_left_1605559717253
10277_left_1605560105840
1605560105840
modified
10277_left_1605483676378
10277_left_1605560105900
1605560105900
modified
10277_left_1605560105840
and here is the result after applying the algorithm that creates the cleaned dataset
key
date
status
10277_left_1605560105900
1605560105900
modified
As you can see we branched from the first version of the data (1605483676378), created two modified versions and deleted one of those, before making a final modification on the remaining one, so the resulting data only contains one row.
┌──────►1605559717253 ──────► 1627550679123 ─────► no output row
created │ modified deleted
1605483676378│
│ ┌──────────────────┐
└──────►1605560105840─────┼──►1605560105900 ├─────► row visible in
modified │ modified │ cleaned
└──────────────────┘ dataset
This works as every update is treated singularly. However, I would like to be able to inspect the origin of a certain record. That is, I want to know the original date when the record was created, something like this
key
date
status
date_added
10277_left_1605560105900
1605560105900
modified
1605483676378
I'm thinking on how to do this. I would avoid having to loop through the entire history of a record as this would be not efficient.
As the algorithm is currently working in Pyspark I would like to find a solution that works there, but hints in Pandas are also accepted.
IIUC you want to find the root node of a child node. I assume all your keys are unique in the below:
# df is your original df, df2 the one after you apply your algo
d = df.set_index("key")["source_key"].to_dict()
def find_root(node):
cur = d.get(node, np.NaN)
return find_root(cur) if cur is not np.NaN else node
df2["root"] = df2["key"].map(find_root)
print (df2)
key date status root
0 10277_left_1605560105900 1605560105900 modified 10277_left_1605483676378
I have columns taken from excel as a dataframe, the columns are as follows:
HolidayTourProvider|Packages|Meals|Accommodation|LocalTravelVehicle|Cancellationfee
Holiday Tour Provider has a couple of company names
Packages, the features provided in each package are mostly the same like
Meals,Accommodation etc... even though one company may call it "Saver", others may call it "Budget". (each of column mostly follow Yes/No, except Local travel vehicle are again car names like Ford Taurus,jeep cherokee etc..
Cancellation amount is integers)
I need to write a function like
match(HolidayTP,Package)
where the user can give input like
match(AdventureLife, Luxury)
then I need to return all the packages that have similar features with Luxury by other Holiday Tour Providers, no matter what name they give the package like 'Semi Lux', 'Comfort' etc...
I want to give a counter for every match and display all the packages that exceed the counter by 3 or 4.
This is my first python code. I am stuck here.
fb is the total df I exported to
def mapHol(HTP, PACKAGE):
mfb = (fb['HTP']== HTP)&(fb['package']== package)
B = fb[mfb]
for i in fb[i]:
for j in B[j]:
if fb[i]==B[j]:
count+=1
I dont know how to proceed, please help me this is my first major project, I started on my own.
I know some event happens to those tools at certain timeframe, unfortunately, these events happen at the different timeframe for different tools. I am working on filtering the data to those 2-3 hour time frame to so, I can quantify the improvements and compare before and after the fix. I know we can filter the data based on time using pandas between_time, however, I am not sure how to go about filtering the data by eqp_id and also different time frame. What I am doing is a little crude method, I appreciate if anyone of you has a better and efficient solution for my problem.
dmv2361=report1[report1['Eqp_ID'] == 'dmv2361']
df_2361=ALC2361.between_time('01:30', '04:30')
dmv2362=report1[report1['Eqp_ID'] == 'dmv2362']
df_2362=ALC2362.between_time('03:30', '06:30')
dmv2363=report1[report1['Eqp_ID'] == 'dmv2363']
dmv2363=ALC2363.between_time('05:30', '08:30')
I am expecting something like this or better way
alc= report1[report1["Eqp_ID"].isin(['dmv2360', 'dmv2361', 'dmv2362', 'dmv2363', 'dmv2370', 'dmv2371', 'dmv2372', 'dmv2373', 'dmv2374'])].sort_values(by='Start_Date', ascending=True). between_time('23:30-02:30', '01:30-04:30', 'so on')
You can use indexer_between_time to do this within a single loc:
report1[(report1['Eqp_ID'] == 'dmv2361') && report1.index.indexer_between_time('01:30', '04:30')]
Note: This is what's called under the hood in between_time.
This is a bit of a specific question, but somebody must have done this before. I would like to get the latest papers from pubmed. Not papers about a certain subjects, but all of them. I thought to query depending on modification date (mdat). I use biopython.py and my code looks like this
handle = Entrez.egquery(mindate='2015/01/10',maxdate='2017/02/19',datetype='mdat')
results = Entrez.read(handle)
for row in results["eGQueryResult"]:
if row["DbName"]=="nuccore":
print(row["Count"])
However, this results in zero papers. If I add term='cancer' I get heaps of papers. So the query seems to need the term keyword... but I want all papers, not papers on a certain subjects. Any ideas how to do this?
thanks
carl
term is a required parameter, so you can't omit it in your call to Entrez.egquery.
If you need all the papers within a specified timeframe, you will probably need a local copy of MEDLINE and PubMed Central:
For MEDLINE, this involves getting a license. For PubMed Central, you
can download the Open Access subset without a license by ftp.
EDIT for python3. The idea is that the latest pubmed id is the same thing as the latest paper (which I'm not sure is true). Basically does a binary search for the latest PMID, then gives a list of the n most recent. This does not look at dates, and only returns PMIDs.
There is an issue however where not all PMIDs exist, for example https://pubmed.ncbi.nlm.nih.gov/34078719/ exists, https://pubmed.ncbi.nlm.nih.gov/34078720/ does not (retraction?), and https://pubmed.ncbi.nlm.nih.gov/34078721/ exists. This ruins the binary search since it can't know if it's found a PMID that hasn't been used yet, or if it has found one that has previously existed.
CODE:
import urllib
def pmid_exists(pmid):
url_stem = 'https://www.ncbi.nlm.nih.gov/pubmed/'
query = url_stem+str(pmid)
try:
request = urllib.request.urlopen(query)
return True
except urllib.error.HTTPError:
return False
def get_latest_pmid(guess = 27239557, _min_guess=None, _max_guess=None):
#print(_min_guess,'<=',guess,'<=',_max_guess)
if _min_guess and _max_guess and _max_guess-_min_guess <= 1:
#recursive base case, this guess must be the largest PMID
return guess
elif pmid_exists(guess):
#guess PMID exists, search for larger ids
_min_guess = guess
next_guess = (_min_guess+_max_guess)//2 if _max_guess else guess*2
else:
#guess PMID does not exist, search for smaller ids
_max_guess = guess
next_guess = (_min_guess+_max_guess)//2 if _min_guess else guess//2
return get_latest_pmid(next_guess, _min_guess, _max_guess)
#Start of program
n = 5
latest_pmid = get_latest_pmid()
most_recent_n_pmids = range(latest_pmid-n, latest_pmid)
print(most_recent_n_pmids)
OUTPUT:
[28245638, 28245639, 28245640, 28245641, 28245642]