Variables don't hold value for very long pymongo ipython - pymongo

Perhaps I should just restart my computer, but it seems my variable are losing their values. A simple aggregation only seems to hold the contents of my database for a short period of time. Note: I'm doing this in an ipython notebook.
MONGODB_URI ='mongodb://username:password#***.mongolab.****/***'
client = MongoClient(MONGODB_URI)
db = client.get_default_database()
collectn = db.collection_name
pipe = [
{"$unwind":"$predictions"},
{"$match": {"predictions.t_obj": datetime.datetime(2015, 10, 29, 11, 0)}}
]
should_be_data = collectn.aggregate(pipe)
list(should_be_data)
// returns what we expect, i.e. data
list(should_be_data)
// returns []
Why do the contents of my variable disappear?

should_be_data isn't a list/data-container, but a generator.
The first time you run list(should_be_data), the generator is consumed completely. This lines consumed the elements from the generator, and stores them in a new list.
By the second time you run list(should_be_data), the generator is already exhaused, thus it returns no further elements.
If you want it to be a list to begin with, just replace
should_be_data = collectn.aggregate(pipe)
with
should_be_data = list(collectn.aggregate(pipe))

Related

Can we pass dataframes between different notebooks in databricks and sequentially run multiple notebooks? [duplicate]

I have a notebook which will process the file and creates a data frame in structured format.
Now I need to import that data frame created in another notebook, but the problem is before running the notebook I need to validate that only for some scenarios I need to run.
Usually to import all data structures, we use %run. But in my case it should be combinations of if clause and then notebook run
if "dataset" in path": %run ntbk_path
its giving an error " path not exist"
if "dataset" in path": dbutils.notebook.run(ntbk_path)
this one I cannot get all the data structures.
Can someone help me to resolve this error?
To implement it correctly you need to understand how things are working:
%run is a separate directive that should be put into the separate notebook cell, you can't mix it with the Python code. Plus, it can't accept the notebook name as variable. What %run is doing - it's evaluating the code from specified notebook in the context of the current Spark session, so everything that is defined in that notebook - variables, functions, etc. is available in the caller notebook.
dbutils.notebook.run is a function that may take a notebook path, plus parameters and execute it as a separate job on the current cluster. Because it's executed as a separate job, then it doesn't share the context with current notebook, and everything that is defined in it won't be available in the caller notebook (you can return a simple string as execution result, but it has a relatively small max length). One of the problems with dbutils.notebook.run is that scheduling of a job takes several seconds, even if the code is very simple.
How you can implement what you need?
if you use dbutils.notebook.run, then in the called notebook you can register a temp view, and caller notebook can read data from it (examples are adopted from this demo)
Called notebook (Code1 - it requires two parameters - name for view name & n - for number of entries to generate):
name = dbutils.widgets.get("name")
n = int(dbutils.widgets.get("n"))
df = spark.range(0, n)
df.createOrReplaceTempView(name)
Caller notebook (let's call it main):
if "dataset" in "path":
view_name = "some_name"
dbutils.notebook.run(ntbk_path, 300, {'name': view_name, 'n': "1000"})
df = spark.sql(f"select * from {view_name}")
... work with data
it's even possible to do something like with %run, but it could require a kind of "magic". The foundation of it is the fact that you can pass arguments to the called notebook by using the $arg_name="value", and you can even refer to the values specified in the widgets. But in any case, the check for value will happen in the called notebook.
The called notebook could look as following:
flag = dbutils.widgets.get("generate_data")
dataframe = None
if flag == "true":
dataframe = ..... create datarame
and the caller notebook could look as following:
------ cell in python
if "dataset" in "path":
gen_data = "true"
else:
gen_data = "false"
dbutils.widgets.text("gen_data", gen_data)
------- cell for %run
%run ./notebook_name $generate_data=$gen_data
------ again in python
dbutils.widgets.remove("gen_data") # remove widget
if dataframe: # dataframe is defined
do something with dataframe

RStudio Error: Unused argument ( by = ...) when fitting gam model, and smoothing seperately for a factor

I am still a beginnner in R. For a project I am trying to fit a gam model on a simple dataset with a timeset and year. I am doing it in R and I keep getting an error message that claims an argument is unused, even though I specify it in the code.
It concerns a dataset which includes a categorical variable of "Year", with only two levels. 2020 and 2022. I want to investigate if there is a peak in the hourly rate of visitors ("H1") in a nature reserve. For each observation period the average time was taken, which is the predictor variable used here ("T"). I want to use a Gam model for this, and have the smoothing applied differently for the two years.
The following is the line of code that I tried to use
`gam1 <- gam(H1~Year+s(T,by=Year),data = d)`
When I try to run this code, I get the following error message
`Error in s(T, by = Year) : unused argument (by = Year)`
I also tried simply getting rid of the "by" argument
`gam1 <- gam(H1~Year+s(T,Year),data = d)`
This allows me to run the code, but when trying to summon the output using summary(gam1), I get
Error in [<-(tmp, snames, 2, value = round(nldf, 1)) : subscript out of bounds
Since I feel like both errors are probably related to the same thing that I'm doing wrong, I decided to combine the question.
Did you load the {mgcv} package or the {gam} package? The latter doesn't have factor by smooths and as such the first error message is what I would expect if you did library("gam") and then tried to fit the model you showed.
To fit the model you showed, you should restart R and try in a clean session:
library("mgcv")
# load you data
# fit model
gam1 <- gam(H1 ~ Year + s(T, by = Year), data = d)
It could well be that you have both {gam} and {mgcv} loaded, in which case whichever you loaded last will be earlier on the function search path. As both packages have functions gam() and s(), R might just be finding the wrong versions (masking), so you might also try
gam1 <- mgcv::gam(H1 ~ Year + mgcv::s(T, by = Year), data = d)
But you would be better off only loading {mgcv} if you wan factor by smooths.
#Gavin Simpson
I did have both loaded, and I tried just using mgcv as you suggested. However, then I get the following error.
Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
I am assuming this is simply because it's not actually trying to use the "gam" function, but rather it attempts to name something gam1. So I would assume I actually need the package of 'gam' before I could do this.
The second line of code also doesn't work. I get the following error
Error in model.frame.default(formula = H1 ~ Year + mgcv::s(T, by = Year), :
invalid type (list) for variable 'mgcv::s(T, by = Year)'
This happens no matter the order I download the two packages in. And if I don't download 'gam', I get the error as described above.

PyMongo checking if entry is already in array

I want to have an array of IPs that went to x webpage and have them in an array. However right now everytime an IP connects it gets added, if multiple people connect multiple times there'll be many duplicates.
Currently I'm just using
query = {"_id": paste_id, "ip": {"$ne": real_ip}
collection.update_one(query, {"$push": {"ip": ip}})
but that (logically) just pushes the entry into the array, what'd be a way to check if ip is already in the array?
UPDATED ANSWER
The updated code you posted is checking for real_ip and then inserting ip; so you're not checking the item your actually updating.
My original answer below still stands as the correct approach; however this complete example demonstrates attempting to insert the real_ip 5 times and it shows the value is only inserted once:
from pymongo import MongoClient
collection = MongoClient()['mydatabase'].collection
result = collection.insert_one({'ip': []})
paste_id = result.inserted_id
real_ip = '1.2.3.4'
for i in range(5):
query = {"_id": paste_id, "ip": {"$ne": real_ip}}
collection.update_one(query, {"$push": {"ip": real_ip}})
print(list(collection.find()))
prints:
[{'_id': ObjectId('5feefdb862e2ed3ea952a035'), 'ip': ['1.2.3.4']}]
ORIGINAL ANSWER
Add a check that the IP is not already in the ip array, e.g.
query = {'ip': {'$ne': ip}}

Issues trying to open a bi-dimensional array leave contained in a ROOT Tree in Pyroot

I’m stuck with a problem using Pyroot. I’m not able to read a leaf on a tree which is a two dimensional array of float values. You can see the related Tree in the following:
root [1] TTree tr=(TTree)g->Get(“tevent_2nd_integral”)
root [2] tr.Print()
*Tree :tevent_2nd_integral: Event packet tree 2nd GTUs integral *
*Entries : 57344 : Total = 548967602 bytes File Size = 412690067 *
: : Tree compression factor = 1.33 *
*Br 7 :photon_count_data : photon_count_data[1][1][48][48]/F *
*Entries : 57344 : Total Size= 530758073 bytes File Size = 411860735 *
*Baskets : 19121 : Basket Size= 32000 bytes Compression= 1.29 *
…
The array (the bold one) is photon_count_data[1][1][48][48]. Actually i have several root files and I tried both to make a chain and to use hadd method like hadd file.root 'ls /path/.root’.*
I tried several ways as i will show soon. Anytime i found different problem: once the numpy array which should contain the 48x48 values per each event was not created at all, others just didn’t write anything or strange values (negative also which is not possible).
My code is the following:
# calling the root file after using hadd to merge all files
rootFile = path+"merge.root"
f = XROOT.TFile(rootFile,'read')
tree = f.Get('tevent_2nd_integral')
# making a chain
PDMchain=TChain("tevent_2nd_integral")
for filename in sorted(os.listdir(path)):
if filename.endswith('.root') and("CPU_RUN_MAIN" in filename) :
PDMchain.Add(filename)
pdm_counts = []
#First method using python pyl class
leaves = tree.GetListOfLeaves()
# define dynamically a python class containing root Leaves objects
class PyListOfLeaves(dict) :
pass
# create an istance
pyl = PyListOfLeaves()
for i in range(0,leaves.GetEntries() ) :
leaf = leaves.At(i)
name = leaf.GetName()
# add dynamically attribute to my class
pyl.__setattr__(name,leaf)
for iev in range(0,nEntries_pixel) :
tree.GetEntry(iev)
pdm_counts.append(pyl.photon_count_data.GetValue())
# the Draw method
count = tree.Draw("photon_count_data","","")
pdm_counts.append(np.array(np.frombuffer(tree.GetV1(), dtype=np.float64, count=count)))
#ROOT buffer method
for event in PDMchain:
pdm_data_for_this_event = event.photon_count_data
pdm_data_for_this_event.SetSize(2304) #ROOT buffer
pdm_couts.append(np.array(pdm_data_for_this_event,copy=True))
with the python class method the array pdm_counts is filled with just the first element contained in photon_count_data
with the Draw method I get a segmentation violation or a strange kernel issue
with the root buffer method I get right back a list containing all the 2304 (48x48) values but they are completely different from those in the photon_count_data, id est, negative values or orders of magnitude senseless
Could you tell me where I’m wrong or if there could be a more elegant and quick method to do so.
Thanks in advance
actually I found the solution and I would like to share it if anytime someone will need it!
Actually the third method explained
for event in PDMchain:
pdm_data_for_this_event = event.photon_count_data
pdm_data_for_this_event.SetSize(2304) #ROOT buffer
pdm_couts.append(np.array(pdm_data_for_this_event,copy=True))
works, but unfortunately I was using Spyder to visualize data and for some reason it return strange values which are not right! So...don't use Spyder!!!
Moreover another method works fine:
from root_pandas import read_root
data = read_root('merge.root', 'tevent_2nd_integral', columns=['cpu_packet_time', 'photon_count_data'])
Cheers!

Pull info from datapool, increment a value and store the datapool

The application I test has some areas where it requires unique data. Specifically, the application will generate a request number that can only be used once. After my test runs I must manually update my datapool reference for this number. Is there any way using java, that I can get the information stored in my datapool, increase the value by one, and then save the data back to the datapool. This way I can keep rft in sync with my application in regard to this number.
Here is an example how to read a value from the datapool, increment it by 1, and save it back to the datapool. It is an adapted example from the book Software Test Engineering with IBM Rational Functional Tester. The original source code is from chapter 5 (and can be downloaded from the book's homepage).
// some imports
import org.eclipse.hyades.edit.datapool.IDatapoolCell;
import org.eclipse.hyades.edit.datapool.IDatapoolEquivalenceClass;
import org.eclipse.hyades.execution.runtime.datapool.IDatapool;
import org.eclipse.hyades.execution.runtime.datapool.IDatapoolRecord;
int value = dpInt("value");
value++;
java.io.File dpFile = new java.io.File((String) getOption(IOptionName.DATASTORE), "SomeDatapool.rftdp");
IDatapool dp = dpFactory().load(dpFile, true);
IDatapoolEquivalenceClass equivalenceClass = (IDatapoolEquivalenceClass) dp.getEquivalenceClass(dp
.getDefaultEquivalenceClassIndex());
IDatapoolRecord record = equivalenceClass.getRecord(0);
IDatapoolCell cell = (IDatapoolCell) record.getCell(0);
cell.setCellValue(value);
DatapoolFactory factory = DatapoolFactory.get();
factory.save((org.eclipse.hyades.edit.datapool.IDatapool) dp);
I think it is quite a lot of code to simply change one value—maybe it is easier to use some other method like writing the value to a normal text file.