Pymongo: while returning the data through a def only returns one row - pymongo

def av():
for row in info.aggregate([{"$project": {"firstname": 1}}]):
list=[]
list.append(row)
list=str(list)
return list
print(av())
here, if instead of writing 'return list' I write 'print (list)' it gives me all the data I needed
but, if I try to return it. the output gives only the first row of the data(collection)
As i want to call the function later in the program to give if conditions to check whether the given output is present in the data or not
it is a must that i return the and not print.
Please tell me what i am missing
or is there a better way do the same

The problem is return exit the for loop in the first iteration, meanwhile print do not.
Write return list at the same level than the for-loop. moreover you empty the list everytime, define it before the loop.
def av():
list=[]
for row in info.aggregate([{"$project": {"firstname": 1}}]):
list.append(row)
list=str(list)
return list
print(av())

Related

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

Issue adding new columns to dataframe using pyspark

Say I run this
DF1.withColumn("Is_elite",
array_intersect(DF1.year,DF1.elite_years))
.show()
I get the result I want which is a new column called Is_elite with the correct values and all
Then in the next command I run
DF1.show
It just shows me what DF1 would have looked like had I not run the first command, my column is missing.
Since you have added .show() method in the line, it is not returning a new data frame. Make the following changes and try it out
elite_df = DF1.withColumn("Is_elite",array_intersect(DF1.year,DF1.elite_years))
elite_df.show()
In case you get confused about the object in python, try to print the type of object.
#the following must return a dataframe object.
print(type(elite_df))
Dataframes are immutable and every transformation create a new dataframe reference and hence if you try to print the old datagram, you will not get the revised result.

Integer variable from a custom keyword in the robot framework

I have a custom keyword in the robot framework which counts the items of a list. This works already in my underlying python file and prints the number 5 when five elements exists in a list.
Then I want to bring this value to the robot framework. But instead of a number I get:
${N_groups} is <built-in method count of list object at 0x03B01D78>
The code of the robot file:
*** Test Cases ***
Count Groups
${N_groups} Setup Groups Count Groups
log to console ${N_groups}
How to get item-count of the list as an integer value?
Here is a part of my python file:
#keyword(name="Count Groups")
def count_groups(self):
N = self.cur_page.count_groups()
return N
And a more low level python file:
def count_groups(self):
ele_tc = self._wait_for_treecontainer_loaded(self._ef.get_setup_groups_treecontainer())
children_text = self._get_sublist_filter(ele_tc, lambda ele: ele.find_element_by_tag_name('a').text,
True)
return children_text.count
Your function count_groups is returning children_text.count. children_text is a list, and you're returning the count method of that object, which explains the error that you're seeing. It's no different than if you did something like return [1,2,3].count.
Perhaps you intend to actually call the count function and return the results? Or, perhaps you are intending to return the length of the list? It's hard to see what the intent of the code is.
In either case, robot is reporting exactly what you're doing: you're returning a reference to a function, not an integer. My guess is that what you really want to do is return the number of items in the list, in which case you should change the return statement to:
return len(children_text)

How can I use `apply` with a function that takes multiple inputs

I have a function that has multiple inputs, and would like to use SFrame.apply to create a new column. I can't find a way to pass two arguments into SFrame.apply.
Ideally, it would take the entry in the column as the first argument, and I would pass in a second argument. Intuitively something like...
def f(arg_1,arg_2):
return arg_1 + arg_2
sf['new_col'] = sf.apply(f,arg_2)
suppose the first argument of function f is one of the column.
Say argcolumn1 in sf, then
sf['new_col'] = sf['argcolumn1'].apply(lambda x:f(x,arg_2))
should work
Try this.
sf['new_col'] = sf.apply(lambda x : f(arg_1, arg_2))
The way i understand your question (and because none of the previous answers are marked as accepted), it seems to me that you are trying to apply a transformation using two different columns of a single SFrame, so:
As specified in the online documentation, the function you pass to the SFrame.apply method will be called for every row in the SFrame.
So you should rewrite your function to receive a single argument representing the current row, as follow:
def f(row):
return row['column_1'] + row['column_2']
sf['new_col'] = sf.apply(f)

Pentaho PDI: Final value of previous row's calculated field

I tried to use the Analytik Query step to access some calculated field of the previous row. Turns out that the rows are all calculated in parallel and that accessing the previous row's fields gives you the current value they have during their processing, which is kind of random. It does not seem to be possible to obtain the final value of a field of a previous row. Or is there any other way than the Analytik Query step? I imagine all I need is a checkbox "Wait for previous rows to complete"...
What I need this for: I am processing time dependent data and doing a state recognition. When I am currently in state A, I do other stuff with my data then when I am in state B. So I need to know the state of the previous data row (which is determined not before the end of my transformation).
It can be done is Excel really easy, so I guess there must be some way in PDI. :-)
Thanks for any help!
If i have understood your question correctly, you may try using the Block this step until steps finish. This step waits until all the step copies that are specified in the dialog have finished. Read the link for more.
Hope this helps:)
I believe that it can be resolved by using the User Defined Java Class (UDJC) step.
If you sort the rows before processing them, the Sort By step would wait for the last row set by default.
Here's the most basic example of writing an output row for each input row. One important thing to keep in mind with the User Defined Java Class step, is the fact that they rewrite your whole data set, therefore need to be well thought of, especially if you do look-backs at previous rows. I hope this helps a bit.
// A class member that stores the previous row:
public Object[] previousRow;
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi)
throws KettleException {
// Fetching row:
Object[] r = getRow();
// Check if r is null:
if (r == null) {
setOutputDone();
return false;
}
// Get some field's value:
String someFieldValue = get(Fields.In, "someFieldName").getString(r);
// Log value if you want:
logBasic("current field value is " + someFieldValue);
// Generate an output row object:
Object[] outputRow = RowDataUtil.createResizedCopy(r, data.outputRowMeta.size());
// Modify row's field values if needed:
get(Fields.Out, "someFieldName").setValue(outputRow, "a modified value here");
// Write row:
putRow(data.outputRowMeta, outputRow);
// Assign current row to previous row:
previousRow = r;
return true;
}
EDIT:
One more important thing to note about PDI - the blocking method, either by blocking steps or by the Sort by step, is done by checking row sets rather than single rows.
How can this be verified?
Right click --> Transformation Settings --> Miscellaneous --> Nr of rows in rowset.
The default value is 10000 rows. PDI developers often create a deadlock by using one of the blocking steps with a row set size that doesn't fit their data volume - do keep that in mind.
Use "Identify last row in a stream" & "Filter rows" transformations. The 1st transformation checks if its the last row and returns a Boolean value and the later can be used to filter the records based on the Boolean value returned.