Is it possible to use only certain values in a column when performing a pandas query

Is it possible to use only certain values in a column when performing a pandas query - pandas

I'm trying to use a post/zipcode map to plot Longitude/Latitude coordinated using GeoPandas.
If the post/zipcodes are sequential i.e. see below code, then I have no issues but there are some outlying post/zipcodes I'd like to use i.e 5118, 5371-3 but also redundant, 5354, as well as they are or are not required to be viewed on the map.
df_a = df.query('code >= 5350 & code <= 5355')
ax = df_a.plot()
Can I add or remove post/zipcodes within the one line of code above in a query?
Map with marker points

Related

Selecting Columns Based on Multiple Criteria in a Julia DataFrame

I need to select values from a single column in a Julia dataframe based on multiple criteria sourced from an array. Context: I'm attempting to format the data from a large Julia DataFrame to support a PCA (primary component analysis), so I first split the original data into an anlytical matrix and a label array. This is my code, so far (doesn't work):
### Initialize source dataframe for PCA
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
### Extract 1/2 of rows into analytical matrix
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
### Extract last column as labels
arLabels=dfSource[1:2:end,3]
### Select filtered rows
datGet=matSource[:,arLabels>=0.2 & arLabels<0.7][1,:]
print(datGet)
output> MethodError: no method matching...
At the last line before the print(datGet) statement, I get a MethodError indicating a method mismatch related to use of the & logic. What have I done wrong?

A small example of alternative implementation (maybe you will find it useful to see what DataFrames.jl has in-built):
# avoid materialization if dfSource is large
dfSourceHalf = #view dfSource[1:2:end, :]
lazyFilter = Iterators.filter(row -> 0.2 <= row[3] < 0.7, eachrow(dfSourceHalf))
matFiltered = mapreduce(row -> collect(row[1:2]), hcat, lazyFilter)
matFiltered[1, :]
(this is not optimized for speed, but rather as a showcase what is possible, but still it is already several times faster than your code)

This code works:
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
arLabels=dfSource[1:2:end,3]
datGet=matSource[:,(arLabels.>=0.2) .& (arLabels.<0.7)][1,:]
print(datGet)
output> [0,10,0]
Note the use of parenthetical enclosures (arLabels.>=0.2) and (arLabels<0.7), as well as the use of the .>= and .< syntax (which forces Julia to iterate through a container/collection). Finally, and most crucially (since it's the part most people miss), note the use of .& in place of just &. The dot operator makes all the difference!

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.

You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

Processing pandas data in declarative style

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.
To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.
I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:
df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)
Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.
Is there a nice way that I can write this out without having to group by the same columns over and over?

Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:
arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
dfg = dfg.do_stuff1() # Perform all needed operations
dfg = do_stuff2(dfg) #
arr.append(dfg)
result = pd.concat(arr)
An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:
def all_operations(dfg):
# Do stuff
return result_df
result = df.group_by(['vehicle_id', 'day']).apply(all_operations)
In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.

Keeping table formatting in Sage with multiple tables

As the title suggests, I am trying to keep proper table formatting in Sage while displaying multiple tables (this is strictly a formatting question, so no knowledge of the math involved is necessary). Currently, I am using the following code:
my_table2 = table([column1, column2], frame = True)
my_table1 = table([in_the_cone, lengths_in_cone], frame = True)
result_table1 = my_table1.transpose()
result_table2 = my_table2.transpose()
result_table1
result_table2
With this, I receive no output for table1 and the following output for table2:
I want both tables to look this way, but having no output for the first table is no good. So I tried changing the bottom two lines to:
result_table1, result_table2
While this does display both tables, the formatting now looks like:
Is there a way I can display both tables at the same time with the first formatting?

It would have been nice for you to include a full minimal working example, but in any case it does depend a little on the output.
Basically, in a notebook or other "cell", only the last return value prints to the screen in some fashion (sometimes via a "hook" as in your case). But if you use the comma, that implicitly creates a "tuple" which is then printed as a tuple, so you lose that "hook" to display things with math modes (since a tuple doesn't have that).
In this case, the (newish) canonical way to achieve what you want is
pretty_print(result_table1)
pretty_print(result_table2)
though you may want to put print "\n" in between so they don't end up right on top of each other.
Edit: Here is a picture in Jupyter inside of Sage.

Plotting using multiple variables in gnuplot

I have a datafile with multiple columns, the first two indicating the position and the others indicating other properties (such as number of items sent from this point). eg:
1 1 1 57.11
2 1 2 62.40
3 4 1 31.92
What I want to do is plot the points at the positions, but use values from the other columns to vary point type and size (for example). However I can't seem to find a way to reference columns in the plot. I know of the use of "variable", but I cant find a way to use multiple variables.
What I want is something like the following:
plot "mydata" using 1:2 notitle with points pt ($3) ps ($4/10)
so that pt and ps use the value for each point taken from the third and fourth columns respectively.
Is this even possible in gnuplot? Is there some sort of work-around?

You should be able to use the keyword variable to do something like this:
plot 'datafile' using 1:2:3:4 w points ps variable lc variable
Or possibly mapping the value to a palette:
plot 'datafile' using 1:2:3:4 w points ps variable lc palette
The keyword variable and/or palette causes gnuplot to read the properties from the file and they both require an extra column to be read via using. Of course all the usual stuff with using applies -- You can apply transforms to the data, etc:
plot 'datafile' using 1:2:3:($4+32.) w points ps variable lc palette
I don't remember off the top of my head whether the 3rd column will be the pointsize or the color here, and I don't have time right now to play around with it to figure it out. You can do the experimenting and post a comment, or I'll come back to this when I have time and add an update.
Some of the other properties (e.g. pointtype) can't be changed quite to easily using variable. The easiest way to do this is to use filters with the gnuplot ternary operator.
First, write a function that returns a pointtype based on the data from 1 column of the datafile:
my_point_type(x) = x
Here I use a simple identity function, but it could be anything. Now, you can loop over the pointtypes you want (here 1-10) making a plot for each:
plot [for PT=1:10] 'datafile' u 1:((my_point_type($3) == PT) ? $2:NaN) with points pt PT
This assumes that the column with pointtype information is the 3rd column and that the second column holds the position information. This can also be combine with the stuff that I demonstrated above.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas