Possible to Stringize a Polars Expression? - sql

Is it possible to stringize a Polars expression and vice-versa(? For example, convert df.filter(pl.col('a')<10) to a string of "df.filter(pl.col('a')<10)". Is roundtripping possible e.g. eval("df.filter(pl.col('a')<10)") for user input or tool automation. I know this can be done with a SQL expression but I'm interested in native. I want to show the specified filter in the title of plots.

Expressions
>>> expr = pl.col("foo") > 2
>>> print(str(expr))
[(col("foo")) > (2i32)]
LazyFrames
>>> df = pl.DataFrame({
... "foo": [1, 2, 3]
... })
>>> json_state = df.lazy().filter(expr).write_json()
>>> query_plan = pl.LazyFrame.from_json(json_state)
>>> query_plan.collect()
shape: (1, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 3 │
└─────┘

Related

Pandera print pa.errors.SchemaErrors with pa.check_inputs

Is there a way to print SchemaErrors when using pa.check_inputs? say i have df below
import pandera as pa
import pandas as pd
df = pd.DataFrame.from_dict({
'a' : [1,2,2,4,5],
'b' : [1,2,3,4,'dogs'],
})
schema = pa.DataFrameSchema({
'a': pa.Column(
pa.Int64,
checks=[pa.Check.isin([1,2,3,4,5])]),
'b': pa.Column(
pa.Int64,
checks=[pa.Check.isin([1,2,3,4,5])]),
})
if I where to run foo
#pa.check_input(schema, lazy=True)
def foo(df : pd.DataFrame) -> int:
return df.b.count()
foo(df)
the output would look like so:
Error Counts
------------
- schema_component_check: 2
Schema Error Summary
--------------------
failure_cases n_failure_cases
schema_context column check
Column b dtype('int64') [object] 1
isin({1, 2, 3, 4, 5}) [dogs] 1
Usage Tip
---------
however what I'd really would like to see is :
schema_context column check check_number failure_case index
0 Column b dtype('int64') None object None
1 Column b isin({1, 2, 3, 4, 5}) 0 dogs 4
which we get if we use try except.
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print( err.failure_cases ) # dataframe of schema errors

How to get an item in a polars dataframe column and put it back into the same column at a different location

Still new to polars and rust ... so here is a nooby question:
How do I access a value within a DataFrame at a specific location.
How do I overwrite a value within a DataFrame at a specific location.
Here is a NON-WORKING code:
use polars::prelude::*;
fn main() {
let df = df! [
"STOCK" => ["TSLA", "META", "AA",],
"STRIKES" => [10, 20, 5],
]
.unwrap();
println!("df\t{:?}", df);
// Take TSLA's STRIKE (10)
let tsla_strike = df
.lazy()
.filter((col("STOCK") == lit("TSLA")))
.with_column(col("STRIKES"))
.first()
.collect();
let o_i32 = GetOutput::from_type(DataType::Int32);
// Overwrite AA's STRIKE with tsla_strike (5 ==> 10)
let df = df
.lazy()
.filter((col("STOCK") == lit("AA")).into())
.with_column(col("STRIKES").map(|x| tsla_strike,o_i32))
.collect()
.unwrap();
println!("df\t{:?}", df);
}
Here is the result I like to get:
RESULT:
df shape: (3, 2)
┌───────┬─────────┐
│ STOCK ┆ STRIKES │
│ --- ┆ --- │
│ str ┆ i32 │
╞═══════╪═════════╡
│ TSLA ┆ 10 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ META ┆ 20 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ AA ┆ 10 │
└───────┴─────────┘
An antipattern way to do it, is to traverse the DF and then at the same time build a new DF with the desired values.
You can use the when -> then -> otherwise construct. When STOCK=="AA" then take the STRIKE where STOCK=="TSLA", otherwise just take the STRIKE. This construct is vectorized and fast (it does not operate on the single elements).
let df2 = df
.lazy()
.clone()
.select([
col("STOCK"),
when(col("STOCK").eq(lit("AA")))
.then(col("STRIKES").filter(col("STOCK").eq(lit("TSLA"))))
.otherwise(col("STRIKES"))
])
.collect()?;
Another option in case you have a lot of mappings to do would be a mapping data frame and left joining the replacement values.
let mapping = df! [
"ORIGINAL_STOCK" => ["TSLA", "AA"],
"REPLACEMENT_STOCK" => ["AA", "META"]
]?;
let df2 = df
.clone()
.lazy()
.join(mapping.clone().lazy(), [col("STOCK")], [col("ORIGINAL_STOCK")], JoinType::Left)
.join(df.clone().lazy(), [col("REPLACEMENT_STOCK")], [col("STOCK")], JoinType::Left)
.select([
col("STOCK"),
when(col("STRIKES_right").is_not_null())
.then(col("STRIKES_right"))
.otherwise(col("STRIKES"))
.alias("STRIKES")
])
.collect()?;

Use "extern __declspec(dllimport)" in Cython

Can I use extern __declspec(dllimport) in Cython? I am trying to wrap embree in Windows, but am not sure I can dynamically link in Cython.
I read this SO post which is great for changing C/C++ and header files directly, but I'm not sure how to implement this in a .pxd file.
For example, the Embree 2.17.7 x64 header rtcore.h defines RTCORE_API as
#ifndef RTCORE_API
#if defined(_WIN32) && !defined(EMBREE_STATIC_LIB)
# define RTCORE_API extern "C" __declspec(dllimport)
#else
# define RTCORE_API extern "C"
#endif
#endif
However, these are left off the function signatures that use them in the pyembree pxd file rtcore.pxd. This seems consistent with the Cython docs, which state to
Leave out any platform-specific extensions to C declarations such as __declspec()
However, even if I point the pyembree setup.py file to my downloaded embree DLL by changing the line
ext.libraries = ["embree"]
to
ext.libraries = [""C:/Program Files/Intel/Embree v2.17.7 x64/bin/embree""]
I still get 3 linking errors:
mesh_construction.obj : error LNK2001: unresolved external symbol __imp_rtcMapBuffer
mesh_construction.obj : error LNK2001: unresolved external symbol __imp_rtcNewTriangleMesh
mesh_construction.obj : error LNK2001: unresolved external symbol __imp_rtcUnmapBuffer
build\lib.win-amd64-3.8\pyembree\mesh_construction.cp38-win_amd64.pyd : fatal error LNK1120: 3 unresolved externals
I know from this SO post and the Microsoft docs that __imp_ related linker errors are due to not finding DLLs. However, you can see in rtcore_geometry.h that it is defined:
and in rtcore_geometry.pxd it is defined:
the only difference being that the .pxd file does not include RTCORE_API in the signature.
Does anyone know how I can resolve this issue so pyembree will build?
EDIT: It should also be noted, I have added
# distutils: language=c++
to all my .pyx and .pxd files. This SO post was also reviewed, but it did not solve my problem.
UPDATE: Adding the embree.lib file to my local pyembree/embree2 folder and updating setup.py to
ext.libraries = ["pyembree/embree2/*"]
permits the code to compile via
py setup.py build_ext -i
However, the packages do not load:
>>> import pyembree
>>> from pyembree import rtcore_scene
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: DLL load failed while importing rtcore_scene: The specified module could not be found.
Do I need to define the "subpackages" in my setup.py? This is my current setup.py:
from setuptools import find_packages, setup
import numpy as np
from Cython.Build import cythonize
from Cython.Distutils import build_ext
include_path = [np.get_include()]
ext_modules = cythonize(
'pyembree/*.pyx',
language_level=3,
include_path=include_path)
for ext in ext_modules:
ext.include_dirs = include_path
ext.libraries = [
"pyembree/embree2/*",
]
setup(
name="pyembree",
version='0.1.6',
cmdclass = {"build_ext": build_ext},
ext_modules=ext_modules,
zip_safe=False,
packages=find_packages(),
include_package_data = True
)
and the directory structure is as follows (pyembree is the top-level folder in my .venv\lib\site-packages folder of my project):
pyembree
│ .authors.yml
│ .gitignore
│ .mailmap
│ AUTHORS
│ CHANGELOG.rst
│ LICENSE
│ MANIFEST.in
│ pyproject.toml
│ README.rst
│ setup.py
│
├───build
│ └───temp.win-amd64-3.8
│ └───Release
│ └───pyembree
│ mesh_construction.cp38-win_amd64.exp
│ mesh_construction.cp38-win_amd64.lib
│ mesh_construction.obj
│ rtcore.cp38-win_amd64.exp
│ rtcore.cp38-win_amd64.lib
│ rtcore.obj
│ rtcore_scene.cp38-win_amd64.exp
│ rtcore_scene.cp38-win_amd64.lib
│ rtcore_scene.obj
│ triangles.cp38-win_amd64.exp
│ triangles.cp38-win_amd64.lib
│ triangles.obj
│
├───pyembree
│ │ mesh_construction.cp38-win_amd64.pyd
│ │ mesh_construction.cpp
│ │ mesh_construction.h
│ │ mesh_construction.pyx
│ │ rtcore.cp38-win_amd64.pyd
│ │ rtcore.cpp
│ │ rtcore.pxd
│ │ rtcore.pyx
│ │ rtcore_geometry.pxd
│ │ rtcore_geometry_user.pxd
│ │ rtcore_ray.pxd
│ │ rtcore_scene.cp38-win_amd64.pyd
│ │ rtcore_scene.cpp
│ │ rtcore_scene.pxd
│ │ rtcore_scene.pyx
│ │ triangles.cp38-win_amd64.pyd
│ │ triangles.cpp
│ │ triangles.pyx
│ │ __init__.pxd
│ │ __init__.py
│ │
│ ├───embree2
│ │ embree.lib
│ │ rtcore.h
│ │ rtcore.isph
│ │ rtcore_builder.h
│ │ rtcore_geometry.h
│ │ rtcore_geometry.isph
│ │ rtcore_geometry_user.h
│ │ rtcore_geometry_user.isph
│ │ rtcore_ray.h
│ │ rtcore_ray.isph
│ │ rtcore_scene.h
│ │ rtcore_scene.isph
│ │ rtcore_version.h
│ │ tbb.lib
│ │ tbbmalloc.lib
│ │
│ └───__pycache__
│ __init__.cpython-38.pyc
│
└───tests
test_intersection.py
The code functions properly once I literally hand-copy and paste over the DLLs into the generated .egg folder in my .venv\Lib\site-packages folder:
pyembree-0.1.6-py3.8-win-amd64.egg
├───EGG-INFO
│ dependency_links.txt
│ native_libs.txt
│ not-zip-safe
│ PKG-INFO
│ SOURCES.txt
│ top_level.txt
│
└───pyembree
│ embree.dll
│ freeglut.dll
│ mesh_construction.cp38-win_amd64.pyd
│ mesh_construction.cpp
│ mesh_construction.py
│ rtcore.cp38-win_amd64.pyd
│ rtcore.cpp
│ rtcore.py
│ rtcore_scene.cp38-win_amd64.pyd
│ rtcore_scene.cpp
│ rtcore_scene.py
│ tbb.dll
│ tbbmalloc.dll
│ triangles.cp38-win_amd64.pyd
│ triangles.cpp
│ triangles.py
│ __init__.py
│
└───__pycache__
mesh_construction.cpython-38.pyc
rtcore.cpython-38.pyc
rtcore_scene.cpython-38.pyc
triangles.cpython-38.pyc
__init__.cpython-38.pyc
However, how can I tell python to copy and paste these DLLs over? Can I put something in my setup.py file?
Edit: Per #ead's comments, the setup.py can be updated to the following to automate copying the DLLs over the the right folder (thanks #ead!):
import os
from setuptools import find_packages, setup
import numpy as np
from Cython.Build import cythonize
from Cython.Distutils import build_ext
include_path = [
np.get_include(),
]
ext_modules = cythonize("pyembree/*.pyx", language_level=3, include_path=include_path)
for ext in ext_modules:
ext.include_dirs = include_path
ext.libraries = [
"pyembree/embree2/lib/embree",
"pyembree/embree2/lib/tbb",
"pyembree/embree2/lib/tbbmalloc",
]
setup(
name="pyembree",
version="0.1.6",
cmdclass={"build_ext": build_ext},
ext_modules=ext_modules,
zip_safe=False,
packages=find_packages(),
include_package_data=True,
package_data={"pyembree": ["*.cpp", "*.dll"]},
)

Create a dictionary using "if... else" in Julia?

I'm working with a sample CSV file that lists nursing home residents' DOBs and DODs. I used those fields to calculate their age at death, and now I'm trying to create a dictionary that "bins" their age at death into groups. I'd like the bins to be 1-25, 26-50, 51-75, and 76-100.
Is there a concise way to make a Dict(subject_id, age, age_bin) using "if... else" syntax?
For example: (John, 76, "76-100"), (Moira, 58, "51-75").
So far I have:
#import modules
using CSV
using DataFrames
using Dates
# Open, read, write desired files
input_file = open("../data/FILE.csv", "r")
output_file = open("FILE_output.txt", "w")
# Use to later skip header line
file_flag = 0
for line in readlines(input_file)
if file_flag==0
global file_flag = 1
continue
end
# Define what each field in FILE corresponds to
line_array = split(line, ",")
subject_id = line_array[2]
gender = line_array[3]
date_of_birth = line_array[4]
date_of_death = line_array[5]
# Get yyyy-mm-dd only (first ten characters) from fields 4 and 5:
date_birth = date_of_birth[1:10]
date_death = date_of_death[1:10]
# Create DateFormat; use to calculate age
date_format = DateFormat("y-m-d")
age_days = Date(date_death, date_format) - Date(date_birth, date_format)
age_years = round(Dates.value(age_days)/365.25, digits=0)
# Use "if else" statement to determine values
keys = age_years
function values()
if age_years <= 25
return "0-25"
elseif age_years <= 50
return "26-50"
elseif age_years <= 75
return "51-75"
else age_years < 100
return "76-100"
end
end
values()
# Create desired dictionary
age_death_dict = Dict(zip(keys, values()))
end
Edit: or is there a better way to approach this using DataFrames?
To answer your question, " is there a concise way using if/else" -- probably not, given that you have 5 cases (age ranges) you have to account for. Suppose you have names and ages in two separate lists (which I assume you generate from your example code, although I can't see the input CSVs):
julia> name = ["John", "Mary", "Robert", "Cindy", "Beatrice"];
julia> ages = [24, 73, 75, 69, 90];
julia> function bin_age_ifelse(a)
if a<1
return "Invalid age"
elseif 1<=a<=25
return "1-25"
elseif 25<a<=50
return "26-50"
elseif 50<a<=75
return "51-75"
else
return "76-100"
end
end
bin_age_ifelse (generic function with 1 method)
julia> binned_ifelse = Dict([n=>[a, bin_age_ifelse(a)] for (n,a) in zip(name, ages)])
Dict{String, Vector{Any}} with 5 entries:
"John" => [24, "1-25"]
"Mary" => [73, "51-75"]
"Beatrice" => [90, "76-100"]
"Robert" => [75, "51-75"]
"Cindy" => [69, "51-75"]
Here's an option for the binning function to avoid if/else syntax, although there are probably yet more elegant ways to do it:
julia> function bin_age(a)
bins = [1:25, 26:50, 51:75, 76:100]
for b in bins
if a in b
return "$(b[1])-$(b[end])"
end
end
end
bin_age (generic function with 1 method)
julia> bin_age(84)
"76-100"
I've taken some liberties with the format of the answer, using the name as the key, since your original question describes a dict format that doesn't really make sense in Julia. If you'd like to have the keys be the age ranges, you could construct the dictionary above and then invert it as described here (with some modification since the values above have two entries).
If you don't care about name, age, or age range being a key, then I would suggest using DataFrames.jl:
julia> using DataFrames
julia> d = DataFrame(name=name, age=ages, age_range=[bin_age(a) for a in ages])
5×3 DataFrame
Row │ name age age_range
│ String Int64 String
─────┼────────────────────────────
1 │ John 24 1-25
2 │ Mary 73 51-75
3 │ Robert 75 51-75
4 │ Cindy 69 51-75
5 │ Beatrice 90 76-100

How to render two pd.DataFrames in jupyter notebook side by side?

Is there an easy way to quickly see contents of two pd.DataFrames side-by-side in Jupyter notebooks?
df1 = pd.DataFrame([(1,2),(3,4)], columns=['a', 'b'])
df2 = pd.DataFrame([(1.1,2.1),(3.1,4.1)], columns=['a', 'b'])
df1, df2
You should try this function from #Wes_McKinney
def side_by_side(*objs, **kwds):
''' Une fonction print objects side by side '''
from pandas.io.formats.printing import adjoin
space = kwds.get('space', 4)
reprs = [repr(obj).split('\n') for obj in objs]
print(adjoin(space, *reprs))
# building a test case of two DataFrame
import pandas as pd
import numpy as np
n, p = (10, 3) # dfs' shape
# dfs indexes and columns labels
index_rowA = [t[0]+str(t[1]) for t in zip(['rA']*n, range(n))]
index_colA = [t[0]+str(t[1]) for t in zip(['cA']*p, range(p))]
index_rowB = [t[0]+str(t[1]) for t in zip(['rB']*n, range(n))]
index_colB = [t[0]+str(t[1]) for t in zip(['cB']*p, range(p))]
# buliding the df A and B
dfA = pd.DataFrame(np.random.rand(n,p), index=index_rowA, columns=index_colA)
dfB = pd.DataFrame(np.random.rand(n,p), index=index_rowB, columns=index_colB)
side_by_side(dfA,dfB) Outputs
cA0 cA1 cA2 cB0 cB1 cB2
rA0 0.708763 0.665374 0.718613 rB0 0.320085 0.677422 0.722697
rA1 0.120551 0.277301 0.646337 rB1 0.682488 0.273689 0.871989
rA2 0.372386 0.953481 0.934957 rB2 0.015203 0.525465 0.223897
rA3 0.456871 0.170596 0.501412 rB3 0.941295 0.901428 0.329489
rA4 0.049491 0.486030 0.365886 rB4 0.597779 0.201423 0.010794
rA5 0.277720 0.436428 0.533683 rB5 0.701220 0.261684 0.502301
rA6 0.391705 0.982510 0.561823 rB6 0.182609 0.140215 0.389426
rA7 0.827597 0.105354 0.180547 rB7 0.041009 0.936011 0.613592
rA8 0.224394 0.975854 0.089130 rB8 0.697824 0.887613 0.972838
rA9 0.433850 0.489714 0.339129 rB9 0.263112 0.355122 0.447154
The closest to what you want could be:
> df1.merge(df2, right_index=1, left_index=1, suffixes=("_1", "_2"))
a_1 b_1 a_2 b_2
0 1 2 1.1 2.1
1 3 4 3.1 4.1
It's not specific of the notebook, but it will work, and it's not that complicated. Another solution would be to convert your dataframe to an image and put them side by side in subplots. But it's a bit far-fetched and complicated.
I ended up using a helper function to quickly compare two data frames:
def cmp(df1, df2, topn=10):
n = topn
a = df1.reset_index().head(n=n)
b = df2.reset_index().head(n=n)
span = pd.DataFrame(data=[('-',) for _ in range(n)], columns=['sep'])
a = a.merge(span, right_index=1, left_index=1)
return a.merge(b, right_index=1, left_index=1, suffixes=['_L', '_R'])