euclidean distance between two dataframes - dataframe

I have two dataframes. For simplicity assume, they each have only one entry
+--------------------+
| entry |
+--------------------+
|[0.34, 0.56, 0.87] |
+--------------------+
+--------------------+
| entry |
+--------------------+
|[0.12, 0.82, 0.98] |
+--------------------+
How can I compute the euclidean distance between the entries of these two dataframes? Right now I have the following code:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from scipy.spatial import distance
inference = udf(lambda x, y: float(distance.euclidean(x, y)), DoubleType())
inference_result = inference(a, b)
but I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/udf.py", line 197, in wrapper
return self(*args)
File "/usr/lib/spark/python/pyspark/sql/udf.py", line 177, in __call__
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
File "/usr/lib/spark/python/pyspark/sql/column.py", line 68, in _to_seq
cols = [converter(c) for c in cols]
File "/usr/lib/spark/python/pyspark/sql/column.py", line 68, in <listcomp>
cols = [converter(c) for c in cols]
File "/usr/lib/spark/python/pyspark/sql/column.py", line 56, in _to_java_column
"function.".format(col, type(col)))
TypeError: Invalid argument, not a string or column: DataFrame[embedding:
array<float>] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column
literals, use 'lit', 'array', 'struct' or 'create_map' function.

Related

Replacing unique array of strings in a row using pyspark

I am trying the following code which replace an empty list with unique array of a column("apples_set") when the condition "all" is satisfied.
The column "apple_logic_string" is of type Array[String]
Data frame looks like this:
apples_patterns.show()
+--------------------+-----------------+
| apples_logic_string|apples_set |
+--------------------+-----------------+
| "234" |["43","54"] |
| "65" |["95"] |
| "all" |[] |
| "76" |["84","67"] |
+--------------------+-----------------+
The code:
unique_all_apples = set(apples_patterns.agg(F.flatten(F.collect_set('apples_set'))).head()[0]) # noqa
error_patterns = apples_patterns.withColumn('apples_set', F.when(F.col('apples_logic_string') == 'all',
unique_all_apples).otherwise(F.col('apples_set')))
The Error:
Traceback (most recent call last):
File "/myproject/datasets/apples_matching.py", line 24, in compute
apples_patterns = apples_patterns.withColumn('apples_set', F.when(F.col('apples_logic_string') == 'all',
File "/scratch/asset-install/1c9821b4f6adc95ac4d5f15ff109001b/miniconda38/lib/python3.8/site-packages/pyspark/sql/functions.py", line 1518, in when
jc = sc._jvm.functions.when(condition._jc, v)
File "/scratch/asset-install/1c9821b4f6adc95ac4d5f15ff109001b/miniconda38/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/scratch/asset-install/1c9821b4f6adc95ac4d5f15ff109001b/miniconda38/lib/python3.8/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/scratch/asset-install/1c9821b4f6adc95ac4d5f15ff109001b/miniconda38/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.when.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [43,54,95,84,67]
You can use array function: array documentation
In your case you may use it like this:
F.array([F.lit(x) for x in unique_all_apples])
sample code
import pyspark.sql.functions as F
x = [("234", ["43", "54"]), ("65", ["95"]), ("all", []), ("76", ["84", "67"])]
apples_patterns = spark.createDataFrame(x, schema=["apples_logic_string", "apples_set"])
unique_all_apples = set(
apples_patterns.agg(F.flatten(F.collect_set("apples_set"))).head()[0]
)
error_patterns = apples_patterns.withColumn(
"apples_set",
F.when(
F.col("apples_logic_string") == "all",
F.array([F.lit(x) for x in unique_all_apples]),
).otherwise(F.col("apples_set")),
)
And the output:
+-------------------+--------------------+
|apples_logic_string| apples_set|
+-------------------+--------------------+
| 234| [43, 54]|
| 65| [95]|
| all|[54, 95, 43, 67, 84]|
| 76| [84, 67]|
+-------------------+--------------------+
The easiest solution is to create another dataframe with one row that contains all distinct apples_set using explode than collect_set, after that joined to the original dataframe:
import spark.implicits._
val data = Seq(
("234", Seq("43", "54")),
("65", Seq("95")),
("all", Seq()),
("76", Seq("84", "67"))
)
val df = spark.sparkContext.parallelize(data).toDF("apples_logic_string", "apples_set")
val allDf = df.select(explode(col("apples_set")).as("apples_set")).agg(collect_set("apples_set").as("all_apples_set"))
.withColumn("apples_logic_string", lit("all"))
df.join(broadcast(allDf), Seq("apples_logic_string"), "left")
.withColumn("apples_set", when(col("apples_logic_string").equalTo("all"), col("all_apples_set")).otherwise(col("apples_set")))
.drop("all_apples_set").show(false)
+-------------------+--------------------+
|apples_logic_string|apples_set |
+-------------------+--------------------+
|234 |[43, 54] |
|65 |[95] |
|all |[84, 95, 67, 54, 43]|
|76 |[84, 67] |
+-------------------+--------------------+

Exporting pandas df with column of tuples to BQ throws pyarrow error

I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame({"id": [1,2,3], "items": [('a', 'b'), ('a', 'b', 'c'), tuple('d')]}
>print(df)
id items
0 1 (a, b)
1 2 (a, b, c)
2 3 (d,)
After registering my GCP/BQ credentials in the normal way...
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path_to_my_creds.json"
... I try to export it to a BQ table:
import pandas_gbq
pandas_gbq.to_gbq(df, "my_table_name", if_exists="replace")
but I keep getting the following error:
Traceback (most recent call last):
File "<string>", line 4, in <module>
File "/Users/max.epstein/opt/anaconda3/envs/rec2env/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 1205, in to_gbq
...
File "/Users/max.epstein/opt/anaconda3/envs/rec2env/lib/python3.7/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 342, in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
File "pyarrow/array.pxi", line 915, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object
I have tried converting the tuple column to string with df = df.astype({"items":str}) and adding a table_schema param to the pandas_gbq.to_gbq... line but I keep getting this same error.
I have also tried replacing the pandas_gbq.to_gbq... line with the bq_client.load_table_from_dataframe method described here but still get the same pyarrow.lib.ArrowTypeError: Expected bytes, got a 'tuple' object error...
So I think this is a weird issue with pandas dtypes being separate from Python types, and the astype only converting the type and not the pandas dtype. Try also converting the dtype to match the type after the astype statement.
Such that.
df = df.astype({"items": str})
Is replaced with:
df = df.astype({"items": str})
df = df.convert_dtypes()
Let me know if this works.

Can't create Dask dataframe although Pandas dataframe gets created for the same query (sqlalchemy.exc.NoSuchTableError)

Hello I am trying to create a Dask Dataframe by pulling data from an Oracle Database as:
import cx_Oracle
import pandas as pd
import dask
import dask.dataframe as dd
# Build connection string/URL
user='user'
pw='pw'
host = 'xxx-yyy-x000'
port = '9999'
sid= 'XXXXX000'
ora_uri = 'oracle+cx_oracle://{user}:{password}#{sid}'.format(user=user, password=pw, sid=cx_Oracle.makedsn(host,port,sid))
tstquery ="select ID from EXAMPLE where rownum <= 5"
# Create Pandas Dataframe from ORACLE Query pull
tstdf1 = pd.read_sql(tstquery
,con = ora_uri
)
print("Dataframe tstdf1 created by pd.read_sql")
print(tstdf1.info())
# Create Dask Dataframe from ORACLE Query pull
tstdf2 = dd.read_sql_table(table = tstquery
,uri = ora_uri
,index_col = 'ID'
)
print(tstdf2.info())
As you can see the Pandas DF gets created but not the Dask DF. Following is the stdout:
Dataframe tstdf1 created by pd.read_sql
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
ID 5 non-null int64
dtypes: int64(1)
memory usage: 120.0 bytes
None
Traceback (most recent call last):
File "dk_test.py", line 40, in <module>
,index_col = 'ID'
File "---------------------------python3.6/site-packages/dask/dataframe/io/sql.py", line 103, in read_sql_table
table = sa.Table(table, m, autoload=True, autoload_with=engine, schema=schema)
File "<string>", line 2, in __new__
File "---------------------------python3.6/site-packages/sqlalchemy/util/deprecations.py", line 130, in warned
return fn(*args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 496, in __new__
metadata._remove_table(name, schema)
File "---------------------------python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "---------------------------python3.6/site-packages/sqlalchemy/util/compat.py", line 154, in reraise
raise value
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 491, in __new__
table._init(name, metadata, *args, **kw)
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 585, in _init
resolve_fks=resolve_fks,
File "---------------------------python3.6/site-packages/sqlalchemy/sql/schema.py", line 609, in _autoload
_extend_on=_extend_on,
File "---------------------------python3.6/site-packages/sqlalchemy/engine/base.py", line 2147, in run_callable
return conn.run_callable(callable_, *args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/engine/base.py", line 1604, in run_callable
return callable_(self, *args, **kwargs)
File "---------------------------python3.6/site-packages/sqlalchemy/engine/default.py", line 429, in reflecttable
table, include_columns, exclude_columns, resolve_fks, **opts
File "---------------------------python3.6/site-packages/sqlalchemy/engine/reflection.py", line 653, in reflecttable
raise exc.NoSuchTableError(table.name)
sqlalchemy.exc.NoSuchTableError: select ID from EXAMPLE where rownum <= 5
Needless to say, the table exists (As demonstrated by the creation of the Pandas DF), the Index is on
the col ID as well. What is the problem ?

Arithmetic in pandas HDF5 queries

Why am I getting an error when I try to do simple arithmetic on constants in an HDF5 where clause? Here's an example:
>>> import pandas
>>> import numpy as np
>>> d = pandas.DataFrame({"A": np.arange(10), "B": np.random.randint(1, 100, 10)})
>>> store = pandas.HDFStore('teststore.h5', mode='w')
>>> store.append('thingy', d, format='table', data_columns=True, append=False)
>>> store.select('thingy', where="B>50")
A B
0 0 61
1 1 63
6 6 80
7 7 79
8 8 52
9 9 82
>>> store.select('thingy', where="B>40+10")
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
store.select('thingy', where="B>40+10")
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 682, in select
return it.get_result()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 1365, in get_result
results = self.func(self.start, self.stop, where)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 675, in func
columns=columns, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4006, in read
if not self.read_axes(where=where, **kwargs):
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 3212, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\io\pytables.py", line 4527, in __init__
self.condition, self.filter = self.terms.evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 580, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 122, in prune
res = pr(left.value, right.prune(klass))
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "c:\users\brenbarn\documents\python\extensions\pandas\pandas\computation\pytables.py", line 317, in evaluate
raise ValueError("query term is not valid [%s]" % self)
ValueError: query term is not valid [[Condition : [None]]]
Querying directly on the underlying pytables object seems to work:
>>> for row in store.get_storer('thingy').table.where("B>40+10"):
... print(row[:])
(0L, 0, 61)
(1L, 1, 63)
(6L, 6, 80)
(7L, 7, 79)
(8L, 8, 52)
(9L, 9, 82)
So what is going on here?
This is simply not supported. I suppose it could fail with a slightly better message. it is trying to and the 2 nodes (the comparison and the +10) and doesn't know how to deal with it as it's not a comparison operation.
I suppose it could be implemented but IMHO is needlessly complex

Incompatible indexer with Series

Why do I get an error:
import pandas as pd
a = pd.Series(index=[4,5,6], data=0)
print a.loc[4:5]
a.loc[4:5] += 1
Output:
4 0
5 0
Traceback (most recent call last):
File "temp1.py", line 9, in <module>
dtype: int64
a.loc[4:5] += 1
File "lib\site-packages\pandas\core\indexing.py", line 88, in __setitem__
self._setitem_with_indexer(indexer, value)
File "lib\site-packages\pandas\core\indexing.py", line 177, in _setitem_with_indexer
value = self._align_series(indexer, value)
File "lib\site-packages\pandas\core\indexing.py", line 206, in _align_series
raise ValueError('Incompatible indexer with Series')
ValueError: Incompatible indexer with Series
Pandas 0.12.
I think this is a bug, you can work around this by use tuple index:
import pandas as pd
a = pd.Series(index=[4,5,6], data=0)
print a.loc[4:5]
a.loc[4:5,] += 1