Replacing unique array of strings in a row using pyspark - dataframe

I am trying the following code which replace an empty list with unique array of a column("apples_set") when the condition "all" is satisfied.
The column "apple_logic_string" is of type Array[String]
Data frame looks like this:
apples_patterns.show()
+--------------------+-----------------+
| apples_logic_string|apples_set |
+--------------------+-----------------+
| "234" |["43","54"] |
| "65" |["95"] |
| "all" |[] |
| "76" |["84","67"] |
+--------------------+-----------------+
The code:
unique_all_apples = set(apples_patterns.agg(F.flatten(F.collect_set('apples_set'))).head()[0]) # noqa
error_patterns = apples_patterns.withColumn('apples_set', F.when(F.col('apples_logic_string') == 'all',
unique_all_apples).otherwise(F.col('apples_set')))
The Error:
Traceback (most recent call last):
File "/myproject/datasets/apples_matching.py", line 24, in compute
apples_patterns = apples_patterns.withColumn('apples_set', F.when(F.col('apples_logic_string') == 'all',
File "/scratch/asset-install/1c9821b4f6adc95ac4d5f15ff109001b/miniconda38/lib/python3.8/site-packages/pyspark/sql/functions.py", line 1518, in when
jc = sc._jvm.functions.when(condition._jc, v)
File "/scratch/asset-install/1c9821b4f6adc95ac4d5f15ff109001b/miniconda38/lib/python3.8/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/scratch/asset-install/1c9821b4f6adc95ac4d5f15ff109001b/miniconda38/lib/python3.8/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/scratch/asset-install/1c9821b4f6adc95ac4d5f15ff109001b/miniconda38/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.when.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [43,54,95,84,67]

You can use array function: array documentation
In your case you may use it like this:
F.array([F.lit(x) for x in unique_all_apples])
sample code
import pyspark.sql.functions as F
x = [("234", ["43", "54"]), ("65", ["95"]), ("all", []), ("76", ["84", "67"])]
apples_patterns = spark.createDataFrame(x, schema=["apples_logic_string", "apples_set"])
unique_all_apples = set(
apples_patterns.agg(F.flatten(F.collect_set("apples_set"))).head()[0]
)
error_patterns = apples_patterns.withColumn(
"apples_set",
F.when(
F.col("apples_logic_string") == "all",
F.array([F.lit(x) for x in unique_all_apples]),
).otherwise(F.col("apples_set")),
)
And the output:
+-------------------+--------------------+
|apples_logic_string| apples_set|
+-------------------+--------------------+
| 234| [43, 54]|
| 65| [95]|
| all|[54, 95, 43, 67, 84]|
| 76| [84, 67]|
+-------------------+--------------------+

The easiest solution is to create another dataframe with one row that contains all distinct apples_set using explode than collect_set, after that joined to the original dataframe:
import spark.implicits._
val data = Seq(
("234", Seq("43", "54")),
("65", Seq("95")),
("all", Seq()),
("76", Seq("84", "67"))
)
val df = spark.sparkContext.parallelize(data).toDF("apples_logic_string", "apples_set")
val allDf = df.select(explode(col("apples_set")).as("apples_set")).agg(collect_set("apples_set").as("all_apples_set"))
.withColumn("apples_logic_string", lit("all"))
df.join(broadcast(allDf), Seq("apples_logic_string"), "left")
.withColumn("apples_set", when(col("apples_logic_string").equalTo("all"), col("all_apples_set")).otherwise(col("apples_set")))
.drop("all_apples_set").show(false)
+-------------------+--------------------+
|apples_logic_string|apples_set |
+-------------------+--------------------+
|234 |[43, 54] |
|65 |[95] |
|all |[84, 95, 67, 54, 43]|
|76 |[84, 67] |
+-------------------+--------------------+

Related

euclidean distance between two dataframes

I have two dataframes. For simplicity assume, they each have only one entry
+--------------------+
| entry |
+--------------------+
|[0.34, 0.56, 0.87] |
+--------------------+
+--------------------+
| entry |
+--------------------+
|[0.12, 0.82, 0.98] |
+--------------------+
How can I compute the euclidean distance between the entries of these two dataframes? Right now I have the following code:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from scipy.spatial import distance
inference = udf(lambda x, y: float(distance.euclidean(x, y)), DoubleType())
inference_result = inference(a, b)
but I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/udf.py", line 197, in wrapper
return self(*args)
File "/usr/lib/spark/python/pyspark/sql/udf.py", line 177, in __call__
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
File "/usr/lib/spark/python/pyspark/sql/column.py", line 68, in _to_seq
cols = [converter(c) for c in cols]
File "/usr/lib/spark/python/pyspark/sql/column.py", line 68, in <listcomp>
cols = [converter(c) for c in cols]
File "/usr/lib/spark/python/pyspark/sql/column.py", line 56, in _to_java_column
"function.".format(col, type(col)))
TypeError: Invalid argument, not a string or column: DataFrame[embedding:
array<float>] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column
literals, use 'lit', 'array', 'struct' or 'create_map' function.

How to specify meta for Dask dataframe read_sql_table(meta=??)?

I am calling dask.dataframe.read_sql_table() on a table include_retweets with the following types
Table "public.includes_retweets"
Column | Type |
-------------------------+---------+
level_0 | integer |
reshared_message_id | integer |
resharing_message_id | integer |
resharing_user_id | integer |
index | integer |
original_message_id | integer |
political_link | text |
link_slant | text |
political_phrases | text[] |
phrase_slant | text |
slant_overall | text |
original_user_id | integer |
intermediate_user_id | integer |
intermediate_message_id | integer |
When I don't specify the meta parameter, Dask reads in the data as
level_0 reshared_message_id resharing_message_id resharing_user_id index original_message_id political_link link_slant political_phrases phrase_slant slant_overall original_user_id intermediate_user_id
intermediate_message_id
8135 0 209789789 8135 209780224 26 209780224 news.yahoo.com democrat None None democrat 3178208 209780224
8135 1 209789789 8135 209780224 27 209780224 news.yahoo.com democrat None None democrat 3178208 209780224
1785557 0 209829307 1785557 209828919 94 209828919 None None [{, ", T, R, U, M, P, , 2, 0, 2, 0, ", }] republican republican 237437 209828919
In what is returned by Dask, I want the 3rd row to have [{TRUMP2020}] for political_phrases.
How do I specify meta correctly?
I have tried (per https://github.com/dask/dask/issues/4723)
meta3={'level_0': [1], 'reshared_message_id': [1], 'resharing_message_id': [1], 'resharing_user_id': [1], 'index': [1], 'original_message_id': [1], 'political_link': ["Text"], 'link_slant': ["Text"], 'political_phrases': [["Text", "secondText"]], 'phrase_slant': ["Text"], 'slant_overall': ["Text"], 'original_user_id': [1], 'intermediate_user_id': [1], 'intermediate_message_id': [1]}
meta4 = pd.DataFrame(meta3)[0:0]
and (per Create Empty Dataframe in Pandas specifying column types)
dtypes=np.dtype([('level_0',int), ('reshared_message_id',int), ('resharing_message_id',int) , ('resharing_user_id',int) , ('index',int) , ('original_message_id',int) , ('political_link',str), ('link_slant',str) ,('political_phrases', np.str_), ('phrase_slant', str), ('slant_overall', str), ('original_user_id',int), ('intermediate_user_id',int) , ('intermediate_message_id',int),])
meta4 = np.empty(0, dtype=dtypes)
And then passing these values or variations of these values in for meta here
dask.dataframe.read_sql_table(includes_retweets_table, db_uri, meta=meta4, index_col='intermediate_message_id', npartitions=12)
When I print what is returned from dask.dataframe.read_sql_table(), I get
Dask DataFrame Structure:
level_0 reshared_message_id resharing_message_id resharing_user_id index original_message_id political_link link_slant political_phrases phrase_slant slant_overall original_user_id intermediate_user_id intermediate_message_id
npartitions=12
8135.0 int64 int64 int64 int64 int64 int64 object object object object object int64 int64 int64
17494966.0 ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
192363276.0 ... ... ... ... ... ... ... ... ... ... ... ... ... ...
209850107.0 ... ... ... ... ... ... ... ... ... ... ... ... ... ...
When I try to execute .compute() I get the error below
distributed.worker - WARNING - Compute Failed
Function: execute_task
args: ((<function check_meta at 0x7f8b040ca290>, (<function apply at 0x7f8b02de3830>, <function _read_sql_chunk at 0x7f8b041dc0e0>, [<sqlalchemy.sql.selectable.Select at 0x7f8b04353710; Select object>, 'postgresql+psycopg2://postgres:password#localhost:5432/stocktwits', Empty DataFrame
Columns: [level_0, reshared_message_id, resharing_message_id, resharing_user_id, index, original_message_id, political_link, link_slant, political_phrases, phrase_slant, slant_overall, original_user_id, intermediate_user_id, intermediate_message_id]
Index: []], (<class 'dict'>, [['engine_kwargs', (<class 'dict'>, [])], ['index_col', 'intermediate_message_id']])), Empty DataFrame
Columns: [level_0, reshared_message_id, resharing_message_id, resharing_user_id, index, original_message_id, political_link, link_slant, political_phrases, phrase_slant, slant_overall, original_user_id, intermediate_user_id, intermediate_message_id]
Index: [], 'from_delayed'))
kwargs: {}
Exception: KeyError('Only a column name can be used for the key in a dtype mappings argument.')
Traceback (most recent call last):
File "interpreterScript.py", line 359, in <module>
main()
File "interpreterScript.py", line 162, in main
print(includes_retweets.compute())
File "/opt/anaconda3/lib/python3.7/site-packages/dask/base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/dask/base.py", line 444, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 2682, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 1982, in gather
asynchronous=asynchronous,
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 832, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py", line 339, in sync
raise exc.with_traceback(tb)
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py", line 323, in f
result[0] = yield future
File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 1841, in _gather
raise exception.with_traceback(traceback)
File "/opt/anaconda3/lib/python3.7/site-packages/dask/utils.py", line 31, in apply
return func(*args, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/sql.py", line 216, in _read_sql_chunk
return df.astype(meta.dtypes.to_dict(), copy=False)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5857, in astype
"Only a column name can be used for the "
KeyError: 'Only a column name can be used for the key in a dtype mappings argument.'
This error doesn't make sense to me because I am including all of the column names from the table include_retweets with the correct data types.
Dask dataframe uses Pandas for data types. Pandas uses the Python object type for anything complex, so you want an entry like {"political_phrases": object}

PySpark DataFrame (pyspark.sql.dataframe.DataFrame) To CSV

I have a transactions table like this::
transactions.show()
+---------+-----------------------+--------------------+
|person_id|collect_set(title_name)| prediction|
+---------+-----------------------+--------------------+
| 3513736| [Make or Break, S...|[Love In Island.....|
| 3516443| [The Blacklist]|[Moordvrouw, The ...|
| 3537643| [S4 - Dutch progr...|[Vamos met de Fam...|
| 3547688| [Phileine Zegt So...| []|
| 3549345| [The Wolf of Wall...| []|
| 3550565| [Achtste Groepers...| []|
| 3553669| [Mega Mindy: Reis...| []|
| 3558162| [Snitch, Philomen...| []|
| 3561387| [Automata, The Hi...|[Bella Donna's, M...|
| 3570126| [The Wolf of Wall...| []|
| 3576602| [Harry & Meghan: ...|[Weg van Jou, Moo...|
| 3586366| [Gooische Vrouwen...|[Familieweekend, ...|
| 3586560| [Hooligans 3: Nev...| []|
| 3590208| [S2 - Dutch drama...|[Love In Island.....|
+---------+-----------------------+——————————----------+
The structure of the table looks like
transactions.printSchema()
root
|-- person_id: long (nullable = false)
|-- collect_set(title_name): array (nullable = true)
| |-- element: string (containsNull = true)
|-- prediction: array (nullable = true)
| |-- element: string (containsNull = true)
Now, I want to write this table to csv keeping the content of each column. Tried following
transactions.repartition(1)\
.write.mode('overwrite')\
.save(path="//Users/King/Documents/my_final.csv", format='csv',sep=',',header = 'true')
However, I get the following error.
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-66-7473346bdbb1> in <module>()
----> 1 vl_assoc_rules_pred.repartition(1).write.mode('overwrite').save(path="s3a://ci-data-apps/rashid/vl-assoc-rules/vl_assoc_rules_pred.csv", format='csv',sep=',',header = 'true')
/usr/lib/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
593 self._jwrite.save()
594 else:
--> 595 self._jwrite.save(path)
596
597 #since(1.4)
/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o840.save.
Could someone please tell how I can write this table to csv keeping the content of each column intact?
Thanks in advance !!
Assuming that 'transactions' is a dataframe, you can try this:
transactions.to_csv(file_name, sep=',')
to save it as CSV.
can use spark-csv:
Spark 1.3
df.save('mycsv.csv', 'com.databricks.spark.csv')
Spark 1.4+
df.write.format('com.databricks.spark.csv').save('mycsv.csv')
In Spark 2.0+ you can use csv data source directly:
df.write.csv('mycsv.csv')

Flask-SqlAlchemy Filtering by time section of datetime

I am looking to filter by the time section of a datetime column in Flask-Admin using Flask-SQlAlchemy.
My attempt so far is:
class BaseTimeBetweenFilter(filters.TimeBetweenFilter):
def apply(self, query, value, alias=None):
return query.filter(cast(Doctor.datetime, Time) >= value[0],
cast(Doctor.datetime, Time) <= value[1]).all()
I've got the time selector showing and if I do
print (value[0])
or
print (value[1])
it prints out the inputted times as expected. However the query is not working.
admin-panel_1 | response = self.full_dispatch_request()
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1641, in full_dispatch_request
admin-panel_1 | rv = self.handle_user_exception(e)
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1544, in handle_user_exception
admin-panel_1 | reraise(exc_type, exc_value, tb)
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 33, in reraise
admin-panel_1 | raise value
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1639, in full_dispatch_request
admin-panel_1 | rv = self.dispatch_request()
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1625, in dispatch_request
admin-panel_1 | return self.view_functionsrule.endpoint
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask_admin/base.py", line 69, in inner
admin-panel_1 | return self._run_view(f, *args, **kwargs)
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask_admin/base.py", line 368, in _run_view
admin-panel_1 | return fn(self, *args, **kwargs)
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask_admin/model/base.py", line 1818, in index_view
admin-panel_1 | view_args.search, view_args.filters)
admin-panel_1 | File "/usr/local/lib/python3.5/dist-packages/flask_admin/contrib/sqla/view.py", line 975, in get_list
admin-panel_1 | count = count_query.scalar() if count_query else None
admin-panel_1 | AttributeError: 'list' object has no attribute 'scalar'
Also, I import Time and Cast from sqlalchemy, is this OK or should I be getting it from flask-sqlalchemy?
from flask_sqlalchemy import SQLAlchemy
from sqlalchemy import Time, cast
Flask-Admin constructs a query by joining tables, applying filters and sorting. After all this procedures it calls the query itself and gets its results.
Your apply method should return sqlalchemy.orm.query.Query instance as the one it gets as query argument. When you add .all() method this query is called and you get query result as a list. Remove .all() call from result value and your filter should work:
class BaseTimeBetweenFilter(filters.TimeBetweenFilter):
def apply(self, query, value, alias=None):
return query.filter(
cast(Doctor.datetime, Time) >= value[0],
cast(Doctor.datetime, Time) <= value[1]
)
The source of importing doesn't really matter. flask_sqlalchemy.SQLAlchemy instance contains the same objects as sqlalchemy:
>>> from flask_sqlalchemy import SQLAlchemy
>>> db = SQLAlchemy()
>>> db.Time
<class 'sqlalchemy.sql.sqltypes.Time'>
>>> db.cast
<function cast at 0x7f60a3e89c80>
>>> from sqlalchemy import Time, cast
>>> Time
<class 'sqlalchemy.sql.sqltypes.Time'>
>>> cast
<function cast at 0x7f60a3e89c80>
>>> db.cast == cast and db.Time == Time
True

How to avoid error "Cannot compare type 'Timestamp' with type 'str'" pandas 0.16.0

I have various dataframes with this format
df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2009-10-23, ..., 2010-06-15]
Length: 161, Freq: None, Timezone: None
df.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
every now and then, at the execution of this line:
zeros_idx = df[ (df.A==0) | (df.B==0) | (df.C==0) | (df.D==0) ].index
I get the following error with this stack trace:
zeros_idx = df[ (df.A==0) | (df.B==0) | (df.C==0) | (df.D==0) ].index
File "/usr/lib64/python3.4/site-packages/pandas/core/ops.py", line 811, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "/usr/lib64/python3.4/site-packages/pandas/core/frame.py", line 3158, in _combine_series
return self._combine_match_columns(other, func, level=level, fill_value=fill_value)
File "/usr/lib64/python3.4/site-packages/pandas/core/frame.py", line 3191, in _combine_match_columns
left, right = self.align(other, join='outer', axis=1, level=level, copy=False)
File "/usr/lib64/python3.4/site-packages/pandas/core/generic.py", line 3143, in align
fill_axis=fill_axis)
File "/usr/lib64/python3.4/site-packages/pandas/core/generic.py", line 3225, in _align_series
return_indexers=True)
File "/usr/lib64/python3.4/site-packages/pandas/core/index.py", line 1810, in join
return_indexers=return_indexers)
File "/usr/lib64/python3.4/site-packages/pandas/tseries/index.py", line 904, in join
return_indexers=return_indexers)
File "/usr/lib64/python3.4/site-packages/pandas/core/index.py", line 1820, in join
return_indexers=return_indexers)
File "/usr/lib64/python3.4/site-packages/pandas/core/index.py", line 1830, in join
return_indexers=return_indexers)
File "/usr/lib64/python3.4/site-packages/pandas/core/index.py", line 2083, in _join_monotonic
join_index, lidx, ridx = self._outer_indexer(sv, ov)
File "pandas/src/generated.pyx", line 8558, in pandas.algos.outer_join_indexer_object (pandas/algos.c:157803)
File "pandas/tslib.pyx", line 823, in pandas.tslib._Timestamp.__richcmp__ (pandas/tslib.c:15585)
TypeError: Cannot compare type 'Timestamp' with type 'str'