Spark Multiple Conditions Join

Spark Multiple Conditions Join - apache-spark-sql

I am using spark sql to join three tables, however i get error with multiple column conditions.
test_table = (T1.join(T2,T1.dtm == T2.kids_dtm, "inner")
.join(T3, T3.kids_dtm == T1.dtm
and T2.room_id == T3.room_id
and T2.book_id == T3.book_id, "inner"))
ERROR:
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/python/pyspark/sql/column.py", line 447, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Instead of specifying "and", i have tried putting "&" and "&&" , but none of these work. Any help would be appreciated.

Nvm, following works with use of "&" and brackets:
test_table = (T1.join(T2,T1.dtm == T2.kids_dtm, "inner")
.join(T3, (T3.kids_dtm == T1.dtm)
& (T2.room_id == T3.room_id)
& (T2.book_id == T3.book_id), "inner"))

Related

PySpark : AttributeError: 'DataFrame' object has no attribute 'values'

I'm a newbie in PySpark and I want to translate the following scripts which are pythonic into pyspark:
api_param_df = pd.DataFrame([[row[0][0], np.nan] if row[0][1] == '' else row[0] for row in http_path.values], columns=["api", "param"])
df = pd.concat([df['raw'], api_param_df], axis=1)
but I face the following error, which error trackback is following:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-df055fb7d6a1> in <module>()
21 # Notice we also make \? and the second capture group optional so that when there are no query parameters in http path, it returns NaN.
22
---> 23 api_param_df = pd.DataFrame([[row[0][0], np.nan] if row[0][1] == '' else row[0] for row in http_path.values], columns=["api", "param"])
24 df = pd.concat([df['raw'], api_param_df], axis=1)
25
/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py in __getattr__(self, name)
1642 if name not in self.columns:
1643 raise AttributeError(
-> 1644 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
1645 jc = self._jdf.apply(name)
1646 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'values'
The full script is as follow, and explanations are commented for using regex to apply on the certain column http_path in df to parse api and param and merge/concat them to df again.
#Extract features from http_path ["API URL", "URL parameters"]
regex = r'([^\?]+)\?*(.*)'
http_path = df.filter(df['http_path'].rlike(regex))
# http_path
#0 https://example.org/path/to/file?param=42#frag...
#1 https://example.org/path/to/file
# api param
#0 https://example.org/path/to/file param=42#fragment
#1 https://example.org/path/to/file NaN
#where in regex pattern:
#- (?:https?://[^/]+/)? optionally matches domain but doesn't capture it
#- (?P<api>[^?]+) matches everything up to ?
#- \? matches ? literally
#- (?P<param>.+) matches everything after ?
# Notice we also make \? and the second capture group optional so that when there are no query parameters in http_path, it returns NaN.
api_param_df = pd.DataFrame([[row[0][0], np.nan] if row[0][1] == '' else row[0] for row in http_path.values], columns=["api", "param"])
df = pd.concat([df['raw'], api_param_df], axis=1)
df
Any help will be appreciated.

The syntax is valid with Pandas DataFrames but that attribute doesn't exist for the PySpark created DataFrames. You can check out this link for the documentation.
Usually, the collect() method or the .rdd attribute would help you with these tasks.
You can use the following snippet to produce the desired result:
http_path = sdf.rdd.map(lambda row: row['http_path'].split('?'))
api_param_df = pd.DataFrame([[row[0], np.nan] if len(row) == 1 else row for row in http_path.collect()], columns=["api", "param"])
sdf = pd.concat([sdf.toPandas()['raw'], api_param_df], axis=1)
Note that I removed the comments to make it more readable and I've also substituted the regex with a simple split.

Not able to extract any column from a pandas dataframe [duplicate]

I have successfully read a csv file using pandas. When I am trying to print the a particular column from the data frame i am getting keyerror. Hereby i am sharing the code with the error.
import pandas as pd
reviews_new = pd.read_csv("D:\\aviva.csv")
reviews_new['review']
**
reviews_new['review']
Traceback (most recent call last):
File "<ipython-input-43-ed485b439a1c>", line 1, in <module>
reviews_new['review']
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "C:\Users\30216\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\indexes\base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)
File "pandas\index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)
File "pandas\hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)
File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)
KeyError: 'review'
**
Can someone help me in this ?

I think first is best investigate, what are real columns names, if convert to list better are seen some whitespaces or similar:
print (reviews_new.columns.tolist())
I think there can be 2 problems (obviously):
1.whitespaces in columns names (maybe in data also)
Solutions are strip whitespaces in column names:
reviews_new.columns = reviews_new.columns.str.strip()
Or add parameter skipinitialspace to read_csv:
reviews_new = pd.read_csv("D:\\aviva.csv", skipinitialspace=True)
2.different separator as default ,
Solution is add parameter sep:
#sep is ;
reviews_new = pd.read_csv("D:\\aviva.csv", sep=';')
#sep is whitespace
reviews_new = pd.read_csv("D:\\aviva.csv", sep='\s+')
reviews_new = pd.read_csv("D:\\aviva.csv", delim_whitespace=True)
EDIT:
You get whitespace in column name, so need 1.solutions:
print (reviews_new.columns.tolist())
['Name', ' Date', ' review']
^ ^

import pandas as pd
df=pd.read_csv("file.txt", skipinitialspace=True)
df.head()
df['review']

dfObj['Hash Key'] = (dfObj['DEAL_ID'].map(str) +dfObj['COST_CODE'].map(str) +dfObj['TRADE_ID'].map(str)).apply(hash)
#for index, row in dfObj.iterrows():
# dfObj.loc[`enter code here`index,'hash'] = hashlib.md5(str(row[['COST_CODE','TRADE_ID']].values)).hexdigest()
print(dfObj['hash'])

I get this error when i try to use Wolfram Alpha in VS code python ValueError: dictionary update sequence element #0 has length 1; 2 is required

This is my code
import wolframalpha
app_id = '876P8Q-R2PY95YEXY'
client = wolframalpha.Client(app_id)
res = client.query(input('Question: '))
print(next(res.results).text)
the question I tried was 1 + 1
and i run it and then i get this error
Traceback (most recent call last):
File "c:/Users/akshi/Desktop/Xander/Untitled.py", line 9, in <module>
print(next(res.results).text)
File "C:\Users\akshi\AppData\Local\Programs\Python\Python38\lib\site-packages\wolframalpha\__init__.py", line 166, in text
return next(iter(self.subpod)).plaintext
ValueError: dictionary update sequence element #0 has length 1; 2 is required
Please help me

I was getting the same error when I tried to run the same code.
You can refer to "Implementing Wolfram Alpha Search" section of this website for better understanding of how the result was extracted from the dictionary returned.
https://medium.com/#salisuwy/build-an-ai-assistant-with-wolfram-alpha-and-wikipedia-in-python-d9bc8ac838fe
Also, I tried the following code by referring to the above website....hope it might help you :)
import wolframalpha
client = wolframalpha.Client('<your app_id>')
query = str(input('Question: '))
res = client.query(query)
if res['#success']=='true':
pod0=res['pod'][0]['subpod']['plaintext']
print(pod0)
pod1=res['pod'][1]
if (('definition' in pod1['#title'].lower()) or ('result' in pod1['#title'].lower()) or (pod1.get('#primary','false') == 'true')):
result = pod1['subpod']['plaintext']
print(result)
else:
print("No answer returned")

ValueError: invalid literal for int() with base 10: 'O'

I am relatively new to python, and as such I don't always understand why I get errors. I keep getting this error:
Traceback (most recent call last):
File "python", line 43, in <module>
ValueError: invalid literal for int() with base 10: 'O'
This is the line it's referring to:
np.insert(arr, [i,num], "O")
I'm trying to change a value in a numpy array.
Some code around this line for context:
hOne = [one,two,three]
hTwo = [four,five,six]
hThree = [seven, eight, nine]
arr = np.array([hOne, hTwo, hThree])
test = "O"
while a != Answer :
Answer = input("Please Enter Ready to Start")
if a == Answer:
while win == 0:
for lists in arr:
print(lists)
place = int(input("Choose a number(Use arabic numerals 1,5 etc.)"))
for i in range(0,len(arr)):
for num in range(0, len(arr[i])):
print(arr[i,num], "test")
print(arr)
if place == arr[i,num]:
if arr[i,num]:
np.delete(arr, [i,num])
np.insert(arr, [i,num], "O")
aiTurn = 1
else:
print(space_taken)
The number variables in the lists just hold the int version of themselves, so one = 1, two = 2 three = 3, etc
I've also tried holding "O" as a variable and changing it that way as well.
Can anyone tell me why I'm getting this error?

Flask & SQLAlchemy & PostgreSQL - In a query can an 'int' be cast to a 'string' to permit use of 'like'

Using Flask and SQLAlchemy is it possible to create a query where a column can be cast from a number to a string so that .like() can be used as a filter?
The sample code below illustrates what I'm after, however Test 3 is a broken statement (ie: No attempt at casting so the query fails. Error is below)
Test 1 - demonstrates a standard select
Test 2 - demonstrates a select using like on a string
Can 'test 3' be modified to permit a like on a number?
In PostgreSQL the SQL query would be:
SELECT * FROM mytable WHERE number::varchar like '%2%'
Any assistance gratefully appreciated.
from flask import Flask
from flask_sqlalchemy import SQLAlchemy
from sqlalchemy import Table, Column, Integer, String
app = Flask(__name__)
app.debug = True
app.config.from_pyfile('config.py')
db = SQLAlchemy( app )
class MyTable(db.Model):
'''My Sample Table'''
__tablename__ = 'mytable'
number = db.Column( db.Integer, primary_key = True )
text = db.Column( db.String )
def __repr__(self):
return( 'MyTable( ' + str( self.number ) + ', ' + self.text + ')' )
test_1 = (db.session.query(MyTable)
.all())
print "Test 1 = " + str( test_1 )
test_2 = (db.session.query(MyTable)
.filter( MyTable.text.like( '%orl%' ) )
.all())
print "Test 2 = " + str( test_2 )
test_3 = (db.session.query(MyTable)
.filter( MyTable.number.like( '%2%' ) )
.all())
And the sample data:
=> select * from mytable;
number | text
--------+-------
100 | Hello
20 | World
And the error:
Traceback (most recent call last):
File "sample.py", line 33, in <module>
.filter( MyTable.number.like( '%2%' ) )
File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2320, in all
return list(self)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2438, in __iter__
return self._execute_and_instances(context)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2453, in _execute_and_instances
result = conn.execute(querycontext.statement, self._params)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 729, in execute
return meth(self, multiparams, params)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 322, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context
context)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1159, in _handle_dbapi_exception
exc_info
File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_context
context)
File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 436, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (ProgrammingError) operator does not exist: integer ~~ unknown
LINE 3: WHERE mytable.number LIKE '%2%'
^
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
'SELECT mytable.number AS mytable_number, mytable.text AS mytable_text \nFROM mytable \nWHERE mytable.number LIKE %(number_1)s' {'number_1': '%2%'}

Solved. The Query method filter can take an expression, so the solution is:
from sqlalchemy import cast, String
result = (db.session.query(MyTable)
.filter( cast( MyTable.number, String ).like( '%2%' ) )
.all())
With the result:
Test 3 = [MyTable( 20, World)]
Found the information in the SQLAlchemy Query API documentation.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark Multiple Conditions Join - apache-spark-sql

Nvm, following works with use of "&" and brackets: test_table = (T1.join(T2,T1.dtm == T2.kids_dtm, "inner") .join(T3, (T3.kids_dtm == T1.dtm) & (T2.room_id == T3.room_id) & (T2.book_id == T3.book_id), "inner"))

Related

PySpark : AttributeError: 'DataFrame' object has no attribute 'values'

Not able to extract any column from a pandas dataframe [duplicate]

I get this error when i try to use Wolfram Alpha in VS code python ValueError: dictionary update sequence element #0 has length 1; 2 is required

ValueError: invalid literal for int() with base 10: 'O'

Flask & SQLAlchemy & PostgreSQL - In a query can an 'int' be cast to a 'string' to permit use of 'like'

Categories

Resources