pyspark pass multiple options in dataframe - apache-spark-sql

I am new to python and pyspark. I would like to know
how can I write the below spark dataframe function in pyspark:
val df = spark.read.format("jdbc").options(
Map(
"url" -> "jdbc:someDB",
"user" -> "root",
"password" -> "password",
"dbtable" -> "tableName",
"driver" -> "someDriver")).load()
I tried to write as below in pyspark. But, getting syntax error:
df = spark.read.format("jdbc").options(
map(lambda : ("url","jdbc:someDB"), ("user","root"), ("password","password"), ("dbtable","tableName"), ("driver","someDriver"))).load()
Thanks in Advance

In PySpark, pass the options as keyword arguments:
df = spark.read\
.format("jdbc")\
.options(
url="jdbc:someDB",
user="root",
password="password",
dbtable="tableName",
driver="someDriver",
)\
.load()
Sometimes it's handy to keep them in a dict and unpack them later using the splat operator:
options = {
"url": "jdbc:someDB",
"user": "root",
"password": "password",
"dbtable": "tableName",
"driver": "someDriver",
}
df = spark.read\
.format("jdbc")\
.options(**options)\
.load()
Regarding the code snippets from your question: you happened to mix up two different concepts of "map":
Map in Scala is a data structure also known as "associative array" or "dictionary", equivalent to Python's dict
map in Python is a higher-order function you can use for applying a function to an iterable, e.g.:
In [1]: def square(x: int) -> int:
...: return x**2
...:
In [2]: list(map(square, [1, 2, 3, 4, 5]))
Out[2]: [1, 4, 9, 16, 25]
In [3]: # or just use a lambda
In [4]: list(map(lambda x: x**2, [1, 2, 3, 4, 5]))
Out[4]: [1, 4, 9, 16, 25]

Try to use option() instead:
df = spark.read \
.format("jdbc") \
.option("url","jdbc:someDB") \
.option("user","root") \
.option("password","password") \
.option("dbtable","tableName") \
.option("driver","someDriver") \
.load()

To load a CSV file with multiple parameters, pass the arguments to load():
df = spark.read.load("examples/src/main/resources/people.csv",
format="csv", sep=":", inferSchema="true", header="true")
Here's the documentation for that.

Related

tensorflow 2.8 dataset get single element error

I would like to use "get_single_element" to get the one batch dataset as dict. The keys in dict are features names. The values are values.
My code is in tf2.8 on EC2 instance and run from jupyter.
import pandas as pd
import tensorflow as tf
df = pd.DataFrame({
'id': [1, 2, 6, 3, 5, 0],
'value_1': [10.892561, 7.210528, 1.2278101, -9.251782, 0.2118367, 6.9128551],
'value_2': ['large', 'small', 'mid','small','large', 'mid'],
'name_1': ['tyne', 'wnhp', 'ebhg','lpzhn','tyne', 'ebhg'],
'label': [0, 1, 0, 1, 1, 1]
})
dataset = tf.data.Dataset.from_tensor_slices(dict(df))
dataset = dataset.batch(2)
print(type(dataset))
print(dataset)
print(len(dataset))
class Features(object):
feature_data = {
"id": tf.io.FixedLenFeature((1,), dtype=tf.int8),
"value_1": tf.io.VarLenFeature(dtype=tf.float32),
"value_2": tf.io.FixedLenFeature((1,), dtype=tf.string),
"name_1": tf.io.FixedLenFeature((1,), dtype=tf.string)
}
label_data = {"label": tf.io.FixedLenFeature((1,), dtype=tf.int8)}
def process_sample(ds):
print(f"ds type is {type(ds)}")
features = tf.io.parse_single_example(ds, Features.feature_data) # error !
labels = tf.io.parse_single_example(ds, Features.label_data)['label']
return (features, labels)
dataset = dataset.map(lambda x: process_sample(x), num_parallel_calls=tf.data.AUTOTUNE)
dataset = tf.data.Dataset.get_single_element(dataset.batch(len(dataset)))
print(f"dataset type is {type(dataset)} dataset is {dataset}")
def get_dict(input_ds):
feature_dict_tensors = dict(input_ds)
print(f"\feature_dict_tensors type is {type(feature_dict_tensors)}, feature_dict_tensors is {feature_dict_tensors}")
return feature_dict_tensors
ds = dataset.map(get_dict)
print(f"ds type is {type(ds)}")
print(ds)
I got error:
File "<ipython-input-4-e9407f42e0a7>", line 37, in None *
lambda x: process_sample(x), num_parallel_calls=tf.data.AUTOTUNE)
File "<ipython-input-4-e9407f42e0a7>", line 33, in process_sample *
features = tf.io.parse_single_example(ds, Features.feature_data)
TypeError: Expected any non-tensor type, but got a tensor instead.
Based on https://www.tensorflow.org/api_docs/python/tf/io/parse_single_example, the first argument should be
A scalar string Tensor, a single serialized Example.
Why I got this error ?
thanks

Julia "MethodError: no method matching build_tree"

I have a very simple sample script:
using Pkg
Pkg.add("DecisionTree")
Pkg.add("DataFrames")
using DataFrames
using DecisionTree
dat = DataFrame(A=[1, 2, 3, 4, 5], B=[2, 5, 1, 2, 6])
model = build_tree(dat[!, "A"], dat[!, "B"])
Which returns an error:
ERROR: LoadError: MethodError: no method matching build_tree(::Vector{Int64}, ::Vector{Int64})
Closest candidates are:
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}, ::Any) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
build_tree(::AbstractVector{T}, ::AbstractMatrix{S}, ::Any, ::Any) where {S, T} at C:\Users\**\.julia\packages\DecisionTree\iWCbW\src\classification\main.jl:74
What is going on? How do I deal with that?
Your data types do not match. Try this:
C = reshape(dat[!, "B"], (1, 5))
model = DecisionTree.build_tree(dat[!, "A"], C')

numpy array of array with custom filtering

I am trying to filter a numpy array of array with given conditions, for example
input = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
output where the [0] >= 4, [1] >= 5, [2] >= 6
expected result = np.array([[4,5,6],[4,5,6]])
what would be the best way to achieve this with performance concern?
extended question: and how to retrieve the correspondance index of the each output elements in the input array?
You can do:
a = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
a[(a[:,0] >=4) & (a[:,1] >= 5) & (a[:,2] >=6)]
Here you create binary masks for the conditions on each elements in each row of the data, use the logical and to combine them, and finally use the resulting mask to get the matching data rows.
To find the index of the data rows matching the conditions, you can use numpys where() function:
idx = np.where((a[:,0] >=4) & (a[:,1] >= 1) & (a[:,2] >=6))[0]
As per your request, a numba version
import numpy as np
import numba as nb
import sys
import timeit
target = np.random.randint(low=-100000, high=100000, size=(int(sys.argv[2]), 3), dtype=np.int)
comp = np.array([4, 5, 6])
#nb.njit((nb.int64[:, :], nb.int64[::3]), parallel=True)
def cmp(a, b):
c = np.empty((a.shape[0],), dtype=a.dtype)
for i in nb.prange(a.shape[0]):
c[i] = a[i][0] > b[0] and a[i][1] > b[1] and a[i][2] > b[2]
return c
def cmp_normal(a, b):
# return np.all(a > b, axis=1)
return (a[:,0] >=b[0]) & (a[:,1] >= b[1]) & (a[:,2] >=b[2])
print(timeit.timeit(lambda: eval(sys.argv[1])(target, comp), number=10))
First output time is for sequential numba, second one is for parallel numba.
Parallel numba gives 5 times speed up compared to sequential
(base) xxx#xxx:~$ python test.py cmp 1000000
6.40756068899982
(base) xxx#xxx:~$ python test.py cmp 1000000
1.3425709140001345
Now vanilla numpy
(base) xxx#xxx:~$ python test.py cmp_normal 1000000
4.04174472700015
Numba parallel is fastest. But if you try to return a[c] instead, numba will slow down. So it depends on what you write
In [223]: arr =np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
In [224]: arr
Out[224]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 4, 5, 6],
[ 0, 9, 19]])
Since you are testing values, one for each column, you can do a simple numpy == test (the (3,) test broadcasts with the (4,3) arr)
In [225]: arr==[4,5,6]
Out[225]:
array([[False, False, False],
[ True, True, True],
[ True, True, True],
[False, False, False]])
and where a whole row is true:
In [226]: (arr==[4,5,6]).all(axis=1)
Out[226]: array([False, True, True, False])
This can be applied as a boolean mask to select those rows from arr:
In [227]: arr[_]
Out[227]:
array([[4, 5, 6],
[4, 5, 6]])
and the numeric indices:
In [228]: np.nonzero(__)
Out[228]: (array([1, 2]),)

How to create variable PySpark Dataframes by Dropping Null columns

I have 2 JSON files in a relative folder named 'source_data'
"source_data/data1.json"
{
"name": "John Doe",
"age": 32,
"address": "ZYZ - Heaven"
}
"source_data/data2.json"
{
"userName": "jdoe",
"password": "password",
"salary": "123456789"
}
Using the following PySpark code I have created DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("source_data")
print(df.head())
Output:
df.head(10)
[Row(name='John Doe', age=32, address='ZYZ - Heaven', userName=None, password=None, salary=None),
Row(name=None, age=None, address=None, userName='jdoe', password='password', salary='123456789')]
Now I want to create variable number of DataFrame, by dropping 'None' type column values, like this:
df1.head()
[Row(name='John Doe', age=32, address='ZYZ - Heaven']
and,
df2.head()
[Row(userName='jdoe', password='password', salary='123456789')]
I am only finding solutions for dropping entire row based on all or any column(s)
Is there any ways to achieve what I am looking for ?
TIA
You can just select the columns that you require in a different dataframe and filter that based on the condition.
//source data
val df = spark.read.json("path")
//select and filter
val df1 = df.select("address","age","name")
.filter($"address".isNotNull || $"age".isNotNull || $"name".isNotNull)
val df2 = df.select("password","salary","userName")
.filter($"password".isNotNull || $"salary".isNotNull || $"userName".isNotNull)
//see the output as dataframe or using head as you want
println(df1.head)
df2.head
Output for both the head command
df1 :
df2:

How to find the index in a list of numbers where there are repeating numbers [duplicate]

Does anyone know how I can get the index position of duplicate items in a python list?
I have tried doing this and it keeps giving me only the index of the 1st occurrence of the of the item in the list.
List = ['A', 'B', 'A', 'C', 'E']
I want it to give me:
index 0: A
index 2: A
You want to pass in the optional second parameter to index, the location where you want index to start looking. After you find each match, reset this parameter to the location just after the match that was found.
def list_duplicates_of(seq,item):
start_at = -1
locs = []
while True:
try:
loc = seq.index(item,start_at+1)
except ValueError:
break
else:
locs.append(loc)
start_at = loc
return locs
source = "ABABDBAAEDSBQEWBAFLSAFB"
print(list_duplicates_of(source, 'B'))
Prints:
[1, 3, 5, 11, 15, 22]
You can find all the duplicates at once in a single pass through source, by using a defaultdict to keep a list of all seen locations for any item, and returning those items that were seen more than once.
from collections import defaultdict
def list_duplicates(seq):
tally = defaultdict(list)
for i,item in enumerate(seq):
tally[item].append(i)
return ((key,locs) for key,locs in tally.items()
if len(locs)>1)
for dup in sorted(list_duplicates(source)):
print(dup)
Prints:
('A', [0, 2, 6, 7, 16, 20])
('B', [1, 3, 5, 11, 15, 22])
('D', [4, 9])
('E', [8, 13])
('F', [17, 21])
('S', [10, 19])
If you want to do repeated testing for various keys against the same source, you can use functools.partial to create a new function variable, using a "partially complete" argument list, that is, specifying the seq, but omitting the item to search for:
from functools import partial
dups_in_source = partial(list_duplicates_of, source)
for c in "ABDEFS":
print(c, dups_in_source(c))
Prints:
A [0, 2, 6, 7, 16, 20]
B [1, 3, 5, 11, 15, 22]
D [4, 9]
E [8, 13]
F [17, 21]
S [10, 19]
>>> def indices(lst, item):
... return [i for i, x in enumerate(lst) if x == item]
...
>>> indices(List, "A")
[0, 2]
To get all duplicates, you can use the below method, but it is not very efficient. If efficiency is important you should consider Ignacio's solution instead.
>>> dict((x, indices(List, x)) for x in set(List) if List.count(x) > 1)
{'A': [0, 2]}
As for solving it using the index method of list instead, that method takes a second optional argument indicating where to start, so you could just repeatedly call it with the previous index plus 1.
>>> List.index("A")
0
>>> List.index("A", 1)
2
I made a benchmark of all solutions suggested here and also added another solution to this problem (described in the end of the answer).
Benchmarks
First, the benchmarks. I initialize a list of n random ints within a range [1, n/2] and then call timeit over all algorithms
The solutions of #Paul McGuire and #Ignacio Vazquez-Abrams works about twice as fast as the rest on the list of 100 ints:
Testing algorithm on the list of 100 items using 10000 loops
Algorithm: dupl_eat
Timing: 1.46247477189
####################
Algorithm: dupl_utdemir
Timing: 2.93324529055
####################
Algorithm: dupl_lthaulow
Timing: 3.89198786645
####################
Algorithm: dupl_pmcguire
Timing: 0.583058259784
####################
Algorithm: dupl_ivazques_abrams
Timing: 0.645062989076
####################
Algorithm: dupl_rbespal
Timing: 1.06523873786
####################
If you change the number of items to 1000, the difference becomes much bigger (BTW, I'll be happy if someone could explain why) :
Testing algorithm on the list of 1000 items using 1000 loops
Algorithm: dupl_eat
Timing: 5.46171654555
####################
Algorithm: dupl_utdemir
Timing: 25.5582547323
####################
Algorithm: dupl_lthaulow
Timing: 39.284285326
####################
Algorithm: dupl_pmcguire
Timing: 0.56558489513
####################
Algorithm: dupl_ivazques_abrams
Timing: 0.615980005148
####################
Algorithm: dupl_rbespal
Timing: 1.21610942322
####################
On the bigger lists, the solution of #Paul McGuire continues to be the most efficient and my algorithm begins having problems.
Testing algorithm on the list of 1000000 items using 1 loops
Algorithm: dupl_pmcguire
Timing: 1.5019953958
####################
Algorithm: dupl_ivazques_abrams
Timing: 1.70856155898
####################
Algorithm: dupl_rbespal
Timing: 3.95820421595
####################
The full code of the benchmark is here
Another algorithm
Here is my solution to the same problem:
def dupl_rbespal(c):
alreadyAdded = False
dupl_c = dict()
sorted_ind_c = sorted(range(len(c)), key=lambda x: c[x]) # sort incoming list but save the indexes of sorted items
for i in xrange(len(c) - 1): # loop over indexes of sorted items
if c[sorted_ind_c[i]] == c[sorted_ind_c[i+1]]: # if two consecutive indexes point to the same value, add it to the duplicates
if not alreadyAdded:
dupl_c[c[sorted_ind_c[i]]] = [sorted_ind_c[i], sorted_ind_c[i+1]]
alreadyAdded = True
else:
dupl_c[c[sorted_ind_c[i]]].append( sorted_ind_c[i+1] )
else:
alreadyAdded = False
return dupl_c
Although it's not the best it allowed me to generate a little bit different structure needed for my problem (i needed something like a linked list of indexes of the same value)
dups = collections.defaultdict(list)
for i, e in enumerate(L):
dups[e].append(i)
for k, v in sorted(dups.iteritems()):
if len(v) >= 2:
print '%s: %r' % (k, v)
And extrapolate from there.
I think I found a simple solution after a lot of irritation :
if elem in string_list:
counter = 0
elem_pos = []
for i in string_list:
if i == elem:
elem_pos.append(counter)
counter = counter + 1
print(elem_pos)
This prints a list giving you the indexes of a specific element ("elem")
Using new "Counter" class in collections module, based on lazyr's answer:
>>> import collections
>>> def duplicates(n): #n="123123123"
... counter=collections.Counter(n) #{'1': 3, '3': 3, '2': 3}
... dups=[i for i in counter if counter[i]!=1] #['1','3','2']
... result={}
... for item in dups:
... result[item]=[i for i,j in enumerate(n) if j==item]
... return result
...
>>> duplicates("123123123")
{'1': [0, 3, 6], '3': [2, 5, 8], '2': [1, 4, 7]}
from collections import Counter, defaultdict
def duplicates(lst):
cnt= Counter(lst)
return [key for key in cnt.keys() if cnt[key]> 1]
def duplicates_indices(lst):
dup, ind= duplicates(lst), defaultdict(list)
for i, v in enumerate(lst):
if v in dup: ind[v].append(i)
return ind
lst= ['a', 'b', 'a', 'c', 'b', 'a', 'e']
print duplicates(lst) # ['a', 'b']
print duplicates_indices(lst) # ..., {'a': [0, 2, 5], 'b': [1, 4]})
A slightly more orthogonal (and thus more useful) implementation would be:
from collections import Counter, defaultdict
def duplicates(lst):
cnt= Counter(lst)
return [key for key in cnt.keys() if cnt[key]> 1]
def indices(lst, items= None):
items, ind= set(lst) if items is None else items, defaultdict(list)
for i, v in enumerate(lst):
if v in items: ind[v].append(i)
return ind
lst= ['a', 'b', 'a', 'c', 'b', 'a', 'e']
print indices(lst, duplicates(lst)) # ..., {'a': [0, 2, 5], 'b': [1, 4]})
Wow, everyone's answer is so long. I simply used a pandas dataframe, masking, and the duplicated function (keep=False markes all duplicates as True, not just first or last):
import pandas as pd
import numpy as np
np.random.seed(42) # make results reproducible
int_df = pd.DataFrame({'int_list': np.random.randint(1, 20, size=10)})
dupes = int_df['int_list'].duplicated(keep=False)
print(int_df['int_list'][dupes].index)
This should return Int64Index([0, 2, 3, 4, 6, 7, 9], dtype='int64').
def index(arr, num):
for i, x in enumerate(arr):
if x == num:
print(x, i)
#index(List, 'A')
In a single line with pandas 1.2.2 and numpy:
import numpy as np
import pandas as pd
idx = np.where(pd.DataFrame(List).duplicated(keep=False))
The argument keep=False will mark every duplicate as True and np.where() will return an array with the indices where the element in the array was True.
string_list = ['A', 'B', 'C', 'B', 'D', 'B']
pos_list = []
for i in range(len(string_list)):
if string_list[i] = ='B':
pos_list.append(i)
print pos_list
def find_duplicate(list_):
duplicate_list=[""]
for k in range(len(list_)):
if duplicate_list.__contains__(list_[k]):
continue
for j in range(len(list_)):
if k == j:
continue
if list_[k] == list_[j]:
duplicate_list.append(list_[j])
print("duplicate "+str(list_.index(list_[j]))+str(list_.index(list_[k])))
Here is one that works for multiple duplicates and you don't need to specify any values:
List = ['A', 'B', 'A', 'C', 'E', 'B'] # duplicate two 'A's two 'B's
ix_list = []
for i in range(len(List)):
try:
dup_ix = List[(i+1):].index(List[i]) + (i + 1) # dup onwards + (i + 1)
ix_list.extend([i, dup_ix]) # if found no error, add i also
except:
pass
ix_list.sort()
print(ix_list)
[0, 1, 2, 5]
def dup_list(my_list, value):
'''
dup_list(list,value)
This function finds the indices of values in a list including duplicated values.
list: the list you are working on
value: the item of the list you want to find the index of
NB: if a value is duplcated, its indices are stored in a list
If only one occurence of the value, the index is stored as an integer.
Therefore use isinstance method to know how to handle the returned value
'''
value_list = []
index_list = []
index_of_duped = []
if my_list.count(value) == 1:
return my_list.index(value)
elif my_list.count(value) < 1:
return 'Your argument is not in the list'
else:
for item in my_list:
value_list.append(item)
length = len(value_list)
index = length - 1
index_list.append(index)
if item == value:
index_of_duped.append(max(index_list))
return index_of_duped
# function call eg dup_list(my_list, 'john')
If you want to get index of all duplicate elements of different types you can try this solution:
# note: below list has more than one kind of duplicates
List = ['A', 'B', 'A', 'C', 'E', 'E', 'A', 'B', 'A', 'A', 'C']
d1 = {item:List.count(item) for item in List} # item and their counts
elems = list(filter(lambda x: d1[x] > 1, d1)) # get duplicate elements
d2 = dict(zip(range(0, len(List)), List)) # each item and their indices
# item and their list of duplicate indices
res = {item: list(filter(lambda x: d2[x] == item, d2)) for item in elems}
Now, if you print(res) you'll get to see this:
{'A': [0, 2, 6, 8, 9], 'B': [1, 7], 'C': [3, 10], 'E': [4, 5]}
def duplicates(list,dup):
a=[list.index(dup)]
for i in list:
try:
a.append(list.index(dup,a[-1]+1))
except:
for i in a:
print(f'index {i}: '+dup)
break
duplicates(['A', 'B', 'A', 'C', 'E'],'A')
Output:
index 0: A
index 2: A
This is a good question and there is a lot of ways to it.
The code below is one of the ways to do it
letters = ["a", "b", "c", "d", "e", "a", "a", "b"]
lettersIndexes = [i for i in range(len(letters))] # i created a list that contains the indexes of my previous list
counter = 0
for item in letters:
if item == "a":
print(item, lettersIndexes[counter])
counter += 1 # for each item it increases the counter which means the index
An other way to get the indexes but this time stored in a list
letters = ["a", "b", "c", "d", "e", "a", "a", "b"]
lettersIndexes = [i for i in range(len(letters)) if letters[i] == "a" ]
print(lettersIndexes) # as you can see we get a list of the indexes that we want.
Good day
Using a dictionary approach based on setdefault instance method.
List = ['A', 'B', 'A', 'C', 'B', 'E', 'B']
# keep track of all indices of every term
duplicates = {}
for i, key in enumerate(List):
duplicates.setdefault(key, []).append(i)
# print only those terms with more than one index
template = 'index {}: {}'
for k, v in duplicates.items():
if len(v) > 1:
print(template.format(k, str(v).strip('][')))
Remark: Counter, defaultdict and other container class from collections are subclasses of dict hence share the setdefault method as well
I'll mention the more obvious way of dealing with duplicates in lists. In terms of complexity, dictionaries are the way to go because each lookup is O(1). You can be more clever if you're only interested in duplicates...
my_list = [1,1,2,3,4,5,5]
my_dict = {}
for (ind,elem) in enumerate(my_list):
if elem in my_dict:
my_dict[elem].append(ind)
else:
my_dict.update({elem:[ind]})
for key,value in my_dict.iteritems():
if len(value) > 1:
print "key(%s) has indices (%s)" %(key,value)
which prints the following:
key(1) has indices ([0, 1])
key(5) has indices ([5, 6])
a= [2,3,4,5,6,2,3,2,4,2]
search=2
pos=0
positions=[]
while (search in a):
pos+=a.index(search)
positions.append(pos)
a=a[a.index(search)+1:]
pos+=1
print "search found at:",positions
I just make it simple:
i = [1,2,1,3]
k = 0
for ii in i:
if ii == 1 :
print ("index of 1 = ", k)
k = k+1
output:
index of 1 = 0
index of 1 = 2