Getting error while converting base64 string into image using pyspark - pandas

I want to extract and process an image data (3D array) available in base64 format using pyspark. I'm using pandas_udf with pyarrow as a processing function. While parsing the base64 string into pandas_udf function, first I convert the base64 string into image. But, at this step I'm getting error as "TypeError: file() argument 1 must be encoded string without null bytes, not str."
I am using function base64.b64decode(imgString) to convert base64 string to image. I'm using python 2.7
...
avrodf=sqlContext.read.format("com.databricks.spark.avro").load("hdfs:///Raw_Images_201803182350.avro")
interested_cols = ["id","name","image_b64"]
indexed_avrodf = avrodf.select(interested_cols)
ctx_cols = ["id","name"]
result_sdf = indexed_avrodf.groupby(ctx_cols).apply(img_proc)
schema = StructType([
StructField("id",StringType()),
StructField("name",StringType()),
StructField("image",StringType()),
StructField("Proc_output",StringType())
])
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def img_proc(df):
df['Proc_output'] = df['image_b64'].apply(is_processed)
return df
def is_processed(imgString):
import cv2
from PIL import Image, ImageDraw, ImageChops
import base64
wisimg = base64.b64decode(imgString)
image = Image.open(wisimg)
.....
return processed_status

Related

how to change the data type from string to bytes, keeping the contents of the string unchanged?

How can you convert string to bytes? And it's not about decode/encode, I have just bytes in the string, I just need to convert the format of the string to bytes.
The point is that I want to write the array numpy in the image metadata. In order to save both the shape of the array and its contents I use the pickle package, but in the image metadata can be written only a string, so I convert the pickle object to a string by simple srt(). such data are written and read from the image metadata look the same as the bits but in string format:
b'\x80\x04\x95\xaa\x00\x00\x00...
Now for pickle to be able to convert this data to numpy array I need to return it to type (bytes), but how can I do that? Everyone on the internet is talking about converting strings to bytes and vice versa via decode/encode, but that's not what I need
It would be good if you show some code.
pickle.dumps returns a bytes object; ideally you would be able to write this straight to your image metadata without the str; investigate if you can write metadata in "binary" mode. If that is not an option, I suggest looking at base64 encoding.
If you insist on using the str method, you could use ast.literal_eval to somewhat safely convert back to a bytes object.
This sample demonstrates binary and str and ast.literal_eval
import ast
import pickle
import numpy as np
a = np.array([[1.23, np.pi], [3, 4]])
# binary mode
with open('data.bin', 'wb') as outfile:
outfile.write(pickle.dumps(a))
with open('data.bin', 'rb') as infile:
b = pickle.loads(infile.read())
print('binary')
print(b)
print(f'{(a==b).all() = }')
# ast.literal_eval hack
with open('data.txt', 'w') as outfile:
outfile.write(str(pickle.dumps(a)))
with open('data.txt') as infile:
c = pickle.loads(ast.literal_eval(infile.read()))
print()
print('str & ast.literal_eval')
print(c)
print(f'{(c==b).all() = }')
On my system (Ubuntu 20.04, Python 3.9.7 via Conda), this gives:
binary
[[1.23 3.14159265]
[3. 4. ]]
(a==b).all() = True
str & ast.literal_eval
[[1.23 3.14159265]
[3. 4. ]]
(c==b).all() = True

How to make a tf.transform (Tensorflow Transform) encoded dict?

I'm trying to get a "tf.transform encoded dict" with this tfx.components.Transform function.
transform = Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=os.path.abspath(_taxi_transform_module_file),
instance_name="taxi")
context.run(transform)
I need a dict like this: " a dict of the data you load ({feature_name: feature_value})."
Transform as mentioned above gives me a TfRecord file. How can i decode it properly?
Any help would be appreciated.
import tensorflow_transform as tft
def preprocessing_fn(inputs):
NUMERIC_FEATURE_KEYS = ['PetalLengthCm', 'PetalWidthCm',
'SepalLengthCm', 'SepalWidthCm']
TARGET_FEATURES = "Species"
outputs = inputs.copy()
del outputs['Id']
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(outputs[key])
return outputs
Write a module like this i have written one for iris dataset it's simple to understand for your dataset also you can do like this it will be saved as a tfrecord dataset

How to calculate tf-idf when working on .txt files in python 3.7?

I have books in pdf and I want to do NLP tasks such as preprocessing, tf-idf calculation, word2vec, etc on those books. So I converted them into .txt files and was trying to get tf-idf scores. Previously I performed tf-idf on a CSV file, so I made some changes in that code and tried to use it for .txt file. But I am unsuccessful in my attempt.
Below is my code:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = open('jungle book.txt', 'r+')
# print(data.read())
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec.fit(data)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data)
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(20)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(20)
print(tf_idf)
This code is working until print ('Non Zero Count :', cvec_count.shape) and printing:
Sparse Matrix Shape : (0, 7132)
Non Zero Count : 0
Then it is giving error:
ZeroDivisionError: division by zero
Even if I run this code with ignoring ZeroDivisionError, still it is wrong as it is not counting any frequencies.
I have no idea how to work around .txt file. What is the proper way to work on .txt file for NLP tasks?
Thanks in advance!
You are getting the error because data variable is empty or wrong type. Just opening the text file is not enough. You have to read the contents into a string variable and then do the preprocessing on that variable. Try replacing
data = open('jungle book.txt', 'r+')
# print(data.read())
with
with open('jungle book.txt', 'r') as file:
data = file.read()

Form Data to get a particular output from urlencode

What should be the dictionary form_data
Desired Output from python code >> data = parse.urlencode(form_data).encode():
"entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4"
I tried various dictionary structures including ones with None, [] and dictionary within dictionary but I am unable to get this output
form_data = {'entry.330812148_sentinel':None,
'entry.330812148':'Test1',
'entry.330812148':'Test2',
'entry.330812148':'Test3',
'entry.330812148':'Test4'}
from urllib import request, parse
data = parse.urlencode(form_data).encode()
print("Printing Parsed Form Data........")
"entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4"
You can use parse_qs from urllib.parse to return the python data structure
import urllib.parse
>>> s = 'entry.330812148_sentinel=&entry.330812148=Test1&entry.330812148=Test2&entry.330812148=Test3&entry.330812148=Test4'
>>> d1 = urllib.parse.parse_qs(s)
>>> d1
{b'entry.330812148': [b'Test1', b'Test2', b'Test3', b'Test4']}

Getting data from odo.resource(source) to odo.resource(target)

I'm trying to extend the odo library with functionality to convert a GDAL dataset (raster with spatial information) to a NetCDF file.
Reading in the gdal dataset goes fine. But in the creation stage of the netcdf I need some metadata of the gdal dataset (metadata that is not know yet when calling odo.odo(source,target) ). How could I achieve this?
a short version of my code so far:
import odo
from odo import resource, append
import gdal
import netCDF4 as nc4
import numpy as np
#resource.register('.+\.tif')
def resource_gdal(uri, **kwargs):
ds = gdal.Open(uri)
# metadata I need to transfer to netcdf
b = ds.GetGeoTransform() #bbox, interval
return ds
#resource.register('.+\.nc')
def resource_netcdf(uri, dshape=None, **kwargs):
ds = nc4.Dataset(uri,'w')
# create lat lon dimensions and variables
ds.createDimension(lat, dshape[0].val)
ds.createDimension(lon, dshape[1].val)
lat = ds.createVariable('lat','f4', ('lat',))
lon = ds.createVariable('lon','f4', ('lon',))
# create a range from the **gdal metadata**
lat_array = np.arange(dshape[0].val)*b[1]+b[0]
lon_array = np.arange(dshape[1].val)*b[5]+b[3]
# assign the range to the netcdf variable
lat[:] = lat_array
lon[:] = lon_array
# create the variable which will hold the gdal data
data = ds.createVariable('data', 'f4', ('lat', 'lon',))
return data
#append.register(nc4.Variable, gdal.Dataset)
def append_gdal_to_nc4(tgt, src, **kwargs):
arr = src.ReadAsArray()
tgt[:] = arr
return tgt
Thanks!
I don't have much experience with odo, but from browsing the source code and docs it looks like resource_netcdf() should not be involved in translating gdal data to netcdf. Translating should be the job of a gdal_to_netcdf() function decorated by convert.register. In such a case, the gdal.Dataset object returned by resource_gdal would have all sufficient information (georeferencing, pixel size) to make a netcdf.