What is the best way to send Arrow data to the browser? - serialization

I have Apache Arrow data on the server (Python) and need to use it in the browser. It appears that Arrow Flight isn't implemented in JS. What are the best options for sending the data to the browser and using it there?
I don't even need it necessarily in Arrow format in the browser. This question hasn't received any responses, so I'm adding some additional criteria for what I'm looking for:
Self-describing: don't want to maintain separate schema definitions
Minimal overhead: For example, an array of float32s should transfer as something compact like a data type indicator, length value and sequence of 4-byte float values
Cross-platform: Able to be easily sent from Python and received and used in the browser in a straightforward way
Surely this is a solved problem? If it is I've been unable to find a solution. Please help!

Building off of the comments on your original post by David Li, you can implement a non-streaming version what you want without too much code using PyArrow on the server side and the Apache Arrow JS bindings on the client. The Arrow IPC format satisfies your requirements because it ships the schema with the data, is space-efficient and zero-copy, and is cross-platform.
Here's a toy example showing generating a record batch on server and receiving it on the client:
Server:
from io import BytesIO
from flask import Flask, send_file
from flask_cors import CORS
import pyarrow as pa
app = Flask(__name__)
CORS(app)
#app.get("/data")
def data():
data = [
pa.array([1, 2, 3, 4]),
pa.array(['foo', 'bar', 'baz', None]),
pa.array([True, None, False, True])
]
batch = pa.record_batch(data, names=['f0', 'f1', 'f2'])
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
writer.write_batch(batch)
return send_file(BytesIO(sink.getvalue().to_pybytes()), "data.arrow")
Client
const table = await tableFromIPC(fetch(URL));
// Do what you like with your data
Edit: I added a runnable example at https://github.com/amoeba/arrow-python-js-ipc-example.

Related

Can I use a .txt file as a user database in Telegram? I use Telethon

So, I created a minigame bot on telegram. The bot just contains a fishing game, and it's already running. I want if a user fishes and gets a fish, the fish will be stored in a database. So the user can see what he got while fishing. Does this is require SQL?
I haven't tried anything, because I don't understand about storing data in python. If there is a tutorial related to this, please share it in the comments. Thank you
You can use anything to store user data, including text files.
The simplest approaches to storing data can be serializing a dictionary to JSON with the builtin json module:
DATABASE = 'database.json' # (name or extension don't actually matter)
import json
# loading
with open(DATABASE, 'r', encoding='utf-8') as fd:
user_data = json.load(fd)
user_data[1234] = 5 # pretend user 1234 scored 5 points
# saving
with open(DATABASE, 'w', encoding='utf-8') as fd:
json.dump(user_data, fd)
This would only support simple data-types. If you need to store custom classes, as long as you don't upgrade your Python version, you can use the built-in pickle module:
DATABASE = 'database.pickle' # (name or extension don't actually matter)
import pickle
# loading
with open(DATABASE, 'rb') as fd:
user_data = pickle.load(fd)
user_data[1234] = 5 # pretend user 1234 scored 5 points
# saving
with open(DATABASE, 'wb') as fd:
pickle.dump(user_data, fd)
Whether this is a good idea or not depends on how many users you expect your bot to have. If it's even a hundred, these approaches will work just fine. If it's in the thousands, perhaps you could use separate files per user, and still be okay. If it's more than that, then yes, using any database, including the built-in sqlite3 module, would be a better idea. There are many modules for different database engines, but using SQLite is often enough (and there are also libraries that make using SQLite easier).
Telethon itself uses the sqlite3 module to store the authorization key, and a cache for users it has seen. It's not recommended to reuse that same file for your own needs though. It's better to create your own database file if you choose to use sqlite3.
Using a txt file as database is a terrible idea, go with SQL

How do I setup my Tiingo API key in pandas?

I am trying to learn ML models for predicting stock prices and initially, I tried using DataReader
import pandas_datareader as web
df = web.DataReader('AAPL', data_source='yahoo', start='2016-01-01', end='2021-08-01')
But I get a RemoteDataError and kept hitting a dead end trying to figure it out so I tried using tiingo
https://tiingo-python.readthedocs.io/en/latest/readme.html
I read through the documentation and tried passing a dictionary with 'api_key' as a key
into my tiingo client, ie.
from tiingo import TiingoClient
client = TiingoClient()
config = {}
config['session'] = True
config['api_key'] = 'my_api_key'
client = TiingoClient(config)
The documentation says I can now use TiingoClient to make API calls, however,
RuntimeError: Tiingo API Key not provided. Please provide via environment variable or config argument.
It is quite challenging learning the ML models and its syntax but what compounds the difficulty for me is what some data-scientests consider to be trivial as they don't typically deal with gathering or scraping data. Maybe my question is trivial but I've spent about an hour trying to figure out how to import data properly for stock prices and the only method that worked for me so far is
df = web.get_data_yahoo('stock symbol')
but I would like to grasp the other ways of importing stock prices via Tiingo and DataReader so if anyone can provide explanations/tips/suggestions I'd greatly appreciate it.
EDIT: for my tiingo account I did not buy any subscription plan for using their data as I was under the impression I can access data for free with my api-key
This is what I use, but its identical to what you are using it seems.
config = {}
config['session'] = True
config['api_key'] = "key here"
client = TiingoClient(config)
Remove this line: TiingoClient()

How can I configure a specific serialization method to use only for Celery ping?

I have a celery app which has to be pinged by another app. This other app uses json to serialize celery task parameters, but my app has a custom serialization protocol. When the other app tries to ping my app (app.control.ping), it throws the following error:
"Celery ping failed: Refusing to deserialize untrusted content of type application/x-stjson (application/x-stjson)"
My whole codebase relies on this custom encoding, so I was wondering if there is a way to configure a json serialization but only for this ping, and to continue using the custom encoding for the other tasks.
These are the relevant celery settings:
accept_content = [CUSTOM_CELERY_SERIALIZATION, "json"]
result_accept_content = [CUSTOM_CELERY_SERIALIZATION, "json"]
result_serializer = CUSTOM_CELERY_SERIALIZATION
task_serializer = CUSTOM_CELERY_SERIALIZATION
event_serializer = CUSTOM_CELERY_SERIALIZATION
Changing any of the last 3 to [CUSTOM_CELERY_SERIALIZATION, "json"] causes the app to crash, so that's not an option.
Specs: celery=5.1.2
python: 3.8
OS: Linux docker container
Any help would be much appreciated.
Changing any of the last 3 to [CUSTOM_CELERY_SERIALIZATION, "json"] causes the app to crash, so that's not an option.
Because result_serializer, task_serializer, and event_serializer doesn't accept list but just a single str value, unlike e.g. accept_content
The list for e.g. accept_content is possible because if there are 2 items, we can check if the type of an incoming request is one of the 2 items. It isn't possible for e.g. result_serializer because if there were 2 items, then what should be chosen for the result of task-A? (thus the need for a single value)
This means that if you set result_serializer = 'json', this will have a global effect where all the results of all tasks (the returned value of the tasks which can be retrieved by calling e.g. response.get()) would be serialized/deserialized using the json-serializer. Thus, it might work for the ping but it might not for the tasks that can't be directly serialized/deserialized to/from JSON which really needs the custom stjson-serializer.
Currently with Celery==5.1.2, it seems that task-specific setting of result_serializer isn't possible, thus we can't set a single task to be encoded in 'json' and not 'stjson' without setting it globally for all, I assume the same case applies to ping.
Open request to add result_serializer option for tasks
A short discussion in another question
Not the best solution but a workaround is that instead of fixing it in your app's side, you may opt to just add support to serialize/deserialize the contents of type 'application/x-stjson' in the other app.
other_app/celery.py
import ast
from celery import Celery
from kombu.serialization import register
# This is just a possible implementation. Replace with the actual serializer/deserializer for stjson in your app.
def stjson_encoder(obj):
return str(obj)
def stjson_decoder(obj):
obj = ast.literal_eval(obj)
return obj
register(
'stjson',
stjson_encoder,
stjson_decoder,
content_type='application/x-stjson',
content_encoding='utf-8',
)
app = Celery('other_app')
app.conf.update(
accept_content=['json', 'stjson'],
)
You app remains to accept and respond stjson format, but now the other app is configured to be able to parse such format.

Creating a REST Handler for any of Splunk's REST endpoints

How to create a Persistent(or any for that matter) REST HANDLER for any given(inbuilt) SPLUNK REST API Endpoint? How to use PersistentServerConnectionApplication class ?
I have gone through https://gist.github.com/LukeMurphey/238004c8976804a8e79570d22721fd99 but cant figure out where to start and how to make one.
There was a great .conf presentation about REST Handlers by James Ervin from a few years ago, https://conf.splunk.com/files/2016/slides/extending-splunks-rest-api-for-fun-and-profit.pdf
Sample code is available from https://github.com/jrervin/splunk-rest-examples
James' echo example is quite straight forward. Make sure you also pay attention to the additions that are necessary in web.conf and restmap.conf.
import os
import sys
if sys.platform == "win32":
import msvcrt
# Binary mode is required for persistent mode on Windows.
msvcrt.setmode(sys.stdin.fileno(), os.O_BINARY)
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
msvcrt.setmode(sys.stderr.fileno(), os.O_BINARY)
from splunk.persistconn.application import PersistentServerConnectionApplication
class EchoHandler(PersistentServerConnectionApplication):
def __init__(self, command_line, command_arg):
PersistentServerConnectionApplication.__init__(self)
def handle(self, in_string):
return {'payload': in_string, # Payload of the request.
'status': 200 # HTTP status code
}
Suggest you just get a copy of his app and deploy it, confirm it all works, then modify if for your particular use-case.

Multiple mongoDB related to same django rest framework project

We are having one django rest framework (DRF) project which should have multiple databases (mongoDB).Each databases should be independed. We are able to connect to one database, but when we are going to another DB for writing connection is happening but data is storing in DB which is first connected.
We changed default DB and everything but no changes.
(Note : Solution should be apt for the usage of serializer. Because we need to use DynamicDocumentSerializer in DRF-mongoengine.
Thanks in advance.
While running connect() just assign an alias for each of your databases and then for each Document specify a db_alias parameter in meta that points to a specific database alias:
settings.py:
from mongoengine import connect
connect(
alias='user-db',
db='test',
username='user',
password='12345',
host='mongodb://admin:qwerty#localhost/production'
)
connect(
alias='book-db'
db='test',
username='user',
password='12345',
host='mongodb://admin:qwerty#localhost/production'
)
models.py:
from mongoengine import Document
class User(Document):
name = StringField()
meta = {'db_alias': 'user-db'}
class Book(Document):
name = StringField()
meta = {'db_alias': 'book-db'}
I guess, I finally get what you need.
What you could do is write a really simple middleware that maps your url schema to the database:
from mongoengine import *
class DBSwitchMiddleware:
"""
This middleware is supposed to switch the database depending on request URL.
"""
def __init__(self, get_response):
# list all the mongoengine Documents in your project
import models
self.documents = [item for in dir(models) if isinstance(item, Document)]
def __call__(self, request):
# depending on the URL, switch documents to appropriate database
if request.path.startswith('/main/project1'):
for document in self.documents:
document.cls._meta['db_alias'] = 'db1'
elif request.path.startswith('/main/project2'):
for document in self.documents:
document.cls._meta['db_alias'] = 'db2'
# delegate handling the rest of response to your views
response = get_response(request)
return response
Note that this solution might be prone to race conditions. We're modifying a Documents globally here, so if one request was started and then in the middle of its execution a second request is handled by the same python interpreter, it will overwrite document.cls._meta['db_alias'] setting and first request will start writing to the same database, which will break your database horribly.
Same python interpreter is used by 2 request handlers, if you're using multithreading. So with this solution you can't start your server with multiple threads, only with multiple processes.
To address the threading issues, you can use threading.local(). If you prefer context manager approach, there's also a contextvars module.