Django local server: Atomic Database from a text file - sql

I made a Web app that takes in a text file, reads each line, takes the 11th character and saves it to SQLite3 db. How do I lock the database or have two or more separate tables while multiple requests are running?
I have added adding ATOMIC_REQUESTS': True to the settings.py in Django.
and I tried creating temporary tables for each request, but can't figure it out. I am pretty fresh to Django 2.2
My View.py
def home(request):
if request.method == 'GET':
return render(request, 'home.html')
if request.method == 'POST':
form = DocumentForm(data=request.POST, files=request.FILES)
print(form.errors)
if form.is_valid():
try:
f = request.FILES['fileToUpload']
except:
print('\033[0;93m'+ "No File uploaded, Redirecting" +'\033[0m')
return HttpResponseRedirect('/tryagain')
print('\033[32m'+ "Success" +'\033[0m')
print('Working...')
line = f.readline()
while line:
#print(line)
mst = message_typer.messages.create_message(str(line)[11])
line = f.readline()
else:
print('\033[0;91m'+ "Failed to validate Form" +'\033[0m')
return HttpResponseRedirect('/output')
return HttpResponse('Failure')
def output(request):
s = message_typer.messages.filter(message='s').count()
A = message_typer.messages.filter(message='A').count()
d = message_typer.messages.filter(message='d').count()
E = message_typer.messages.filter(message='E').count()
X = message_typer.messages.filter(message='X').count()
P = message_typer.messages.filter(message='P').count()
r = message_typer.messages.filter(message='r').count()
B = message_typer.messages.filter(message='B').count()
H = message_typer.messages.filter(message='H').count()
I = message_typer.messages.filter(message='I').count()
J = message_typer.messages.filter(message='J').count()
R = message_typer.messages.filter(message='R').count()
message_types = {'s':s, 'A':A, 'd':d, 'E':E, 'X':X, 'P':P,\
'r':r, 'B':B, 'H':H, 'I':I, 'J':J, 'R':R }
output = {'output':message_types}
#return HttpResponse('Received')
message_typer.messages.all().delete()
return render(request, 'output.html',output)
When the web page loads, it should display a simple break down each character in the 11th position of the uploaded text file.
However, if two requests are running concurrently, the first page that makes the request gets an Operation Error; Db is locked.
Traceback to here:
message_typer.messages.all().delete()
The second page will sum the total of the two files that were uploaded.
I do want to wipe the table after so that the next user will have an empty table to populate and perform a count on.
Is there a better way?

Related

How can I create reliable flask-SQLAlchemy interactions with server-side-events?

I have a flask app that is functioning to expectations, and I am now trying to add a message notification section to my page. The difficulty I am having is that the database changes I am trying to rely upon do not seem to be updating in a timely fashion.
The html code is elementary:
<ul id="out" cols="85" rows="14">
</ul><br><br>
<script type="text/javascript">
var ul = document.getElementById("out");
var eventSource = new EventSource("/stream_game_channel");
eventSource.onmessage = function(e) {
ul.innerHTML += e.data + '<br>';
}
</script>
Here is the msg write code that the second user is executing. I know the code block is run because the redis trigger is properly invoked:
msg_join = Messages(game_id=game_id[0],
type="gameStart",
msg_from=current_user.username,
msg_to="Everyone",
message=f'{current_user.username} has requested to join.')
db.session.add(msg_join)
db.session.commit()
channel = str(game_id[0]).zfill(5) + 'startGame'
session['channel'] = channel
date_time = datetime.utcnow().strftime("%Y/%m/%d %H:%M:%S")
redisChannel.set(channel, date_time)
Here is the flask stream code, which is correctly triggered by a new redis time, but when I pull the list of messages, the new message the the second user has added is not yet accessible:
#games.route('/stream_game_channel')
def stream_game_channel():
#stream_with_context
def eventStream():
channel = session.get('channel')
game_id = int(left(channel, 5))
cnt = 0
while cnt < 1000:
print(f'cnt = 0 process running from: {current_user.username}')
time.sleep(1)
ntime = redisChannel.get(channel)
if cnt == 0:
msgs = db.session.query(Messages).filter(Messages.game_id == game_id)
msg_list = [i.message for i in msgs]
cnt += 1
ltime = ntime
lmsg_list = msg_list
for i in msg_list:
yield "data: {}\n\n".format(i)
elif ntime != ltime:
print(f'cnt > 0 process running from: {current_user.username}')
time.sleep(3)
msgs = db.session.query(Messages).filter(Messages.game_id == game_id)
msg_list = [i.message for i in msgs]
new_messages = # need to write this code still
ltime = ntime
cnt += 1
yield "data: {}\n\n".format(msg_list[len(msg_list)-len(lmsg_list)])
return Response(eventStream(), mimetype="text/event-stream")
The syntactic error that I am running into is that the msg_list is exactly the same length (i.e the pushed new message does not get written when i expect it to). Strangely, the second user's session appears to be accessing this information because its stream correctly reflects the addition.
I am using an Amazon RDS MySQL database.
The solution was to utilize a db.session.commit() before my db.session.query(Messages).filter(...) even where no writes were pending. This enabled an immediate read from a different user session, and my code commenced to react to the change in message list length properly.

insert_many in pymongo not persisting

I'm having some issues with persisting documents with pymongo when using insert_many.
I'm handing over a list of dicts to insert_many and it works fine from inside the same script that does the inserting. Less so once the script has finished.
def row_to_doc(row):
rowdict = row.to_dict()
for key in rowdict:
val = rowdict[key]
if type(val) == float or type(val) == np.float64:
if np.isnan(val):
# If we want a SQL style document collection
rowdict[key] = None
# If we want a NoSQL style document collection
# del rowdict[key]
return rowdict
def dataframe_to_collection(df):
n = len(df)
doc_list = []
for k in range(n):
doc_list.append(row_to_doc(df.iloc[k]))
return doc_list
def get_mongodb_client(host="localhost", port=27017):
return MongoClient(host, port)
def create_collection(client):
db = client["material"]
return db["master-data"]
def add_docs_to_mongo(collection, doc_list):
collection.insert_many(doc_list)
def main():
client = get_mongodb_client()
csv_fname = "some_csv_fname.csv"
df = get_clean_csv(csv_fname)
doc_list = dataframe_to_collection(df)
collection = create_collection(client)
add_docs_to_mongo(collection, doc_list)
test_doc = collection.find_one({"MATERIAL": "000000000000000001"})
When I open up another python REPL and start looking through the client.material.master_data collection with collection.find_one({"MATERIAL": "000000000000000001"}) or collection.count_documents({}) I get None for the find_one and 0 for the count_documents.
Is there a step where I need to call some method to persist the data to disk? db.collection.save() in the mongo client API sounds like what I need but it's just another way of inserting documents from what I have read. Any help would be greatly appreciated.
The problem was that I was getting my collection via client.db_name.collection_name and it wasn't getting the same collection I was creating with my code. client.db_name["collection-name"] solved my issue. Weird.

Combine two TTS outputs in a single mp3 file not working

I want to combine two requests to the Google cloud text-to-speech API in a single mp3 output. The reason I need to combine two requests is that the output should contain two different languages.
Below code works fine for many language pair combinations, but unfortunately not for all. If I request e.g. a sentence in English and one in German and combine them everything works. If I request one in English and one in Japanes I can't combine the two files in a single output. The output only contains the first sentence and instead of the second sentence, it outputs silence.
I tried now multiple ways to combine the two outputs but the result stays the same. The code below should show the issue.
Please run the code first with:
python synthesize_bug.py --t1 'Hallo' --code1 de-De --t2 'August' --code2 de-De
This works perfectly.
python synthesize_bug.py --t1 'Hallo' --code1 de-De --t2 'こんにちは' --code2 ja-JP
This doesn't work. The single files are ok, but the combined files contain silence instead of the Japanese part.
Also, if used with two Japanes sentences everything works.
I already filed a bug report at Google with no response yet, but maybe it's just me who is doing something wrong here with encoding assumptions. Hope someone has an idea.
#!/usr/bin/env python
import argparse
# [START tts_synthesize_text_file]
def synthesize_text_file(text1, text2, code1, code2):
"""Synthesizes speech from the input file of text."""
from apiclient.discovery import build
import base64
service = build('texttospeech', 'v1beta1')
collection = service.text()
data1 = {}
data1['input'] = {}
data1['input']['ssml'] = '<speak><break time="2s"/></speak>'
data1['voice'] = {}
data1['voice']['ssmlGender'] = 'FEMALE'
data1['voice']['languageCode'] = code1
data1['audioConfig'] = {}
data1['audioConfig']['speakingRate'] = 0.8
data1['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data1)
response = request.execute()
audio_pause = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_pause = response['audioContent']
ssmlLine = '<speak>' + text1 + '</speak>'
data1 = {}
data1['input'] = {}
data1['input']['ssml'] = ssmlLine
data1['voice'] = {}
data1['voice']['ssmlGender'] = 'FEMALE'
data1['voice']['languageCode'] = code1
data1['audioConfig'] = {}
data1['audioConfig']['speakingRate'] = 0.8
data1['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data1)
response = request.execute()
# The response's audio_content is binary.
with open('output1.mp3', 'wb') as out:
out.write(base64.b64decode(response['audioContent'].decode('UTF-8')))
print('Audio content written to file "output1.mp3"')
audio_text1 = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_text1 = response['audioContent']
ssmlLine = '<speak>' + text2 + '</speak>'
data2 = {}
data2['input'] = {}
data2['input']['ssml'] = ssmlLine
data2['voice'] = {}
data2['voice']['ssmlGender'] = 'MALE'
data2['voice']['languageCode'] = code2 #'ko-KR'
data2['audioConfig'] = {}
data2['audioConfig']['speakingRate'] = 0.8
data2['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data2)
response = request.execute()
# The response's audio_content is binary.
with open('output2.mp3', 'wb') as out:
out.write(base64.b64decode(response['audioContent'].decode('UTF-8')))
print('Audio content written to file "output2.mp3"')
audio_text2 = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_text2 = response['audioContent']
result = audio_text1 + audio_pause + audio_text2
with open('result.mp3', 'wb') as out:
out.write(result)
print('Audio content written to file "result.mp3"')
raw_result = raw_text1 + raw_pause + raw_text2
with open('raw_result.mp3', 'wb') as out:
out.write(base64.b64decode(raw_result.decode('UTF-8')))
print('Audio content written to file "raw_result.mp3"')
# [END tts_synthesize_text_file]ls
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('--t1')
parser.add_argument('--code1')
parser.add_argument('--t2')
parser.add_argument('--code2')
args = parser.parse_args()
synthesize_text_file(args.t1, args.t2, args.code1, args.code2)
You can find the answer here:
https://issuetracker.google.com/issues/120687867
Short answer: It's not clear why it is not working, but Google suggests a workaround to first write the files as .wav, combine and then re-encode the result to mp3.
I have managed to do this in NodeJS with just one function (idk how optimal is it, but at least it works). Maybe you could take inspiration from it
I have used memory-streams dependency from npm
var streams = require('memory-streams');
function mergeAudios(audios) {
var reader = new streams.ReadableStream();
var writer = new streams.WritableStream();
audios.forEach(element => {
if (element instanceof streams.ReadableStream) {
element.pipe(writer)
}
else {
writer.write(element)
}
});
reader.append(writer.toBuffer())
return reader
}
Input parameter is a list which contain ReadableStream or responce.audioContent from synthesizeSpeech operation. If it is readablestream, it uses pipe operation, if it is audiocontent, it uses write method. At the end all content is passed into an readabblestream.

Loop on scrapy FormRequest but only one item created

So I've tried to loop on a formrequest that call my function that create, fill and yield the item, only pb : only one and only one item is done no matter how many times he looped and I can't figure out why ?
def access_data(self, res):
#receive all ID and request the infos
res_json = (res.body).decode("utf-8")
res_json = json.loads(res_json)
for a in res_json['data']:
logging.warning(a['id'])
req = FormRequest(
url='https://my_url',
cookies={my_cookies},
method='POST',
callback=self.fill_details,
formdata={'valeur': str(a['id'])},
headers={'X-Requested-With': 'XMLHttpRequest'}
)
yield req
def fill_details(self, res):
logging.warning("annonce")
item = MyItem()
item['html'] = res.xpath('//body//text()')
item['existe'] = True
item['ip_proxy'] = None
item['launch_time'] = str(mySpider.init_time)
yield item
To be sure everything is clear :
When I run this, the log "annonce" is printed only one time while my logging a['id'] in my request loop is printed a lot and i can't find a way to fix this
I found the way !
If any one has the same pb : as my url is always the same (only formdata change) the scrapy filter is taking the control and destroy the duplicates.
Activate dont_filter to true in the formrequest to make it works

Possible to speed up this algorithm?

I am trying to speed up a ruby algorithm. I have a rails app that uses active record and nokogiri to visit a list of urls in a database and scrape the main image from the page and save it under the image attribute associated with that url.
This rails task usually takes about 2:30 s to complete and I am trying to speed it up as a learning exercise. Would it be possible to use C through RubyInline and raw SQL code to achieve the desired result? My only issue is that if I use C I lose the database connection that active record with ruby had, and have no idea how to write SQL queries in conjunction with the C code that will properly connect to my db.
Has anyone had experience with this, or even know if it's possible? I'm doing this as primarily a learning exercise and was wondering whether it was even possible. Here is the code that I want to translate into C and SQL if you are interested:
task :getimg => :environment do
stories = FeedEntry.all
stories.each do |story|
if story.image.nil?
url = story.url
doc = Nokogiri::HTML(open(url))
if doc.at_css(".full-width img")
img = doc.at_css(".full-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".body-width img")
img = doc.at_css(".body-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".body-narrow-width img")
img = doc.at_css(".body-narrow-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".caption img")
img = doc.at_css(".caption img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnnArticleGalleryPhotoContainer img")
img = doc.at_css(".cnnArticleGalleryPhotoContainer img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnn_strylftcntnt div img")
img = doc.at_css(".cnn_strylftcntnt div img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnn_stryimg640captioned img")
img = doc.at_css(".cnn_stryimg640captioned img")[:src]
story.image = img
story.save!
end
else
#do nothing
end
end
end
I would appreciate any and all help and insights in this matter. Thank you in advance!!
Speed of DB Saving
I've written a web crawler in ruby and I found that one of the bottlenecks that can affect performance is the actual creation of the row in the database. It's faster to have a single mass insert at the end of extracting all URLs than to have multiple individual inserts (at-least for Postgres).
So instead of calling YourModel.save! for every url you visit, just push every url to an array that will keep track of url's that you need to save to the database. Then once you've finished scraping all links, do a mass insert of all the image links through an sql command.
stories.each do |story|
url = story.url
doc = Nokogiri::HTML(open(url))
img_url = doc.at_css("img")[:src]
to_insert.push "(#{img_url})"
end
#notice the mass insert at the end
sql = "INSERT INTO your_table (img_url) VALUES #{to_insert.join(", ")}"
#CONN is a constant declared at the top of your file (CONN = ActiveRecord::Base.connection)
#that connects to the database
CONN.execute sql
"Speed Up" Downloading
The downloading of links will also be a bottleneck. Thus, the best option would be to create a thread pool, where each thread is allocated a partition of URLs from the database to scrape. This way, you will never be stuck waiting for a single page to download before you do any real processing.
Some pseudoish ruby code:
number_of_workers = 10
(1..number_of_workers).each do |worker|
Thread.new do
begin
urls_to_scrape_for_this_thread = [...list of urls to scrape...]
while urls_to_scrape > 0
url = take_one_url_from_list
scrape(url)
end
rescue => e
puts "========================================"
puts "Thread # #{i} error"
puts "#{e.message}"
puts "#{e.backtrace}"
puts "======================================="
raise e
end
end
end
Are the URLs remote? if so, first benchmark it to see the network latency. If that's the bottleneck, I think you have nothing to do with your code or your choice of language.
How many FeedEntrys do you have in your database? I suggest using FeedEntry.find_each instead of FeedEntry.all.each, because the former loads 1000 entries into memory, processes them, and then loads the next 1000 entries ..., while the latter loads all entries into memory and then iterates over them, which requires more memory and increases GC cycles.
If the bottleneck is neither one of the above, then maybe it's the DOM node searching algorithm which is slow. You can find the (only one?) img node, then check its parent node or grandparent node if necessary, and update your entries accordingly.
image_node = doc.at_css('img')
story.update image: image_node['src'] if needed?(image_node)
def needed?(image_node)
parent_node = image_node.parent
parent_class = image_node.parent['class']
return true if parent_class == 'full-width'
return true if parent_class == 'body-width'
return true if parent_class == 'body-narrow-width'
return true if parent_class == 'caption'
return true if parent_class == 'cnnArticleGalleryPhotoContainer'
return true if parent_class == 'cnn_stryimg640captioned'
return false unless parent_node.node_type == 'div'
return true if parent_node.parent['class'] == 'cnn_strylftcntnt'
false
end