Do the children in a Shopify bulk operation JSONL with nested connections come right after their parent? - shopify

When parsing a bulk operation JSONL file with nested items from top to bottom line by line, when I reach a new top level parent object, does that mean I've gone through all children of the previous parent?
Context
When processing a bulk operation JSONL file, I do some processing that requires having a parent and all of their children. I'd like to keep my memory requirements as small as possible, so I need to know when I'm done processing an object and all of its children.
Example for clarification
Using the documentation page's JSONL example:
{"id":"gid://shopify/Product/1921569226808"}
{"id":"gid://shopify/ProductVariant/19435458986123","title":"52","__parentId":"gid://shopify/Product/1921569226808"}
{"id":"gid://shopify/ProductVariant/19435458986040","title":"70","__parentId":"gid://shopify/Product/1921569226808"}
{"id":"gid://shopify/Product/1921569259576"}
{"id":"gid://shopify/ProductVariant/19435459018808","title":"34","__parentId":"gid://shopify/Product/1921569259576"}
{"id":"gid://shopify/Product/1921569292344"}
{"id":"gid://shopify/ProductVariant/19435459051576","title":"Default Title","__parentId":"gid://shopify/Product/1921569292344"}
{"id":"gid://shopify/Product/1921569325112"}
{"id":"gid://shopify/ProductVariant/19435459084344","title":"36","__parentId":"gid://shopify/Product/1921569325112"}
{"id":"gid://shopify/Product/1921569357880"}
{"id":"gid://shopify/ProductVariant/19435459117112","title":"47","__parentId":"gid://shopify/Product/1921569357880"}
If I'm reading the file line by line from top to bottom and I hit Product with id gid://shopify/Product/1921569259576 on line 4, does this mean that I've already seen all of the previous product's (gid://shopify/Product/1921569226808) product variants the JSONL file contains?

What a lot of people do is shell out and use tac to reverse the file. Then when you parse the file you end up nice processing the children first and then knowing when you hit the parent, you have everything and you can move.
Obviously this is nicer than getting the parent and then the children, and then wondering, have I hit all the children or are there more.
Try it! It works!
My pseudo code (which you can convert to whatever scripting language you want looks like this:
inventory_file = Tempfile.new
inventory_file.binmode
uri = URI(result.data.node.url)
IO.copy_stream(uri.open, inventory_file) # store large amount of JSON Lines data in a tempfile
inventory_file.rewind # move from EOF to beginning of file
y = "#{Rails.root}/tmp/#{shop_domain}.reversed.jsonl"
`tac #{inventory_file.path} > #{y}`
puts "Completed Reversal of file using tac, so now we can quickly iterate the inventory"
f = File.foreach(y)
variants = {}
f.each_entry do |line|
data = JSON.parse(line)
# play with my data
end

Related

GridFs read PDF

I am trying to build a financial dashboard with Flask and pymongo. The starting point is a flask form which saves data in a MongoDB database. One of the fields in the form is a FileField (wtforms) which allows the upload of a PDF, which is then stored in MongoDB with GridFS.
Now I manage to save the pdf and I can see the resulting entries within the .files and .chunks collections. Now I would like to build a function that retrieves the PDFs and analyses them with some basic NLP, however I struggle with the getting meaningful data.
When I do:
storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()
The result is a binary file. If I continue with:
with open(data, 'rb') as f:
b = f.read()
The result is "ValueError: embedded null byte or sometimes an empty "byte string".
Any help on this?
To follow up on the above, I found a solution for myself that consists in 2 separate functions:
(1) Upon upload of the form and before uploading the files to MongoDB, I apply a function based on pdfminer that extracts the string content of the PDF and tranform it into a list of sentences using NLTK. I will then store this list in the .files via the storage.put(file, sent_list = sent_list) #sent_list being the variable name of the list of sentences.
Whenever I wish to run NLP operations on the file, I will just call the "sent_list" variable from mongodb.
(2) If I wish to display the stored pdf in its original content however, I included the following function as a separate route.
storage = GridFS(db, collection)
data = storage.get_last_version(filename)
response = make_response(data.read())
extension = data.filename.split('.')[-1]
response.headers['Content-Type'] = f'application/{extension}'
response.headers['Content-Disposition'] = f'inline; filename={data.filename}'
return response
(2) will open a new tab in my flask app showing the .pdf file in its original format.
I hope this helps anyone coming across a similar problem in the future.

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.
Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

How does the Global Image ID work ? - Linking Dual-EELS datasets

Is there a way to script a pair of dual-EELS SIs so that DM recognises them as siblings (say, for use in the "SI->Align SI by Peak" menu option)?
I have a script which performs a transformation on a pair of dualEELS SIs where the results are given in new images with all the tags copied from the original. The new SIs do not seem to be acknowledged by the SI options as a pair, however. A MWE of how this occurs is below:
image a, b
GetTwoLabeledImagesWithPrompt("Get SI", \
"Get DualEELS SIs", \
"Low-Loss", a, \
"High-Loss", b)
image LL, HL
LL := a.ImageClone()
HL := b.ImageClone()
LL.ShowImage()
HL.ShowImage()
Assuming that the two inputs are real dualEELS SIs. Attempting afterward to run a method like "SI->Align SI by Peak" on the outputs does not recognise the second SI as a sibling.
I suspect that my issue is in properly assigning the four EELS:Dual acquire sibling:UID tags highlighted in the image provided, however I have no idea how (or if) these are accessible from the scripting language.
Thanks in advance for any help you are able to render.
Yes, "siblings" in DigitalMicrograph are recognized by tag-checks and tagging of the Unique Image Id (UID).
Depending on the exact application/plugin, there might be additional tag-checks before a sibling can be accepted (i.e. "is it EELS data?", "is it spatially compatible?", etc.) but he prime-mechanism is using the UID.
A bit of background info
The UID is a set of four long-numbers generated at random whenever a new image data is created, and then subsequently stored with the data. It is "unique" by the assumption that a set of four randomly generated 8-byte longs are "unique".
If you create an image, save it to disc, and open it, the UID will be the same. (It is stored with the data.)
If you ImageClone() an image, it gets a new UID.
If you copy a image file on the hard drive and rename it, it will retain the UID.
Creating and using UIDs
The commands to get an image's UID are described in the F1 help documentation here:
And the example section even has a script showing how one uses UID to "link" data together:

Elm project scaling: separating Messages/Updates

I am trying to separate files in an Elm project, as keeping everything in global Model, Messages, etc. would be just a mess.
Here is how I tried it so far:
So, there are some global files, and then Header has its own files. However I keep getting error, when importing Header.View into my global View:
The 1st and 2nd entries in this list are different types of values.
Which kind of makes sense:
The 1st entry has this type:
Html Header.Messages.Msg
But the 2nd is:
Html Msg
So, my question is whether all the messages (from all my modules, like Header) needs to be combined somehow in global Messages.elm? Or there is a better way of doing this?
My advice would be to keep messages and update in 1 file until that feels uncomfortable (for you to decide how many lines of code that means - see Evan's Elm Europe talk for more on the modules flow). When you want to break something out, define a new message in Main
type Msg
= HeaderMsg Header.Msg
| ....
Then use Cmd.map HeaderMsg in your update function and Html.map HeaderMsg in your view function to connect up your sub-components

Variable does not save event.response from network.request -Lua

I am trying to create a Lua program using Sublime and Corona. I want to fetch a webpage, use a pattern to extract certain text from the page, and then save the extracted text into a table. I am using the network.request method provided by Corona
The problem: The extracted text is not saving into the global variable I created. Whenever I try to reference it or print it outside of the function it returns nil. Any ideas why this is happening?
I have attached a screen shot of my event.response output. This is what I want to be saved into my Lua table
Event.response Output
Here is my code:
local restaurants = {}
yelpString = ""
--this method tells the program what to do once the website is retrieved
local function networkListener( event )
if ( event.isError ) then
print( "Network error: ", event.response )
else
yelpString = event.response
--loops through the website to find the pattern that extracts
restaurant names and prints it out
for i in string.gmatch(yelpString, "<span >(.-)<") do
table.insert(restaurants, i)
print(i)
end
end
end
-- retrieves the website
network.request( "https://www.yelp.com/search?
cflt=restaurants&find_loc=Cleveland%2C+OH%2C+US", "GET", networkListener )
This sounds like a scoping problem. From the output you give, it looks like networkListener is being called, and you are successfully adding the text into the restaurants table. Moreover, since you define restaurants as a table, it should be a table when you reference it, not nil. So by deduction, the issue must be that you are trying to access the restaurants table from somewhere where it is not in scope.
If you declare restaurants as "local" in the top level of a file (i.e. not within a function or a block), it will be accessible to the whole file, but it won't be accessible to anything outside the file. So the table.insert(restaurants, i) in your code will work, but if you try to reference restaurants from somewhere outside the file, it will be nil. I'm guessing this is the cause of the problems that you are running into.
For more details on scope, have a look at the Programming in Lua book. The book is for Lua 5.0, but the scoping rules for local variables have not changed in later versions of Lua (as of this writing, the latest is Lua 5.3).