Is there a performance difference between saveAs() and exportDocument()?

Is there a performance difference between saveAs() and exportDocument()? - photoshop

When exporting an image in Photoshop, is there a performance difference between app.activeDocument.exportDocument() and app.activeDocument.saveAs()?

Based on some tests it looks live saveAs() is about 1.5 to 2 times faster than exportDocument().

Related

BigQuery is there any way to break the large result into smaller chucks for processing?

Hi i am new to the BigQuery, if i need to fetch a very large set of data, say more than 1 GB, how can i break it into smaller pieces for quicker processing? i will need to process the result and dump it into a file or elasticsearch. i need to find a efficient way to handle it. i tried with the QueryRequest.setPageSize option, but that does't seem to work. I set 100 and it doesn't seem to break on every 100 record i put this line to see how many record i get back before i turn to a new page
result = result.getNextPage();
it displays at random number of records. sometimes at 1000, sometimes at 400, etc.
thanks

Not sure if this helps you but in our project we have something that seems to be similar: we process lots of data in BigQuery and need to use the final result for later usage (it contains roughly 15 Gbs for us when compressed).
What we did was to first save the results to a table with AllowLargeResults set to True and then export the result by compressing it into cloud storage using the Python API.
It automatically breaks the results into several files.
After that we have a Python script that downloads concurrently all files, reads through the whole thing and builds some matrices for us.
I don't quite remember how long it takes to download all the files, I think it's around 10 minutes. I'll try to confirm this one.

write dataframe to .xlsx too slow

I have a 40MB dataframe 'dfScore' I am writing to .xlsx。
the code is as follow,
writer = pandas.ExcelWriter('test.xlsx', engine='xlsxwriter')
dfScore.to_excel(writer,sheet_name='Sheet1')
writer.save()
the code dfScore.to_excel take almost an hour ,the code writer.save() takes another hour. Is this normal? Is there a good way to take less than 10 min?
i already searched in stackoverflow ,but it seems some suggestions not working on my problem.

Why don't you save it as .csv?
I have worked with heavier DataFrames on my personal laptop and I had the same problem with writing to xlsx.
your_dataframe.to_csv('my_file.csv',encoding='utf-8',columns=list_of_dataframe_columns)
then you can simply convert it to .xlsx with MS Excel or an online convertor.

the code dfScore.to_excel take almost an hour ,the code writer.save() takes another hour. Is this normal?
That sounds a bit too high. I ran an XlsxWriter test writing 1,000,000 rows x 5 columns and it took ~ 100s. The time will vary based on the CPU and Memory of the test machine but 1 hour is 36 times slower which doesn't seem right.
Note, Excel, and thus XlsxWriter, only supports 1,048,576 rows per worksheet so you are effectively throwing away 3/4s of your data and wasting time doing it.
Is there a good way to take less than 10 min?
For pure XlsxWriter programs pypy gives a good speed up. For example rerunning my 1,000,000 rows x 5 columns testcase with pypy the time went from 99.15s to 16.49s. I don't know if Pandas works with pypy though.

How can I speed up my Microsoft Project VBA code?

I have a macro I use for Microsoft Project that loops through each task in the project, and performs several checks to find any problems with the tasks. These checks include several IF and Select Case statements. When dealing with large projects with more tasks, the macro can get lengthy. Is there anything I can do to improve the speed up the macro? I have already turned off screen updating and manual calculation.

Turning off screen updating and setting calculation mode to Manual are the only application settings you can use to improve performance; the rest depends on your algorithm.
Your description of the problem is a bit vague: How large are your projects and how long does the macro take? If your projects are 1,000 tasks and you are making a dozen checks and your code takes more than five minutes, then yes, there is surely room for improvement. But if it's 20,000 tasks and 50 checks and the macro takes two minutes, stop trying to improve it--that's great performance.
Bottom line: it is impossible to tell if there is room for improvement without seeing your code.

If you use the same property (e.g. objTask.Start) in several different comparisons in your code then set the property into a local variable once and then perform your comparisons on the local variable.
For example:
Slow code:
If objTask.start < TestDate1 and objTask.Start > TestDate2 then ...
Fast code:
Define dteStart as Date
dteStart = objTask.Start
if dteStart < TestDate1 and dteStart > testdate2 then ...
Calls to the COM object model are expensive. The second code example will be quite a bit faster (although as noted by Rachel above) it really does depend on the volume of data being processed.
Also, make sure you define your variables with appropriate types as relying on the default Variant data type is very slow.

if you have some variables with lot of data like collections think about setting it to nothing and the end of your function
Set TasksCollection=Nothing

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?

You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?

BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

Best approach for bringing 180K records into an app: core data: yes? csv vs xml?

I've built an app with a tiny amount of test data (clues & answers) that works fine. Now I need to think about bringing in a full set of clues & answers, which roughly 180K records (it's a word game). I am worried about speed and memory usage of course. Looking around the intertubes and my library, I have concluded that this is probably a job for core data. Within that approach however, I guess I can bring it in as a csv or as an xml (I can create either one from the raw data using a scripting language). I found some resources about how to handle each case. What I don't know is anything about overall speed and other issues that one might expect in using csv vs xml. The csv file is about 3.6 Mb and the data type is strings.
I know this is dangerously close to a non-question, but I need some advice as either approach requires a large coding commitment. So here are the questions:
For a file of this size and characteristics, would one expect csv or
xml to be a better approach? Is there some other
format/protocol/strategy that would make more sense?
Am I right to focus on core data?
Maybe I should throw some fake code here so the system doesn't keep warning me about asking a subjective question. But I have to try! Thanks for any guidance. Links to discussions appreciated.

As for file size CSV will always be smaller compared to an xml file as it contains only the raw data in ascii format. Consider the following 3 rows and 3 columns.
Column1, Column2, Column3
1, 2, 3
4, 5, 6
7, 8, 9
Compared to it's XML counter part which is not even including schema information in it. It is also in ascii format but the rowX and the ColumnX have to be repeated mutliple times throughout the file. Compression of course could help fix this but I'm guessing even with compression the CSV will still be smaller.
<root>
<row1>
<Column1>1</Column1>
<Column2>2</Column2>
<Column3>3</Column3>
</row1>
<row2>
<Column1>4</Column1>
<Column2>5</Column2>
<Column3>6</Column3>
</row2>
<row3>
<Column1>7</Column1>
<Column2>8</Column2>
<Column3>9</Column3>
</row3>
</root>
As for your other questions sorry I can not help there.

This is large enough that the i/o time difference will be noticeable, and where the CSV is - what? 10x smaller? the processing time difference (whichever is faster) will be negligible compared to the difference in reading it in. And CSV should be faster, outside of I/O too.
Whether to use core data depends on what features of core data you hope to exploit. I'm guessing the only one is query, and it might be worth it for that, although if it's just a simple mapping from clue to answer, you might just want to read the whole thing in from the CSV file into an NSMutableDictionary. Access will be faster.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Is there a performance difference between saveAs() and exportDocument()? - photoshop

When exporting an image in Photoshop, is there a performance difference between app.activeDocument.exportDocument() and app.activeDocument.saveAs()?

Based on some tests it looks live saveAs() is about 1.5 to 2 times faster than exportDocument().

Related

BigQuery is there any way to break the large result into smaller chucks for processing?

write dataframe to .xlsx too slow

How can I speed up my Microsoft Project VBA code?

Cloud DataFlow performance - are our times to be expected?

Best approach for bringing 180K records into an app: core data: yes? csv vs xml?

Categories

Resources