I’m trying to decide the best way to load data into my app, it’s basically a book, but I want to have control of the chapter line number and chapter name ( so I can add comments and notes under relevant lines) etc. Both options allow me to do this. There’s going to be about 25 large-ish chapters.
What would be the best overall in terms of the iPhone platform? The data is already on my computer I just need to select which format is the best?
I think things like memory management and perhaps other limitations within the iphone need to be considered?
Are there any other factors which need to be taken into account?
Thanks guys,
Ok so here are the two possible options to load the data :
XML:
<toolTipsBook>
− <chapter index="1" name="Chapter Name">
<line index="1" text="line text here"/>
<line index="2" text=" line text here "/>
<line index="3" text=" line text here "/>
<line index="4" text=" line text here "/>
<line index="5" text=" line text here "/>
<line index="6" text=" line text here "/>
<line index="7" text=" line text here "/>
</chapter>
SQL Dump
-- Chapter 1 (Chapter 1)
INSERT INTO `book_text` (`index`, `chapter`, `line`, `text`) VALUES
(1, 1, 1, ' line text here '),
(2, 1, 2, ' line text here '),
(3, 1, 3, ' line text here '),
(4, 1, 4, ' line text here '),
(5, 1, 5, ' line text here '),
(6, 1, 6, ' line text here '),
(7, 1, 7, line text here ');
Apple's plist format is a good choice for hierarchical data on the iPhone. It's XML, but it's supported by Foundation, so importing is as easy as [NSDictionary dictionaryWithContentsOfFile:...].
I would suggest splitting everything into chapters and only keeping one or two loaded at a time if you're worried about memory.
Related
In Microsoft Access 2016 (build 16.0.8201.2200), the VBA TransferSpreadsheet method is not working properly when the format of numbers in Windows 10 is customized, specifically, on computer with US region selected, if you swap "decimal symbol" and "digit grouping symbol" to be formatted like customary in Germany:
When I use TransferSpreadsheet to save a query, when I subsequently attempt to open that workbook in Excel, it says:
We have found some problem in some content in '...'. Do you want us to try to recover as much as we can?
When I do, I get the following warning:
Excel was able to open the file by repairing or removing the unreadable content.
When I look at the contents of the XLSX contents, I'm not surprised it's having a problem, because the internal XML is not well-formed. Because I've replaced the decimal separator to be "," in Windows, it's creating the numbers in the XML with commas, not decimal places. But XML standards dictate that regardless of your regional preferences, numbers in XML should use a "." as decimal symbol.
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<dimension ref="A1:K20"/>
<sheetViews>...</sheetViews>
<sheetFormatPr defaultRowHeight="15"/>
<sheetData>
<row outlineLevel="0" r="1">...</row>
<row outlineLevel="0" r="2">
...
<c r="D2" s="0">
<v>2,9328903531E+16</v>
</c>
<c r="E2" s="0">
<v>5,404939826E+16</v>
</c>
<c r="F2" s="0">
<v>2,3923963705E+16</v>
</c>
...
</row>
...
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>
While the "," might be the desired format for decimal symbol in the UI, the XLSX internal format must conform to XML standard, "." decimal symbol.
How do I solve this?
Bottom line, for the TransferSpreadsheet method to work correctly, if you want to change the formatting of numbers, do not use the "Customize Format" setting:
You should instead reset those values back to their defaults, and then select an appropriate region in the preceding dialog box, one that formats numbers as you prefer:
Having choosen a region that is formatted as desired, you thereby avoid the TransferSpreadsheet bug. When you do this, the spreadsheet will appear correctly in Excel:
But the XLSX will be formatted properly, too:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<dimension ref="D3:F3"/>
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="F12" sqref="F12"/>
</sheetView>
</sheetViews>
<sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25"/>
<cols>
<col min="4" max="6" width="26.85546875" style="1" bestFit="1" customWidth="1"/>
</cols>
<sheetData>
<row r="3" spans="4:6" x14ac:dyDescent="0.25">
<c r="D3" s="1">
<v>2.9328903531E+16</v>
</c>
<c r="E3" s="1">
<v>5.40493826E+16</v>
</c>
<c r="F3" s="1">
<v>2.3923963705E+16</v>
</c>
</row>
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>
I'm trying to print from UWP app and following this link
While saving it as a pdf file, it's printing normally. I'm able to copy the letters as well. But when I'm pasting it somewhere else, it is printing something like this: ""
I've tried different fonts as well, but no help.
Here is the XAML I'm trying to print:
<Grid x:Name="PrintableArea" Background="White">
<StackPanel x:Name="TextContent">
<TextBlock TextAlignment="Center" FontFamily="Arial" FontWeight="Bold">
This is Test
</TextBlock >
</StackPanel>
</Grid>
How to fix it?
Whatever you are using to create the PDF is clearly unable to create the PDF file with a ToUnicode CMap.
PDF files usually only embed a subset of a font in order to keep the size down. This generally means that the Encoding applied to the font is non-standard (and it generally isn't ASCII anyway). So for example if you have the text "Hello World" then the character codes would be assigned so that "H" = 1, "e" = 2 and so on.
If you copy and paste that, then you get 1, 2, 3, 3, 4, 5, 6, 4, 7, 3, 8 which will appear as binary.
A PDF file may contain a ToUnicode CMap which maps the character code to Unicode code points, and a PDF viewer application can use that to copy the Unicode code points instead of the character codes, which permits sane copy/paste. But its optional. This is because the original design decisions around PDF were to create a portable viewer, the PDF file should look the same on all consumers, but the designers didn't have editing or copying in mind.
This seems like it should be the most basic thing to do in python that it should be almost a default option. I have a text file that has lines such as
123, [12, 23, 45, 67]
The second array is variable in length. How do I read this in? For whatever reason I cannot find a single piece of documentation on how to deal with '[' or ']' which one might argue is the single most basic character in python.
np.loadtxt was a bust, apparently this is only for the most simple of file formats
np.genfromtxt was a bust, due to the missing columns. BTW one would like to believe the missing_value functionality could be helpful here. Would be useful to know what, if anything, the missing_value thing actually does (it is not explained clearly in the documentation at all).
I tried the np.fromstring route which gives me
['123', '[12', '23', '45', '67]']
Presumably I could parse this item by item to deal with the '[' and ']' but at this stage I have just made my own python file reader to read in a fairly basic python construct!
As for the desired output, at this stage I would settle for almost anything. The obvious construct would be line by line of the form
[123, [12, 23, 45, 67]]
loadtxt and genfromtxt parse a line, starting with a simple split.
In [360]: '123, [12, 23, 45, 67]'.split(',')
Out[360]: ['123', ' [12', ' 23', ' 45', ' 67]']
then they try to convert the individual strings. Some convert easily to ints or floats. The ones with [ and ] don't. Handling those is not trivial.
The csv reader that comes with Python can handle quoted text, e.g.
`one, "twenty, three", four'
I have not played with it enough to know whether it can treat [] as quotes or not.
Your bracketed text is easier to parse if you use different delimiters inside the brackets, eg
In [371]: l1='123; [12, 23, 45, 67]'.split(';')
In [372]: l1
Out[372]: ['123', ' [12, 23, 45, 67]']
In [373]: l2=l1[1].strip().strip(']').strip('[').split(',')
In [374]: l2
Out[374]: ['12', ' 23', ' 45', ' 67']
As Warren commented, plain CSV is something of an industry standard, and used in many languages. The use of brackets and such has not been standardized. But there are data exchange languages like XML, JSON and yaml, as well as non-text data files (e.g. HD5F).
JSON example:
In [377]: json.loads('[123, [12, 23, 45, 67]]')
Out[377]: [123, [12, 23, 45, 67]]
The default option is eval. It lets you evaluate Python expressions in strings. It's a security hazard though, see e.g. this question. But ast.literal_eval should be okay. For example:
from ast import literal_eval
with open("name of file") as fh:
data = [literal_eval(line) for line in fh]
I'm trying to index my nutch crawled data by running:
bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate" crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
At first it was working totally Ok. I indexed my data, sent a few queries and recieved good results. But then I ran the crawling again, so that it fetches more pages, and now when I run the nutch index command, I face with
java.io.IOException: Job failed!
here is my hadoop log:
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
What I realize is that in the mentioned page there must be a really long term.
So in schema.xml(in nutch) and managed-schema(in solr) I changed the type of "id", "content",and "text" from "strings" to "text_general" :
But it didn't solve the problem.
I'm no expert, so I'm not sure how to correct the analyzer without screwing up something else. I've read that I can:
1. use (in index analyzer), a LengthFilterFactory in order to filter out those tokens that don't fall withing a requested length range.
2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the max length of indexed tokens
but there are so many analyzer in the schema. should I change the analyzer defined for ? if yes since the content and other fields' type are text_general, isn't it gonna affect all of them too?
Anyone knows how can I fix this problem? I would really appreciate any help.
BTW, I am using nutch 1.11 and solr 6.0.0.
Assuming that you're using the schema.xml bundled with Nutch as the base schema for your Solr installation, basically you'll just need to add either of those filters (LengthFilterFactory or TruncateTokenFilterFactory) to the text_general field type.
Starting from the initial definition of the text_general fieldType (https://github.com/apache/nutch/blob/master/conf/schema.xml#L108-L123) you'll need to add the following to the <analyzer type="index"> section:
...
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- remove long tokens -->
<filter class="solr.LengthFilterFactory" min="3" max="7"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
...
This could also be applied to the query analyzer using the same syntax. If you want to use the TruncateTokenFilterFactory filter just swap the added line with:
<filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
Also, don't forget to adjust the parameters of each filter to your needs (min, max for the LengthFilterFactory) and prefixLength for the TruncateTokenFilterFactory.
Answering your other questions: yes this would affect all fields with the text_general type but this is not so problematic because if you find another super-long term in any other field, the same error will be thrown. If you still want to isolate this change just for the content field, just create a new fieldType with a new name (truncated_text_general, for instance, just copy&paste the entire fieldType section and change the name attribute) and then change the type of the content field (https://github.com/apache/nutch/blob/master/conf/schema.xml#L339) to match your newly created fieldType.
That being said, just select sane values for the filters to avoid missing a lot of terms from your index.
Ok, so I've finally decided on how to load my data I'm going to go with loading my book data as an XML file. The problem is that I'm not too sure on where to start, I've heard terms such 'parsing' but dont know how exactly it fits in.
I have added the code below if someone could give me a start in the right direction I would really appreciate that, to begin with all I want to do is load one line with my own comment under that and then the next line and so on.
<myBook>
− <chapter index="1" name="Chapter Name">
<line index="1" text="line text here"/>
<line index="2" text=" line text here "/>
<line index="3" text=" line text here "/>
<line index="4" text=" line text here "/>
<line index="5" text=" line text here "/>
<line index="6" text=" line text here "/>
<line index="7" text=" line text here "/>
</chapter>
Thanks guys,
Maybe this questions helps you: Navigating XML from Objective-C
There a some classes with which you can process your XML file. If you don't know about XML in general, read the Wikipedia article about XML. The most common techniques that are used to process XML are also described there.
Use Cocoa's XML parser?
Apple's NSXMLParser is an event based based on libXML, it is a SAX parser. I have found it to be slow when parsing large files on the iPhone 3G as it doesn't take advantage of libXML's xmlParseChunk() function.
Have you thought about using JSON as an alternative?
Do you really mean "load into Xcode" or are you talking about reading your custom XML file in your application? And didn't you ask this as How To load XML file into iPhone project ?