A simple statistical test if a set of 1000 documents is homogeneous

A simple statistical test if a set of 1000 documents is homogeneous - testing

I have a simple statistical question and hope someone here has a quick answer.
I have a set of 200 documents, each document should contain exactly 3 pages. My assumption is that all 100% of those documents have 3 pages. I want to take a sample that would statistically confirm that that set is homogeneous, which means that all documents have exactly 3 pages. If I find even one document in a sample having != 3 pages I would know my set is in-homogeneous.
How many documents do I have to look at to be 80% sure my set is homogeneous? Should I have more then 200 documents in my base set, for instance 1000?

I am not sure but i dont think that can be calculated from the given details, U should know the standard deviation of the base set.

You are trying to test whether all documents are 3 pages. A statistical test will not help here. In most cases what you will have is a 5%, and 1% significance tests that the number of mean pages is 3. This means that there will be a 1 in 20, and 1 in a 100, respectively, chance that the pages might be different from 3.

Related

How to sample rows from a table with a specific probability?

I'm using BigQuery at my new position, and I'm totally new to SQL/BigQuery.
I'm testing a machine learning model and monitoring an A/B test with a different ratio, e.g., 3 vs. 10. To compare the A/B results, e.g., # of page view, I want to make the ratios equal first so that I can compare easily. For example, say we have a table with 13 records (3 are from A and 10 are from B). In addition, each row contains an id field that is identical. What I want to do is to extract only 3 samples out of 10 for B to match the sample number to A.
I'm trying to use the FARM_FINGERPRINT function to map fields to integers. Then I'm taking ABS and then calculating MOD to convert the integer numbers to a specific range, e.g., [0, 10). Eventually, I would like to get 3 in 10 items using the following line:
MOD(ABS(FARM_FINGERPRINT(field)), 10) < 3
However, I found that even if I run A/B with exactly the same ML model with different A/B ratio, the result is different between A and B (The results should be same because A and B are running the same ML model with just the different ratio). This made me doubt that the above implementation may bring some biased data sampling. I also read this post and confirmed the FARM_FINGERPRINT might not bring a randomly distributed result.
*There's a critical reason why I cannot simply multiply 3/10 to B, which is confidential and cannot disclose here.
Is there a better way to accomplish the equally distributed sampling?
Thank you in advance. (I'm sorry if the question is vague, as I'm hiding the confidential parts.)

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?

What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)

SDK2 query for counting: which is more efficient?

I have an app that is displaying metrics about defects in a project.
I have the option of making one query that returns all the defects, and from that I can break out about four different metrics (How many defects escaped QA in 90 days, 180 days, and then the same metrics again but only counting sev1/sev2 defects).
I could make four queries and limit the results to one so that I just get a count for each. Or I could make one query that encompass them all (all defects that escaped QA in 180 days) and then count up the difference.
I'm figuring worst case, the number of defects that escaped QA in the last six months will generally be less than 100, certainly less 500 worst case.
Which would you do-- four queryies with one result each, or one single query that on average might return 50, perhaps worst case 500?
And I guess the key question is-- where are the inflections points? Perhaps I have more metrics tomorrow (who knows, 8?) and a different average defect counts. Is there a rule of thumb I could use to help choose which approach?

Well I would probably make the series of four queries and use the result count. If you are expecting 500 defects that will end up being three queries each with 200 defects anyways.
The solution where you do each individual query and use the total result count would be safe with even a very large amount of defects. Plus I usually find it to be a bad plan to think that I know the data sets that an App will be dealing with. Most of my Apps end up living much longer and being used on larger datasets than I intended.

The max page size is 200, so it sounds like you'd be requesting between 1 and 3 pages to get all the data vs. 4 queries with a page size of 1 and using the TotalResultCount...
You'd definitely have less aggregation code to write if you use the multi query approach (letting the server do the counting for you based on your supplied filters).
I'd guess the 4 independent queries might be faster but it would be interesting to hear back your experimental results...

Getting all Twitter Follows (ids) with Groovy?

I was reading an article here and it looks like he is grabbing the IDs by the 100s. I thought it was possible to grab by 5000 each time?
The reason I'm asking is because sometimes there are profiles with much larger amounts of followers and you wouldn't have enough actions to do it all in one hour if one was to grab it by 100 each time.
So is it possible to grab 5000 ids each time, if so, how would I do this?

GET statuses/followers as shown in that article has been deprecated, but did used to return batches of 100
If you're trying to get follower ids, you would use GET followers/ids. This does return batches of up to 5000, and should just require you to change the URL slightly (see example URL at the bottom of the documentation page)

Youtube API problem - when searching for playlists, start-index does not work past 100

I have been trying to get the full list of playlists matching a certain keyword. I have discovered however that using start-index past 100 brings the same set of results as using start-index=1. It does not matter what the max-results parameter is - still the same results. The total results returned however is way above 100, thus it cannot be that the query returned only 100 results.
What might the problem be? Is it a quota of some sort or any other authentication restriction?
As an example - the queries bring the same result set, whether you use start-index=1, or start-index=101, or start-index = 201 etc:
http://gdata.youtube.com/feeds/api/playlists/snippets?q=%22Jan+Smit+Laura%22&max-results=50&start-index=1&v=2
Any idea will be much appreciated!
Regards
Christo

I made an interface for my site, and the way I avoided this problem is to do a query for a large number, then store the results. Let your web page then break up the results and present them however is needed.
For example, if someone wants to do a search of over 100 videos, do the search and collect the results, but only present them with the first group, say 10. Then when the person wants to see the next ten, you get them from the list you stored, rather than doing a new query.
Not only does this make paging faster, but it cuts down on the constant queries to the YouTube database.
Hope this makes sense and helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas