I want to calculate statistics of missing data per each site in my vcf file.
Using vcftools --missing-site gives wrong stats for several sites.
Is there is any other way to calculate it?
Thank you!
Using plink --missing you can calculate missingness per individual or per variant from a vcf file.
Related
I want to do ARIMA_plus forecasting on a series of sale records. The problem is that sale records only contain sales. When doing the forecast we need to insert for every product the "non sales", which, essentially, are rows with the import column set to cero for every day the product has not been sold. We have here two options:
Fill the database with those zero-rows (uses a lot of space)
When doing the forecasting with ARIMA_PLUS in bigquery tell the model to fill with zeros instead of interpolating (default and seemingly unique option).
I want to follow the second option, yet, i dont see how. Here you can see a screenshot of the documentation Google info about interpolation
The first option would be carried out with a merge, nevertheless I would prefer to discard it since it increases the size of the sales table.
I have scanned the documentation and havent seen any solution
You need to provide an input dataset covering the missing values with the right method for your use case.
In other words, the SQL query must solve the interpolation so that the input for the model already contains the expected data.
You can, for example, create a query to add a liner interpolation solution for your use case.
So, the first approach you mentioned can be solved using that input SQL (rather than adding the data to the source table) and the second approach is not valid in bigquery, as far as I know.
Here you have an example: https://justrocketscience.com/post/interpolation_sql/
I have two data-sets in my SSRS tool, first table contain 12,000 records and second one 26,000 records. And 40 columns in each table.
While building a report each time I go preview - it takes forever to display.
Is any way to do something to avoid that, so I can at least not spent so much time to build this report?
Thank you in advance.
Add a dummy parameter to limit your dataset. Or just change your select to select top 100 while building the report
#vercelli's answer is a good one. In addition you can change your cache options in the designer (for all resultsets including patramters) so that the queries are not rerun each time.
This is really useful plus - a couple of tips for you:
1. I don't recommend caching until you are happy with the your dataset results.
2. If you are using the cache and you want to do a quick refresh then the data is stored in a ".data" file in the same location as a your .rdl. You can delete this to query the database again if required.
Hey guys i have a report I am writing that is pulling trucks, order numbers, and charges from different tables. I am running into a problem when it pulls more than one truck for the same order it doubles or triples the price. My plan to resolve this is to write an expression to calculate the charge that goes along the lines of Fields!amount.value / CountDistict(truckColumn) But it is not working and I am not sure of the best route to take from there of if this is the best solution. Any help would be great.
How you tried writing it as this:
=Fields!amount.value/(COUNTDISTINCT(Fields!TruckColumn.Value))
Let's say I have a simple data set in csv like:
Time,User
01/22/2014 15:23:01,Bob
01/22/2014 16:24:01,John
01/22/2014 16:27:01,Bob
01/22/2014 17:23:01,Bob
Can I generate a box plot with Time on the x-axis (quantized to the hour), user on the y-axis, and sum(*) on the z-axis? So essentially the number of 'hits' per user per hour.
No matter how I try this, LogParser appears to only allow me one category (I have two, time and user). I get an error that either the time or the user is non-numerical.
Also, I don't know the full list of users... otherwise I could split out the 'sum(*)' to sum each user separately.
Thanks in advance, any help is much appreciated!
Unfortunately LogParser only supports one single categoric axis (the X axis), so the answer is no.
We have identified Apache Solr as a possible solution to our problem. Please bear with me, I'm new to Apache Solr. We are planning to upload several large CVS files and use Solrs REST like feature to get the result back in XML/JSON.
The problem I am thinking of is e.g. you have two file currency.csv and country.csv and they both have a 'GBP' as the currency entry in them. So if you upload these both files into Solr and do a query for value of 'GBP' then form which file entries will this have been returned?
What I would ideally like to do is a query that would only return currency e.g. 'GBP' form entries that were upload from the currency.csv and not from the country.csv file.
Hope someone can help or point me in the right direction as we may have files with similar data and yet we need to be sure to retrieve the right values from the right csv file.
Thanks in advance.
GM
UPDATE
Is it better to have multiple cores? i.e. one core per file?
You can add an additional field data_type which would indicate the type like country or currency for the records.
You can then use the field to filter the results by the type or be able to display and use the type to indicate which type the record belongs to.