Custom Translator Pricing - price

The Translator Text API cost description is not clear to me on the page here:
Training a custom system is 1 character per character of training material in source text.
Is that a typo? 1 character per character of training material does not make sense to me. How should it be calculated if I opt for the price tier S1 (pay as you go)?
Hosting a deployed system is 1M characters per month per deployed system.
Does that mean if I am in the S1 tier, the monthly cost for hosting 4 deployed systems for the first 1M characters is $10x4=$40?
Many Thanks!

Training is charged per character in your source training material, with a minimum charge of 250K characters per training. For instance, if you had 500K characters in the source training material, you would be charged for 500K characters at the price for your tier. At the S1 level, that would be $5. The minimum charge for training is 250K characters ($2.50 at S1) if the source material has fewer than 250K characters.
Deployment is a flat charge of 1M characters per deployed system per month regardless of the number of characters in the training material. Yes, if you had 4 deployed systems at the S1 level, it would be $10 per system, so $10x4=$40 total per month for the 4 systems.

Related

Text classification of inconsistent regulatory material for regulatory requirements

Hi I am new to tensorflow and tensorflowHub for that matter and would like to know what I should be using from the API considering my text classification use case.
I want to find out the housing setback information for different municipalities. Using Python I have web scraped the info successfully and used NLTK to classify the words but I want to take it a step further and use ML given no two municipality codes are alike! As an example one municipality may have something like this:
Setback requirements.
Minimum front setback: 25 feet.
Minimum side setback from a street right-of-way: 25 feet.
Minimum side setback from an interior lot line: five feet.
Minimum rear setback for principal uses: 25 feet.
Minimum rear setback for accessory uses: ten feet.
etc
While another may have the following text.
For all R-1 districts except 4-R-1, the minimum setbacks shall be as follows:
Front. No building or structure shall be located within fifty (50) feet of the centerline of any street nor twenty (20) feet of the property line, whichever is greater.
There shall be a side yard setback on each side of the parcel equal to ten percent (10%) of the width of the parcel. In no case shall the minimum required side yard setback be less than five (5) feet. In order to preserve architectural integrity, the side yard setback required for an addition to an existing building or structure may be permitted to utilize the established setback, provided that the established side yard setback is not less than five (5) feet.
Rear. There shall be a rear yard setback of not less than fifteen (15) feet.
For the 4-R-1 district, the minimum setbacks shall be as follows:
Front. No building or structure shall be located within forty (40) feet of the centerline of any street nor ten (10) feet of the property line, whichever is greater.
Side. Three (3) feet.
Rear. Fifteen (15) feet.
etc
How can I classify this text on the required setbacks that each municipality requires? I eventually want to use this in ARCgis as a shape file or similiar.
Any help would be appreciated!
In general it's a challenging modelling task given that I assume there is not much data from a given municipality and not a lot of standardization between municipalities.
If the amount of data is available would support it, you could try fine-tuning one of the existing transformers (https://tfhub.dev/google/collections/transformer_encoders_text/1) using the same pre-training objective that was used to train them in the first place, e.g. masked language model (MLM).
In general, tensorflow-hub might not be the best tag to get a general modeling advice since tfhub.dev is a repository of pretrained models published by the OSS community.

Data selecting in data science project about predicting house prices

This is my first data science project and I need to select some data. Of course, I know that I can not just select all the data available because this will result in overfitting. I am currently investigating house prices in the capital of Denmark for the past 10 years and I wanted to know which type of houses I should select in my data:
Owner-occupied flats and houses (This gives a dataset of 50000 elements)
Or just owner-occupied flats (this gives a dataset of 43000 elements)
So as you can see there are a lot more owner-occupied flats sold in the capital of Denmark. My opinion is that I should select just the owner-occupied flats because then I have the "same" kind of elements in my data and still have 43000 elements.
Also, there are a lot higher taxes involved if you own a house rather than owning an owner-occupied flat. This might affect the price of the house and skew the data a little bit.
I have seen a few projects where both owner-occupied flats and houses are selected for the data and the conclusion was overfitting, so that is what I am looking to avoid.
This is an classic example of over-fitting due to lack of data or insufficient data.
Let me example the selection process to resolve this kind of problem. I will example using the example of credit card fraud then relate that with your question or any future problem of prediction with classified data.
In ideal world credit card fraud are not that common. So, if you look at the real data you will find only 2% data which resulted in fraud. So, if you train a model with this datasets it would be biased as you don't have normal distribution of the class (i.e fraud and none fraud transaction in your case its Owner-occupied flats and houses). There are 4 a way to tackle this issue.
Let's Suppose Datasets has 90 none fraud data points and 10 fraud data points.
1. Under sampling majority class
In this we just select 10 data points from 90 and train model with 10:10 so distribution is normalised (In your case using only 7000 of 43000 flats). This is not ideal way as we would be throughout a huge amount of data.
2. Over sampling minority class by duplication
In this we duplicate the 10 data points to make it 90 data point distribution is normalised (In your case duplicating 7000 house data to make it 43000 i.e equal to that of flat). While this work there is a better way.
3. Over sampling minority class by SMOTE (recommended)
Synthetic Minority Over-sampling Technique is a technique we use k nearest neigbors algo to generate the minority class in your case the housing data. There is module named imbalanced-learn (here) which can be use to implement this.
4. Ensemble Method
In this method you divide your data into multiple datasets to make it balance for example dividing 90 into 9 sets so that each set can have 10 fraud class data and 10 none fraud class data. In your case diving 43000 in batch of 7000. After that training each one separately and using majority vote mechanism to predict.
So now I have created the following diagram. The green line shows the price per square meter of owner occupied flats and the red line shows price per square meter of houses (all prices in DKK). I was wondering if there is imbalanced classification? The maximum deviation of the prices is atmost 10% (see for example 2018). Is 10% enough to say that the data is biased and hence therefore is imbalanced classified?

I want train_test_split to train mainly on one specific number range

I am running some regression models in jupyter/python to predict the cycle time of certain projects. I used train_test_split from sklearn to randomly divide my data set.
The models tend to work pretty well for projects with high cycle times (between 150 - 300 days), but I care more about the lower cycle times between 0 and 50 days.
I believe the model is more accurate for the higher range because most of the projects (about 60-70%) have cycle times over 100 days. I want my model to mainly get the lower cycle times right, because for the purposes of what I'm doing, a project with a cycle time of 120 days is the same as a project with 300 day cycle time.
In my mind, I need to train more on the projects with shorter cycle times? I feel like this might help?
Is there a way to split the data less randomly? Aka train on a higher ratio of shorter cycle time projects
Is there a better or different approach I should consider?

Discrepancy in Azure SQL DTU reporting?

Refer to DTU graph below.
• Both graphs show DTU consumption for the same period, but captured at different times.
• Graph on the left was captured minutes after DTU-consuming event;
• Graph on the right was captured some 19 hrs after.
Why are the two graphs different?
The difference is in the scale of the data points: your graph shows the same scale on the bottom (likely through use of the 'custom' view of the DTU percentage and other metrics) but the granularity of the data has changed. This is a similar question - the granularity for the last hour of data is 5 seconds, whereas the scale for multiple hours is 5 minutes - and the average of the 100 datapoints is the value for that 5 minute data point.
I'll verify this with the engineering team and update if it is inaccurate.

SSAS - Classification - How To Split Data into: Training Set - Validation Set - Test Set

I have a set of 300,000 records of historic customer purchases data. I have started SSAS data mining project to identify best customers.
The split of data:
-90% non-buyers
-10% buyers
I have used various various algorithms of SSAS (decision trees and neural networks showed best lift) to explore my data.
The goal of the project is to identify/score customers according who is most likely to buy a product.
Currently, I have used all my records for this purpose. It feels that something is missing in the project. I am reading two books now about data mining. Both of them talk about splitting data mining into different sets; however, none of them explain HOW to actually split them.
I believe I need to split may records into 3 sets and re-run the ssas algorithms.
Main questions:
How do I split data into training, validation and test sets
1.1 What ratio of buyers and non-buyers should be in a training set?
How do I score my customers according to most likely to buy a product and least likely to buy a product.
The division of your set could be done randomly as your data set is big and the number of buyers is not too low (10%). However, if you want to be sure that your sets are representative you could take 80% of your buyers samples and 80% of non buyers samples and mix them to build a training set that contains 80% of your total data set and it has the same ratio of buyers-non buyers as the original data set which makes the subsets representative. You may want to divide your dataset not in two subsets but in three: training, crossvalidation and test. If you use a neural networkas you said you should use the crossvalidation subset to tune your model (weight decay, learning rate, momentum...).
Regarding your your second question you could use a neural network as you said and take the output, that will be in the range [0, 1] if you use a sigmoid as the activation function in the output layer, as the probability. I would also recommend you to take a look to collaborative filtering for this task because it would help you to know which products may be a customer interested in using your knowledge of other buyers with similar preferences.