When are text field lengths enforced in Salesforce? Can I get the actual field length? - api

I was importing data from Salesforce today, when my BULK INSERT failed with too-long data: longer than the field length as reported from Salesforce itself. I discovered that this field, which Salesforce describes as a TEXT(40), has values up to 255 characters long. I can only guess that the field had a 255-character limit in the past, was changed to TEXT(40), and Salesforce has not yet applied the new limit.
When are field lengths enforced? Only when new data is inserted or modified? Are they enforced at any other point, such as a weekly schedule?
Second, is there any way to know the actual field length limit? As a database guy, not being able to rely on the metadata I've been given makes me cringe. As just one random example, if we were to restore this table from backup I assume that the long values would bomb, or possibly be truncated.
I'm using the SOAP API.

Field lengths are enforced on create/update. If you later reduce the length, existing records are not truncated. I imagine this is because salesforce is storing these as 255's regardless.
Pragmatically speaking, the "actual" field limit for any text field in salesforce should be considered 255. This is because it's possible that at some point in the past, records were inserted when the limit was as high as 255.
And you're right that if you were to dump that table and re-insert it, you very well could have rejected records due to values that exceed the field size as currently defined.

Related

Memory reserved according to the defined field size or just the size of the data inside?

In HANA, there's a column of type NVARCHAR(4000) with value ThisISaString, is the RAM that is being used = 4000 or 13?
If it reserves 4000, then this space could really add up when you have a lot of records.
I am trying to decide how big I should make my text fields.
What I make of your question in its current form is how SAP HANA handles variable length strings when it comes to presenting it to the client (I take from your intention to reserve a buffer.
Thus, I'm not going to discuss what happens inside of HANA when you enter a value into a table - this is rather complex and depends on the table type used (column, row, external, temporary...)
So, for the client application, a (N)VARCHAR
will result in a string with the length of the stored value, i.e. no padding (with spaces at the end) will happen.

Change data type in table due to Disk Space/Memory Error

Attempts at changing data type in Access have failed due to error:
"There isn't enough disk space or memory". Over 385,325 records exists in the table.
Attempts at the following links, among other StackOverFlow threads, have failed:
Can't change data type on MS Access 2007
Microsoft Access can't change the datatype. There isn't enough disk space or memory
The intention is to change data type for one column from "Text" to "Number". The aforementioned links cannot accommodate that either due to size or the desired data type fields.
Breaking out the table may not be an option due to the number of records.
Help on this would be appreciated.
I cannot tell for sure about MS Access, but for MS SQL one can avoid a table rebuild (requiring lots of time and space) by appending a new column that allows null- values at the rightmost end of the table, update the column using normal update queries and AFAIK even drop the old column and rename the new one. So in the end it's just the location of that column that has changed.
As for your 385,325 records (I'd expect that number to be correct) even if the table had 1000 columns with 500 unicode- characters each we'd end up with approximately 385,325*1000*500*2 ~ 385 GB of data. That should nowadays not be the maximum available - so:
if it's the disk space you're running out of, how about move the data to some other computer, change the DB there and move it back.
if the DB seems to be corrupted (and standard tools didn't help (make a copy)) it will most probably help to create a new table or database using table creation (better: create manually and append) queries.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

How do I get Average field length and Document length in Lucene?

I am trying to implement BM25f scoring system on Lucene. I need to make a few minor changes to the original implementation given here for my needs, I got lost at the part where he gets Average Field Length and document length... Could someone guide me as to how or where I get it from?
You can get field length from TermVector instances associated with documents' fields, but that will increase your index size. This is probably the way to go unless you cannot afford a larger index. Of course you will still need to calculate the average yourself, and store it elsewhere (or perhaps in a special document with a well-known external id that you just update when the statistics change).
If you can store the data outside of the index, one thing you can do is count the tokens when documents are tokenized, and store the counts for averaging. If your document collection is static, just dump the values for each field into a file & process after indexing. If the index needs to get updated with additions only, you can store the number of documents and the average length per field, and recompute the average. If documents are going to be removed, and you need an accurate count, you will need to re-parse the document being removed to know how many terms each field contained, or get the length from the TermVector if you are using that.