HiveQL is cutting zeros from fields - sql

when I bring some data to the hive, it ends up cutting the 0 both at the beginning and at the end, normally this data comes from a document.
However, the CPF comes with 11 digits and the CNPJ comes with 14, and when the data is entered in the table, it ends up cutting the zeros (depending on the document, it can be at the beginning or at the end), how can I avoid this problem even though I understand that can you enter two types of documents in the column?
Measuring shell size with LEGTH doesn't work, because of these cuts in zeros.

Related

has anyone determined how to make excel show a significant digit if it's a zero

Excel truncates my whole integer data and does not show the tenths place (i.e., shows 4 instead of 4.0). I don't want to increase the decimal place as not all data should show the x.0 place.
The round function does not work - won't round 4 to 4.0.
Excel help says to use this formula: =ROUND(B18,2-LEN(INT(B18))), but doesn't work if ending digit is 0.

Correct type of data for latitude and longtitude SSIS ETL process

I'm trying to convert and upload latitude and longitude data into a database through an ETL process I created where we take the source data from a .csv file and convert it to DECIMAL. Here you have an example of what the two values look like:
Latitude (first column): 41.896585191199556
Longitude (second column):-87.66454238198166
I set the data type on the database as for:
Latitude DECIMAL(10,8)
Longitude DECIMAL(11,8)
The main problem arises when I try to convert data from file to database and then I get the message
[Flat File Source [85]] Error: Data conversion failed. The data conversion for column "Latitude" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".
View of my process:
When trying to ignore the error Latitude and Longitude values in the database are changed to NULL... The flat file encoding is 65001.
I tried doing conversions for data types: float, DECIMAL, int and nothing helped.
My questions are:
what data type for these above values should I use in the target database.
what data type should i choose on input for flat file ?
what data type to set for conversion (I suspect the one we will have on the database) ?
please note that some records in the file are missing the location
view from Data:
view from Data Conversion:
UPDATE
When FastParse is run I receive an error message as below:
What data type should I choose in this case ? I set everything up as #billinkc suggested. When I set an integer, for example DT_I4, it results in NULL and the same error as before (in this message there is no possibility to select some data type for the value of Latitude, i.e. DECIMAL or STRING).
You need DECIMAL(11,8). That has three digits before the decimal place and either digits after.
The conversion failure is no doubt happening when you have longitudes above 100 or less than -100.
The error reported indicates the failure point is the Flat File Source
[Flat File Source [85]] Error: Data conversion failed. The data conversion for column "Latitude" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".
I'm on a US locale machine so you could be running into issues with the decimal separator. If that's the case, then in your Flat File Source, right click and select Show Advanced Editor. Go to Input and Output Properties, and under the Flat File Source Output, expand Output Columns and for each column that is a floating point number, check the FastParse option.
If that works, great, you have a valid Flat File Source.
I was able to get this working two different ways. I defined two Flat File Connection Managers in my package: FFCM Dec and FFCM String While I prefer to minimize the number of operations and transforms I apply to my packages, declaring the data types as strings can help you get past the hurdle of "I can't even get my data flow to start because of bad data"
Source data
I created a CSV saved as UTF-8
Latitude,Longitude
41.896585191199556,-87.66454238198166
FFCM Dec
I configured a standard CSV
I defined my columns with the DataType of DT_DECIMAL
FFCM String
Front page is the same but on the columns in the Advanced section, I left the data type as DT_WSTR with a length of 50
At this point, we've defined the basic properties of how the source data is structured.
Destination
I went with consistency on the size for the destination. You're not going to save anything by using 10 vs 11 and I'm too lazy to look up the allowable domain for lat/long numbers
CREATE TABLE dbo.SO_65909630
(
[Latitude] decimal(18,15)
, [Longitude] decimal(18,15)
)
Data Flow
I need to run but you either use the correctly typed data when you bring it in (DFT DEC) or you transform it.
The blanks I see in your source data will likely need to be dealt with (either you have a column that needed to be escaped or there is no data - which will cause the data conversion to fail so I'd advocate this approach
Row counts are there just to provide a place to put a data viewer while I was building the answer
What data type should I use for lat and long
Decimal is an exact data type so it will store the exact value you supply. When used it takes the form of decimal(scale, precision). Before my current role, I had never used any other data type for non-whole numbers.
Books On Line on decimal and numeric (Transact-SQL) https://learn.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15
Scale
The maximum total number of decimal digits to be stored. This number includes both the left and the right sides of the decimal point. The precision must be a value from 1 through the maximum precision of 38. The default precision is 18.
Precision
The number of decimal digits that are stored to the right of the decimal point. This number is subtracted from p to determine the maximum number of digits to the left of the decimal point. Scale must be a value from 0 through p, and can only be specified if precision is specified. The default scale is 0 and so 0 <= s <= p. Maximum storage sizes vary, based on the precision.
Precision Storage bytes
1 - 9 5
10-19 9
20-28 13
29-38 17
For the table I defined above, it will cost us 18 bytes (2 * 9) for each lat/long to store.
But let's look at the actual domain for latitude and longitude (on Earth) This magnificent answer on GIS.se is printed out and hangs from my work monitor https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude
Pasting the relevant bits here
The sixth decimal place is worth up to 0.11 m: you can use this for laying out structures in detail, for designing landscapes, building roads. It should be more than good enough for tracking movements of glaciers and rivers. This can be achieved by taking painstaking measures with GPS, such as differentially corrected GPS.
The seventh decimal place is worth up to 11 mm: this is good for much surveying and is near the limit of what GPS-based techniques can achieve.
The eighth decimal place is worth up to 1.1 mm: this is good for charting motions of tectonic plates and movements of volcanoes. Permanent, corrected, constantly-running GPS base stations might be able to achieve this level of accuracy.
The ninth decimal place is worth up to 110 microns: we are getting into the range of microscopy. For almost any conceivable application with earth positions, this is overkill and will be more precise than the accuracy of any surveying device.
Ten or more decimal places indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless. Be careful, because unless you are the one reading these numbers off the device, this can indicate low quality processing!
Your input values show more than 10 digits of precision so I'm guessing it's a calculated value and not a "true observation". That's good, that gives us more wiggle room to work with.
Why, we could dial that decimal declaration down the following for half* the storage cost of the first one
CREATE TABLE dbo.SO_65909630_alt
(
[Latitude] decimal(8,5)
, [Longitude] decimal(8,5)
);
Well that's good, we've stored the "same" data at lower the cost. Maybe your use case is just "where are my stores" and even if you're Walmart with under 12000 stores, who cares? That's a trivial cost. But if you need to also store the coordinates of their customers, the storage cost per record might start to matter. Or use Amazon or Alibaba or whatever very large consumer retailer exists when you read this.
In my work, I deal with meteorological data and it comes in all shapes and sizes but a common source for me is Stage IV data It's just hourly rainfall amounts across the contiguous US. So 24 readings per coordinate, per day. Coordinate system is 1121 x 881 (987,601 points) so expressing hourly rainfall in the US for a day is 23,702,424 rows. The difference between 18 bytes versus 10 bytes can quickly become apparent given that Stage IV data is available back to 2008.
We actually use a float (or real) to store latitude and longitude values because it saves us a 2 bytes per coordinate.
CREATE TABLE dbo.SO_65909630_float
(
[Latitude] float(24)
, [Longitude] float(24)
);
INSERT INTO dbo.SO_65909630_alt
(
Latitude
, Longitude
)
SELECT * FROM dbo.SO_65909630 AS S
Now, this has caused me pain because I can't use an exact filter in queries because of the fun of floating point numbers.
My decimal typed table has this in it
41.89659 -87.66454
And my floating type table has this in it
41.89658 -87.66454
Did you notice the change to the last digit in Latitude? 8 not 9 as the decimal table has but either way, it doesn't matter
SELECT * FROM dbo.SO_65909630_float AS S WHERE S.Latitude = 41.89658
This won't find a row because of floating point rounding exact match nonsense. Instead, your queries become very tight range queries, like
SELECT * FROM dbo.SO_65909630_float AS S WHERE S.Latitude >= (41.89658 - .00005) AND S.Latitude <= (41.89658 + .00005)
where .00005 is a value that you'll have to experiment with given your data to find out how much you need to adjust the numbers to find it again.
Finally, for what it's worth, if you convert lat and long into the Geography Point it's going to coerce the input data type to float as it is.

Print a number in decimal

Well, it is a low-level question
Suppose I store a number (of course computer store number in binary format)
How can I print it in decimal format. It is obvious in high-level program, just print it and the library does it for you.
But how about in a very low-level situation where I don't have this library.
I can just tell what 'character' to output. How to convert the number into decimal characters?
I hope you understand my question. Thank You.
There are two ways of printing decimals - on CPUs with division/remainder instructions (modern CPUs are like that) and on CPUs where division is relatively slow (8-bit CPUs of 20+ years ago).
The first method is simple: int-divide the number by ten, and store the sequence of remainders in an array. Once you divided the number all the way to zero, start printing remainders starting from the back, adding the ASCII code of zero ('0') to each remainder.
The second method relies on the lookup table of powers of ten. You define an array of numbers like this:
int pow10 = {10000,1000,100,10,1}
Then you start with the largest power, and see if you can subtract it from the number at hand. If you can, keep subtracting it, and keep the count. Once you cannot subtract it without going negative, print the count plus the ASCII code of zero, and move on to the next smaller power of ten.
If integer, divide by ten, get both the result and the remainder. Repeat the process on the result until zero. The remainders will give you decimal digits from right to left. Add 48 for ASCII representation.
Basically, you want to tranform a number (stored in some arbitrary internal representation) into its decimal representation. You can do this with a few simple mathematical operations. Let's assume that we have a positive number, say 1234.
number mod 10 gives you a value between 0 and 9 (4 in our example), which you can map to a character¹. This is the rightmost digit.
Divide by 10, discarding the remainder (an operation commonly called "integer division"): 1234 → 123.
number mod 10 now yields 3, the second-to-rightmost digit.
continue until number is zero.
Footnotes:
¹ This can be done with a simple switch statement with 10 cases. Of course, if your character set has the characters 0..9 in consecutive order (like ASCII), '0' + number suffices.
It doesnt matter what the number system is, decimal, binary, octal. Say I have the decimal value 123 on a decimal computer, I would still need to convert that value to three characters to display them. Lets assume ASCII format. By looking at an ASCII table we know the answer we are looking for, 0x31,0x32,0x33.
If you divide 123 by 10 using integer math you get 12. Multiply 12*10 you get 120, the difference is 3, your least significant digit. we go back to the 12 and divide that by 10, giving a 1. 1 times 10 is 10, 12-10 is 2 our next digit. we take the 1 that is left over divide by 10 and get zero we know we are now done. the digits we found in order are 3, 2, 1. reverse the order 1, 2, 3. Add or OR 0x30 to each to convert them from integers to ascii.
change that to use a variable instead of 123 and use any numbering system you like so long as it has enough digits to do this kind of work
You can go the other way too, divide by 100...000, whatever the largest decimal you can store or intend to find, and work your way down. In this case the first non zero comes with a divide by 100 giving a 1. save the 1. 1 times 100 = 100, 123-100 = 23. now divide by 10, this gives a 2, save the 2, 2 times 10 is 20. 23 - 20 = 3. when you get to divide by 1 you are done save that value as your ones digit.
here is another given a number of seconds to convert to say hours and minutes and seconds, you can divide by 60, save the result a, subtract the original number - (a*60) giving your remainder which is seconds, save that. now take a and divide by 60, save that as b, this is your number of hours. subtract a - (b*60) this is the remainder which is minutes save that. done hours, minutes seconds. you can then divide the hours by 24 to get days if you want and days and then that by 7 if you want weeks.
A comment about divide instructions was brought up. Divides are very expensive and most processors do not have one. Expensive in that the divide, in a single clock, costs you gates and power. If you do the divide in many clocks you might as well just do a software divide and save the gates. Same reason most processors dont have an fpu, gates and power. (gates mean larger chips, more expensive chips, lower yield, etc). It is not a case of modern or old or 64 bit vs 8 bit or anything like that it is an engineering and business trade off. the 8088/86 has a divide with a remainder for example (it also has a bcd add). The gates/size if used might be better served than for a single instruction. Multiply falls into that category, not as bad but can be. If operand sizes are not done right you can make either instruction (family) not as useful to a programmer. Which brings up another point, I cant find the link right now but a way to avoid divides but convert from a number to a string of decimal digits is that you can multiply by .1 using fixed point. I also cant find the quote about real programmers not needing floating point related to keeping track of the decimal point yourself. its the slide rule vs calculator thing. I believe the link to the article on dividing by 10 using a multiply is somewhere on stack overflow.

How to simplify big numeric input from user? [Objective C]

I've building a very basic iphone app where the user will be able to enter or select a very large numeric cash value (usually in the thousands or millions).
At the moment I am using a simple text box entry with number pad selected.
I am going to use the example of a Football transfer fee as an analogy.
A transfer fee can be in many millions and I really do not want the user to be mis-typing zero's, or getting frustrated with the number of zero's they have to enter.
In addition, as the text box/numeric cash value is not displayed with any currency formatting it makes it very unintuitive to know just how much you are entering.
In this thread I have a way of displaying big numbers on the screen; you'll also notice the numbers are formatted in chunks (ie: 2.25m, 2m, 7.25m, etc) -- it makes the process more streamlined and is more visually intuitive.
But what I am unsure about is how to make it easy for the user to enter big numbers without typing stupidly long zeros every time.
Possible solution 1 -- Use a UIPickerView with 3+ segments for each of the units.
Problem -- it won't handle smaller numbers properly, also you may get weird looking numbers like 1.15k which although correct is not what I want to display.
Possible solution 2 -- Use a +/- button to allow a user to simply increase/decrease the number by a factor of 250 or 500. This is the simplest answer, but its not as elegant as a UIPickerView
If there is another way to do this, a way to simplify the input of big numeric numbers from a user, I'd be interested.
You could add formatted output right above or below the text field. As they enter numbers, update the formatted field adding currency symbols, commas and decimals. Not the most elegant way to do this, but it would be simple to implement, and intuitive to the user.

in sql,How does fixed-length data type take place in memory?

I want to know in sql,how fixed-length data type take places length in memory?I know is that for varchar,if we specify length is (20),and if user input length is 15,it takes 20 by setting space.for varchar2,if we specify length is (20),and if user input is 15,it only take 15 length in memory.So how about fixed-length data type take place?I searched in Google,but I did not find explanation with example.Please explain me with example.Thanks in advance.
A fixed length data field always consumes its full size.
In the old days (FORTRAN), it was padded at the end with space characters. Modern databases might do that too, but either implicitly trim trailing blanks off or the query might have to do it explicitly.
Variable length fields are a relative newcomer to databases, probably in the 1970s or 1980s they made widespread appearances.
It is considerably easier to manage fixed length record offsets and sizes rather than compute the offset of each data item in a record which has variable length fields. Furthermore, a fixed length data record is easily addressed in a data file by computing the byte offset of its beginning by multiplying the record size times the record number (and adding the length of whatever fixed header data is at the beginning of file).