SQLAlchemy maria Db column encoding problem - dataframe

I am pulling a table from a maradb database using SQL Alchemy like so:
engine=db.create_engine('mariadb+mariadbconnector://username:password#127.0.0.1:3306/dbname?charset=utf8mb4')
then create the connection
connection = engine.connect()
I then add a select statement and put the table in a pandas data frame
from sqlalchemy.sql import text
objects = text(''' SELECT * FROM objects 0
INNER JOIN area a ON a.id =o.area_id LIMIT 10;''')
raw = pd.read_sql(objects,connection)
So far so good, but when I access the data frame the uuid column is in the bellow format
0 b'\x05\xd5\x0b\x80\xf4\x05O\xd3\x9e\x17\x88\xb5p\xca\x8d6'
1 b'Q\xddJ\xd6y\xdeOG\xad\xdc\xbc#\xb5,\xfe\x08'
2 b'z\xde\xb7\xb8\x160O\xc9\x80\x0b\x96\xbaR\x04k\r'
3 b'\xeb\x7f\xb9~\xa8\x0eO\x9f\x87\xea`#\x16)QD'
4 b'\x051\xc1\x81\xbf\xe2O!\xa3AT\xa1\xf7X\x92\xbc'
5 b'\x1c\x00x\x99\xbbQO\xc9\xbdZ\xccb(K5b'
6 b'DFg\xa7_\xfeO\xe3\x95\x95-u\xd7\xed\x90\xd8'
7 b'\x91\xba\xe0\xe2\x1c\xe7OS\xbbW\x0b\xcd\t\x85V\xf0'
8 b'`\xdb\xd7\xba~\xdeO\xb2\xa5\xcd)\x00\xa5&\xa0,'
9 b'%\x06\xf5<_\xa7O\x08\x9c\x90\n|t\xc8\x95\xdc'
Going back to the database and executing the same query I get the bellow result in the uuid column
1 Õ ô OÓ µpÊ 6
2 QÝJÖyÞOG­Ü¼#µ,þ
3 zÞ·¸ 0OÉ ºR k
4 ë ¹~¨ O ê`# )QD
5 1Á ¿âO!£AT¡÷X ¼
6 x »QOɽZÌb(K5b
7 DFg§_þOã -u×í Ø
8 ºàâ çOS»W Í Vð
9 `Û׺~ÞO²¥Í) ¥& ,
10 % õ<_§O |tÈ Ü
I understand I am having an encoding problem and I tried decoding-encoding like bellow
raw.uuid.str.encode('utf-8')
but I am stuck. Any pointers are much appreciated. On how I can fix this problem on the source or at least at the data frame lvl.

MariaDB Connector/Python sets utf8mb4 by default and doesn't accept another character set.
Since the result is a binary object, the uuid was stored in a blob (binary) column. Instead of trying to encode it, you need to convert it to a string:
>>>import uuid
>>>uuid.UUID(bytes=b'\x05\xd5\x0b\x80\xf4\x05O\xd3\x9e\x17\x88\xb5p\xca\x8d6')
UUID('05d50b80-f405-4fd3-9e17-88b570ca8d36')

Related

How to process mainframe numbers where "{" is the last character

I have a one mainframe file data like as below
000000720000{
I need to parse the data and load into a hive table like below
72000
the above field is income column and "{" sign which denotes +ve amount
datatype used while creating table income decimal(11,2)
in layout.cob copybook using INCOME PIC S9(11)V99
could someone help?
The number you want is 7200000 which would be 72000.00.
The conversion you are looking for is:
Positive numbers
{ = 0
A = 1
B = 2
C = 3
D = 4
E = 5
F = 6
G = 7
H = 8
I = 9
Negative numbers (this makes the whole value negative)
} = 0
J = 1
K = 2
L = 3
M = 4
N = 5
O = 6
P = 7
Q = 8
R = 9
Let's explain why.
Based on your question the issue you are having is when packed decimal data is unpacked UNPK into character data. Basically, the PIC S9(11)V2 actually takes up 7 bytes of storage and looks like the picture below.
You'll see three lines. The top is the character representation (missing in the first picture because the hex values do not map to displayable characters) and the lines below are the hexadecimal values. Most significant digit on top and least below.
Note that in the rightmost byte the sign is stored as C which is positive, to represent a negative value you would see a D.
When it is converted to character data it will look like this
Notice the C0 which is a consequence of the unpacking to preserve the sign. Be aware that this display is on z/OS which is EBCDIC. If the file has been transferred and converted to another code-page you will see the correct character but the hex values will be different.
Here are all the combinations you will likely see for positive numbers
and here for negative numbers
To make your life easy, if you see one of the first set of characters then you can replace it with the corresponding number. If you see something from the second set then it is a negative number.

Unable to identify strange whitespace character in MSSQL table

We have a process that reads an XML file into our database and inserts any rows that aren't currently in another table to that table.
This process also has a trigger to write to an audit table and a nightly snapshot is also held in another table.
In the XML holding table a field looks like 1234567890123456 but it exists on our live table as 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6. Those spaces will not be removed by any combination of REPLACE functions. We have tried all CHAR values and it does not recognise the character. The audit table and nightly snapshot, however, contain the correct values.
Similarly, if we run a comparison between SELECT CASE WHEN '1234567890123456' = '1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 ' THEN 1 ELSE 0 END, this returns 1, so they match. However LEN('1234567890123456') is 16 and LEN('1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 ') is 32.
We have ran some queries to loop through the characters in the field and output the ASCII and Unicode values for the characters. The digits return the correct ASCII/Unicode values, but this random whitespace character does not return a value.
An example of the incorrectly displayed one is 0x35000000320000003800000036000000380000003300000039000000370000003800000037000000330000003000000035000000340000003000000033000000 and a correct one is 0x3500320038003600380033003200300030003000360033003600380036003000. Both were added by the same means on the same day. One has the extra bytes, the other is fine.
How can we identify this character and get rid of it? Is there a reason this would have been inserted originally? How can we avoid this in future?
Data entry
It looks like some null (i.e. Char(0)) characters have got into the data.
If the data was supposed to be ASCII when it was entered but UTF-16 data got, then it could be:
Entered character codes: 48 00
Sent to database: 48 00 00 00
To avoid that, remove disallowed characters as the first step in processing the input, say by using a regex to replace [\x00-\x1F] with an empty string.
Data repair
Search for entries which a Char(0) in them to confirm that they can be found that way.
If so, replace the Char(0) with an empty string.
If that doesn't work, you could convert the data to the format '0x35000000320000003800000036000000380000003300000039000000370000003800000037000000330000003000000035000000340000003000000033000000', replace '000000' with '00', and then convert back.

gnuplot: Spurious data points in plots when using index

I'm trying to use gnuplot 4.6 patchlevel 6 to visualize some data from a file test.dat which looks like this:
#Pkg 1
type min max avg
small 1 10 5
medium 5 15 7
large 10 20 15
#Pkg 2
small 3 9 5
medium 5 13 6
large 11 17 13
(Note that the values are actually separated by tabs even though it shows as spaces here.)
My gnuplot commands are
reset
set datafile separator "\t"
plot 'test.dat' index 0 using 2:xticlabels(1) title col, '' using 3 title col, '' using 4 title col
This works fine as long as there is only a single data block in test.dat. When I add the second block spurious data points appear. Why is that and how can it be fixed?
YFTR: Using stat on the file yields only expected results. It reports two data blocks for the full file and correct values (for min, max and sum) when I specify one of the two using index
as mentioned in the comment to the question, one has to explicitly repeat the index 0 specification within all parts of the plot command as
plot 'test.dat' index 0 using 2, '' index 0 using 3, ...
otherwise '' refers to all blocks in the data file

In postgresql How to get all rows that ends with 1?

Suppose I have a table as follows:
id name length
1 A 21.5
2 B 12.4
3 C 0
4 D 17
5 E 1
I wish to get:
id name length
1 A 21.5
5 E 1
Meaning all rows that hase length that ends up with 1.
length is a numeric column.
It's very simple thing to do with programing languages but it seems quite not natural for SQL. How can I do that efficiently and simply?
My only thought is to convert the field to Text and then lose eveything after the . then convert it to array and choose the letter in the position of array length. This will probebly work but it seems like a very bad solution.
You can use FLOOR and modulo division:
SELECT *
FROM tab
WHERE FLOOR(length) % 10 = 1;
SqlFiddleDemo

Processing loading table data

I have a text file "celldata.txt" containing a very simple table of data.
1 2 3 4
5 6 7 8
9 10 11 12
1 2 3 4
2 3 4 5
The problem is when it comes to accessing the data at a certain column and row.
My approach has been to load using loadTable.
Table table;
int numCols;
int numRows;
void setup() {
size(200,200);
table = loadTable("celldata.txt","tsv");
numRows=table.getRowCount();
numCols=table.getColumnCount();
}
void draw() {
background(255);
fill(0);
text(numRows +" "+ numCols,100,100); // Check num of cols and rows
println(table.getFloat(0,0));
}
Question 1: When I do this, it says the number of rows are 5 and the number of columns is just 1. Why is it not 5 x 4?
Question 2: Why is table.getFloat(0,0) "NaN" instead of the first element of the data?
I want to use a much bigger matrix later and access certain elements (of type double) with something like getFloat(i,j) and be able to loop through all elements.
Using the same example data as I, can someone please help me understand what is wrong with my code and how to access the textfile's data? Should I be using another method than loadTable?
You've told Processing that the file contains tab separated values (by using the "tsv" option), but your file contains space separated values.
Since your file does not contain any tabs, it reads the entire row as a single value. So the 0,0 position of your table is 1 2 3 4, which isn't a number- hence the NaN. This is also why it thinks your table only has one column.
You should modify your celldata.txt file to actually be separated by tabs instead of spaces:
1 2 3 4
5 6 7 8
9 10 11 12
1 2 3 4
2 3 4 5
You could also separate them by commas and then use the "csv" option.
If you're still having trouble, you can see what Processing is reading in by adding saveTable(table, "data/new.csv"); to the end of your setup() function and then looking at that file. It will be a list of values separated by commas, so you can see exactly where Processing thinks the cells of the table are.