how to eliminate the \N in hive file at hdfs?

how to eliminate the \N in hive file at hdfs? - hive

I load a data in hive table,some of the columns are empty,while I view the table in hive it shows null.
when I download that data in HDFS in the path /apps/hive/warehouse/dbname/file-name.
In that downloaded file have \N value instead of null.
how to eliminate that \N value with empty in my file.
and also i want to save my file in XLSX format

tblproperties ('serialization.null.format' = '')
Demo
hive> create table t1 (i int,j int,k int);
hive> insert into t1 values (1,null,2);
hive> select * from t1;
+------+------+------+
| t1.i | t1.j | t1.k |
+------+------+------+
| 1 | NULL | 2 |
+------+------+------+
$ hdfs dfs -cat /user/hive/warehouse/t1/* | od -Anone -tacd1x1
1 soh \ N soh 2 nl # a = named characters
1 001 \ N 001 2 \n # c = ASCII characters or backslash escapes
49 1 92 78 1 50 10 # d1 = decimal (1-byte)
31 01 5c 4e 01 32 0a # x1 = hexadecimal (1-byte)
hive> create table t2 (i int,j int,k int) tblproperties ('serialization.null.format' = '');
hive> insert into t2 values (1,null,2);
hive> select * from t2;
+------+------+------+
| t2.i | t2.j | t2.k |
+------+------+------+
| 1 | NULL | 2 |
+------+------+------+
$ hdfs dfs -cat /user/hive/warehouse/t2/* | od -Anone -tacd1x1
1 soh soh 2 nl # a = named characters
1 001 001 2 \n # c = ASCII characters or backslash escapes
49 1 1 50 10 # d1 = decimal (1-byte)
31 01 01 32 0a # x1 = hexadecimal (1-byte)

Related

Recursive query to produce edges of a path?

I have a table paths:
CREATE TABLE paths (
id_travel INT,
point INT,
visited INT
);
Sample rows:
id_travel | point | visited
-----------+-------+---------
10 | 35 | 0
10 | 16 | 1
10 | 93 | 2
5 | 15 | 0
5 | 26 | 1
5 | 193 | 2
5 | 31 | 3
And another table distances:
CREATE TABLE distances (
id_port1 INT,
id_port2 INT,
distance INT CHECK (distance > 0),
PRIMARY KEY (id_port1, id_port2)
);
I need to make a view:
id_travel | point1 | point2 | distance
-----------+--------+--------+---------
10 | 35 | 16 | 1568
10 | 16 | 93 | 987
5 | 15 | 26 | 251
5 | 26 | 193 | 87
5 | 193 | 31 | 356
I don't know how to make dist_trips by a recursive request here:
CREATE VIEW dist_view AS
WITH RECURSIVE dist_trips (id_travel, point1, point2) AS
(SELECT ????)
SELECT dt.id_travel, dt.point1, dt.point2, d.distance
FROM dist_trips dt
NATURAL JOIN distances d;
dist_trips is a recursive request witch and should return three columns: id_travel, point1, and point2 from table paths.

You don't need recursion. Can be plain joins:
SELECT id_travel, p1.point AS point1, p2.point AS point2, d.distance
FROM paths p1
JOIN paths p2 USING (id_travel)
LEFT JOIN distances d ON d.id_port1 = p1.point
AND d.id_port2 = p2.point
WHERE p2.visited = p1.visited + 1
ORDER BY id_travel, p1.visited;
db<>fiddle here
Your paths seem to have gapless ascending numbers. Just join each point with the next.
I threw in a LEFT JOIN to keep all edges of each path in the result, even if the distances table should not have a matching entry. Probably prudent.
Your NATURAL JOIN didn't go anywhere. Generally, NATURAL is rarely useful and breaks easily. The manual warns:
USING is reasonably safe from column changes in the joined relations
since only the listed columns are combined. NATURAL is considerably
more risky since any schema changes to either relation that cause a
new matching column name to be present will cause the join to combine
that new column as well.

bit varying in Postgres to be queried by sub-string pattern

The following Postgres table contains some sample content where the binary data is stored as bit varying (https://www.postgresql.org/docs/10/datatype-bit.html):
ID | Binary data
----------------------
1 | 01110
2 | 0111
3 | 011
4 | 01
5 | 0
6 | 00011
7 | 0001
8 | 000
9 | 00
10 | 0
11 | 110
12 | 11
13 | 1
Q: Is there any query (either native SQL or as Postgres function) to return all rows where the binary data field is equal to all sub-strings of the target bit array. To make it more clear lets look at the example search value 01101:
01101 -> no result
0110 -> no result
011 -> 3
01 -> 4
0 -> 5, 10
The result returned should contain the rows: 3, 4, 5 and 10.
Edit:
The working query is (thanks to Laurenz Albe):
SELECT * FROM table WHERE '01101' LIKE (table.binary_data::text || '%')
Furthermore I found this discussion about Postgres bit with fixed size vs bit varying helpful:
PostgreSQL Bitwise operators with bit varying "cannot AND bit strings of different sizes"

How about
WHERE '01101' LIKE (col2::text || '%')

I think you are looking for bitwise and:
where col2 & B'01101' = col2

PostgreSQL: How do I multiply every entry in multiple columns by a fixed number?

Okay, my question is similar to this but my case is different. In my PostgreSQL 9.5 database, I have a table my_table having a layout like follows:
ID a0 a1 .. a23 b0 b1 .. b23 c0 c1 .. c23
1 23 22 .. 11 12 0.12 .. 65 0.17 12 .. 19
2 42 52 .. 12 1.2 14 .. 42 0.35 12 .. 12
3 11 25 .. 13 2.5 0.14 .. 15 1.1 8 .. 14
First column ID (integer) is unique for each record while there are 24 columns (numeric) for each variable a, b and c thus summing up to 72 columns. I want to multiply each entry in these 72 columns to a fixed number, let say 0.20. I am aware of PostgreSQL UPDATE command like this:
UPDATE my_table set a0 = a0 * 0.20
In this case, I would need to repeat this command a large number of times (undesired). Is there an alternate quick approach (single SELECT or iteration) to multiply a fixed number to multiple columns?

Example table:
drop table if exists my_table;
create table my_table(id serial primary key, a1 dec, a2 dec, a3 dec);
insert into my_table values
(default, 1, 2, 3);
Use execute in an anonymous code block:
do $$
begin
execute concat('update my_table set ', string_agg(format('%1$I = %1$I * 0.2', attname), ','))
from pg_attribute a
where attrelid = 'my_table'::regclass
and attnum > 0
and attname ~ '^[abc]+';
end
$$;
select * from my_table;
id | a1 | a2 | a3
----+-----+-----+-----
1 | 0.2 | 0.4 | 0.6
(1 row)

How do I compare rows of a table against all other rows of the table?

I would like to create a script that takes the rows of a table which have a specific mathematical difference in their ASCII sum and to add the rows to a separate table, or even to flag a different field when they have that difference.
For instance, I am looking to find when the ASCII sum of word A and the ASCII sum of word B, both stored in rows of a table, have a difference of 63 or 31.
I could probably use a loop to select these rows, but SQL is not my greatest virtue.
ItemID | asciiSum |ProperDiff
-------|----------|----------
1 | 100 |
2 | 37 |
3 | 69 |
4 | 23 |
5 | 6 |
6 | 38 |
After running the code, the field ProperDiff will be updated to contain 'yes' for ItemID 1,2,3,5,6, since the AsciiSum for 1 and 2 (100-37) = 63 etc.

This will not be fast, but I think it does what you want:
update t
set ProperDiff = 'yes'
where exists (select 1
from t t2
where abs(t2.AsciiSum - t.AsciiSum) in (63, 31)
);
It should work okay on small tables.

Why SELECT 0, ... instead of SELECT

Lets say I have a SQLite database that contains a table:
sqlite> create table person (id integer, firstname varchar, lastname varchar);
Now I want to get every entry which is in the table.
sqlite> select t0.id, t0.firstname, t0.lastname from person t0;
This works fine and this it what I would use. However I have worked with a framework from Apple (Core Data) that generates SQL. This framework generates a slightly different SQL query:
sqlite> select 0, t0.id, t0.firstname, t0.lastname from person t0;
Every SQL query generated by this framework begins with "select 0,". Why is that?
I tried to use the explain command to see whats going on but this was inconclusive - at least to me.
sqlite> explain select t0.id, t0.firstname, t0.lastname from person t0;
addr opcode p1 p2 p3 p4 p5 comment
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0 Trace 0 0 0 00 NULL
1 Goto 0 11 0 00 NULL
2 OpenRead 0 2 0 3 00 NULL
3 Rewind 0 9 0 00 NULL
4 Column 0 0 1 00 NULL
5 Column 0 1 2 00 NULL
6 Column 0 2 3 00 NULL
7 ResultRow 1 3 0 00 NULL
8 Next 0 4 0 01 NULL
9 Close 0 0 0 00 NULL
10 Halt 0 0 0 00 NULL
11 Transactio 0 0 0 00 NULL
12 VerifyCook 0 1 0 00 NULL
13 TableLock 0 2 0 person 00 NULL
14 Goto 0 2 0 00 NULL
And the table for the second query looks like this:
sqlite> explain select 0, t0.id, t0.firstname, t0.lastname from person t0;
addr opcode p1 p2 p3 p4 p5 comment
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
0 Trace 0 0 0 00 NULL
1 Goto 0 12 0 00 NULL
2 OpenRead 0 2 0 3 00 NULL
3 Rewind 0 10 0 00 NULL
4 Integer 0 1 0 00 NULL
5 Column 0 0 2 00 NULL
6 Column 0 1 3 00 NULL
7 Column 0 2 4 00 NULL
8 ResultRow 1 4 0 00 NULL
9 Next 0 4 0 01 NULL
10 Close 0 0 0 00 NULL
11 Halt 0 0 0 00 NULL
12 Transactio 0 0 0 00 NULL
13 VerifyCook 0 1 0 00 NULL
14 TableLock 0 2 0 person 00 NULL
15 Goto 0 2 0 00 NULL

Some frameworks do this in order to tell, without any doubt, whether a row from that table was returned.
Consider
A B
+---+ +---+------+
| a | | a | b |
+---+ +---+------+
| 0 | | 0 | 1 |
+---+ +---+------+
| 1 | | 1 | NULL |
+---+ +---+------+
| 2 |
+---+
SELECT A.a, B.b
FROM A
LEFT JOIN B
ON B.a = A.a
Results
+---+------+
| a | b |
+---+------+
| 0 | 1 |
+---+------+
| 1 | NULL |
+---+------+
| 2 | NULL |
+---+------+
In this result set, it is not possible to see that a = 1 exists in table B, but a = 2 does not. To get that information, you need to select a non-nullable expression from table b, and the simplest way to do that is to select a simple constant value.
SELECT A.a, B.x, B.b
FROM A
LEFT JOIN (SELECT 0 AS x, B.a, B.b FROM B) AS B
ON B.a = A.a
Results
+---+------+------+
| a | x | b |
+---+------+------+
| 0 | 0 | 1 |
+---+------+------+
| 1 | 0 | NULL |
+---+------+------+
| 2 | NULL | NULL |
+---+------+------+
There are a lot of situations where these constant values are not strictly required, for example when you have no joins, or when you could choose a non-nullable column from b instead, but they don't cause any harm either, so they can just be included unconditionally.

When I have code to dynamically generate a WHERE clause, I usually start the clause with a:
WHERE 1 = 1
Then the loop to add additional conditions always adds each condition in the same format:
AND x = y
without having to put conditional logic in place to check if this is the first condition or not: "if this is the first condition then start with the WHERE keyword, else add the AND keyword.
So I can imagine a framework doing this for similar reasons. If you start the statement with a SELECT 0 then the code to add subsequent columns can be in a loop without any conditional statements. Just add , colx each time without any conditional checking along the lines of "if this is the first column, don't put a comma before the column name, otherwise do".
Example pseudo code:
String query = "SELECT 0";
for (Column col in columnList)
query += ", col";

Only Apple knows … but I see two possibilities:
Inserting a dummy column ensures that the actual output columns are numbered beginning with 1, not 0. If some existing interface already assumed one-based numbering, doing it this way in the SQL backend might have been the easiest solution.
If you make a query for multiple objects using multiple subqueries, a value like this could be used to determine from which subquery a record originates:
SELECT 0, t0.firstname, ... FROM PERSON t0 WHERE t0.id = 123
UNION ALL
SELECT 1, t0.firstname, ... FROM PERSON t0 WHERE t0.id = 456
(I don't know if Core Data actually does this.)
Your EXPLAIN output shows that the only difference is (at address 4) that the second program sets the extra column to zero, so there is only a minimal performance difference.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to eliminate the \N in hive file at hdfs? - hive

Related

Recursive query to produce edges of a path?

bit varying in Postgres to be queried by sub-string pattern

PostgreSQL: How do I multiply every entry in multiple columns by a fixed number?

How do I compare rows of a table against all other rows of the table?

Why SELECT 0, ... instead of SELECT

Categories

Resources