How can I create an external table using textfile with presto? - hive

I've a csv file in hdfs directory /user/bzhang/filefortable:
123,1
And I use the following to create an external table with presto in hive:
create table hive.testschema.au1 (count bigint, matched bigint) with (format='TEXTFILE', external_location='hdfs://192.168.0.115:9000/user/bzhang/filefortable');
But when I run select * from au1, I got
presto:testschema> select * from au1;
count | matched
-------+---------
NULL | NULL
I changed the comma to the TAB as the delimeter but it still returns NULL. But If I modify the csv as
123
with only 1 column, the select * from au1 gives me:
presto:testschema> select * from au1;
count | matched
-------+---------
123 | NULL
So maybe I'm wrong with the file format or anything else?

I suppose the field delimiter of the table is '\u0001'.
You can change the ',' to '\u0001' or change the field delimiter to ',' , and check your problem was solved

Related

SQL - Replace a particular part of column string value (between second and third slash)

In my SQLServer DB, I have a table called Documents with the following columns:
ID - INT
DocLocation - NTEXT
DocLocation has values in following format:
'\\fileShare1234\storage\ab\xyz.ext'
Now it seems these documents are stored in multiple file share paths.
We're planning to migrate all documents in one single file share path called say 'newFileShare' while maintaining the internal folder structure.
So basically '\\fileShare1234\storage\ab\xyz.ext' should be updated to '\\newFileShare\storage\ab\xyz.ext'
Two questions:
How do I query my DocLocation to extract DocLocations with unique file share values? Like 'fileShare1234' and 'fileShare6789' and so on..
In a single Update query how do I update my DocLocation values to newFileShare ('\\fileShare1234\storage\ab\xyz.ext' to '\\newFileShare\storage\ab\xyz.ext')
I think the trick would be extract and replace text between second and third slashes.
I've still not figured out how to achieve my first objective. I require those unique file shares for some other tasks.
As for the second objective, I've tried using replace between it will require multiple update statements. Like I've done as below:
update Documents set DocLocation = REPLACE(Cast(DocLocation as NVarchar(Max)), '\\fileShare1234\', '\\newFileShare\')
The first step is fairly easy. If all your paths begin with \\, then you can find all the DISTINCT servers using SUBSTRING. I will make a simple script with a table variable to replicate some data. The value of 3 is in the query and it is the length of \\ plus 1 since SQL Server counts from 1.
DECLARE #Documents AS TABLE(
ID INT NOT NULL,
DocLocation NTEXT NOT NULL
);
INSERT INTO #Documents(ID, DocLocation)
VALUES (1,'\\fileShare56789\storage\ab\xyz.ext'),
(2,'\\fileShare1234\storage\ab\cd\xyz.ext'),
(3,'\\share4567890\w\x\y\z\file.ext');
SELECT DISTINCT SUBSTRING(DocLocation, 3, CHARINDEX('\', DocLocation, 3) - 3) AS [Server]
FROM #Documents;
The results from this are:
Server
fileShare1234
fileShare56789
share4567890
For the second part, we can just concatenate the new server name with the path that appears after the first \.
UPDATE #Documents
SET DocLocation = CONCAT('\\newfileshare\',
SUBSTRING(DocLocation, 3, LEN(CAST(DocLocation AS nvarchar(max))) - 2));
SELECT * FROM #Documents;
For some reason I cannot create a table with the results here, but the values I see are this:
\\newfileshare\fileShare56789\storage\ab\xyz.ext
\\newfileshare\fileShare1234\storage\ab\cd\xyz.ext
\\newfileshare\share4567890\w\x\y\z\file.ext
Please try the following solution based on XML and XQuery.
Their data model is based on ordered sequences. Exactly what we need while processing fully qualified file path: [position() ge 4]
When you are comfortable, just run the UPDATE statement by updating the DocLocation column with the calculated result.
It is better to use NVARCHAR(MAX) instead of NText data type.
SQL
-- DDL and sample data population, start
DECLARE #tbl AS TABLE(ID INT IDENTITY PRIMARY KEY, DocLocation NVARCHAR(MAX));
INSERT INTO #tbl(DocLocation) VALUES
('\\fileShare56789\storage\ab\xyz.ext'),
('\\fileShare1234\storage\ab\cd\xyz.ext'),
('\\share4567890\w\x\y\z\file.ext');
-- DDL and sample data population, end
DECLARE #separator CHAR(1) = '\'
, #newFileShare NVARCHAR(100) = 'newFileShare';
SELECT ID, DocLocation
, result = '\\' + #newFileShare + #separator +
REPLACE(c.query('data(/root/r[position() ge 4]/text())').value('text()[1]', 'NVARCHAR(MAX)'), SPACE(1), #separator)
FROM #tbl
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(DocLocation, #separator, ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t(c);
Output
+----+---------------------------------------+--------------------------------------+
| ID | DocLocation | result |
+----+---------------------------------------+--------------------------------------+
| 1 | \\fileShare56789\storage\ab\xyz.ext | \\newFileShare\storage\ab\xyz.ext |
| 2 | \\fileShare1234\storage\ab\cd\xyz.ext | \\newFileShare\storage\ab\cd\xyz.ext |
| 3 | \\share4567890\w\x\y\z\file.ext | \\newFileShare\w\x\y\z\file.ext |
+----+---------------------------------------+--------------------------------------+
to get the unique list of shared folder path , you can use this query:
SELECT distinct SUBSTRING(DocLocation,0,CHARINDEX('\',DocLocation,3))
from Documents
and your update command should work and yes you can merge copuple of replace update but better to run them seperately
update Documents
set DocLocation = REPLACE(DocLocation,'\\fileShare1234','\\newFileShare')
but I recommend you always record relative address instead of full path like: \storage\ab\xyz.ext'

SQL UPDATE specific characters in string

I have a column with the following values (there is alot more):
20150223-001
20150224-002
20150225-003
I need to write an UPDATE statement which will change the first 2 characters after the dash to 'AB'. Result has to be the following:
20150223-AB1
20150224-AB2
20150225-AB3
Could anyone assist me with this?
Thanks in advance.
Use this,
DECLARE #MyString VARCHAR(30) = '20150223-0000000001'
SELECT STUFF(#MyString,CHARINDEX('-',#MyString)+1,2,'AB')
If there is a lot of data, you could consider to use .WRITE clause. But it is limited to VARCHAR(MAX), NVARCHAR(MAX) and VARBINARY(MAX) data types.
If you have one of the following column types, the .WRITE clause is easiest for this purpose, example below:
UPDATE Codes
SET val.WRITE('AB',9,2)
GO
Other possible choice could be simple REPLACE:
UPDATE Codes
SET val=REPLACE(val,SUBSTRING(val,10,2),'AB')
GO
or STUFF:
UPDATE Codes
SET val=STUFF(val,10,2,'AB')
GO
I based on the information that there is always 8 characters of date and one dash after in the column. I prepered a table and checked some solutions which were mentioned here.
CREATE TABLE Codes(val NVARCHAR(MAX))
INSERT INTO Codes
SELECT TOP 500000 CONVERT(NVARCHAR(128),GETDATE()-CHECKSUM(NEWID())%1000,112)+'-00'+CAST(ABS(CAST(CHECKSUM(NEWID())%10000 AS INT)) AS NVARCHAR(128))
FROM sys.columns s1 CROSS JOIN sys.columns s2
I run some tests, and based on 10kk rows with NVARCHAR(MAX) column, I got following results:
+---------+------------+
| Method | Time |
+---------+------------+
| .WRITE | 28 seconds |
| REPLACE | 30 seconds |
| STUFF | 15 seconds |
+---------+------------+
As we can see STUFF looks like the best option for updating part of string. .WRITE should be consider when you insert or append new data into string, then you could take advantage of minimall logging if the database recovery model is set to bulk-logged or simple. According to MSDN articleabout UPDATE statement: Updating Large Value Data Types
According to the OP Comment:-
Its always 8 charachters before the dash but the characters after the
dash can vary. It has to update the first two after the dash.
use the next simple code:-
DECLARE #MyString VARCHAR(30) = '20150223-0000000001'
SELECT REPLACE(#MyString,SUBSTRING(#MyString,9,3),'-AB')
Result:-
20150223-AB00000001
try,
update table set column=stuff(column,charindex('-',column)+1,2,'AB')
Declare #Table1 TABLE (DateValue Varchar(50))
INSERT INTO #Table1
SELECT '20150223-000000001' Union all
SELECT '20150224-000000002' Union all
SELECT '20150225-000000003'
SELECT DateValue,
CONCAT(SUBSTRING(DateValue,0,CHARINDEX('-',DateValue)),
REPLACE(LEFT(SUBSTRING(DateValue,CHARINDEX('-',DateValue)+1,Len(DateValue)),2),'00','-AB'),
SUBSTRING(DateValue,CHARINDEX('-',DateValue)+1,Len(DateValue))) AS ExpectedDateValue
FROM #Table1
OutPut
DateValue ExpectedDateValue
---------------------------------------------
20150223-000000001 20150223-AB000000001
20150224-000000002 20150224-AB000000002
20150225-000000003 20150225-AB000000003
To Update
Update #Table1
SEt DateValue= CONCAT(SUBSTRING(DateValue,0,CHARINDEX('-',DateValue)),
REPLACE(LEFT(SUBSTRING(DateValue,CHARINDEX('-',DateValue)+1,Len(DateValue)),2),'00','-AB'),
SUBSTRING(DateValue,CHARINDEX('-',DateValue)+1,Len(DateValue)))
From #Table1
SELECT * from #Table1
OutPut
DateValue
-------------
20150223-AB000000001
20150224-AB000000002
20150225-AB000000003

SQL server insert encoding

Using SQL Server 2012 Management studio,
running the following command insert the data but modify/convert the "," to another char who look like a comma but is not (char code 8128):
INSERT INTO [dbo].[MyTable] VALUES(3,'City','Qu,bec')
I tried the Prefix N but it didnt worked:
INSERT INTO [dbo].[MyTable] VALUES(3,'City',N'Qu,bec')
However, if i use the "Edit" mode of Management studio, the good value is inserted.
The data type of the column is nvarchar(100)
I think it has something to do about Encoding but I cant find how to fix it. In my C# project, I use LinqToSql to extract the data and I end with the bad char (char code 8128) if the data was inserted with the command instead of the "Edit" mode.
I would appreciate a fix and a short explanation. Thx
If you want to insert these values from code, then you would use the N prefix, and use the actual unicode character like so:
create table mytable (id int, type varchar(16), name nvarchar(64))
insert into mytable values (3,'City',N'Québec')
select * from mytable
rextester demo: http://rextester.com/JUBZS75211
returns:
+----+------+--------+
| id | type | name |
+----+------+--------+
| 3 | City | Québec |
+----+------+--------+

Create table with a variable name

I need to create tables on daily basis with name as date in form at (yyMMdd), I tried this :
dbadmin=> \set table_name 'select to_char(current_date, \'yyMMdd \')'
dbadmin=> :table_name;
to_char
---------
150515
(1 row)
and then tried to create table with table name from the set parameter :table_name, but got this
dbadmin=> create table :table_name(col1 varchar(1));
ERROR 4856: Syntax error at or near "select" at character 14
LINE 1: create table select to_char(current_date, 'yyMMdd ')(col1 va...
Is there a way where i could store a value in a variable and then use that variable as table name or to assign priority that the inner select statement has execute first to give me the name i require.
Please suggest!!!
Try this
for what ever reason the variable stored comes with some space and i had to remove it and also cannot start naming table starting with numbers so i had to add something in form like tbl_
in short you just need to store the value of the exit so you need to do some extra work and execute the query.
\set table_name `vsql -U dbadmin -w d -t -c "select concat('tbl_',replace(to_char(current_date, 'yyMMdd'),' ',''))"`
Create table:
create table :table_name(col1 varchar(1));
(dbadmin#:5433) [dbadmin] *> \d tbl_150515
Schema | public
Table | tbl_150515
Column | col1
Type | varchar(1)
Size | 1
Default |
Not Null | f
Primary Key | f
Foreign Key |

Insert a empty string on SQL Server with BULK INSERT

Example table contains the fields Id (the Identity of the table, an integer); Name (a simple attribute that allows null values, it's a string)
I'm trying a CSV that contains this:
1,
1,""
1,''
None of them gives me a empty string as the result of the bulk insertion. I'm using SQL Server 2012.
What can I do?
As far as I know, bulk insert can't insert empty string, it can either keep null value or use default value with keepnulls option or without keepnulls option. For your 3 sample records, after insert database, it should be like:
| id | name
| 1 | NULL
| 1 | ""
| 1 | ''
The reason is, the bulk insert will treat your first row, second column value as null; for other 2 rows, will take the second column value as not null, and take it as it is. Instead of let Bulk Insert to insert empty string value for you, you can let you table column having default value as empty string.
Example as following:
CREATE TABLE BulkInsertTest (id int, name varchar(10) DEFAULT '')
Bulk Insert same CSV file into table
BULK INSERT Adventure.dbo.BulkInsertTest
FROM '....\test.csv'
WITH
(
FIELDTERMINATOR ='\,',
ROWTERMINATOR ='\n'
)
SELECT * FROM BulkInsertTest
The result will be like following: (The first row in your CSV will get an empty string)
| id | name
| 1 |
| 1 | ""
| 1 | ''
Please bear in mind that the specified DEFAULT value will only get inserted if you are not using the option KEEPNULLS.
Using the same example as above, if you add the option KEEPNULLS to the BULK INSERT, i.e.:
BULK INSERT BulkInsertTest
FROM '....\test.csv'
WITH
(
FIELDTERMINATOR ='\,',
ROWTERMINATOR ='\n',
KEEPNULLS
)
will result in the default column value being ignored and NULLs being inserted fro empty strings, i.e:
SELECT * FROM BulkInsertTest
will now give you:
id name
1 NULL
1 ""
1 ''
There does not seem to be a good reason to add KEEPNULLS this in your example, but I came across a similar problem just now, where KEEPNULLS was required in the BULK INSERT.
My solution was to define make the column [name] in the staging table BulkInsertTest NOT NULL but remember that the DEFAULT column value gets ignored and an empty string gets inserted instead.
See more here : Keep Nulls or UseDefault Values During Bulk Import (SQL Server)