Redshift UNLOAD Select Distinct creates larger zip file vs Select *? - gzip

I'm seeing a very strange thing with Redshift UNLOAD and wondering if anyone else has seen this or has an explanation for it.
I have one UNLOAD query. When I Unload using 'Select Distinct' with GZIP, the files unloaded add up to three times as large as if I do a 'Select *' (no distinct) with GZIP.
Here's the query:
UNLOAD ('SELECT DISTINCT <29 fields> FROM public.mytable WHERE myfield = 999')
TO 's3://myBucket/myfile.txt' CREDENTIALS 'mycreds' DELIMITER '\t'
GZIP PARALLEL TRUE MAXFILESIZE 256 MB ALLOWOVERWRITE;
The output of this query adds up to 26GB.
If I change it to 'Select *', the output is 8GB. Also, Select Distinct creates 14 zip files, while Select * creates only 5 zip files.
This is contradictory as one would expected the deduped files to be smaller.
I'm thinking GZIP Distinct is creating a much larger zip dictionary file than GZIP Select *.
The question is, why?

Related

SQL import from openrowset mixed type [keep leading 0, long ints and TEXT] correctly

I have a column in Excel I'm trying to import, and it includes codes like the following:
01166
1166
2354654765432
xx132
I use IMEX=1 and everything imports as TEXT but when I assign the select to temporary table with nvarchar type, the codes with long numbers become incorrect:
1001051 becomes 1.00105e+006
I tried to cast to bigint, but this makes the code lose the leading 0 in 01166
this is my current query:
INSERT INTO #XTEMP
SELECT DISTINCT
(CASE ISNUMERIC([item_code]) WHEN 1 THEN CAST(CAST([item_code] AS BIGINT) AS nvarchar) ELSE CAST([item_code] AS nvarchar) END)
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 12.0 Xml;HDR=YES;IMEX=1;Database=C:\path\file.xls',
'SELECT * FROM [sheet$]')
IMEX=1 allows reading mixed columns; it does not force data values to be text.
Looking at this SO page, you can see a possible workaround if you set HDR=NO, but that has its own complications.
If you can, the best thing to do is set the format of the column in the spreadsheet to Text.
Not exactly a fix, but more like a work around; I wrote an Excel Macro that appended a character (e.g. 'B') to the codes, then removed it when importing using Openrowset like:
select RIGHT([code], LEN([code])-1)

escape character on copy csv

I have a sql script with intend to export data from postgres and spool into a csv file which used to work fine until I added random sample into the line.
Here is the code it look like with random()
\COPY (select accountid, to_char(createtime,'YYYY-MM-DD HH24:MI:SS.ms') from accounts random()< 0.01 limit 1000) to '/home/oracle/scripts/accounts_p.csv' WITH DELIMITER ',' NULL AS ' '
ERROR MESSAGE when running this sql script:
psql:accounts_sample_p.sql:1: \copy: ERROR: syntax error at or near ")"
LINE 1: ...from accounts random ( ) < 0.01 l...
^
Appearantly it did not like the (). Tried using escape character with \ before ( and before ), but it did not help.
Can anyone give me an advice on how to overcome this? Thanks.
you seem to be missing a 'WHERE' in your SELECT statement. i.e.:
select accountid, to_char(createtime,'YYYY-MM-DD HH24:MI:SS.ms') from WHERE accounts random()< 0.01 limit 1000
Alternative methods on selecting a random set in PostgreSQL can be found here if you're struggling with performance:
Best way to select random rows PostgreSQL

Searching for a specific text value in a column in SQLite3

Suppose I have a table named 'Customer' with many columns and I want to display all customers who's name ends with 'Thomas'(Lastname = 'Thomas'). The following query shows an empty result(no rows). Also it didn't show any error.
SELECT * FROM Customer WHERE Lastname = 'Thomas';
While executing the following query give me correct result.
SELECT * FROM Customer WHERE Lastname LIKE '%Thomas%';
I would like to know what is the problem with my first query. I am using sqlite3 with Npm. Below is the result of '.show' command(Just in case of the problem is with config).
sqlite> .show
echo: off
explain: off
headers: on
mode: column
nullvalue: ""
output: stdout
separator: "|"
stats: off
width:
Use Like instead of =
Trim to ensure that there arent spaces messing around
so the query will be
SELECT * FROM Customer WHERE trim(Lastname) LIKE 'Thomas';
depending on your types, probably you dont need point 2, since as can be read in mysql manual
All MySQL collations are of type PADSPACE. This means that all CHAR
and VARCHAR values in MySQL are compared without regard to any
trailing spaces
But the point 1 could be the solution. Actually if you want to avoid problems, you should compare strings with LIKE, instead of =.
If You still have problems, probably you will have to use collates.
SELECT *
FROM t1
WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci; #using your real table collation
more information here But specifically with 'Thomas' you shouldn't need it, since it hasn't got any special characters.

Export from sql to excel

I run a query using sqlplus command line interface. The query will fetch some 30 million records. I need to export the result to either csv or xls format. Can anyone let me know if this is possible?
Any help is much appreciated.
Thanks in advance.
Try spool myresults.csv before your select statement, which Excel can easily open.
EDIT
Like this:
SET UNDERLINE OFF
SET COLSEP ','
--That's the separator used by excel later to parse the data to columns
SET LINES 100 PAGES 100
SET FEEDBACK off
--If you don't want column headings in CSV file
SET HEADING off
Spool ~\myresults.csv
--Now the actual query
SELECT * FROM YOUR_TABLE;
Spool OFF
EDIT 2
You might want to batch your results if you're going to query 30M records. I've never gone that far in an excel file but the limit is 65535 rows (that would be 458 files for 30M records).
I'd go with cutting up your query into block of 60K blocks and spooling each select to a different excel file, maybe by looping on an integer and concatenating it to the end of each filename.
SET PAGESIZE 50000
SET FEEDBACK OFF
SET MARKUP HTML ON SPOOL ON
SET NUM 24
SPOOL sample.xls
SELECT * from users;
SPOOL OFF
SET MARKUP HTML OFF SPOOL OFF
This option will help you to export directly to a excel file
SET PAGESIZE 40000
SET FEEDBACK OFF
SET MARKUP HTML ON
SET NUM 24
SPOOL file_name.xls
---- Execute your query
SPOOL OFF
SET MARKUP HTML OFF
SPOOL OFF
Spool sqlplus to xls format

How to read values from Excel using Openrowset Function?

I am reading excel sheet using openrowset function?
My Excel sheet has numeric value in General Type Column. For some reason these values are brought over as nulls even though they have a values. I am not sure why this is happening. I looked into the format of the fields and they are set to General in Excel, I tried setting them to Text and that did not help.
I tried to bring the contents from the excel source to a text file in csv format and for some reason the Text field containing numeric value came out as blank (NULL).
Any inputs on getting this addressed will be much appreciated.
SET #Cmd = 'INSERT INTO Table_01
SELECT * FROM OPENROWSET(''Microsoft.Jet.OLEDB.4.0'', ''Excel 8.0;Database=' + #ExcelFilePath + ''',
''SELECT * FROM [Sheet1$]'')'
EXEC(#Cmd)
This is to do with TypeGuessRows and IMEX:
OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;HDR=YES;IMEX=1;Database=x.xls',
'SELECT * FROM [Sheet2$]');
TypeGuesssRows can be found at:
HKEY_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/
A value of 0 means all rows.