Databricks Round Function not working when Exporting to CSV

Databricks Round Function not working when Exporting to CSV - apache-spark-sql

I have a table in Databricks with format such as:
Column_Name Type
person_id string
score float
When running this SQL query in Databricks:
select person_id, round(score, 0) as score, floor(score) as score2
from table
where score is not null
The output looks like this in the Databricks UI:
person_id score score2
1 445 445
2 157 157
When exporting the PREVIEW, the CSV output looks the same:
person_id,score,score2
1,445,445
2,157,157
However, when exporting the FULL RESULTS (ie, re-running and export all), the CSV output looks like this:
1,445.0,445
2,157.0,157
The Round function is not retaining it's expected results, but rather, defaulting to float with a decimal point. The floor function appears to be working fine. Is this a Spark thing? Novice databricks user here.

Related

How to aggregate logs by field and then by bin in AWS CloudWatch Insights?

I'm trying to do a query that will first aggregate by field count and after by bin(1h) for example I would like to get the result like:
# Date Field Count
1 2019-01-01T10:00:00.000Z A 123
2 2019-01-01T11:00:00.000Z A 456
3 2019-01-01T10:00:00.000Z B 567
4 2019-01-01T11:00:00.000Z B 789
Not sure if it's possible though, the query should be something like:
fields Field
| stats count() by Field by bin(1h)
Any ideas how to achieve this?

Is this what you need?
fields Field | stats count() by Field, bin(1h)

If you want to create a line chart, you can do it by separately counting each value that your field could take.
fields
Field = 'A' as is_A,
Field = 'B' as is_B
| stats sum(is_A) as A, sum(is_B) as B by bin(1hour)
This solution requires your query to include a string literal of each value ('A' and 'B' in OP's example). It works as long as you know what those possible values are.
This might be what Hugo Mallet was looking for, except the avg() function won't work here so he'd have to calculate the average by dividing by a total

Not able to group by a certain field and create visualizations.
fields Field
| stats count() by Field, bin(1h)
Keep getting this message
No visualization available. Try this to get started:
stats count() by bin(30s)

Amazon Redshift - Pivot Large JSON Arrays

I have an optimisation problem.
I have a table containing about 15MB of JSON stored as rows of VARCHAR(65535). Each JSON string is an array of arbitrary size.
95% contains 16 or fewer elements
the longest (to date) contains 67 elements
the hard limit is 512 elements (before 64kB isn't big enough)
The task is simple, pivot each array such that each element has its own row.
id | json
----+---------------------------------------------
01 | [{"something":"here"}, {"fu":"bar"}]
=>
id | element_id | json
----+------------+---------------------------------
01 | 1 | {"something":"here"}
01 | 2 | {"fu":"bar"}
Without having any kind of table valued functions (user defined or otherwise), I've resorted to pivoting via joining against a numbers table.
SELECT
src.id,
pvt.element_id,
json_extract_array_element_text(
src.json,
pvt.element_id
)
AS json
FROM
source_table AS src
INNER JOIN
numbers_table AS pvt(element_id)
ON pvt.element_id < json_array_length(src.json)
The numbers table has 512 rows in it (0..511), and the results are correct.
The elapsed time is horrendous. And it's not to do with distribution or sort order or encoding. It's to do with (I believe) redshift's materialisation.
The working memory needed to process 15MB of JSON text is 7.5GB.
15MB * 512 rows in numbers = 7.5GB
If I put just 128 rows in numbers then the working memory needed reduces by 4x and the elapsed time similarly reduces (not 4x, the real query does other work, it's still writing the same amount of results data, etc, etc).
So, I wonder, what about adding this?
WHERE
pvt.element_id < (SELECT MAX(json_array_length(src.json)) FROM source_table)
No change to the working memory needed, the elapsed time goes up slightly (effectively a WHERE clause that has a cost but no benefit).
I've tried making a CTE to create the list of 512 numbers, that didn't help. I've tried making a CTE to create the list of numbers, with a WHERE clause to limit the size, that didn't help (effectively Redshift appears to have materialised using the 512 rows and THEN applied the WHERE clause).
My current effort is to create a temporary table for the numbers, limited by the WHERE clause. In my sample set this means that I get a table with 67 rows to join on, instead of 512 rows.
That's still not great, as that ONE row with 67 elements dominates the elapsed time (every row, no matter how many elements, gets duplicated 67 times before the ON pvt.element_id < json_array_length(src.json) gets applied).
My next effort will be to work on it in two steps.
As above, but with a table of only 16 rows, and only for row with 16 or fewer elements
As above, with the dynamically mixed numbers table, and only for rows with more than 16 elements
Question: Does anyone have any better ideas?

Please consider declaring the JSON as an external table. You can then use Redshift Spectrum's nested data syntax to access these values as if they were rows.
There is a quick tutorial here: "Tutorial: Querying Nested Data with Amazon Redshift Spectrum"
Simple example:
{ "id": 1
,"name": { "given":"John", "family":"Smith" }
,"orders": [ {"price": 100.50, "quantity": 9 }
,{"price": 99.12, "quantity": 2 }
]
}
CREATE EXTERNAL TABLE spectrum.nested_tutorial
(id int
,name struct<given:varchar(20), family:varchar(20)>
,orders array<struct<price:double precision, quantity:double precision>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-files/temp/nested_data/nested_tutorial/'
;
SELECT c.id
,c.name.given
,c.name.family
,o.price
,o.quantity
FROM spectrum.nested_tutorial c
LEFT JOIN c.orders o ON true
;
id | given | family | price | quantity
----+-------+--------+-------+----------
1 | John | Smith | 100.5 | 9
1 | John | Smith | 99.12 | 2

Neither the data format, nor the task you wish to do, is ideal for Amazon Redshift.
Amazon Redshift is excellent as a data warehouse, with the ability to do queries against billions of rows. However, storing data as JSON is sub-optimal because Redshift cannot use all of its abilities (eg Distribution Keys, Sort Keys, Zone Maps, Parallel processing) while processing fields stored in JSON.
The efficiency of your Redshift cluster would be much higher if the data were stored as:
id | element_id | key | value
----+------------+---------------------
01 | 1 | something | here
01 | 2 | fu | bar
As to how to best convert the existing JSON data into separate rows, I would frankly recommend that this is done outside of Redshift, then loaded into tables via the COPY command. A small Python script would be more efficient at converting the data that trying strange JOINs on a numbers table in Redshift.

Maybe if you avoid parsing and interpreting JSON as JSON and instead work with this as text it can work faster. If you're sure about the structure of your JSON values (which I guess you are since the original query does not produce the JSON parsing error) you might try just to use split_part function instead of json_extract_array_element_text.
If your elements don't contain commas you can use:
split_part(src.json,',',pvt.element_id)
if your elements contain commas you might use
split_part(src.json,'},{',pvt.element_id)
Also, the part with ON pvt.element_id < json_array_length(src.json) in the join condition is still there, so to avoid JSON parsing completely you might try to cross join and then filter out non-null values.

SQL Server 2014 equivalent to mysql's find_in_set()

I'm working with a database that has a locations table such as:
locationID | locationHierarchy
1 | 0
2 | 1
3 | 1,2
4 | 1
5 | 1,4
6 | 1,4,5
which makes a tree like this
1
--2
----3
--4
----5
------6
where locationHierarchy is a csv string of the locationIDs of all its ancesters (think of a hierarchy tree). This makes it easy to determine the hierarchy when working toward the top of the tree given a starting locationID.
Now I need to write code to start with an ancestor and recursively find all descendants. MySQL has a function called 'find_in_set' which easily parses a csv string to look for a value. It's nice because I can just say "find in set the value 4" which would give all locations that are descendants of locationID of 4 (including 4 itself).
Unfortunately this is being developed on SQL Server 2014 and it has no such function. The CSV string is a variable length (virtually unlimited levels allowed) and I need a way to find all ancestors of a location.
A lot of what I've found on the internet to mimic the find_in_set function into SQL Server assumes a fixed depth of hierarchy such as 4 levels maximum) which wouldn't work for me.
Does anyone have a stored procedure or anything that I could integrate into a query? I'd really rather not have to pull all records from this table to use code to individually parse the CSV string.
I would imagine searching the locationHierarchy string for locationID% or %,{locationid},% would work but be pretty slow.

I think you want like -- in either database. Something like this:
select l.*
from locations l
where l.locationHierarchy like #LocationHierarchy + ',%';
If you want the original location included, then one method is:
select l.*
from locations l
where l.locationHierarchy + ',' like #LocationHierarchy + ',%';
I should also note that SQL Server has proper support for recursive queries, so it has other options for hierarchies apart from hierarchy trees (which are still a very reasonable solution).

Finally It worked for me..
SELECT * FROM locations WHERE locationHierarchy like CONCAT(#param,',%%') OR
o.unitnumber like CONCAT('%%,',#param,',%%') OR
o.unitnumber like CONCAT('%%,',#param)

how to change datatype of a column in sybase query?

One of my query to Sybase server is returning garbage data. After some investigations i found out that one of the columns with datatype double is causing the issue. If I don't select that particular column then the query returns correct result. The column is question is a double with laarge number of decimal places. I tried to use round function upto 4 decimal places but still i get corrupt data. How can I correctly specify the column in my query to get correct data?
I am using windows 7 box and Sybase Adaptive server enterprise driver. (Sybase client 15.5). I am using 32 bit drivers.
Sample results:
Incorrect result using sybase ASE driver on windows 7 box
"select ric_code as ric, adjusted_weight as adjweight from v_temp_idx_comp where index_ric_code='.AXJO' and ric_code='AQG.AX'"
ric adjweight
1 AQG.AX NA
2 \020 NA
3 <NA> NA
Correct result on windows xp box using Merant driver
"select ric_code, adjusted_weight from v_temp_idx_comp where index_ric_code='.AXJO' and ric_code='AQG.AX'"
ric_code adjusted_weight
1 AQG.AX 0.3163873547
Regards,
Alok

You may try convert to numeric like this:
select ric_code as ric, weight, convert(numeric(16,4), adjusted_weight) as adjweight, currency as currency
from v_temp_idx_comp
where index_ric_code='.AXJO'

how to link tables together using timestamp sql, mysql

here is how my tables are currently setup:
Dataset
|
- Dataset_Id - Int
|
- Timestamp - Timestamp
Flowrate
|
-Flowrate_id - int
|
-Dataset_id - ALL NULL (INT)
|
-TimeStamp - TimeStamp
|
-FlowRate - FLoat
I want to update the flowrate dataset_id column so that its ids corespond to the dataset dataset_ids. The Dataset table has over close to 400000 rows.... How can I do this so that it does not take forever. This data came from different data loggers and that's why I need to link them with their timestamps....

UPDATE
Flowrate JOIN Dataset ON (Flowrate.TimeStamp = Dataset.Timestamp)
SET Flowrate.Dataset_id = Dataset.Dataset_Id
completely independent from Python of course (what a weird tag to put here -- as if MySql cared what language you're using to send fixed SQL statements to it?!). Will be fast if and only if the tables are properly indexed, of course.
Absolutely weird capitalization irregularities you have in your schema, BTW -- would drive me absolutely bonkers if anybody used lowercase vs uppercase at random spots of column names that are so obviously "meant to" be identical! Nevertheless I've tried to reproduce it exactly, but I hope you reconsider this absurd style choice.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Databricks Round Function not working when Exporting to CSV - apache-spark-sql

Related

How to aggregate logs by field and then by bin in AWS CloudWatch Insights?

Amazon Redshift - Pivot Large JSON Arrays

SQL Server 2014 equivalent to mysql's find_in_set()

how to change datatype of a column in sybase query?

how to link tables together using timestamp sql, mysql

Categories

Resources