df.printSchema() in Databricks has its output truncated by the notebook cell. (The cell is not expandable/scrollable to see the schema in its entirety)
Is there a builtin way to view all of the schema?
For context: Python notebook, 7.4ML RunTime
Screenshot showing truncated output
Related
I'm attempting to upload a CSV file (which is an output from a BCP command) to BigQuery using the gcloud CLI BQ Load command. I have already uploaded a custom schema file. (was having major issues with Autodetect).
One resource suggested this could be a datatype mismatch. However, the table from the SQL DB lists the column as a decimal, so in my schema file I have listed it as FLOAT since decimal is not a supported data type.
I couldn't find any documentation for what the error means and what I can do to resolve it.
What does this error mean? It means, in this context, a value is REQUIRED for a given column index and one was not found. (By the way, columns are usually 0 indexed, meaning a fault at column index 8 is most likely referring to column number 9)
This can be caused by myriad of different issues, of which I experienced two.
Incorrectly categorizing NULL columns as NOT NULL. After exporting the schema, in JSON, from SSMS, I needed to clean it
up for BQ and in doing so I assigned IS_NULLABLE:NO to
MODE:NULLABLE and IS_NULLABLE:YES to MODE:REQUIRED. These
values should've been reversed. This caused the error because there
were NULL columns where BQ expected a REQUIRED value.
Using the wrong delimiter The file I was outputting was not only comma-delimited but also tab-delimited. I was only able to validate this by using the Get Data tool in Excel and importing the data that way, after which I saw the error for tabs inside the cells.
After outputting with a pipe ( | ) delimiter, I was finally able to successfully load the file into BigQuery without any errors.
I have a databricks notebook that will run every 2-4 weeks. It will read in a small csv, perform etl on python, truncate and load to a delta table.
This is what I am currently doing to avoid failures related to data type:
python to replace all '-' with '0'
python to drop rows with NaN or nan
spark_df = spark.createDataFrame(dfnew)
spark_df.write.saveAsTable("default.test_table", index=False, header=True)
This automatically detects the datatypes and is working right now.
BUT, what if the datatype cannot be detected or detects wrong? Mostly concerned about doubles, ints, bigints.
I tested casting but it doesnt work on databricks:
spark_df = spark.createDataFrame(dfnew.select(dfnew("Year").cast(IntegerType).as("Year")))
Is there a way to feed a DDL to spark dataframe for databricks? Should I not use spark?
I am creating a spark dataframe in databricks using createdataframe and getting the error:
'Some of types cannot be determined after inferring'
I know I can specify the schema, but that does not help if I am creating the dataframe each time with source data from an API and they decide to restructure it.
Instead I would like to tell spark to use 'string' for any column where a data type cannot be inferred.
Is this possible?
This can be easily handled with schema evaluation with delta format. Quick ref: https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
My actual (not properly working) setup has two pipelines:
Get API data to lake: for each row in metadata table in SQL calling the REST API and copy the reply (json-files) to the Blob datalake.
Copy data from the lake to SQL: For Each file auto create table in SQL.
The result is the correct number of tables in SQL. Only the content of the tables is not what I hoped for. They all contain 1 column named odata.metadata and 1 entry, the link to the metadata.
If I manually remove the metadata from the JSON in the datalake and then run the second pipeline, the SQL table is what I want to have.
Have:
{ "odata.metadata":"https://test.com",
"value":[
{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]}
Want:
[{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]
I tried to add $.['value'] in the API call. The result then was no odata.metadata line, but the array started with {value: which resulted in an error copying to SQL
I also tried to use mapping (in sink) to SQL. That gives the wanted result for the dataset I manually specified the mapping for, but only goes well for the dataset with the same number of column in the array. I don't want to manually do the mapping for 170 calls...
Does anyone know how handle this in ADF? For now I feel like the only solution is to add a Python step in the pipeline, but I hope for a somewhat standard ADF way to do this!
You can add another pipeline with dataflow to remove the content from JSON file before copying data to SQL, using flatten formatters.
Before flattening the JSON file:
This is what I see when JSON data copied to SQL database without flattening:
After flattening the JSON file:
Added a pipeline with dataflow to flatten the JSON file to remove 'odata.metadata' content from the array.
Source preview:
Flatten formatter:
Select the required object from the Input array
After selecting value object from input array, you can see only the values under value in Flatten formatter preview.
Sink preview:
File generated after flattening.
Copy the generated file as Input to SQL.
Note: If your Input file schema is not constant, you can enable Allow schema drift to allow schema changes
Reference: Schema drift in mapping data flow
I am facing an issue which might be related to this question and others similar to this, i decided to create a sperate question because i feel my problem might have some additional things that i need to consider. Here is what I am facing right now.
I have a dataframe in pandas where it reads the data from sql and shows up something like following:
in picture it shows me that values have leading '0' and the datatype of this column is 'object'.
when i run this SQL and export to csv on my windows machine (python 3.7, pandas 1.0.3), it works exactly as required and shows the correct output,
the problem occurs when i try to run on my Linux machine (python 3.5.2, pandas 0.24.2), it always removes the leading zeros while writing to CSV, the csv looks like the following image:
i am not sure, what should i be changing to get the desired result at both environments. will appreciate any help.
Edit:
confirmed that read from SQL in ubuntu dataframe also has leading zeros:
If you can use xlsx files instead of csv, then replace df.to_csv with df.to_excel and the file extension to xlsx.
With xlsx files you also get to store the types, so excel will not assume them to be numbers.
csv vs excel