I am reading data using spark streaming as follows
df = spark.readStream.format("cloudFiles").options(**cloudfile).schema(schema).load(filePath)
and streaming is working as expected. I can see the values coming in with following piece
from pyspark.sql.functions import input_file_name,count
filesdf = (df.withColumn("file", input_file_name()).groupBy("file").agg(count("*")))
display(filesdf)
filesdf dataframe prints the name of file and no. of rows
Next I need to get the filename form dataframe for further processing. How can I do this.
I searched on web and found following:
filename = filesdf.first()['file']
print(filename)
but above piece of code gives following error:
Queries with streaming sources must be executed with writeStream.start();
Please suggest how can i read a column from streaming dataframe for further processing.
I managed to solve the issue. Problem was I was trying to work with dataframe named filesdf rather I should have worked with original df which I got from streaming. When used that a command as simple as following worked for me, so save entire dataframe to a table
df.writeStream.outputMode("append").toTable("members")
With this I am able to write the dataframe contents to a table named members.
Related
I was wondering if there is a method to store ones columns from a dataframe to an already existing csv file without reading the entire file first?
I am working with a very large dataset, where I read 2-5 columns of the dataset, use them for calculating a new variable(column) and I want to store this variable to the entire dataset. My memory can not load the entire dataset at once and therefore I am looking for a way to store the new columns to the entire dataset without loading all of it.
I have tried using chunking with:
df = pd.read_csv(Path, chunksize = 10000000)
But then I am faced with the Error "TypeError: 'TextFileReader' object is not subscriptable" When trying to process the data.
The data is also grouped by two variables and therefore chunking is not preferred when doing these calculations.
I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.
I'm a beginner in Python and the Pandas library, and I'm rather confused by some basic functionality of DataFrame. I was dropping my data frame and has stated inplace=True so my data should be dropped. But why am I still seeing my data when I show it using head or iloc function? I've checked my data using .info() and notice that the data is dropped already by the difference of the data count before stating inplace=True.
So why can I still see my dropped data? Any explanation or pointer would be great. Thanks
Pict
if you have NaN in olny one column, just use df.dropna(inplace=True)
This should get you the result you want.
The reason why your code is not working is because when you do df['to_address'] , you are working with only that column & the output is as series (using inplace=True will not have an effect) which the contents of the column with the NaN rows removed.
You can use df = df.dropna(subset=['to_address']) as well.
I used mongodb spark connector generated a dataframe from mongodb
val df1 = df.filter(df("dev.app").isNotNull).select("dev.app").limit(100)
It's a big collection, so I limit the row to 100.
when I use
df1.show()
it works fast.
But when I use
df1.count
to see the fist row of df1
the result is enter image description here
it is too slow.
Can anybody give me some suggestions?
I think you should try to tweak spark.sql.shuffle.partitions configuration. you may very small data but you are creating too many partitions by default it is 200
see this for info
The Spark csv readers are not as flexible as pandas.read_csv and do not seem to be able to handle parsing dates of different formats etc. Is there a good way of passing pandas DataFrames to Spark Dataframes in an ETL map step? Spark createDataFrame does not appear to always work. Likely the typing system has not been mapping exhaustively? Paratext looks promising but likely new and not yet heavily used.
For example here: Get CSV to Spark dataframe