pandas writes NUL character (\0) when calling to_csv - pandas

In one of my scripts I call the following code on my dataframe to save the data on disc.
to_csv(input_folder / 'tmp' / section_fname, index=False, encoding='latin1', quoting=csv.QUOTE_NONNUMERIC)
When I opened the created file with notepad++ in the "show all characters mode" it showed a lot of NUL characters (\0) inside one of the rows. In addition to this, some rows of the dataframe are not being written
However, if I scroll this line, there are some data of my dataframe after:
This appears somewhere in the middle of my dataframe, so I decided to call head and then tail to look inside this specific portion of the data where it appears. As I can see, the data is pretty all right: there are some integers and strings as it should be.
I am using pandas 1.1.5
I have looked throught the data to ensure that nothing weird is being written that can result in reading this way. In addition to this, I have googled if someone faced the same issue, but mostly it occurs that people read the data with pandas and get NUL characters
I have spent a lot of time digging into the data and the code and have no explanation of such behavior. Maybe someone can help me?
By the way, everytime I write my dataframe it occurs in a different place removing different amount of rows.
Kind regards,
Mike

Related

How to overcome the 2GB limit for a single column value in Spark

I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.

Filtering option causing python to hang - how to debug?

I am preprocessing large datasets to get them ready for clustering operations. I have a script that reads the data from CSV and performs various checks for missing data, erroneous values, etc. Until now, everything has worked as expected. Still, when I ran the script yesterday, it started to hang on to a simple filtering operation. The source data has not changed, but somehow processing can't get past this line. I have isolated the problem by moving the following lines of code to another file, and the same issue is observed:
import pandas as pd
df = pd.read_csv('data1.csv',index_col=0)
# Get list of columns of interest for first check
columns = [col for col in df.columns if 'temp' in col]
# Find indices where any value of a column of interest has a value of 1
indices = list(df[df[columns]==1].dropna(how='all').index)
This previously ran fine, correctly identifying indices with this '1' flag in 'columns'. Now (and with no changes to the code or source data), it hangs on the indices line. I further broke it down to identify the specific problem: df[columns]==1 runs fine, but grabbing the df filtered on this condition (df[df[columns]==1]) is the line that hangs.
How can I troubleshoot what the problem is? Since I had not made any changes when it last worked, I am perplexed. What could possibly be the cause? Thanks in advance for any tips.
EDIT: The below approach seems to be drastically faster and solved the problem:
indices = df[(df[columns]==1).any(1)].index
When tested on a subset of the whole df, it accomplished the task in 0.015 seconds, while the prior method took 15.0 seconds.

can i compress a pandas dataframe into one row?

I have a pandas dataframe that I've extracted from a json object using pd.json_normalize.
It has 4 rows and over 60 columns, and with the exception of the 'ts' column there are no columns where there is more than one value.
Is it possible to merge the four rows togather to give one row which can then be written to a .csv file? I have searched the documentation and found no information on this.
To give context, the data is a one time record from a weather station, I will have records at 5 minute intervals and need to put all the records into a database for further use.
I've managed to get the desired result, it's a little convoluted, and i would expect that there is a much more succint way to do it, but I basically manipulated the dataframe, replaced all nan's with zero, replaced some strings with ints and added the columns together as shown in the code below:
with open(fname,'r') as d:
ws=json.loads(next(d))
df=pd.json_normalize(ws['sensors'], record_path='data')
df3=pd.concat([df.iloc[0],df.iloc[1], df.iloc[2],
df.iloc[3]],axis=1)
df3.rename(columns={0 :'a', 1:'b', 2 :'c' ,3 :'d'}, inplace=True)
df3=df3.fillna(0)
df3.loc['ts',['b','c','d']]=0
df3.loc[['ip_v4_gateway','ip_v4_netmask','ip_v4_address'],'c']=int(0)
df3['comb']=df3['a']+df3['b']+df3['c']+df3['d']
df3.drop(columns=['a','b','c','d'], inplace=True)
df3=df3.T
As has been said by quite a few people, the documentation on this is very patchy, so I hope this may help someone else who is struggling with this problem! (and yes, i know that one line isn't indented properly, get over it!)

Too many errors [invalid] ecountered when loading data into bigquery

I enriched a public dataset of reddit comments with data from LIWC (Linguistic Inquiry and Word Count). I have 60 files รก 600mb. The idea is now to upload to BigQuery, getting them together and analyze the results. Alas i faced some problems.
For a first test I had a test sample with 200 rows and 114 columns. Here is a link to the csv i used
I first asked on Reddit and fhoffa provided a really good answer. The problem seems to be the newlines (/n) in the body_raw column, as redditors often include them in their text. It seems BigQuery cannot process them.
I tried to transfer the original data, which i transfered to storage, back to BigQuery, unedited, untouched, but the same problem. BigQuery cannot even process the original data, which comes from BigQuery...?
Anyway, I can open the csv without problems in other programs such as R, which means that the csv itself is not damaged or the schema is inconsistent. So fhoffa's command should get rid of it.
bq load --allow_quoted_newlines --allow_jagged_rows --skip_leading_rows=1 tt.delete_201607a myproject.newtablename gs://my_testbucket/dat2.csv body_raw,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP
The output was:
Too many positional args, still have ['body_raw,score_h...]
If i take away "tt.delete_201607a" from the command, i get the same error message I have often seen now:
BigQuery error in load operation: Error processing job 'xx': Too many errors encountered.
So i do not know what to do here. Should I get rid of /n with Python? That would take probably days (although im not sure, i am not a programmer), as my complete data set is around 55 million rows.
Or do you have any other ideas?
I checked again, and I was able to load the file you left on dropbox without a problem.
First I made sure to download your original file:
wget https://www.dropbox.com/s/5eqrit7mx9sp3vh/dat2.csv?dl=0
Then I run the following command:
bq load --allow_quoted_newlines --allow_jagged_rows --skip_leading_rows=1 \
tt.delete_201607b dat2.csv\?dl\=0 \
body_raw,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP,oops
As mentioned in reddit, you need the following options:
--allow_quoted_newlines: There are newlines inside some strings, hence the CSV is not strictly newline delimited.
--allow_jagged_rows: Not every row has the same number of columns.
,oops: There is an extra column in some rows. I added this column to the list of columns.
When it says "too many positional arguments", it's because your command says:
tt.delete_201607a myproject.newtablename
Well, tt.delete_201607a is how I named my table. myproject.newtablename is how you named your table. Choose one, not both.
Are you sure you are not able to load the sample file you left on dropbox? Or you are getting errors from rows I can't find on that file?

How to use metadata or have sql recall column width on a fixed width .txt file import

Ill preface this by stating that I have searched and searched and have yet to find my answer.
Long story short, I have a .txt file of around 1.5mm rows and 200 columns. The columns are all set fixed width that I ahve a metadata file for. I get a new file about every 6 months and have been entering the column widths manually. I am trying to figure out a way for sql to recall the widths or how to load the metadata so I can set it to that. The manual process is tedious and time consuming.
It is highly possible that I am just searching the wrong keywords. Any advice would be great but guidance of where I can read and learn about this process would be better(still sort of a beginner here).
thanks
you have to use
BULK INSERT TableA FROM '{inputfilename}'
WITH FORMATFILE = '{xmlformatfile}'
look at this entry for more details