I'm a beginner in Python and the Pandas library, and I'm rather confused by some basic functionality of DataFrame. I was dropping my data frame and has stated inplace=True so my data should be dropped. But why am I still seeing my data when I show it using head or iloc function? I've checked my data using .info() and notice that the data is dropped already by the difference of the data count before stating inplace=True.
So why can I still see my dropped data? Any explanation or pointer would be great. Thanks
Pict
if you have NaN in olny one column, just use df.dropna(inplace=True)
This should get you the result you want.
The reason why your code is not working is because when you do df['to_address'] , you are working with only that column & the output is as series (using inplace=True will not have an effect) which the contents of the column with the NaN rows removed.
You can use df = df.dropna(subset=['to_address']) as well.
Related
enter image description here
I want to make a new column from "TotalPrice" with qcut function but some values returns as NaN. I don't know why?
I tried to change the data type of the column. But nothing has changed.
Edit:
you are doing a cqut on df rather than rfm dataframe. Ensure that this is what you expect to be doing
Because you did not provide some data to build a minimal reproducible example, I would guess that there's not enough data or too many repeated values. Then, the underlying quartile function may fail to find the edge of the quantile and returns NaN
(this did not make any sense because "M" buckets did not make sense with "TotalPrice")
I am reading data using spark streaming as follows
df = spark.readStream.format("cloudFiles").options(**cloudfile).schema(schema).load(filePath)
and streaming is working as expected. I can see the values coming in with following piece
from pyspark.sql.functions import input_file_name,count
filesdf = (df.withColumn("file", input_file_name()).groupBy("file").agg(count("*")))
display(filesdf)
filesdf dataframe prints the name of file and no. of rows
Next I need to get the filename form dataframe for further processing. How can I do this.
I searched on web and found following:
filename = filesdf.first()['file']
print(filename)
but above piece of code gives following error:
Queries with streaming sources must be executed with writeStream.start();
Please suggest how can i read a column from streaming dataframe for further processing.
I managed to solve the issue. Problem was I was trying to work with dataframe named filesdf rather I should have worked with original df which I got from streaming. When used that a command as simple as following worked for me, so save entire dataframe to a table
df.writeStream.outputMode("append").toTable("members")
With this I am able to write the dataframe contents to a table named members.
I have a pandas dataframe that I've extracted from a json object using pd.json_normalize.
It has 4 rows and over 60 columns, and with the exception of the 'ts' column there are no columns where there is more than one value.
Is it possible to merge the four rows togather to give one row which can then be written to a .csv file? I have searched the documentation and found no information on this.
To give context, the data is a one time record from a weather station, I will have records at 5 minute intervals and need to put all the records into a database for further use.
I've managed to get the desired result, it's a little convoluted, and i would expect that there is a much more succint way to do it, but I basically manipulated the dataframe, replaced all nan's with zero, replaced some strings with ints and added the columns together as shown in the code below:
with open(fname,'r') as d:
ws=json.loads(next(d))
df=pd.json_normalize(ws['sensors'], record_path='data')
df3=pd.concat([df.iloc[0],df.iloc[1], df.iloc[2],
df.iloc[3]],axis=1)
df3.rename(columns={0 :'a', 1:'b', 2 :'c' ,3 :'d'}, inplace=True)
df3=df3.fillna(0)
df3.loc['ts',['b','c','d']]=0
df3.loc[['ip_v4_gateway','ip_v4_netmask','ip_v4_address'],'c']=int(0)
df3['comb']=df3['a']+df3['b']+df3['c']+df3['d']
df3.drop(columns=['a','b','c','d'], inplace=True)
df3=df3.T
As has been said by quite a few people, the documentation on this is very patchy, so I hope this may help someone else who is struggling with this problem! (and yes, i know that one line isn't indented properly, get over it!)
I'm using elasticsearch_dsl to make make queries for and searches of an elasticsearch DB.
One of the fields I'm querying is an address, which as a structure like so:
address.first_line
address.second_line
address.city
adress.code
The returned documents hold this in JSON structures, such that the address is held in a dict with a field for each sub-field of address.
I would like to put this into a (pandas) dataframe, such that there is one column per sub-field of the address.
Directly putting address into the dataframe gives me a column of address dicts, and iterating the rows to manually unpack (json.normalize()) each address dict takes a long time (4 days, ~200,000 rows).
From the docs I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?
Searching for a way to solve this problem, I've come across my own answer and found it lacking, so will update with a better way
Specifically: pd.json_normalize(df['json_column'])
In context: pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)
Then drop the original column if required.
Original answer from last year that does the same thing much more slowly
df.column_of_dicts.apply(pd.Series) returns a DataFrame with those dicts flattened.
pd.concat(df,new_df) gets the new columns onto the old dataframe.
Then delete the original column_of_dicts.
pd.concat([df, df.address.apply(pd.Series)], axis=1) is the actual code I used.
For the mixError function in missForest, the documentation says
Usage:
mixError(ximp, xmis, xtrue)
Arguments
ximp : imputed data matrix with variables in the columns and observations in the rows. Note there should not be any missing values.
xmis: data matrix with missing values.
xtrue: complete data matrix. Note there should not be any missing values.
Then my question is..
If I have already xtrue, why do I need this function?
All the examples have a complete data, they impute some NA's on purpose, then they use missForest to fill out the NA's and then they calculate the error comparing the imputed data with the original data without NA's.
But.. what is the sense of that? If I already have the complete data!
So, the question is also
Could xtrue be the original data with all the rows with NA's removed??