Pandas append function adds new columns - pandas

I want to append one row to my dataframe.
Here's the code
import pandas as pd
citiesDataFrame=pd.read_csv('cities.csv')
citiesDataFrame=citiesDataFrame.append({
'LatD': 50,
'"LatM"' : 70,
'"LatS"' : 40,
'"NS"': '"S"',
'"LonD"': 200,
'"LonM"': 15,
'"LonS"': 40,
'"EW"': "E",
'"City"': '"Kentucky"',
'"State"': "KY"},ignore_index=True)
citiesDataFrame
But when i run, append doesn't work properly. In my dataframe i have 10 columns and 128 rows, when i run the code, it appends 9 columns and 1 row (here is modified dataframe) to dataframe.

Notice it works for LatD. The reason is your column names aren't identical to the existing names. Seems like a quoting issue. Not sure why you have the double quotes inside the single quotes. Make the column names match and then the append will work.

Related

Losing rows when renaming columns in pyspark (Azure databricks)

I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

comparing and removing rows in pandas

I am trying to create a new object by comparing two list. If the rows are matching the row should be removed form the splitted row_list or appended to a new list containing only the differences between both lists.
results = []
for row in splitted_row_list:
print(row)
for row1 in all_rows:
if row1 == row:
splitted_row_list.remove(row)
else:
results.append(row)
print(results)
However, this code just returns all the rows. Does anyone have a suggestion?
Sample data
all_rows[0]:'1390', '139080', '13980', '1380', '139080', '13080'
splitted_row_list[0]:'35335','53527','353529','242424','5222','444'
As I understand you want to compare two lists by index and keep the differences and you want to do it with pandas (because of the tag):
So here are two lists for example:
ls1=[0,10,20,30,40,50,60,70,80,90]
ls2=[0,15,20,35,40,55,60,75,80,95]
I make a pandas dataframe with these lists, and build a mask to filter out the the matching values:
df= pd.DataFrame(data={'ls1':ls1, 'ls2':ls2})
mask= df['ls1']!=df['ls2']
I can then call the different values for each list using the mask:
# list 1
df[mask]['ls1'].values
out: array([10, 30, 50, 70, 90])
and
# list 2
df[mask]['ls2'].values
out: array([15, 35, 55, 75, 95])

Adding columns from every other row in a pandas dataframe

In the picture, 1, you can see the start of my data frame. I would like to make two new columns that consist of the values (confirmed_cases and deaths) and to get rid of the 'Type' column. Essentially I want there to be one row of data for each county and to have a confirmed_cases and death column added using the values from the data already. I tried the code below but obviously the length of values does not match the length of index.
Any suggestions?
apidata['Confirmed_Cases'] = apidata['values'].iloc[::2].values
apidata['Deaths'] = apidata['values'].iloc[1::2].values
(Sorry about the link to the photo, I am too new to the site to be able to just include the photo in the post)
Maybe if there's a way to double how many times each value is posted in the new column? So first five deaths would be [5, 5, 26, 26, 0] and then I can just delete every other row?
I ended up figuring it out, by creating a second dataframe that deleted half of the rows (every other one) from the first dataframe and then adding the values from the original dataframe into the new one.
apidata2 = apidata.iloc[::2]
apidata2['Confirmed_Cases'] = apidata['values'].iloc[::2].values
apidata2['Deaths'] = apidata['values'].iloc[1::2].values
apidata2.head()
Finished output

pandas : Indexing for thousands of rows in dataframe

I initially had 100k rows in my dataset. I read the csv using pandas into a dataframe called data. I tried to do a subset selection of 51 rows using .loc. My index labels are numeric values 0, 1, 2, 3 etc. I tried using this command -
data = data.loc['0':'50']
But the results were weird, it took all the rows from 0 to 49999, looks like it is taking rows till the index value starts with 50.
Similarly, I tried with this command - new_data = data.loc['0':'19']
and the result was all the rows, starting from 0 till 18999.
Could this be a bug in pandas?
You want to use .iloc in place of .loc, since you are selecting data from the dataframe via numeric indices.
For example:
data.iloc[:50,:]
Keep in mind that your indices are of numeric-type, not string-type, so querying with a string (as you have done in your OP) attempts to match string-wise comparisons.