Trying to assign string values to column. But all I get is Nan. What to do? - pandas

I'm trying to update a Pandas df column with a column from another df, which changes daily.
What I mean to do is to transplant what's in this:
Daily schedule of workers for all year
To this:
Schedule of workers for today, column in red
I'd like to do it every day until July. It usually worked quite well. In 2023, the calendar was made with slight changes in the format, and I can't make Pandas read the data as I'd like.
I cannot actually assign a column from one database to a columns from the other. The code is accepted by Python, but all I get is NaN, not the strings I hoped for. All the values from the other column are strings. What am I doing wrong?
Thanks!
Here's my code:
today = datetime.today().strftime("%d/%m/%Y")
df["status"] = df_diario[today].astype('str')
df["status"]
nome
Ad NaN
Al NaN
An NaN
Ca NaN
Cl NaN
Da NaN
El NaN
Ga NaN
Hu NaN
Jo NaN
Jo NaN
Jo NaN
Jo NaN
Le NaN
Lu NaN
Lu NaN
Lui NaN
Ma NaN
Mar NaN
Mau NaN
Om NaN
Pa NaN
Pau NaN
Pe NaN
Ro inativo
Ro NaN
Ro NaN
Ron NaN
Vi NaN
Name: status, dtype: object

what is the variable today that you used as the DataFrame accessor? Also, please format your answer so that others can read clearly and help you better.
However, if you do check your answer, there's one line that is not NaN, it is inactivo. It could be that both DataFrames are incompatible and having different indices. If you want to do an reassignment this way, you need to have identical index in both DataFrames.

Found the answer! It actually was in MS Excel. If you type some text that looks like a date, it will automatically define it in a date format. For this reason, I could not transplant my column as I'd like.
Pandas "inherits" the date format from the Excel spreadsheet, so to speak. It would import some of my dates as datetime objects, unrecognizable by the code I had written. It didn't import all of them as such because I had done the 2022 table with Python itself, from Jan 24th on. Because I had manually typed 10/01/2023, in this year's table, Pandas interpreted it as datetime and thus my code didn't work. To prevent the mess, I had to type an apostrophe before the date in an Excel cell.

Related

Nan columns not dropping

i have the data-set that contains some NAN values. i tried this to drop it but it is still showing
df['string_tweet'].dropna(inplace=True)
df['string_tweet']
this is the output
113 apc started let ’ finish started
235 upon vote katsina , apc government left state ...
1796 two people contesting office , one person win ...
1798 deji said peter obi jumping church church.na d...
1850 amnesia set , lem say deleting incriminating p...
...
378726 nan
378727 nan
378728 nan
378729 nan
378730 nan
Name: string_tweet, Length: 63664, dtype: object
please check the length and the row, they are not corresponding
If you have proper NaN values, use the subset argument to work on the whole dataframe:
df.dropna(subset=['string_tweet'], inplace=True)
If your dataframe includes "nan" strings as suggested by #99_m4n, you may filter them out using:
df = df[df['string_tweet']!='nan']
I guess, that the pictured nan are of type numpy.ndarray try to convert your column before droping the NaN.
df['string_tweet']=df['string_tweet'].astype(float)

ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'

I am receiving this error when I try to upload a csv file as a geodataframe.According to other questions resolutions on this site, this method should do the trick.
Here is the code that I am using to: upload the file as a gdf, then produce a subset dataframe with only some of the columns present.
cp_union = gpd.read_file(r'C:\Users\User\Desktop\CPAWS\terrestrial_outputs\cp_union.csv')
cp_union.crs = 'epsg:3005'
cp_trimmed = cp_union[['COSEWIC_status','reason_for_designation','cnm_eng','iucn_cat','mgmt_e','status_e','classification','sq_m']]
As stated in the title, the error that i am receiving is this: ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'. Is there some part of the process of saving a gdf as a csv and then reloading it as a gdf that would cause the creation of an additional geometry column?
EDIT
In another script, I loaded the same csv file as a pd dataframe. Here is the first row of data within that pd dataframe.
Unnamed: 0 0
fid_critic 0
scntfc_nm Castilleja victoriae
cnm_eng Victoria's Owl-clover
cnm_fren Castilléjie de Victoria
cswc_pop NaN
ch_stat Final
cb_site_nm Cattle Point
ch_detail Detailed Polygon
ch_variant NaN
ch_method NaN
hectares 0.8559
utm_zone 10
utm_east 478383
utm_north 5365043
latitude 48.438164
longitude -123.29226
shape_1 0.0
objectid 10251681.0
area_sqm 8478.6733
feat_len 326.5008
fid_protec -1
name_e NaN
name_f NaN
aichi_t11 NaN
iucn_cat NaN
oecm NaN
o_area 0.0
loc_e NaN
loc_f NaN
type_e NaN
mgmt_e NaN
gov_type NaN
legisl_e NaN
status_e NaN
protdate 0
delisdate 0
owner_e NaN
owner_f NaN
subs_right NaN
comments NaN
url NaN
shape_leng 0.0
protected 0
shape_le_1 320.859687
shape_area 6499.790343
geometry POLYGON ((1200735.4438 384059.0133999996, 1200...
COSEWIC_status Endangered
reason_for_designation This small annual herb is confined to a very s...
sq_m 6499.790343
classification c
Name: 0, dtype: object
So my only theory here is that, when you save a gdf as a csv, that the csv contains a column called geometry. Then when you load that csv as a gdf, that geopandas tries to create a new geometry column ontop of the one that was already in the csv. I could be completely wrong about this. Even if this is the case, Im not sure how to go about resolving the issue.
Thanks for the help!
using your sample data to create a CSV. Had to replace geometry as sample is not a valid WKT string
re-produced your error
solved by loading using pandas then converting to geopandas
solution
df = pd.read_csv(f)
cp_union = gpd.GeoDataFrame(
df.loc[:, [c for c in df.columns if c != "geometry"]],
geometry=gpd.GeoSeries.from_wkt(df["geometry"]),
crs="epsg:3005",
)
full code
import pandas as pd
import geopandas as gpd
import io
from pathlib import Path
# fmt: off
df_q = pd.read_csv(io.StringIO("""Unnamed: 0 0
fid_critic 0
scntfc_nm Castilleja victoriae
cnm_eng Victoria's Owl-clover
cnm_fren Castilléjie de Victoria
cswc_pop NaN
ch_stat Final
cb_site_nm Cattle Point
ch_detail Detailed Polygon
ch_variant NaN
ch_method NaN
hectares 0.8559
utm_zone 10
utm_east 478383
utm_north 5365043
latitude 48.438164
longitude -123.29226
shape_1 0.0
objectid 10251681.0
area_sqm 8478.6733
feat_len 326.5008
fid_protec -1
name_e NaN
name_f NaN
aichi_t11 NaN
iucn_cat NaN
oecm NaN
o_area 0.0
loc_e NaN
loc_f NaN
type_e NaN
mgmt_e NaN
gov_type NaN
legisl_e NaN
status_e NaN
protdate 0
delisdate 0
owner_e NaN
owner_f NaN
subs_right NaN
comments NaN
url NaN
shape_leng 0.0
protected 0
shape_le_1 320.859687
shape_area 6499.790343
geometry POLYGON ((5769135.557632876 7083849.386658552, 5843426.213336911 7098018.122146672, 5852821.812968816 7081377.7312996285, 5914814.478616157 7091734.620966213, 5883751.009067913 7017032.330573363, 5902031.719573214 6983898.953064103, 5864452.659165712 6922039.030140929, 5829585.402576889 6878872.269967912, 5835906.522449658 6846685.714836724, 5800391.382286092 6827305.509709548, 5765261.646424723 6876008.057438379, 5765261.402301509 6876010.894933639, 5765264.431247815 6876008.786040769, 5760553.056402712 6927522.42488809, 5720896.599172597 6983360.181762057, 5755349.303491102 7039380.015177476, 5769135.557632876 7083849.386658552))
COSEWIC_status Endangered
reason_for_designation This small annual herb is confined to a very s...
sq_m 6499.790343
classification c"""), sep="\s\s+", engine="python", header=None).set_index(0).T
# fmt: on
# generate a CSV file from sample data
f = Path.cwd().joinpath("SO_q.csv")
df_q.to_csv(f, index=False)
# replicate issue...
try:
gpd.read_file(f)
except ValueError as e:
print(e)
# now the actual solution
df = pd.read_csv(f)
cp_union = gpd.GeoDataFrame(
df.loc[:, [c for c in df.columns if c != "geometry"]],
geometry=gpd.GeoSeries.from_wkt(df["geometry"]),
crs="epsg:3005",
)
GeoPandas automatically adds a geometry field when you read in your CSV.
As you already have a field called "geometry", GeoPandas raises an exception. GeoPandas doesn't read the WKT strings in the "geometry" field as geometry.
Some workarounds:
the GDAL/OGR CSV driver (which geopandas uses) supports the open option GEOM_POSSIBLE_NAMES which lets you specify a different field name to look for geometry in.
gdf = gpd.read_file('/path/to/gdf.csv',
GEOM_POSSIBLE_NAMES="geometry",
KEEP_GEOM_COLUMNS="NO")
The KEEP_GEOM_COLUMNS option is also required otherwise GDAL/OGR will return a "geometry" column as well as the geometry and you still get the original ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'.
Alternatively, if you have control over the original writing of the CSV, the GDAL/OGR CSV driver documentation notes that if a field is named "WKT" it will be read as geometry:
When reading a field named “WKT” is assumed to contain WKT geometry, but also is treated as a regular field.
So one option is to write out the CSV using "WKT" as the geometry column name and then it can be read straight back in:
gdf = gdf.rename_geometry('WKT')
gdf.to_csv('/path/to/gdf.csv', index=False)
# Then in later scripts you can just read it straight back in
gdf = gpd.read_file('/path/to/gdf.csv')
A final option is to read the CSV in as a Pandas DataFrame (per #RobRaymond's answer) then manually create the geometry from the WKT and convert to a GeoDataFrame. Here is an alternative way of doing so:
import geopandas as gpd
df = gpd.read_file('/path/to/gdf.csv', ignore_geometry=True)
# Create geometry objects from WKT strings
df['geometry'] = gpd.GeoSeries.from_wkt(df['geometry'])
# Convert to GDF
gdf = gpd.GeoDataFrame(df)

How to retain NaN values using pandas factorize()?

I have a Pandas data frame with several columns, with some columns comprising categorical entries. I convert (or, encode) these entries to numerical values using factorize() as follows:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize(na_sentinel=None)[0]
The columns have several NaN entries, so I let na_sentinel=None to retain the NaN entries. However, the NaN values are not retained (they get converted to numerical entries), which is not what I desire. My Pandas version is 1.3.5. Is there something I am missing?
Factorize converts NaN values by default to -1. The NaN values are retained in this way since the NaN values can be identified by the -1. You would probably want to keep the default which is:
na_sentinel =-1
see
https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

Pandas: DataFrame op DataFrame Results in NaNs

Why do simple DataFrame op DataFrame operations result in a union'ed DataFrame? Pandas documentation mentions unionizing because of alignment issues. I don't see any alignment issues with df1 and df2. Aren't alignment issues about different shapes, different dtypes, or different indexes?
df1 = pd.DataFrame([[1,2],[3,4]],columns=list('AB'))
df2 = pd.DataFrame([[5,6],[7,8]],columns=list('CD'))
>> df1*df2
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
Another source of alignment issues is non-matching column names. Here, alignment requires identical column names. Either make the column names the same or use .values. Using .values on just the right-hand DataFrame will retain the DataFrame type.
>> df1*df2.values
A B
0 5 12
1 21 32

What is the functionality of the filling method when reindexing?

When reindexing, say, 1 minute data to daily data (e.g. and index for daily prices at 16:00), if there is a situation that there is no 1 minute data for the 16:00 timestamp on a day, we would want to forward fill from the last non-null 1min data. In the following case, there is no 1min data before 16:00 on the 13th, and the last 1min data comes from 10th.
When using reindex with method='ffill', wouldn't one expect the following code to fill in the value on the 13th at 16:00? Inspecting daily1 shows that it is missing however.
import pandas as pd
import numpy as np
hf_index = pd.date_range(start='2013-05-09 9:00', end='2013-05-13 23:59', freq='1min')
hf_prices = np.random.rand(len(hf_index))
hf = pd.DataFrame(hf_prices, index=hf_index)
hf.ix['2013-05-10 18:00':'2013-05-13 18:00',:]=np.nan
hf.plot()
ind_daily = pd.date_range(start='2013-05-09 16:00', end='2013-05-13 16:00', freq='B')
print(ind_daily.values)
daily1 = hf.reindex(index=ind_daily, method='ffill')
To fill as one (or rather I) would expect, I need to do this:
daily2 = daily1.fillna(method='ffill')
If this is the case, what is the fill method in reindex actually doing. It is not clear to me just from the pandas documentation. It seems to me I should not have to do the above line.
I write my comment on the github here as well:
The current behavior in my opinion makes more sense. 'nan' values can be valid "actual" values in some scenarios. the concept of an actual 'nan' value should be different from 'nan' value because of changing index. If I have a dataframe like this:
A B C
1 1.242 NaN 0.110
3 NaN -0.185 -0.209
5 -0.581 1.483 NaN
and i want to keep all nan as nan, it makes much more sense to have:
df.reindex( [2, 4, 6], method='ffill' )
A B C
2 1.242 NaN 0.110
4 NaN -0.185 -0.209
6 -0.581 1.483 NaN
just take whatever value there is ( nan or not nan ) and fill forward until the next available index. Reindexing should not enforce a mandatory fillna on the data.
This is completely different from
df.reindex( [2, 4, 6], method=None )
which produces
A B C
2 NaN NaN NaN
4 NaN NaN NaN
6 NaN NaN NaN
Here is an example:
np.nan can just mean not applicable; say i have hourly data, and on weekends some calculations are just not applicable. I will fill nan for those columns during the weekends. now if I reindex to finer index, say every minute, the reindex will pick the last value from Friday, and fill it out for the whole weekend. This is wrong.
in reindexing a dataframe, forward flll means just take whatever value there is ( nan or not nan ) and fill forward until the next available index. A 'nan' value can be just an actual valid observation which you want to keep as is.
Reindexing should not enforce a mandatory fillna on the data.