combining CSV files from Covid-data - dataframe

I want to combine the CSV files from the Johns Hopkins Covid Data (e.g. https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/05-10-2020.csv & https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-23-2020.csv).
I already managed to load the files into a DataFrame as well as sanitizing the header (_ vs. / in some names). Now I want to pick one column (e.g. Confirmed), rename it to the day of the file and then combine those CSV files to get a progress over time.
This merge needs to be done by state_province. In both frames, the key may not be present. How can I do this? I experimented with rightjoin and outerjoin, but didn't have any success. Can someone point me the right way please?
I initially didn't want to share the code that I have so far because I didn't want to guide to a specific solution - but here it is. It is copied together from several Jupyter cells.
using Dates
start = Dates.Date(2020,1,22) #begin of recording
now = Dates.Date(Dates.now())- Dates.Day(1) #today
date_range = collect(start:Dates.Day(1):now) #create a date range with 1 element per day
prefix = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
suffix = ".csv"
function create_url(date)
return prefix * Dates.format(date, "mm-dd-YYYY") * suffix
end
function cleanup_column_names(name)
if name == "Country/Region" || name == "Country_Region"
return "country"
elseif name == "Province/State" || name == "Province_State"
return "state"
else
return name
end
end
using CSV
using HTTP
using DataFrames
selected_data = "Confirmed"
date = date_range[1]
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
Regards
Tobias

I am relatively new to Julia, so take my answer with a bit of scepticism:
First, we wrap the DataFrame creation into a function:
function prepare_date_df(date)
data = DataFrame(CSV.File(HTTP.get(create_url(date)).body))
DataFrames.rename!(cleanup_column_names, data)
DataFrames.select!(data,["state", "country", selected_data])
DataFrames.rename!(data, 3 => Dates.format(date, "YYYY-mm-dd"))
return data
end
Let's create our first Dataframe:
df = prepare_date_df(date_range[1])
Now, let's iterate over all the other dates, create a dataframe for each date and merge this with our first dataframe:
for date in date_range[2:end]
df_new = prepare_date_df(date)
df = outerjoin(df, df_new, on = [:state, :country])
end
This works fine for the first two months, but with the growing Dataframes, it suddenly gets very slow (and even hangs?). So I would be very interested in a more performative answer!

Related

Arcpy Script to loop through field and run Union Analysis

I have a polygon file in form of a fishnet. Also another feature class with polygons named Trawl_Buffers. There is a unique field within Trawl_Buffers based on YEAR. I'd like to create a script to run a selection on YEAR, and then perform a union analysis with the fishnet polygon for each YEAR. So the desired output would be "Trawl_Buffers_union2003", "Trawl_Buffers_union2004" etc. I have a function that will get me the unique list of the years, and puts them in a list which i called vals.
Then seems I need to run a for loop over this list of unique years, create a temporary selection, then use that as input for the union, but I am having trouble implementing the query process.
Here is where I started, but seriously tripping
import arcpy
#Set the data environment
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb'
trawlBuffs = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\buffers\buffers_testing'
fishnet = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\fishnets\vms_net1k'
unionOut = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\unions\union'
# function to get unique values for the YEAR field found within the trawlBuffs fc
def unique_values(table, field):
with arcpy.da.SearchCursor(table, [field]) as cursor:
return sorted({row[0] for row in cursor})
# Get the unique values for the field 'YEAR' found within the 'trawl_buffs' featureclass table
vals = unique_values(trawlBuffs, "YEAR")
# Create a query string for the selected country
yearSelectionClause = '"YEAR" = ' + "'" + vals + "'"
#loop through the years, create selection, union, make permanent
for year in vals:
year_layer = str(year) + "_union"
arcpy.MakeFeatureLayer_management(trawlBuffs, year_layer)
arcpy.SelectLayerByAttribute_management(year_layer, "NEW_SELECTION", "\"YEAR"\" = %d" % (year))
arcpy.Union_analysis(fishnet, year_layer , unionOut)
arcpy.CopyFeatures_management(year_layer, "union_" + str(year))

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Using to_datetime several columns names

I am working with several CSV's that first N columns are information and then the next Ms (M is big) columns are information regarding a date.
This is the dataframe picture
I need to set just the columns between N+1 to N+M - 1 columns name to date format.
I tried this, in this case N+1 = 5, no matter M, I suppose that I can use -1 to not affect the last column name.
ContDiarios.columns[5:-1] = pd.to_datetime(ContDiarios.columns[5:-1])
but I get the following error:
TypeError: Index does not support mutable operations
The way you are doing is not feasable. Please try this way
def convert(x):
try:
return pd.to_datetime(x)
except:
return x
x.columns = map(convert,x.columns)
Or you can also use df.rename property to convert it.

Pyspark Schema update/alter Dataframe

I need to read a csv file from S3 ,it has string,double data but i will read as string which will provide a dynamic frame of only string. I want to do below for each row
concatenate few columns and create new columns
Add new columns
Convert value in 3rd column from string to date
Convert values of column 4,5,6 individually from string to decimal
Storename,code,created_date,performancedata,accumulateddata,maxmontlydata
GHJ 0,GHJ0000001,2020-03-31,0015.5126-,0024.0446-,0017.1811-
MULT,C000000001,2020-03-31,0015.6743-,0024.4533-,0018.0719-
Below is the code that I have written so far
def ConvertToDec(myString):
pattern = re.compile("[0-9]{0,4}[\\.]?[0-9]{0,4}[-]?")
myString=myString.strip()
doubleVal="";
if myString and not pattern.match(myString):
doubleVal=-9999.9999;
else:
doubleVal=-Decimal(myString);
return doubleVal
def rowwise_function(row):
row_dict = row.asDict()
data='d';
if not row_dict['code']:
data=row_dict['code']
else:
data='CD'
if not row_dict['performancedata']:
data= data +row_dict['performancedata']
else:
data=data + 'HJ'
// new columns
row_dict['LC_CODE']=data
row_dict['CD_CD']=123
row_dict['GBL']=123.345
if rec["created_date"]:
rec["created_date"]= convStr =datetime.datetime.strptime(rec["created_date"], '%Y-%m-%d')
if rec["performancedata"]
rec["performancedata"] = ConvertToDec(rec["performancedata"])
newrow = Row(**row_dict)
return newrow
store_df = spark.read.option("header","true").csv("C:\\STOREDATA.TXT", sep="|")
ratings_rdd = store_df.rdd
ratings_rdd_new = ratings_rdd.map(lambda row: rowwise_function(row))
updatedDF=spark.createDataFrame(ratings_rdd_new)
Basically, I am creating almost new DataFrame. My questions are below -
is this right approach ?
Since i am my changing schema mostly is there any other approach
Use Spark dataframes/sql, why use rdd? You don't need to perform any low level data operations, all are column level so dataframes are easier/efficient to use.
To create new columns - .withColumn(<col_name>, <expression/value>) (refer)
All the if's can be made .filter (refer)
The whole ConvertToDec can be written better using strip and ast module or float.

Why even though I sliced my original DataFrame and assigned it to another variable, my original DataFrame still changed values?

I am trying to calculate a portfolio's daily total price, by multiplying weights of each asset with the daily price of the assets.
Currently I have a DataFrame tw which is all zeros except for the dates that I want to re-balance, which holds my assets weights. What I would like to do is for each month, populate the zeros with the weights I am trying to re-balance with, till the next re-balancing date, and so on and so forth.
My code:
df_of_weights = tw.loc[dates_to_rebalance[13]:]
temp_date = dates_to_rebalance[13]
counter = 0
for date in df_of_weights.index:
if date.year == temp_date.year and date.month == temp_date.month:
if date.day == temp_date.day:
pass
else:
df_of_weights.loc[date] = df_of_weights.loc[temp_date].values
counter += 1
temp_date = dates_to_rebalance[13+counter]
I understand that if you slice your DataFrame and assign it to a variable (df_of_weights), changing the values of said variable would not affect the original DataFrame. However, the values in tw changed. Have been searching for an answer online for a while now and am really confused.
You should use copy in order to fix the problem such that:
df_of_weights = tw.loc[dates_to_rebalance[13]:].copy()
The problem is slicing provides view instead of copy. The issue is still open.
https://github.com/pandas-dev/pandas/issues/15631