How is writing a for loop in R different from Stata? - dataframe

I am working on a project in R and, coming from Stata, am having a hard time adjusting to how for loops work.
When I run this code below, I get exactly what I want. I create a new dataframe "states_1" out of an already existing dataframe "data_frame1."
states_1 <- data_frame1[["interest_over_time"]]
states_1 <- states_1[-c(1:66),]
states_1 = dcast(states_1, date ~ keyword + geo, value.var = "hits")
However, I have almost 15 dataframes and this is a shared project, so I would rather not have to repeat this for every dataframe and use a for loop instead. However, the code below tells me that I am trying to treat a function like a dataframe. Would appreciate help with how to proceed
for(i in 1:13) {
states[i] <- data_frame[i][["interest_over_time"]]
states[i] <- states[i][-c(1:66),]
states[i] = dcast(states[i], date ~ keyword + geo, value.var = "hits")
}

Related

replacing df.append with pd.concat when building a new dataframe from file read

...
header = pd.DataFrame()
for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}:
header = header.append({'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'},
ignore_index=True)`
...
I have some Jupyter Notebook code which reads in 2 text files to data1 and data2 and using a list I am picking out specific matching lines in both files to a dataframe for easy display and comparison in the notebook
Since df.append is now being bumped for pd.concat what's the tidiest way to do this
is it basically to replace the inner loop code with
...
header = pd.concat(header, {all the column code from above })
...
addtional input to comment below
Yes, sorry for example the next block of code does this:
for x in {4,2 5}:
header = header.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1])
ignore_index=True)`
repeated 5 times with different data indices in the loop, and then a different SOMENEWROWNAME
I inherited this notebook and I see now that this way of doing it was because they only wanted to do a numerical float difference on the columns where numbers come
but there are several such blocks, with different lines in the data and where that first parameter SOMENEWROWNAME is the different text fields from the respective lines in the data.
so I was primarily just trying to fix these append to concat warnings, but of course if the code can be better written then all good!
Use list comprehension and DataFrame constructor:
data = [{'col1':data1[x].split(':')[0],
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':'---'} for x in {0,7,8,9,10,11,12,13,14,15,18,19,21,23}]
df = pd.DataFrame(data)
EDIT:
out = []
#sample
for x in {1,7,30}:
out.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
df1 = pd.DataFrame(out)
out1 = []
#sample
for x in {1,7,30}:
out1.append({another dict})))
df2 = pd.DataFrame(out1)
df = pd.concat([df1, df2])
Or:
final = []
for x in {4,2,5}:
final.append({'col1':SOMENEWROWNAME'',
'col2':data1[x].split(':')[1][:-1],
'col3':data2[x].split(':')[1][:-1],
'col4':data2[x]==data1[x],
'col5':float(data2[x].split(':'},[1]([-1]) -float(data1[x].split(':'},[1]([-1]))))))
for x in {4,2, 5}:
final.append({another dict})))
df = pd.DataFrame(final)

Create and Merge Pandas Dataframes in loop

I need to read in bunch of i/p dataframes based on some conditions and then merge them and finally create dataframes as 'merge_m0', 'merge_m1', 'merge_m2' and so on.
In the actual code, I need to read about 20 dataframes. But, for simplicity and ease of understanding, I'm creating 3 dataframes and using a for loop to read them and merge.
#INPUT: Sample input dataframes df0, df1 &df2
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
To do this, I'm using globals() to create dataframes in loop and to merge them but it's not working and throwing " 'DataFrame' object has no attribute 'globals'" error.
#Code:
def comb_mths(x,y):
globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].globals()[f'm{x}_val_mthd'].isin([1,25])]
globals()[f"m{y}"] = globals()[f'df{y}'][(globals()[f'df{y}'].globals()[f'm{y}_val_mthd'].isin([8,10,11,12])) & (globals()[f'df{y}'].globals()[f'm{y}_orig_val_mthd'].isin([2,3,4,5]))]
globals()[f"merge_m{x}"]=pd.merge(globals()[f"m{x}"],globals()[f"m{y}"], how='inner',on=['id'])
for i in range(0,3):
comb_mths(i, i+1)
I've tried as below as well in place of the 1st line in the above function
#globals()[f"m{x}"] = globals()[f'df{x}'][globals()[f'df{x}'].m{x}_val_mthd.isin([1,25])]
#globals()[f"m{x}"] = globals()[f'df{x}']["[f'm{x}_val_mthd']"].isin([1,25])
I think there must be some better and easy alternative to do this and appreciate if anyone can help. Thanks!
Edit#
my updated post:
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
df_list=[]
for i in range(0,3):
df_list.append(globals()[f'df{i}']) #I'm appending all the i/p dataframes which are created already by other step in the code and hope this works
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
dfma = dfa[dfa.iloc[:, 1].isin([1,25])]
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"]
for i in range(0,2):
comb_mths(i)
print(merge_m0)
print(merge_m1)
in the above function after creating "merge_m{i}" dataframe, I need to check one more 'if-else' condition and calculate a variable say 'mths'.
**The logic goes like this:
when i=0, I need to check for "m1_orig_val_mthd", when i=1, I need to check for "m2_orig_val_mthd", when i=2, I need to check for "m3_orig_val_mthd" and so on**
and that if-else condition pseudo code is like below. Can you please show me how do I add this below condition also in the above function?
when i=0 1st iteration
if m1_orig_val_mthd isin (2,4,6):
diff = (mydate - m1_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m1_orig_val_mthd isin (1,3,5):
diff = (mydate - m1_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
when i=1 2nd iteration
if m2_orig_val_mthd isin (2,4,6):
diff = (mydate - m2_appr_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
elif m2_orig_val_mthd isin (1,3,5):
diff = (mydate - m2_bpo_rcvd_dt)//(np.timedelta64(1,'M'))
mths = diff - (i-1)
and so on...
I took a different approach assuming you can create all the input dataframes first. If you can create your dataframes and put them in a list, it makes handling them easier and code easier to read.
df0=pd.DataFrame({'id':[100,101,102,103],'m0_val_mthd':[1,8,25,41],'name':['AAA','BBB','CCC','DDD'],'m0_orig_val_mthd':[2,3,4,5]})
df1=pd.DataFrame({'id':[100,104,102,103],'m1_val_mthd':[1,8,10,25],'name':['EEE','FFF','GGG','HHH'],'m1_orig_val_mthd':[2,3,4,5]})
df2=pd.DataFrame({'id':[100,104,102,103],'m2_val_mthd':[1,8,10,25],'name':['III','JJJ','KKK','LLL'],'m2_orig_val_mthd':[2,3,4,5]})
# add your inputs to the list
df_list = [df0, df1, df2]
# only pass in i, then call dfa, dfb by position in the list
def comb_mths(i):
dfa = df_list[i]
dfb = df_list[i+1]
# print(dfa)
# print(dfb)
# print('\n'*3)
# I wasn't exactly sure what you wanted here, but I think the original issue was you were calling your new dataframe before it was created.
dfma = dfa[dfa.iloc[:, 1].isin([1,25])] # as long as columns are in the same position, you don't need to call them by name, just position
dfmb = dfb[(dfb.iloc[:, 1].isin([8,10,11,12])) & (dfb.iloc[:, 3].isin([2,3,4,5]))]
print(dfma)
print(dfmb)
print('\n'*3)
#creating new merged datframes. cleaned this up too
globals()[f"merge_m{i}"] = dfma.merge(dfmb, how='inner', on=['id'])
return globals()[f"merge_m{i}"] #added return statement
for i in range(0,2): # watch range end or you'll get an error
comb_mths(i)
print(merge_m0)
print(merge_m1)
Additional code:
# to populate the df_list, do this
# you aren't actually naming them, I only did that in example above due to your Example
# when you call them, you are calling the position in the list
df_list = []
for i in range(0,20):
df = 'do your code here'
df_list.append(df)
# print the df to verify they are created
for df in df_list:
print(df)

How to make a scatter plot based on the values of a column in the data set?

I am given a data set that looks something like this
and I am trying to graph all the points with a 1 on the first column separate from the points with a 0, but I want to put them in the same chart.
I know the final result should be something similar to this
But I can't find a way to filter the points in Julia. I'm using LinearAlgebra, CSV, Plots, DataFrames for my project, and so far I haven't found a way to make DataFrames storage types work nicely with Plots functions. I keep running into errors like Cannot convert Float64 to series data for plotting when I try plotting the points individually with a for loop as a filter as shown in the code below
filter = select(data, :1)
newData = select(data, 2:3)
#graph one initial point to create the plot
plot(newData[1,1], newData[1,2], seriestype = :scatter, title = "My Scatter Plot")
#add the additional points with the 1 in front
for i in 2:size(newData)
if filter[i] == 1
plot!(newData[i, 1], newData[i, 2], seriestype = :scatter, title = "My Scatter Plot")
end
end
Other approaches have given me other errors, but I haven't recorded those.
I'm using Julia 1.4.0 and the latest versions of all of the packages mentioned.
Quick Edit:
It might help to know that I am trying to replicate the Nonlinear dimensionality reduction section of this article https://sebastianraschka.com/Articles/2014_kernel_pca.html#principal-component-analysis
With Plots.jl you can do the following (I am passing a fully reproducible code):
julia> df = DataFrame(c=rand(Bool, 100), x = 2 .* rand(100) .- 1);
julia> df.y = ifelse.(df.c, 1, -1) .* df.x .^ 2;
julia> plot(df.x, df.y, color=ifelse.(df.c, "blue", "red"), seriestype=:scatter, legend=nothing)
However, in this case I would additionally use StatsPlots.jl as then you can just write:
julia> using StatsPlots
julia> #df df plot(:x, :y, group=:c, seriestype=:scatter, legend=nothing)
If you want to do it manually by groups it is easiest to use the groupby function:
julia> gdf = groupby(df, :c);
julia> summary(gdf) # check that we have 2 groups in data
"GroupedDataFrame with 2 groups based on key: c"
julia> plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
julia> plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
Note that gdf variable is bound to a GroupedDataFrame object from which you can get groups defined by the grouping column (:c) in this case.

Julia DataFrame: Create new column sum of col values :x by :y

I have a DataFrame of x and y occurrences. I would like to count how often each occurrence happens in the DataFrame and what percentage of the :y occurrences that combination represents. I have the first part down now, thanks to a previous question.
using DataFrames
mydf = DataFrame(y = rand('a':'h', 1000), x = rand('i':'p', 1000))
mydfsum = by(mydf, [:x, :y], df -> DataFrame(n = length(df[:x])))
This successfully creates a column that counts how often each value of :x occurs with each value of :y. Now I need to be able to generate a new column that counts how often each value of :y occurs. I could next create a new DataFrame using:
mydfsumy = by(mydf, [:y], df -> DataFrame(ny = length(df[:x])))
Join the DataFrames together.
mydfsum = join(mydfsum, mydfsumy, on = :y)
And create the percentage :yp column
mydfsum[:yp] = mydfsum[:n] ./ mydfsum[:ny]
But this seems like a clunky workaround for a common data management problem. In R I would do all of this in one line using dplyr:
mydf %>% groupby(x,y) %>% summarize(n = n()) %>% groupby(y) %>% mutate(yp = n/sum(n))
You can do it in one line:
mydfsum = by(mydf, :y, df -> by(df, :x, dd -> DataFrame(n = size(dd,1), yp = size(dd,1)/size(df,1))))
or, if that becomes hard to read, you can use the do notation for anonymous functions:
mydfsum = by(mydf,:y) do df
by(df, :x) do dd
DataFrame(n = size(dd,1), yp = size(dd,1)/size(df,1))
end
end
What you are doing in R is actually doing a first by on both x and y, then mutating a column of the output. You can also do that, but you need to have created that column first. Here I first initialize the yp column with zeroes and then modify it in place with another by.
mydfsum = by(mydf,[:x,:y], df -> DataFrame(n = size(df,1), yp = 0.))
by(mydfsum, :y, df -> (df[:yp] = df[:n]/sum(df[:n])))
For more advanced data manipulation you may want to take a look at Query.jl

selecting every Nth column in using SQLDF or read.csv.sql

I am rather new to using SQL statements, and am having a little trouble using them to select the desired columns from a large table and pulling them into R.
I want to take a csv file and read selected columns into r, in particular, every 9th and 10th column. In R, something like:
read.csv.sql("myfile.csv", sql(select * from file [EVERY 9th and 10th COLUMN])
My trawl of the internet suggests that selecting every nth row could be done with an SQL statement using MOD something like this (please correct me if I am wrong):
"SELECT *
FROM file
WHERE (ROWID,0) IN (SELECT ROWID, MOD(ROWNUM,9) OR MOD(ROWNUM,10)"
Is there a way to make this work for columns? Thanks in advance.
read.csv read.csv would be adequate for this:
# determine number of columns
DF1 <- read.csv(myfile, nrows = 1)
nc <- ncol(DF1)
# create a list nc long where unwanted columns are NULL and wanted are NA
colClasses <- rep(rep(list("NULL", NA), c(8, 2)), length = nc)
# read in
DF <- read.csv(myfile, colClasses = colClasses)
sqldf To use sqldf replace the last line with these:
nms <- names(DF1)
vars <- toString(nms[is.na(colClasses)])
DF <- fn$read.csv.sql(myfile, "select $vars from file")
UPDATE: switched to read.csv.sql
UPDATE 2: correction.