Read multiple files, create a data frame and add a new column containing the name of each file in R - read.table

I am new using dplyr package and I have been trying to read multiple files in R and then create a data frame by binding all the rows, but including the name of each file as a new column. This new column is the corresponding date which is not included in the data.
My list of files (for example):
01012019.aps
02012019.aps
I would like to have my final dataframe like this:
x y file date
1 4 01012019 01-01-2019
2 5 01012019 01-01-2019
3 6 02012019 02-01-2019
4 7 02012019 02-01-2019
I've been trying this:
path_aps<- "C:/Users/.../.../APS"
files_aps <- list.files(path_aps, pattern = "*.aps")
data_aps <- files_aps %>%
map(~ read.table(file.path(path_aps, .), sep = "\t")) %>%
map(~ mutate(filename = files_aps, .))%>%
reduce(gtools::smartbind)
But I am getting this error:
Error: Column filename must be length 288 (the number of rows) or one, not 61
I understand that the list of files in files_aps has 61 elements as this is the number of files that I have in my directory and 288 is the number of rows of each .aps file; however, I haven't been able to make it work to the extend of each .aps file. I've been reading multiple answers to similar questions but still I am not getting the expected result.

I've solved it with the help of this other answer and I've got this:
data_aps <- list.files(path_aps, pattern = "*.aps", full.names = TRUE) %>%
map_df(function(x) read.table(x, sep = "\t") %>%
mutate(filename=gsub(".aps","", basename(x))))

Related

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

Importing JSON data from SQL DB to an R dataframe

I would like to know whether there is a way of importing JSON data from a MySQL DB to an R dataframe.
I have a table like this:
id created_at json
1 2020-07-01 {"name":"Dent, Arthur","group":"Green","age (y)":43,"height (cm)":187,"wieght (kg)":89,"sensor":34834834}
2 2020-07-01 {"name":"Doe, Jane","group":"Blue","age (y)":23,"height (cm)":172,"wieght (kg)":67,"sensor":12342439}
3 2020-07-01 {"name":"Curt, Travis","group":"Red","age (y)":13,"height (cm)":128,"wieght (kg)":47,"sensor":83287699}
I would like to get the columns 'id' and 'json'.
I am using RMySQL package for getting the data from the db to an R dataframe but this gives me only the column 'id', the column 'json' contains only NAs in each row.
Is there any way how to import/load the data and get the json column displayed? And possibly to extract the "sensor" part of the json values?
The result would be a dataframe (df) like this:
id json
1 {"name":"Dent, Arthur","group":"Green","age (y)":43,"height (cm)":187,"wieght (kg)":89,"sensor":34834834}
2 {"name":"Doe, Jane","group":"Blue","age (y)":23,"height (cm)":172,"wieght (kg)":67,"sensor":12342439}
3 {"name":"Curt, Travis","group":"Red","age (y)":13,"height (cm)":128,"wieght (kg)":47,"sensor":83287699}
Or with with the extracted value:
id sensor
1 "sensor":34834834
2 "sensor":12342439
3 "sensor":83287699
Thank you very much for any suggestions.
Using unnest_wider from tidyr
library(dplyr)
con <- DBI::dbConnect(RMySQL::MySQL(), 'db_name', user = 'user', password = 'pass', host = 'hostname')
t <- tbl(con, 'table_name')
t %>%
as_tibble() %>%
transmute(j = purrr::map(json, jsonlite::fromJSON)) %>%
tidyr::unnest_wider(j)
DBI::dbDisconnect(con)
Result:
# A tibble: 3 x 6
name group `age (y)` `height (cm)` `wieght (kg)` sensor
<chr> <chr> <int> <int> <int> <int>
1 Dent, Arthur Green 43 187 89 34834834
2 Doe, Jane Blue 23 172 67 12342439
3 Curt, Travis Red 13 128 47 83287699
If you want to only retrieve data from the last 24 hours (as the OP requested) change the tbl(con, 'table_name') statement to:
t <- DBI::dbGetQuery(con, 'SELECT * FROM `table_name` WHERE DATE(`created_at`) > NOW() - INTERVAL 1 DAY')
Converting your JSON response to a data frame should be straightforward but, because the structure of a JSON response is essentially arbitrary and you haven't given us details of how you obtain it or exact details of its content, it's impossible to give you code that will work in your specific case. However, this is the basic process that works in one of my appliocations, starting with the post call to the API that provides access to the database.
library(httr)
library(jsonlite)
# Query the API
response <- POST(<your code here>)
# Extract the content of the response. Amend the format an encoding if necessary.
content <- content(response, as="text", encoding="UTF-8")
# Convert the content to an R object
content <- fromJSON(content, flatten=FALSE)
# Coerce to data.frame
df <- as.data.frame(content)
You should, of course, incorporate error and status checking throughout the process.
Note: your data contains a spelling mistake. "wieght" should be "weight".

i want to convert 2nd row into column and ratings with each. 2 dimentional matrix

"User-ID";"ISBN";"Book-Rating"
"276725";"034545104X";"0"
"276726";"0155061224";"5"
"276727";"0446520802";"0"
Ouput would be like:
"034545104X";"0155061224";"0446520802"
"276725" "0"
"276726" "5"
"276727" "0"
Solved it with below script in R.
loading reader library
library(readr)
reading CSV file with ';' as delimiter
BX_Book_Ratings <- read_delim("C:/Users/panch/Desktop/Lambton/Term_2_Fall_2017/2017F-T2 BDM 2013 - Data Collection Methods/project_03/dataset/BX-Book-Ratings.csv", ";", escape_double = FALSE, trim_ws = TRUE)
viewing data
View(BX_Book_Ratings)
loading reshape2 library
library(reshape2)
reading only few first rows from dataset
sample_data<-head(BX_Book_Ratings,30)
generating matrix
d <- dcast( sample_data, User_ID ~ ISBN, value.var = "Book_Rating" )
replacing NA with 0
d[is.na(d)] <- 0
loading grid library to display data in table format
library(gridExtra)
displaying table
grid.table(d)
Storing output in CSV format
write.csv(d,"C:/Users/panch/Desktop/Lambton/Term_2_Fall_2017/2017F-T2 BDM 2013 - Data Collection Methods/project_03/dataset/output.csv")

Reformat wide Excel table to more SQL-friendly structure

I have a very wide Excel sheet, from Column A - DIE (about 2500 columns wide), of survey data. Each column is a question, and each row is a response. I'm trying to upload the data to SQL and convert it to a more SQL-friendly format using the UNPIVOT function, but I can't even get it loaded into SQL because it exceeds the 1024-column limit.
Basically, I have an Excel sheet that looks like this:
But I want to convert it to look like this:
What options do I have to make this change, either in Excel (prior to upload) or SQL (while circumventing the 1024 column limit)?
I have had to do this quite a bit. My solution was to write a Python script that would un-crosstab a CSV file (typically exported from Excel), creating another CSV file. The Python code is here: https://pypi.python.org/pypi/un-xtab/ and the documentation is here: http://pythonhosted.org/un-xtab/. I've never run it on a file with 2500 columns, but don't know why it wouldn't work.
R has a very specific function call in one of it's libraries. You can also connect, read, and write data with R into a database. Would suggest downloading R and Rstudio.
Here is a working script to get you started that does what you need:
Sample data:
df <- data.frame(id = c(1,2,3), question_1 = c(1,0,1), question_2 = c(2,0,2))
df
Input table:
id question_1 question_2
1 1 1 2
2 2 0 0
3 3 1 2
Code to transpose the data:
df2 <- gather(df, key = id, value = values)
df2
Output:
id id values
1 1 question_1 1
2 2 question_1 0
3 3 question_1 1
4 1 question_2 2
5 2 question_2 0
6 3 question_2 2
Some helper functions for you to import and export the csv data:
# Install and load the necessary libraries
install.packages(c('tidyr','readr'))
library(tidyr)
library(readr)
# to read a csv file
df <- read_csv('[some directory][some filename].csv')
# To output the csv file
write.csv(df2, '[some directory]data.csv', row.names = FALSE)
Thanks for all the help. I ended up using Python due to limitations in both SQL (over 1024 columns wide) and Excel (well over 1 million rows in the output). I borrowed the concepts from rd_nielson's code, but that was a bit more complicated than I needed. In case it's helpful to anyone else, this is the code I used. It outputs a csv file with 3 columns and 14 million rows that I can upload to SQL.
import csv
with open('Responses.csv') as f:
reader = csv.reader(f)
headers = next(reader) # capture current field headers
newHeaders = ['ResponseID','Question','Response'] # establish new header names
with open('PythonOut.csv','w') as outputfile:
writer=csv.writer(outputfile, dialect='excel', lineterminator='\n')
writer.writerow(newHeaders) # write new headers to output
QuestionHeaders = headers[1:len(headers)] # Slice the question headers from original header list
for row in reader:
questionCount = 0 # start counter to loop through each question (column) for every response (row)
while questionCount <= len(QuestionHeaders) - 1:
newRow = [row[0], QuestionHeaders[questionCount], row[questionCount + 1]]
writer.writerow(newRow)
questionCount += 1

Read only n-th column of a text file which has no header with R and sqldf

I have a similiar problem like this question:
selecting every Nth column in using SQLDF or read.csv.sql
I want to read some columns of large files (table of 150rows, >500,000 columns, space separated, filled with numeric data and only a 32 bit system available). This file has no header, therefore the code in the thread above didn't work and I decided to write a new post.
Do you have an idea to solve this problem?
I thought about something like that, but any results with fread or read.table are also ok:
MyConnection <- file("path/file.txt")
df<-sqldf("select column 1 100 1000 235612 from MyConnection",file.format = list(header=F,sep=" "))
You can use substr to specify the start and end position of the columns you want to read in if they are fixed width:
x <- tempfile()
cat("12345", "67890", "09876", "54321", sep = "\n", file = x)
myfile <- file(x)
sqldf("select substr(V1, 1, 1) var1, substr(V1, 3, 5) var2 from myfile")
# var1 var2
# 1 1 345
# 2 6 890
# 3 9 76
# 4 5 321
See this blog post for some more examples. The "select" statement can easily be constructed with paste if you know the details about the column starting positions and widths.