Reading xlsx into R and creating new header - header

I'm a newbie to R.
I read in an Excel file for a survey, but I started reading observations from the 3rd row of the excel file, as the survey download creates a first two rows of the question string (first row for all questions) followed by a second row of multiple choice questions (each option gets its own column except the first option, which is listed in the same column in the second row as the question in the first row).
So now, my dataframe starts with Row 3.
But now I need to create custom variable names - ie. new variable names for each column before I manipulate further. I'm looking for tips on how to best accomplish this.
What I am thinking:
Create an Excel file with the variable names, and then use this is as the header. I'm not quite sure which code I would use to do this.
Code the names as an empty dataframe, and then somehow merge this so the empty dataframe column names are the column names for the file I imported.
I would appreciate some suggestions on how best to do this!

Related

Is there a quick way to subset columns in PANDAS?

I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because in my last step, I want the program to create me a file containing only what has changed in the "new" version file so that I do not have to update the entire database, only the data points that have changed.
old = pd.read_excel(landdata20201122.xlsx')
new = pd.read_excel(landdata20210105.xlsx')
old['version'] = "old"
new['version'] = "new"
I merge the sheets into one, I then drop duplicate rows based on all the columns in the original files. I have to subset the data because if the program looks at my added version column, it will not be seen as a duplicate row. Statement is listed below
df2 = df1.drop_duplicates(subset=["UWI", "Current DOI Partners", "Encumbrances", "Lease Expiry Date", "Mineral Leases", "Operator", "Attached Land Rights", "Surface Leases"])
df2.shape
I am wondering if there is a quicker way to subset the data, basically the way I currently have it setup, I have to list each column title. Some of my sheets have 100+ columns, so it is a lot of work when I only want it to negate 1 column. Is there a way that I can populate all the column titles and remove the ones I do not want looked at? Or is there a way to enter the columns I DO NOT want compared in the drop duplicates command instead of entering all the columns except one?
If I can just list the columns I do not want to compare, I will be able to use the same script for much more of the data that I am working with as I will not have to edit the drop_duplicates statement each time I compare sheets.
Any help is appreciated, thank you in advance!
If I've understood well:
Store the headers in a list.
Remove the names you don't want by hand.
Inside the subset of drop_duplicates(), place the list.
In case that the columns you want to remove are more than those you want to keep, add by hand all the wanted columns in the list.
With a list, you won't need to write them every time.
How to iterate a list:
list=['first', 'second', 'third']
for i in list:
print(i)
# Output: 'first', 'second', 'third'

Change the format of the current excel

I just want to ask on how do you change the format of the excel into a new excel. I have examples here that i created manually. But I just want to ask if i could do a background process or a button click to change the format i want.
Also:
1. How can I get just the id number "40002" in one cell.
2. How can I get the "ABELONON, RYAN" only and remove the string "Employeee: " And the string " (40002)"
This is the standard format from the biometrics. And i wanted to change it to the format below.
This is the format that I need. Is there any easiest way to change this kind of format? And how? Thanks in advance.
There is no immediate answer to your question, I am afraid. Because, any batch operation that involves to read data from one (or multiple) spreadsheets into one single spreadsheet should match some pattern in the input spreadsheets.
So, I suggest you provide more information in your question about the specific patterns of your input sample (worksheets your want the macro to read from):
are they in different excel source files?
are they in the same excel but different worksheets?
are those worksheets mixed with other worksheets that the macro should not read from?
given one source worksheet, which is the data pattern the macro should expect to find? For example:
once found the first cell from the 1st column of the worksheet with content that starts with Employee:
first filled in row is a header
first filled in row after the header's row is first row of data
a block of data does not contain empty rows and finishes where the first column does not contain a date
the structure of a row of data is always:
Date (date part of a date)
empty
Time (time part of a date)
Time (time part of a date)
where (3) and (4) could repeat in some rows
extract the employee name and the id number, by expecting to follow the following Regex pattern: Employee:\s*(.*)\((\d+)
and copy all the block of data by pairing (1) + (3) + (4), and by adding the employee details as two first columns; where (3) and (4) could repeat for certain rows (this last could would be tricky)
In other words: it is not worthy to make the effort of creating such a macro, if all your source data does not follow one same pattern. But if it does, then update your question and let's see what can be done with it.

How can i merge header cells in Excel writer in pentaho?

I am trying to merge header cells columns into one cell but when i do that my data also comes in one column. I want my resulting output as per this screenshot attached. Kindly help me for this.
Are your columns variable? Or you always have the same output schema?
If it's fixed then, I would use a template where the headers are fixed and I start populating from row 5.
Google Spreadsheet input
If you are using the Spreadsheet input that is not possible on the step.
What I usually do in that kind of situation is to create a row with my headers and hide it so the user don't get confused with two headers. Them the Step will get the result perfectly using the column names provided on first row. (you can use a formula like =b3 there so it changes with the real header. No problem.)
Excel input
If you are using the Excel input step you can set the sheet to be read from row 2, column 0 and should work fine. =)

Build a Column by applying a formula to an existing column (like in excel)

I am new to the community. Hopefully this was not answered already.
I am trying to add a column to a DataFrame that contains a formula based on the previous columns. Example, build a series of stocks returns based on stocks close.
I know how to build a column by doing exactly the same thing to all elements of another, but not to use a columns element and formula to create another.
Thanks for your help.

Transform and load a large CSV to multiple worksheets in one Excel file

Back Story:
NEW PROJECT FROM MANAGEMENT: I have been given a soft project from my boss to evaluate one of our current ETL plans to look for room for improvement in the process, and I am looking for guidance.
MOTIVE: Excel is currently being used and crashes quite often during the process due to file size.
TASK: Every month an analyst receives a large csv file from a survey vendor containing up to 750 columns (not all unique names) with over 15,000 rows to simply transform a large csv file into an excel file with seven worksheets broken up based on the column headings in the csv. Details of how it is broken up is below.
My question is one large csv being transformed into an edited excel file with multiple worksheets any easier or quicker using VB.NET and VS2010 or VBA for that matter, or would using Excel be the simplilest way to continue this process? I am an Expert Excel user but I am still very much a beginner to intermediate at coding in VBA, VB.NET or any other language.
Detailed Question:
I am open to using free or open source software, but I am most familiar with VB.NET and Excel and Excel-VBA. I have played around a bit coding a simple windows form application to load the csv into a datatable using similar TextFieldParser code found here. I have thought of loading it into an array or even a 2d array to more easily edit the column headings and find the duplicate column headings. The datatable option still leaves me with more questions than answers because I need unique column headings and not sure if I should bother with a datatable if I'm going to just write an excel file right away. I tried CSVreader from CodeProject won't work on files with duplicate header names. I feel as though I am having writers block as I am not sure which direction I should take handle such a process. Any input you can provide will be much appreciated, and I apologize if this question does not have a single and clear best answer, Thanks.
Current Analyst tasks using excel
The current analytical plan has said analyst to open the csv in excel, insert a row above row 1 and use a vlookup to replace the 'New' column names with the 'Old' column names based on a simple two column lookup table on a separate worksheet. For example
New becomes Old
"org-name" becomes "org_name" or
"item_1_Vendor" becomes "item_1" or
"date-created_Survey" becomes "date_created"
etc...checking all sent "New" columns against the list of all possible 750 columns.
Then they paste values of the first row and then delete the 2nd row which contained the New headings we want to change.
Then the analyst has to fix the primary key on the file which is called "sid".
The Survey ID field (sid) should have a number for each row of the data file. Sometimes the sid shows up under the sid_HCAHPS or the sid_CGCAHPS fields instead.
The analyst would insert a column next to the "sid" field and put a formula in it like this, for example:
=IF(BE2<>"",BE2,IF(RD2<>"",RD2,IF(UH2<>"",UH2,"")))
Actual cell references would change but in the example excel formula,
"sid"=Range("BE2")
"sid_HCAHPS"=Range("RD2")
"sid_CGCAHPS"=Range("UH2")
Once the newly created primary key column is made and filled without blanks, we can delete the original "sid" column.
The next step is to check the columns because there may be a redundant HCAHPS section of columns (due to a second survey being sent and then returned- coded as Wave 2), delete second set of columns "sid_HCAHPS" through "language"
Next is the largest alteration because we have setup a system where we send this information to our database admins in the form of a seven worksheet excel file to be loaded by an MS Access Query that creates a table from each sheet that gets loaded into our proprietary business intelligence software. All Done!!
Is your question, "can VB.net automate our current analyst tasks?" -If so, then yes.
You could use the streamreader class to get data from your csv
(http://msdn.microsoft.com/en-us/library/system.io.streamreader.aspx)
Then store it either in an array as you mentioned or use the *list class
(http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx)
Once you've got all your data stored you'll need to automate excel, this is quite straight forward but here's a link to get you started with that as well: http://support.microsoft.com/kb/301982/en-gb
With the list class you can create a list of custom objects using either classes or structures. eg.
We define a structure:
Structure rowOfData
Public intPrimaryKey as Integer
Public strIceCreamName as String
Public decPrice as Decimal
End Structure
We can then create a rowOfData and add properties to it:
Dim iceCream1 as rowOfData
iceCream1.intPrimaryKey = 1
iceCream1.strIceCreamName = "Mr Whippy"
iceCream1.decPrice = 0.99
We create a list with:
Dim listOfIceCreams as New List(of rowOfData)
And add to it like this:
listOfIceCreams.Add(iceCream1)
listOfIceCreams.Add(iceCream2)
etc.
And access the members of the list like this:
listOfIceCreams(0).decPrice 'gives us the price of the ice Cream that was added to the list first.
There are also a lot of other useful methods that lists have which arrays don't. You could have a look through that msdn list class link to see if anything jumps out at you that you might need