Import columns to existing OpenRefine project - openrefine

How do I add a column from an external .csv file to an existing project?
I tried to find the solution online, but I wasn't successful.

Using the file you provided, I did this in less than one minute.
I had a project, with one column: .
If you know a little Python, try Jython. Edit Column > Add column based on this column and chose Language : Jython like this:
import csv
#we are going to use DictReader to transform our imported rows into dict,
#so we can latter just refer to the column we want by its key i.e header
rows = csv.DictReader(open('/home/yourusername/Downloads/example.csv'), delimiter=",")
for row in rows:
return row['Comprar'] #'comprar' is the header of the column i want

Related

Order of the columns in Apache Zeppelin when selecting the data from the temprorary table is wrong, how to put specific column first?

Currently we have the scala DataFrame output with id value shown first (but it is chronologically added to the DataFrame last). Other columns appears dynamically based on .pivot() function and the data.
When I call for the data in %sql interpreter, the order is changing, thus making CSV file that I download also have id column as the last one, that doesn't work for me. I can't just write the selection script at once with putting the id column at the first point manually, as I can't control other columns because of pivot. Is there any other way to make specific column go first?
The Scala paragraph is:
resultMean.registerTempTable("mean")
The sql paragraph is:
%sql
select *
from mean
For someone who will read this in future, the reason of such a behavior is in misusing the DataFrame. In Scala .show() was applied to one DataFrame, while the export to the temp table to another one. If you face the same, please double check you apply your methods to the same objects.

Google BigQuery Import csv Using Console - Use first row as header

I have a csv file with 1 column which I want to import into my big query environment. When using the Console to import data - always take my first row as a data row rather than a column name. Is there a way in the console to always ensure the first row is always the column name
E.g.
Tk Number
Tk - 0001
Tk - 0002
In CSV format, if the first row is string and others are integers, then it automatically takes the first row as header name, if you have checked the auto-detect schema option while creating the table.
But since you have strings in header as well as body, you will need to give the schema manually while creating the table in BigQuery. And in advanced options you can specify the number of rows to be skipped under 'header rows to skip' option.

Is there a quick way to subset columns in PANDAS?

I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because in my last step, I want the program to create me a file containing only what has changed in the "new" version file so that I do not have to update the entire database, only the data points that have changed.
old = pd.read_excel(landdata20201122.xlsx')
new = pd.read_excel(landdata20210105.xlsx')
old['version'] = "old"
new['version'] = "new"
I merge the sheets into one, I then drop duplicate rows based on all the columns in the original files. I have to subset the data because if the program looks at my added version column, it will not be seen as a duplicate row. Statement is listed below
df2 = df1.drop_duplicates(subset=["UWI", "Current DOI Partners", "Encumbrances", "Lease Expiry Date", "Mineral Leases", "Operator", "Attached Land Rights", "Surface Leases"])
df2.shape
I am wondering if there is a quicker way to subset the data, basically the way I currently have it setup, I have to list each column title. Some of my sheets have 100+ columns, so it is a lot of work when I only want it to negate 1 column. Is there a way that I can populate all the column titles and remove the ones I do not want looked at? Or is there a way to enter the columns I DO NOT want compared in the drop duplicates command instead of entering all the columns except one?
If I can just list the columns I do not want to compare, I will be able to use the same script for much more of the data that I am working with as I will not have to edit the drop_duplicates statement each time I compare sheets.
Any help is appreciated, thank you in advance!
If I've understood well:
Store the headers in a list.
Remove the names you don't want by hand.
Inside the subset of drop_duplicates(), place the list.
In case that the columns you want to remove are more than those you want to keep, add by hand all the wanted columns in the list.
With a list, you won't need to write them every time.
How to iterate a list:
list=['first', 'second', 'third']
for i in list:
print(i)
# Output: 'first', 'second', 'third'

Reading xlsx into R and creating new header

I'm a newbie to R.
I read in an Excel file for a survey, but I started reading observations from the 3rd row of the excel file, as the survey download creates a first two rows of the question string (first row for all questions) followed by a second row of multiple choice questions (each option gets its own column except the first option, which is listed in the same column in the second row as the question in the first row).
So now, my dataframe starts with Row 3.
But now I need to create custom variable names - ie. new variable names for each column before I manipulate further. I'm looking for tips on how to best accomplish this.
What I am thinking:
Create an Excel file with the variable names, and then use this is as the header. I'm not quite sure which code I would use to do this.
Code the names as an empty dataframe, and then somehow merge this so the empty dataframe column names are the column names for the file I imported.
I would appreciate some suggestions on how best to do this!

Import CSV with Dynamic Columns

My Vendor is providing a CSV file where columns (names included on first line) are dynamic - meaning they will only appear if there is data in them and there is no guarantee on the order the columns will be provided
I am looking to understand the best approach to take to import such a horrible file.
Using the FileHelpers.net and optional fields.. but the issue with this is that the column orders can change
You can build a FileHelpers class on the fly, then use that with the engine to import the CSV. If you import as a DataTable, you would then be able to check if a column exists and populate your database using that or doing whatever you need to with this columns.