I am totally new to python/pandas and trying to do a simple task of splitting a text field (delimited by commas) in a dataframe into multiple columns where the comma is the delimiter. The file is originally in Excel. The field is not left justified, so I left justify it using the following command:
df=df.style.set_properties(**{'text-align': 'left'})
This command does left justify the column, but it also transforms the dataframe into a Styler object, which then doesn't allow me to execute commands that I normally would on a dataframe. I did use the df=df.data command after the transformation, and that does seem to change it back to a dataframe, but the field reverts back to a right justified column. Thanks for any assistance.
Related
Is there a way to have PyCharm show all the columns in one line instead of splitting them? For some reason it is only showing a few columns then splitting below with a '/', while there is a lot of white space that could be utilized.
Currently have pd option set for max columns, which does show all the columns, but wish I could see all the columns in one line. Maybe there is another option I can set, but seems this is likely a function of PyCharm.
pd.set_option('display.max_columns', None)
Here's a distilled version of what we're trying to do. The transformation step is a "Table Input":
SELECT DISTINCT ${SRCFIELD} FROM ${SRCTABLE}
We want to run that SQL with variables/parameters set from each line in our CSV:
SRCFIELD,SRCTABLE
carols_key,carols_table
mikes_ix,mikes_rec
their_field,their_table
In this case we'd want it to run the transformation three times, one for each data line in the CSV, to pull unique values from those fields in those tables. I'm hoping there's a simple way to do this.
I think the only difficulty is, we haven't stumbled across the right step/entry and the right settings.
Poking around in a "parent" transformation, the highest hopes we had were:
We tried chaining CSV file input to Set Variables (hoping to feed it to Transformation Executor one line at a time) but that gripes when we have more than one line from the CSV.
We tried piping CSV file input directly to Transformation Executor but that only sends TE's "static input value" to the sub-transformation.
We also explored using a job, with a Transformation object, we were very hopeful to stumble into what the "Execute every input row" applied to, but haven't figured out how to pipe data to it one row at a time.
Suggestions?
Aha!
To do this, we must create a JOB with TWO TRANSFORMATIONS. The first reads "parameters" from the CSV and the second does its duty once for each row of CSV data from the first.
In the JOB, the first transformation is set up like this:
Options/Logging/Arguments/Parameters tabs are all left as default
In the transformation itself (right click, open referenced object->transformation):
Step1: CSV file input
Step2: Copy rows to result <== that's the magic part
Back in the JOB, the second transformation is set up like so:
Options: "Execute every input row" is checked
Logging/Arguments tabs are left as default
Parameters:
Copy results to parameters, is checked
Pass parameter values to sub transformation, is checked
Parameter: SRCFIELD; Parameter to use: SRCFIELD
Parameter: SRCTABLE; Parameter to use: SRCTABLE
In the transformation itself (right click, open referenced object->transformation):
Table input "SELECT DISTINCT ${SRCFIELD} code FROM ${SRCTABLE}"
Note: "Replace variables in script" must be checked
So the first transformation gathers the "config" data from the CSV and, one-record-at-a-time, passes those values to the second transformation (since "Execute every input row" is checked).
So now with a CSV like this:
SRCTABLE,SRCFIELD
person_rec,country
person_rec,sex
application_rec,major1
application_rec,conc1
status_rec,cur_stat
We can pull distinct values for all those specific fields, and lots more. And it's easy to maintain which tables and which fields are examined.
Expanding this idea to a data-flow where the second transformation updates code fields in a datamart, isn't much of a stretch:
SRCTABLE,SRCFIELD,TARGETTABLE,TARGETFIELD
person_rec,country,dim_country,country_code
person_rec,sex,dim_sex,sex_code
application_rec,major1,dim_major,major_code
application_rec,conc1,dim_concentration,concentration_code
status_rec,cur_stat,dim_current_status,cur_stat_code
We'd need to pull unique ${TARGETTABLE}.${TARGETFIELD} values as well, use a Merge rows (diff) step, use a Filter rows step to find only the 'new' ones, and then a Execute SQL script step to update the targets.
Exciting!
Currently we have the scala DataFrame output with id value shown first (but it is chronologically added to the DataFrame last). Other columns appears dynamically based on .pivot() function and the data.
When I call for the data in %sql interpreter, the order is changing, thus making CSV file that I download also have id column as the last one, that doesn't work for me. I can't just write the selection script at once with putting the id column at the first point manually, as I can't control other columns because of pivot. Is there any other way to make specific column go first?
The Scala paragraph is:
resultMean.registerTempTable("mean")
The sql paragraph is:
%sql
select *
from mean
For someone who will read this in future, the reason of such a behavior is in misusing the DataFrame. In Scala .show() was applied to one DataFrame, while the export to the temp table to another one. If you face the same, please double check you apply your methods to the same objects.
I've a csv file which contains data with new lines for a single row i.e. one row data comes in two lines and I want to insert the new lines data into respective columns. I've loaded the data into sql but now I want to replace the second row data into 1st row with respective column values.
output details:
I wouldn't recommend fixing this in SQL because this is an issue with the CSV file. The issue is that file contains new lines, which causes rows split.
I strongly encourage fixing CSV files, if possible. It's going to be difficult to fix that in SQL given there are going to be more cases like that.
If you're doing the import with SSIS (or if you have the option of doing it with SSIS if you are not currently), the package can be configured to manage embedded carriage returns.
Define your file import connection manager with the columns you're expecting.
In the connection manager's Properties window, set the AlwaysCheckForRowDelimiters property to False. The default value is True.
By setting the property to False, SSIS will ignore mid-row carriage return/line feeds and will parse your data into the required number of columns.
Credit to Martin Smith for helping me out when I had a very similar problem some time ago.
I spent some time Googling, but couldn't find anything useful.
How to select all the values of a single column in OpenRefine in a script?
It seems that all the operations are row-wise
In particular, I want to find highest and lowest values in a column
By default, OpenRefine functionality are limited for computation. The Stats Extension make basic stats per column (min, max, average, medium ...).
Facets will give you a list of all the values in a column - so the simplest way of getting the lowest/highest values in the column is to make a facet on the column and see the resulting highest/lowest in the facet to get the answer.
However I'm not sure if this meets your criteria for selecting the values 'in a script'. By this I assume you mean you want to be able to access the lowest/highest values in a GREL expression?
You can do this, but you have to force OpenRefine to treat all the rows in project as part of a single record. The easiest way to do this is usually to add a column at the start of the project which is empty except for the first cell which contains a value.
Once you've done this you can access all the values in a column by using syntax like:
row.record.cells["Column name"].value
See also my answer to OpenRefine - Fill between cells but not at the end of the list which uses the same technique
Further explanation:
Create a new column at the start of your project and put a single value in the very first cell in that column
Switch to Record mode
At this point you should have a single 'Record' in your project - e.g.
At this point using the syntax like row.record.cells["Column 1"].value gives you an array of all the values in "Column 1". You can then use GREL expressions to manipulate this - including sorting or comparing values.
A Text Facet has an nice undocumented option to gives you aggregated results in a column that you can just copy and paste.
Click on the "X choices" in the upper left corner of the Text Facet box.
This will bring up a separate dialog that contains the values along with the count of each value in that column.
(If your looking to just get ALL the values of a single column, then use Export -> Custom Tabular Exporter and then Select and Order Columns to Export by clicking on checkboxes, then click on Download tab to choose your export format and then click Download button.)