Aligning text columns of different size and content

Aligning text columns of different size and content - optimization

In a past posting, I asked about commands in Bash to align text columns against one another by row. It has become clear to me that the desired task (i.e., aligning text columns of different size and content by row) is much more complex than initially anticipated and that the proposed answer, while acceptable for the past posting, is insufficient on most empirical data sets. Thus, I would like to query the community on the following pseudocode. Specifically, I would like to know if and in what way the following pseudocode could be optimized.
Assume a file with n columns of strings. Some strings might be missing, others might be duplicated. The longest column may not be the first one listed in the file, but shall be the reference column. The order of the rows of this reference column must be maintained.
> cat file # where n=3; first row contains column headers
CL1 CL2 CL3
foo foo bar
bar baz qux
baz qux
qux foo
bar
Pseudocode attempt 1 (totally inadequate):
Shuffle columns so that columns ordered by size (i.e., longest column is first in matrix)
Rownames = strings of first column (i.e., of longest column)
For rownames
For (colname among columns 2:end)
if (string in current cell == rowname) {keep string in location}
if (string in current cell != rowname) {
if (string in current cell == rowname of next row) {add row to bottom of table; move each string of current column one row down}
if (string in current cell != rowname of next row) {add row to bottom of table; move each string of all other columns one row down}
}
Order columns by size:
> cat file_columns_ordered_by_size
CL2 CL1 CL3
foo foo bar
baz bar qux
qux baz
foo qux
bar
Sought output:
> my_code_here file_columns_ordered_by_size
CL2 CL1 CL3
foo foo
bar bar
baz baz
qux qux qux
foo
bar

Edit: Ugh, this doesn't produce the output you wanted. I guess I don't understand the problem. Maybe it will help, anyway.
If you don't mind slurping the entire table into memory, associative arrays (hashes) would work. (Or you can use trees, maps, dictionaries, etc.) There would be one for each column, mapping strings (found in the cells of that column) to the number of times that string is found in that column. Let's name the hashes after their column headers. After slurping, they would look something like this:
CL2 = {'foo':2, 'baz':1, 'bar':1, 'qux':1}
CL1 = {'foo':1, 'baz':1, 'bar':1, 'qux':1}
CL3 = {'bar':1, 'qux':1}
# Store the columns in an array
columnCounts = [CL2, CL1, CL3]
Then write a loop that produces the output, deleting from the associative arrays at each iteration:
while (columnCounts still has at least one non-empty hash) {
key = the hash-key that is present in most (a plurality) of the hashes
for each hash in columnCounts {
if the key is in the hash {
print key
Decrement hash[key]
}
else {
print whitespace
}
}
print newline
}

Related

R code for matching multiple stings in two columns and returning into a third separated by a comma

I have two dataframes. The first df includes column b&c that has multiple stings seperated by a comma. the second has three columns, one that includes all stings in column B, two that includes all strings in c, and three is the resulting string I want to use.
x <- data.frame("uuid" = 1:2, "first" = c("jeff,fred,amy","tina,cat,dog"), "job" = c("bank teller,short cook, sky diver, no job, unknown job","bank clerk,short pet, ocean diver, hot job, rad job"))
x1 <- data.frame("meta" = c("ace", "king", "queen", "jack", 10, 9, 8,7,6,5,4,3), "first" = c("jeff","jeff","fred","amy","tina","cat","dog","fred","amy","tina","cat","dog"), "job" = c("bank teller","short cook", "sky diver", "no job", "unknown job","bank clerk","short pet", "ocean diver", "hot job", "rad job","bank teller","short cook"))
The result would be
result <- data.frame("uuid" = 1:2, "combined" = c("ace,king,queen,jack","5,9,8"))
Thank you in advance!
I tried to beat my head against the wall and it didn't help
Edit- This is the first half of the puzzle BUT it does not search for and then concat the strings together in a cell, only returns the first match found rather than all matches.
Is there a way to exactly match a string in one column with couple of strings in another column in R?

How to transpose columns when they encode multiple "records"?

I have a spreadsheet I have imported into OpenRefine. The creator encoded groups of information (records) in columns. I need to bring each of those groups of columns into its own row, along with all the relevant columns.
Using a simplified example, how would I go from this:
id foo1 foo2 foo3 bar1 bar2 bar3
1 4 6 a 7 9 b
2 5 5 a 8 8 b
3 6 4 a 9 7 b
To this:
id foobar1 foobar2 foobar3
1 4 6 a
1 7 9 b
2 5 5 a
2 8 8 b
3 6 4 a
3 9 7 b
I've been trying to think of a way forward with intermediate columns, but there are are 6 groups of 5 columns and I'm currently stuck.
I found a solution. The steps are:
Concat each group of columns into a single column (FOO_CONCAT, BAR_CONCAT)
Delete the now unneeded columns (foo1..3, bar1..3)
Transpose your CONCAT columns into a single column, no prefix, ignoring blanks, filling down other columns
Now FOO_CONCATs and BAR_CONCATs are all in the same column
Split that column into several columns...(using the separator you used in step 1)
Rename columns
Strip out prefixes (I had foo1:4, bar2:8, etc for clarity)
Transform to numbers (Edit cells -> Common Transforms -> toNumber)
Now you're ready to transpose,facet, etc

I think this is essentially the same has the solution you describe, but possibly with some shortcuts to avoid all the steps.
Given the example data you post I would:
On "Id" column select Edit column->Add column based on this column
from menu
Make new column name "foobar"
Use the GREL forEach(row.columnNames,cn,if(cn.startsWith("foo"),cells[cn].value,null)).join("|")+"~"+forEach(row.columnNames,cn,if(cn.startsWith("bar"),cells[cn].value,null)).join("|")
Once new "foobar" column exists, on this column use menu option Edit cells->Split multi-valued cells using the "~" character (as used in the GREL above)
The also on the "foobar" column use menu option Edit columns->Split into several columns, using the "|" character as in the GREL above
Finally on ID column use menu Edit cells->Fill down
This should result in the output you describe - if you don't need the original columns at this point you can either remove them, or (sometimes quicker) export the first X columns that have the reconfigured data using the custom tabular exporter, and then import that data into a new project.
You can modify the GREL to deal with the exact column groupings you have. In my example I've used the column naming to group the values, but if that isn't the reality of the data you are dealing with you can use GREL like:
forEach(row.columnNames.slice(1,4),cn,cells[cn].value).join("|")+"~"+forEach(row.columnNames.slice(4,8),cn,cells[cn].value).join("|")
Which uses the 'slice' function to select certain columns rather than using some aspect of the column name to select them.

One ggplot from two data frames (1 bar each)

I was looking for an answer everywhere, but I just couldn't find one to this problem (maybe I was just too stupid to use other answers, because I'm new to R).
I have two data frames with different numbers of rows. I want to create a plot containing a single bar per data frame. Both should have the same length and the count of different variables should be stacked over each other. For example: I want to compare the proportions of gender in those to data sets.
t1<-data.frame(cbind(c(1:6), factor(c(1,2,2,1,2,2))))
t2<-data.frame(cbind(c(1:4), factor(c(1,2,2,1))))
1 represents male, 2 represents female
I want to create two barplots next to each other that represent, that the proportions of gender in the first data frame is 2:4 and in the second one 2:2.
My attempt looked like this:
ggplot() + geom_bar(aes(1, t1$X2, position = "fill")) + geom_bar(aes(1, t2$X2, position = "fill"))
That leads to the error: "Error: stat_count() must not be used with a y aesthetic."

First I should merge the two dataframes. You need to add a variable that will identify the origin of the data, add in both dataframes a column with an ID (like t1 and t2). Keep in mind that your columnames are the same in both frames so you will be able to use the function rbind.
t1$data <- "t1"
t2$data <- "t2"
t <- (rbind(t1,t2))
Now you can make the plot:
ggplot(t[order(t$X2),], aes(data, X2, fill=factor(X2))) +
geom_bar(stat="identity", position="stack")

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.

I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.

Matplotlib table: individual column width

Is there a way to specify the width of individual columns in a matplotlib table?
The first column in my table contains just 2-3 digit IDs, and I'd like this column to be smaller than the others, but I can't seem to get it to work.
Let's say I have a table like this:
import matplotlib.pyplot as plt
fig = plt.figure()
table_ax = fig.add_subplot(1,1,1)
table_content = [["1", "Daisy", "ill"],
["2", "Topsy", "healthy"]]
table_header = ('ID', 'Name','Status')
the_table = table_ax.table(cellText=table_content, loc='center', colLabels=table_header, cellLoc='left')
fig.show()
(Never mind the weird cropping, it doesn't happen in my real table.)
What I've tried is this:
prop = the_table.properties()
cells = prop['child_artists']
for cell in cells:
text = cell.get_text()
if text == "ID":
cell.set_width(0.1)
else:
try:
int(text)
cell.set_width(0.1)
except TypeError:
pass
The above code seems to have zero effect - the columns are still all equally wide. (cell.get_width() returns 0.3333333333, so I would think that width is indeed cell-width... so what am I doing wrong?
Any help would be appreciated!

I've been searching the web over and over again looking for similar probelm sollutions. I've found some answers and used them, but I didn't find them quite straight forward. By chance I just found the table method get_celld when simply trying different table methods.
By using it you get a dictionary where the keys are tuples corresponding to table coordinates in terms of cell position. So by writing
cellDict=the_table.get_celld()
cellDict[(0,0)].set_width(0.1)
you will simply adress the upper left cell. Now looping over rows or columns will be fairly easy.
A bit late answer, but hopefully others may be helped.

Just for completion. The column header starts with (0,0) ... (0, n-1). The row header starts with (1,-1) ... (n,-1).
---------------------------------------------
| ColumnHeader (0,0) | ColumnHeader (0,1) |
---------------------------------------------
rowHeader (1,-1) | Value (1,0) | Value (1,1) |
--------------------------------------------
rowHeader (2,-1) | Value (2,0) | Value (2,1) |
--------------------------------------------
The code:
for key, cell in the_table.get_celld().items():
print (str(key[0])+", "+ str(key[1])+"\t"+str(cell.get_text()))

Condition text=="ID" is always False, since cell.get_text() returns a Text object rather than a string:
for cell in cells:
text = cell.get_text()
print text, text=="ID" # <==== here
if text == "ID":
cell.set_width(0.1)
else:
try:
int(text)
cell.set_width(0.1)
except TypeError:
pass
On the other hand, addressing the cells directly works: try cells[0].set_width(0.5).
EDIT: Text objects have an attribute get_text() themselves, so getting down to a string of a cell can be done like this:
text = cell.get_text().get_text() # yup, looks weird
if text == "ID":

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Aligning text columns of different size and content - optimization

Related

R code for matching multiple stings in two columns and returning into a third separated by a comma

How to transpose columns when they encode multiple "records"?

One ggplot from two data frames (1 bar each)

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Matplotlib table: individual column width

Categories

Resources