-1

In a past posting, I asked about commands in Bash to align text columns against one another by row. It has become clear to me that the desired task (i.e., aligning text columns of different size and content by row) is much more complex than initially anticipated and that the proposed answer, while acceptable for the past posting, is insufficient on most empirical data sets. Thus, I would like to query the community on the following pseudocode. Specifically, I would like to know if and in what way the following pseudocode could be optimized.

Assume a file with n columns of strings. Some strings might be missing, others might be duplicated. The longest column may not be the first one listed in the file, but shall be the reference column. The order of the rows of this reference column must be maintained.

> cat file  # where n=3; first row contains column headers
CL1 CL2 CL3
foo foo bar
bar baz qux
baz qux
qux foo
    bar

Pseudocode attempt 1 (totally inadequate):

Shuffle columns so that columns ordered by size (i.e., longest column is first in matrix)
Rownames = strings of first column (i.e., of longest column)
For rownames
  For (colname among columns 2:end)
    if (string in current cell == rowname) {keep string in location}
    if (string in current cell != rowname) {
      if (string in current cell == rowname of next row) {add row to bottom of table; move each string of current column one row down}
      if (string in current cell != rowname of next row) {add row to bottom of table; move each string of all other columns one row down}
    }

Order columns by size:

> cat file_columns_ordered_by_size
CL2 CL1 CL3
foo foo bar
baz bar qux
qux baz 
foo qux 
bar

Sought output:

> my_code_here file_columns_ordered_by_size
CL2 CL1 CL3
foo foo 
    bar bar
baz baz    
qux qux qux
foo
bar
  • 1
    I know that learning Perl and keeping the knowledge present is a PITA, But you should start with that (or this dreaded awk) if you have such persistent use cases. It would probably take me less than 10 min. to code that in Perl. –  Nov 11 '16 at 21:33
  • @ThomasKilian The above task is everything but trivial and cannot be coded in 10 min. It looks simple, doesn't it? I initially thought the way you did - until I dug deeper and found designing efficient code for this task astonishingly tricky. FYI, I am pretty fluent in Python, R and several bash utilities. See my [GitHub](https://github.com/michaelgruenstaeudl) as evidence. – Michael Gruenstaeudl Nov 11 '16 at 22:25
  • This is a simple matrix you can map to a hash of hashes and then scan for rows and cols. Tired now. Let's see if I'll find time tomorrow. –  Nov 11 '16 at 23:27
  • How are the column widths defined? –  Nov 12 '16 at 08:44
  • There is something wrong with your `my_code_here file_columns_ordered_by_size`. CL2 does no longer contain `bar` –  Nov 12 '16 at 12:49
  • @MichaelGruenstaeudl The amount of effort (e.g., '10 mins') does depend on proficiency. However, Perl formats and control flow will handle this adroitly. The learning curve is moderate, but if, as you say, you are familiar with Python and Bash, Perl isn't qualitatively more complicated. Thomas (and I) are simply pointing to a more specialized tool to meet your need. – Kristian H Nov 24 '16 at 15:28

1 Answers1

0

Ok. It took more an hour than 10 minutes. And your requirements were not completely specified (which is normal, but don't expect the result to be 100% complete). So here is a piece of code for you:

tokens = {'':0}
tokenIndex = 0
tokenList = ['']
def addToken(token):
    global tokenIndex
    global tokenList
    if token == " "*len(token): token = ''
    if token in tokens: return tokens[token]
    tokenList.append(token)
    tokenIndex += 1
    tokens[token] = tokenIndex
    return tokenIndex
headers = []
widths = []
columnKeys = []
usage = []
rows = []
first = True
for line in open ("data"):
    if first:
        first = False
        pos = 0
        for token in line[:-1].split(" "):
            columnKeys.append([])
            headers.append(token)
            widths.append(pos)
            pos += len(token) + 1
            usage.append(0)
        widths.append(pos)
        continue
    column = []
    for i in range(1, len(widths)):
        token = addToken(line[widths[i-1]:widths[i]-1])
        if token != 0: usage[i-1] += 1
        column.append(token)
        columnKeys[i-1].append(token)
    rows.append(column)

leadCol = 1
for i in range(2, len(usage)):
    if usage[i] > leadCol: leadCol = i
sortedUsages = {}
for i in range(len(usage)):
    key = str(usage[i])
    if not key in sortedUsages: sortedUsages[key] = []
    sortedUsages[key].append(i)
sortedKeys = []
for keys in sorted(sortedUsages.keys(), reverse=True):
    for key in keys:
        for idx in sortedUsages[key]:
          sortedKeys.append(idx)

line = headers[sortedKeys[0]]
for i in range(1, len(sortedKeys)):
    line += " " + headers[sortedKeys[i]]
print (line)

for row in rows:
    token = row[sortedKeys[0]]
    mainToken = line = tokenList[token]
    for i in range(1, len(sortedKeys)):
        line += " "
        col = columnKeys[sortedKeys[i]]
        if token in col: line += mainToken
        else: line += " "*len(mainToken)
    print (line)

The output is

CL2 CL1 CL3
foo foo    
baz baz    
qux qux qux
foo foo    
bar bar bar

which is hopefully a starting point for you to complete the work.