Total up your columns using R

We use the aggregate() function:

?aggregate

Given a table,

title                 author                 genre        price
Ascerbic Castle       Constance Marigold     fiction      $17
Variegated Voltaire   Sir Ereben Ackerman    non-fiction  $24
Numenara              Monte Cook             role-playing $19
Eclipse Phase         ?                      role-playing $15

After you read in the table, you can total up the prices like so:

> x <- read.table...
> aggregate(price ~genre, data=x, sum)

Print a table off the DOM tree with Mojolicious

I use this and redirect to a aggregate HTML file for later processing – replacing tags with commas and munging it down to CSV format:

#!C:\strawberry\bin\perl\bin

use Mojolicious::DOM;
use File::Slurp 'slurp';

my $dom = scalar slurp ...

Sorry – I did not have the complete code up. I will try to have it tomorrow!

I had a comment at the top of the script that read, “Doesn’t work — resorted to bat.”

From Excel to R: baby steps

Suppose you are given a table of data, and it is stored in Desktop\dnd\pc.txt in CSV* format:

Row Class    Name            Level Absent  Player Type
1   Fighter  Glee Thock      1     Y       Hack and Slash
2   Wizard   Mara Alstace    3     N       Hack and Slash
3   Cleric   Sarl Pruen      1     N       Hack and Slash
4   Rogue    Thief Thiever   1     N       Role-play
5   Warlock  Suzy Q          20    NPC

In R, here are some processing commands:

> setwd("C:/Users/ettis/Desktop/dnd")
> x <- read.table("pc.txt", header=TRUE, sep=",", fill=TRUE)
> x <- x[, c(2,3,4)]
> x <- x[x$Absent %in% c("Y", "N"),]
> x <- x[!(x$Player.Type %in% c("Hack and Slash"),]
> x

This would output

Class    Name            Level 
Rogue    Thief Thiever   1

I set my work directory so that I could refer to files locally. I wouldn’t use fill=TRUE unless read.table() throws an error like “scan row x only has fields for c-1 columns.”

The next statement keeps columns 2 through four, which you can also specify (2:4) or similar.

The next two statements show how to progressively filter columns based on one or more criteria. According to StackOverflow, %in% ignores “na” values; this may affect you or not. (My AutoIt script imports from Excel with an SQL query and formats #N/A as the string UNDEFINED.)

I hope this helps you jump into R!

  • For clarity, I show the table above as tab-separated.

Diagnosing mergesort

The partitioning calculation can be printed first, giving the result of q = (p+r)/2 for each division of the array into its sub-arrays:

q = floor((p+r)/2)

q is 0 when there is one element, because of integer division of a fraction 1/2; then you’re at a single element. The function returns because p \ge r.

The second place to print is during assignment of the left and right subarrays; this shows you the merge order. For terminating single elements, L and R are one-element arrays (not counting sentinel value). These get “merged” because L and R get larger toward the end:

mergesort

 

“Larger toward the end?” Wat

batch scripts, perl, R, AutoIt: an eternal golden something

When Excel was my endpoint, I could lazy evaluate any sums by concatenating numbers as long expressions in the batch script:

set total==0
for /f "delims=," %%i in ('findstr ...
  set total=!total!+%%i

echo !field1!,!field2!,!total!>> csv-file.txt

In VBA, opening the file caused Excel to evaluate the long strings of =0+4+8+12 in the cell as formulas. I didn’t have to worry about adding numbers in the batch script itself.

It’s a different story with perl and R. First, I had to strip the equal signs:

for /f "tokens=*" %%i in (file.txt) do ( 
  for /f "delims== tokens=1-4" %%j in ("!line!") do ( 
    echo %%j%%k%%l%%m>> for-R.txt
  )
)

So cmd would strip the equals signs for me while dividing the line into chunks; I could keep my “x+y+z” expressions, and I would be home free. However, instead of evaluating the field as a single number, R read them as a “factor” type:

> setwd("C:/...")
> x <- read.table(for-R.txt, sep=",", header=TRUE)
> x$total[1]
[1] 0+4+6
Levels 0+4+6 ...

What?

> eval(expression(parse(text="0+4+6")))
10

Okay, opportunity knocked: I would use perl:

#!C:/strawberry/bin/perl/bin

use strict;
use warnings;

# ... code ...
open(my $filehdl, "<", "for-R.txt"); 
while (<$filehdl>) 
  my @fields = split(',', $_);
  my field1 = $fields[0]
  # ...
  my calc = eval($field1); # 0+4+6
  # ...
  print $line . "\n";
# ...

Back in the batch script,

@echo off
...
sum-cumul.pl !file! > for-R-2.txt

Rscript %RSCRIPTS%\gen-report.R

autoitmailer.exe arg1 arg2 arg3
...
endlocal
goto :eof

Instructive disruptions

I was asked to take down the web apps today – well, most notably the reddit clone. There were legal issues and lack-of-censorship issues. A transparent suggestion box with upvoting, unapproved by the higher-ups, was unacceptable.

The top voted responses were continuing education in technology and working a larger block in the week to have the fifth day off. Just having to shut down daemon proc made me really demoralized and helpless: I had to force myself to work on a different problem.

Fortunately, I made some headway into R and perl. Banging out simple web apps with Mojolicious gave me a good in to the framework’s host language; I wrote a script to sum numbers. Usually we take it for granted that numbers get evaluated “automatically,” but this time it was almost hell to get it figured out.

I rewrote the AutoIt emailer to include body text. I would reach the batch script limit quickly if the body text were on the command line; instead, accept an argument to a file path. Then you can FileOpen() and FileRead() in the comfort of the .au3.

AutoIt: mouse click unrecognized in a text field over a wide label

When you lay down your label and text field in AutoIt, be careful of passing in a width argument to the label:

GUICtrlCreateLabel("Name:", x, y, 100) ; dreaded width specified!
;                                 ^^^
Dim $inpName = GUICtrlCreateInput("", x+50, y, 100)

You won’t be able to click into the input text field except at the very right! You could tab into the field, but it was frustratingly elusive to think about.

A user caught this. The code looked like spaghetti to me. “All bugs are shallow” with enough eyes, even if it’s one extra pair.

Since labels autofit the initialized text, I don’t type in a width. I must have just been copy-pasting or something.

How do we get a logarithmic running time?

Merge sort divides an array of size n into n/2 smaller arrays by calculating the middle index. It does this recursively until each partition is one element. Then the algorithm merges them together by selecting the smaller number from each successive pile of left and right numbers.

Describing merge sort is like drawing out an upside-down tree: at the root you have the initial running time of the function T(n), and the last nodes are single elements: an element by itself is already sorted! Sorting a single element takes constant time; there is no real processing to be done. So you have n elements which each take c time for a width of c*n.

Then you imagine the tree having a height of (lg n + 1), which is calculating the number of nodes that a single-element tree has and proving the inductive hypothesis of a 2^i …, &tc, and you get T(n) as an area equal to cn * (lg n + 1) or cn * (cn lg n).

Logarithmic algorithms take less time than linear, so why do we pick the worst-case with the “faster” running time? The more dependent factor of a recursive solution may be the number of divisions needed. Division time D(n) is a quick calculation of floor(q) = (p+r)/2 of elements from p to q, and merge time C(n) takes n steps (linear) to compare two piles, each with n/2 or half the numbers.

I thought I could explain it, but I will have to go back and reread it. It is a fun read.

An AutoIt web driver with Mojolicious

For (small) purposes, you can use AutoIt as a ADODB connection to an Access backend, an executable invoked from Mojolicious daemon proc to return either output to pipe or a return code for success or failure. The latter I can use with /add or /update URLs to inform the user, and the output can be stuffed into a stash for template population.

You can’t DROP Access tables anyway, and SQL injection is a non-issue with a proc that doesn’t know how to process those statements (plus Mojolicious escapes things). It’s certainly not scalable, but for its use case it is straightforward.

Even a database like Access, with primary key integrity, gives you a lot of great opportunities. Two of them are

  • unique constraint
  • primary key

To store concurrent submissions; insert a dummy row and pull the @@IDENTITY with a select, and then write the user submission: copypasta1.txt, copypasta2.txt, etc.

Unique constraint lets you store usernames without collisions. Test the exit code from the AutoIt process to see if the insertion was successful.

How do I remember floor and ceiling?

Imagine a vertical number line, and put x at a point in the middle. Then the next point above and below x are the “caps” for x, ceiling and floor (respectively).

^                ^
|                |
+ ceil(x)        + 6
|                |
+ x              . 5.7
|                |
+ floor(x)       + 5
|                |
v                v

This makes more sense with real numbers (decimals), when x is desired to be an integer. For example, merge sort partitions the array and we need integers for the indices. You can’t get the 5.5th index of a number, although you might get that dividing an 11-element array.