Print workbook to stdout

For well-formatted workbooks, I can skip ssconvert and go with a straight SELECT * FROM $sheetName instead. This is preferable because ssconvert will ,, on blank cells, but I can do a $recordSet.Fields($i).Value = “” conditional and replace it with whatever I like, such as “UNDEFINED.” This lets me avoid NULLs in the database export. I can also skip handling special case of commas inside double quotes, like

Name,Bio,Age
"John Smith","Doctor, historian, food buff",28

The Excel 12.0 OLEDB connection handles both .xls and .xlsx files. Combined with a snippet for GetTableSchemas(), a general purpose workbook exporter is now possible: print to stdout for batch scripting and automation.

My naive version to replace findstr was misguided; findstr is still way faster. I copied files to local machine %TEMP% because I thought network latency was the issue; I only saved a few minutes. I think StringSplit() cost more than StringInStr(), but I didn’t measure it – they’re both slow compared to even a nested for /f %%i in (‘findstr …’) do ( ) loop.

Advertisements

Use Access for your app’s local storage

AutoIt and ADODB can be used to issue SQL statements to an Access database. Since AutoIt can be compiled into a standalone executable, you can use .accdb as a binary replacement for one or more text files. SQLite purports to do the same thing, but I didn’t know enough C to glue Win32 GUI with it; AutoIt is much easier to prototype these things.

Using a local database instead of plaintext files gives some advantages. With Access 2007, you can encrypt databases. It’s easier for the user to edit text files than a collection of tables in a binary. You also get to use SQL to manipulate data instead of ad-hoc scripts to parse CSV. There is also something elegant about specifying a data model on table creation, controlling constraints with keys and uniqueness versus parsing split strings.

According to my tests, the user will need to have Access installed. This is the most straightforward support; otherwise, you may be able to download an Office add-on that provides the driver. I’m fortunate that most machines have Access, so it’s a non-issue.

When you have a good data model, a lot of problems become simpler. Out of today’s list of issues, most of them were cosmetic: larger fonts, tab focus. There was one legitimate bug with input, which is not bad at all.

One connection, multiple queries

MySQL has a last_insert_id() procedure that is really useful: it returns the auto-increment identity of the last inserted record. If you are using INSERT … ON DUPLICATE KEY UPDATE on a handful of key-value tables before inserting foreign key refs – integers – into an aggregate table, is it faster to collect each ID from the key-value tables, or do you wait until the aggregate table insert to do this:

Func Upsert(

... ADODB connection init, recordset, etc.

$query = ... ; ignore AutoIt boilerplate for now

INSERT INTO agg_tbl VALUES (hostid,userid,osid,"SNIFFED") 
  (SELECT hostid,userid,osid FROM 
    Host, User, OpSys WHERE
    agg_tbl.hostid = Host.ID AND
    agg_tbl.userid = User.ID AND
    agg_tbl.osid = OpSys.ID
  )

$rs.Open($query, $conn)
...
EndFunc

That INSERT query has to do a lookup on multiple tables. Instead, you can manage each of the key-value tables’ insert IDs:

Func UpsertSimpleTable()
  Dim $id = -1
  ... AutoIt initialization

  insert into opsys (name) values ('BeOS')
    on duplicate key update id=id;

  select last_insert_id();

  $id = $rs.Fields('last_insert_id()').Value
  ...

  Return $id ; always returns useful ID
EndFunc

And you could have something like

Dim $aAggInsert[3] = [ -1, -1, -1 ] ; host, user, opsys

... ; key-value upserts, tracking IDs

InsertAggregateRecord($aAggInsert)

No look-ups, just pure insert.

The post title mentions multiple queries. I couldn’t do an INSERT; SELECT with one $rs.Open() call. So before $conn.Close() I call $rs.Open() twice: once for the INSERT ON UPDATE DUPLICATE KEY and a second time for SELECT LAST_INSERT_ID();

References

Apache and push-ups

The Apache documentation is nicely organized and overwhelming. If documentation is a project concurrent with the code, we cannot expect the former to match the latter simultaneously: you can only work on one thing or the other at a time. When the codebase stabilizes and the community adopts, then documentation can catch up with the latter. Somehow I lumped quality and documentation together when this is an unfair expectation: early docs are requirements specs; the sole user is the developer, the relationship between coder and code is like a marriage.

The Apache httpd (HTTP daemon, or service, or server) is all about modularity. Even the core features are grouped into a “core” component. Configuration is handled with plaintext files. You enter a pact with the server by referencing variable names, called directives. Think of them as environment variables that won’t pollute the system namespace, generalizations from hardcoded early.

When you push a variable up, you abstract.

Func HelloWorld()
  ConsoleWrite("Hello world!" & @CRLF)
EndFunc

Push up:

Func Hello($target)
  ConsoleWrite("Hello " & $target & "!" & @CRLF)
EndFunc

One more abstraction, even though it seems like we’re just replicating ConsoleWrite() – we’re using a library function for our own purposes:

Func PrintMsg($msg)
  ConsoleWrite($msg & @CRLF)
EndFunc

Now we can start a greeting card company:

PrintMsg("Happy 15th birthday, Sarah!")

Push up:

Func BirthdayGreeting($name, $age) 
  ; TODO handle ages 1-3 
  ; TODO $age = 0
  PrintMsg("Happy " & $age & " th birthday, " & $name & "!")
EndFunc

Have you ever watched a drawstream? It’s where an artist draws something from scratch. You watch him throw down guidelines and gray curves, roughing out a final shape. Sometimes it’s hard to tell that we’re seeing a face or a body. By the end, our eyes collapse the whole into a symbol: we see it. Just as in code, all we see is the functionality.

Let’s copy Apache:

Func ParseDirectives()
  If FileExists($gHTTPDConfigPath) = 0
    ConsoleWriteError("Warning: httpd config not there!" & @CRLF)
    Return 1 ; err
  EndIf
  Dim $fileHdl = FileOpen($gHTTPDConfigPath)
  While 1
    Dim $line = FileReadLine($fileHdl)
    If @error = -1 Then ExitLoop
    Dim $aSplit = StringSplit($line, " ")
    Dim $directive = $aSplit[1]
    ; or opening tag, etc.
    Select
      Case $directive = "ServerRoot"
        $gServerRoot = $aSplit[2] ; ignore further indices
    EndSelect
  Wend
EndFunc

Anyway, after the config file is parsed, we have a system specified to our needs. A general-purpose HTTP server conforms to our requirements. This would have been harder if we had to re-compile Apache every time we wanted to change something.

The magic upsert

The MySQL query INSERT INTO … ON DUPLICATE KEY UPDATE fits both use cases: insert if new, update if existing; and it also handles do-nothing by updating “id=id” (see StackOverflow).

Even with that, it still takes about three hours to populate the database. Why does it take so long? Most of the tables are key-value pairs; that is, auto-increment columns paired with UNIQUE-constraint values. These are searched as integers to add to an aggregate table of numbers, a two-dimensional array of space-saving cleanliness.

None of the data is repeated; everything is relational. However, the INSERT query is bereft of any advantage: it must perform several lookups to get the correct IDs. No matter how fast the database is internally coded, it seems like a linear search to find each value across each table.

Since the data itself was brought out from a database, the highest probability is each row is a unique entity in the aggregate table. If that is the case, then retrieving the @@IDENTITY between inserts – from the key-value tables, not the table of foreign keys – would give me the handful of IDs!

Currently I have something like this:

INSERT INTO kv_tbl1 ... ON DUPLICATE KEY UPDATE ... id=id
INSERT INTO kv_tbl2 ... ON DUPLICATE KEY UPDATE ... id=id
INSERT INTO kv_tbl3 ... ON DUPLICATE KEY UPDATE ... id=id

INSERT INTO aggregate_table (id1,id2,id3) 
  (SELECT id1,id2,id3 FROM kv_tbl1,kv_tbl2,kv_tbl3 
  WHERE kv_tbl1.val='val1' AND kv_tbl2.val='val2' AND
  kv_tbl3.val='val3') ON DUPLICATE KEY UPDATE ...

I could skip the SELECT sub-query in the last INSERT if each key-value table yields a new or existing @@IDENTITY. I really hope so, because avoiding n lookups for n tables prior to an insert would be much faster.

Sunburst in the forest of PivotTables

PivotTables are also known as cross-tables; the pivot operation is called a cross-tabulation. MySQL doesn’t come with this natively, but there is a CPAN module that can generate the query to arrange such a miracle.

On the cusp of server-client, at a crossroads between WAMP and LAMP, inching toward Javascript and the modern web – CGI scripts notwithstanding – I choose to forego Ruby, Python and PHP for something else: perl.

The pragmatic reason is Excel pops a dialog box about insufficient resources. This is to pivot a 100k row workbook across a few hundred columns, divided between one group and another – the application chokes on adding row labels down the side. If I can generate a cross-tabulation from MySQL, I will be a large step closer to running a free kernel.

VBA: sum a column of values

The coordinate (spreadsheet as x-y grid) approach:

Dim total As Long
Dim col As Integer ' your target column

total = 0
col = 3

Range("A1").Select
Range(ActiveCell, ActiveCell.CurrentRegion).Select

' see Excel VBA doc for .Offset()
' to exclude header (or use If r.Row > 1)

For Each r In Selection.Rows
  total = total + Cells(r.Row, col)
Next

Range("C3") = total ' or wherever you wanted it

A little more specific to the searcher:

Dim numRows As Long
Dim totalsCell As Range

Set totalsCell = Range("C3") ' arbitrary totals cell
numRows = 24284 ' also arbitrary

For i = 2 To numRows ' i = 2: skip header
  totalsCell = totalsCell + Range("A" & i)
Next

Or let Excel do it for you:

Range("C3") = "=SUM(A:A)"

Searched: “vba mycol.add.range(“a” & i).value”

Check out my Github gists for more spreadsheet iteration examples.

Specify multiple options to ssconvert in Windows

ssconvert is my go-to tool for converting Excel workbooks to CSV-formatted text files. Recently I found a way to specify multiple options to the command. Assuming your workbook doesn’t have commas in the cells, you can get a nice conversion in one step:

ssconvert -O "separator=, format=raw quoting-mode=never eol=windows" workbook.xls csv-format.txt

Suppose you have a collection of workbooks. You can use findstr to search for lines that do not match the header. Combine /V and /C:”<string>” like so:

for %%i in (*-csv.txt) do (
  findstr /V /C:"Name,Age,Gender" %%i >> final.txt
)

Another ssconvert to turn it into a consolidated workbook:

ssconvert final.txt final.xls

You can wrap it up in a .zip and mail it off:

7z a final.zip final.xls

Finally, set it all up as a schedule off a cURL batch script:

schtasks /create /tn make-final /tr "C:\scripts\curl-save-customer-time.bat" /sc daily /st 14:00:00 /ru admin /rp pass

Notes

  • Excel will fold double-quoted strings into one cell; this is desired behavior.
  • Search AutoIt v3 forums for email script.
  • Experiment with the ssconvert options. Sometimes, you may want quoting. This post empowers you to specify more than one setting.
  • If the one-step conversion does not give you useful line endings, use ssconvert twice: from .xls to .csv, and then .csv to .txt.

Have portfolio, will travel

I need a portfolio. I would want it of others.

A portfolio would presumably consist of finished works or discrete products, an “I did this” stamp on each delineated item. For the interviewer-as-user, there would be an interface like a web page or an executable. Even better if he is already familiar with your work (or a necessity). Aren’t we all about “show, don’t tell?”

That’s the realm of apps, not tools. Your interviewer is your future colleague, a peer or potential mentor. He’s not just the recruiter or manager. Now things change: you build something that is useful to integrate or to build upon; your projects precede you. “I’ve read your work” is a nice advantage.

If you don’t want to learn a specific domain, make your domain specific: solve hard problems in your field, even if they’re re-implementations. It’s not a dealbreaker to write to an existing spec. Just make sure you’re being challenged.

Imma get meta

Should I be blogging about programming or programming on blogs?

A GitHub code commit is a work unit, while a blog post accomplishes little. The only time I review past posts is if a majority of visitors read it. WordPress tells me as much. Does the world need another blog engine?

No, it doesn’t – but I could write one for myself. “Programmer, write ye out of thy box.” These cardboard walls build up in the detritus of college, intern-less, nights of anime and pizza rolls and splurts of meta while our peers rent offices, bootstrap projects, and lay into dreams to make tangible desire.

More important is the approach to developing a blog engine. What is the smallest possible abstraction of the blog domain? What is the neat record I could store thousands of in a database? That is the “post.” Everything else is layered on top as tools, which manipulate patterns of – collections of – this data model.

The set of natural numbers is infinite. Imagine infinite content. Once you SQL CREATE a table with the right constraints, by definition haven’t you established a container of infinite records? It’s only data until a person observing collapses it into information.