A Novel Crawler Learns Exploring

Two weeks ago, I set out to create a novel crawler with its own planner. I thought it would be easy, and indeed the final algorithm is exactly as I imagined it to be. However,

  • when I wrote the algorithm in Zig, the compiler gave up and I don’t know why;
  • when I re-wrote the algorithm in PicoLisp, I ran out of memory to store the program inside me.

At this point, I was starting to feel the universe being hostile towards me. Maybe it’s because of piracy. Maybe it’s because I am not supposed to create this thing. Now I feel distressed from recalling the horrors there be. Maybe one day I will finish the project (Probably not.) in Haskell (If I do this again, I would choose Javascript, since scraping data from HTML is its thing.) or R7RS (It turns out that PicoLisp’s symbol<->value relationship makes symbolic data manipulation much easier, which neither Scheme and CL has that. Therefore, it is easier to use pointers in C to code this algorithm than to use R7RS.), although there isn’t much left for me to learn from trying to implement the same algorithm the third time.

abu, the main author of PicoLisp, helped me with everything I need to know when making a web crawler on IRC and Matrix. The language is surprisingly suitable for symbolic manipulation.

Now I will try to explain how the algorithm works in pseudo-PicoLisp.


In its simplest form, N4 (The Standard Planning? Search? Act? Drive? Engine N4, named after noveler4. Yes, I struggle to describe this one with words.) has three components.

  • Knowledge, for storing what it knows (in code as *M)
  • Abilities, for storing what it can do (in code as *A)
  • Fatigue, for storing what it has done (in code as *F)

This is enough for the agent to explore the environment as much as it can.

Q: But isn’t this a web crawler?
A: This is a web crawler. As you will soon see, the engine is general-purpose, as it doesn’t know about the concept of web crawling.


Here is a simple example.

Tick 0

We start with a model that only knows about one URL and how to get the content of a page of any URL with curl.

g330740F4 is a randomly generated symbol representing the page.

Knowledge

(g330740F4 url "https://example.com/")

Fatigue

Abilities

(can-act 'curl
  '((@X url @Y))
  '((@X content @Z))
  '(()
    (let @Z (in (list "curl" "-s" @Y) (till NIL T))
      (if (<> @@ 0)
        (fatigue 3.0) # retry 3s later
        (remember-output)
        (fatigue FOREVER))))) # retry never

Tick 0 Step 1

At the start of every tick, the engine checks if each ability’s input can be found in Knowledge, and has no associated fatigue. In this case, Fatigue is empty, the ability 'curl is ok to run, so it is run.

Knowledge

(g330740F4 url "https://example.com/")

Fatigue

Abilities

(can-act 'curl
  '((@X url @Y))
  '((@X content @Z))
  '(()
    (let @Z (in (list "curl" "-s" @Y) (till NIL T))
      (if (<> @@ 0)
        (fatigue 3.0) # retry 3s later
        (remember-output)
        (fatigue FOREVER))))) # retry never

In the body of the ability program (started with '(()), the matched values are available. In this case:

  • @X is g330740F4
  • @Y is "https://example.com/"

The ability runs curl -s https://example.com/, and…

Tick 0 Step 2a

If the exit code (@@) from curl -s https://example.com/ is 0, the program

  1. remembers output with @Z set to the entire stdout of curl -s https://example.com/
  2. fatigues forever

Now the program memory looks like this:

Knowledge

(g330740F4 url "https://example.com/")
(g330740F4 content "(omitted)")

Fatigue

(('curl (@X . g330740F4) (@Y . "https://example.com/"))
  . timestamp-of-a-million-years-later)

Abilities

(can-act 'curl
  '((@X url @Y))
  '((@X content @Z))
  '(()
    (let @Z (in (list "curl" "-s" @Y) (till NIL T))
      (if (<> @@ 0)
        (fatigue 3.0) # retry 3s later
        (remember-output)
        (fatigue FOREVER))))) # retry never

Since all usable abilities are used, the engine moves on to the next tick.

Tick 1

Again, the engine checks every ability if the ability can be run. In this case, Fatigue says “no, I have already done this exact same thing before, and I won’t do it until a million years later”.

The engine notes down the minimum fatigue timestamp (when to retry again). At the end of a tick, the engine sleeps for (- MinFatigue Now) until it runs the next tick.

In practice (my code) though, the engine just gives up when it sees it needs to sleep for at least (/ FOREVER 10), which is like a hundred thousand years. This makes the program terminate when there is nothing else to do.

Knowledge

(g330740F4 url "https://example.com/")
(g330740F4 content "(omitted)")

Fatigue

(('curl (@X . g330740F4) (@Y . "https://example.com/"))
  . timestamp-of-a-million-years-later)

Abilities

(can-act 'curl
  '((@X url @Y))
  '((@X content @Z))
  '(()
    (let @Z (in (list "curl" "-s" @Y) (till NIL T))
      (if (<> @@ 0)
        (fatigue 3.0) # retry 3s later
        (remember-output)
        (fatigue FOREVER))))) # retry never

The actual crawler program has more abilities to work with. Those include (the ability to):

  • download web page with curl
  • convert encoding with iconv
  • classify page type from URL
  • extract links to other pages based on current page type with xm.l (unimplemented)

Those are what a novel crawler would do to extract text from a novel website.

For the full list of what it can do, see the file can-act.l in the source code.

Since I was burned out at this point, I didn’t add scope limit/pruning (so it doesn’t crawl for everything), acting parellel, and bidirectional planning. recall-any will be helpful in implementing reverse planning.

Planning is search. Exploring is search. Therefore, there are are a lot of places you can put N4 in. Since N4 is general, you can reuse all the previous ability programs you have written.

If you want uncontrolled fun, here are a list of actions you can take to make it more fun : )

  • use code synthesis to generate new Abilities
  • use a random source to feed new entries into Knowledge

Also, I know some people call this type of thing “symbolic something something”. You can read up on the topic if you want.

Close Calls

Why didn’t I see anyone else use a similar scheduler before I made this?

Build systems like Tup have a similar scheduler, but each task is seen as an indivisible unit.

Prolog systems have predicates, but not multiple predicates on the left side of :-.

Minikanren is similarly close.

If not for the fact that I know dependency tracking and relational logic, I never would have synthesized N4 myself.

What’s Next

*yawns* this article is too long I need to sleep bye!