My attempt at archiving nfscars.net

Background

My mother once had a laptop: the Compaq Armada 1592DT. It came with Windows ME which later got “upgraded” to Windows 98 SE after I managed to completely screw up the OS, a whopping 96MB of RAM and a hard drive that probably wasn’t much larger than 1-3GB. It wasn’t powerful or anything, but some of my earliest memories of playing video games on that thing were related to Sports Car GT and Need for Speed III Hot Pursuit. Yes, both were demos, and the performance was probably in the 10-20FPS range, but it was still great for a younger me. That laptop eventually got replaced by a Windows 98 based desktop which wasn’t a lot better with its 64MB of RAM, but it could play RuneScape (barely) and also Need for Speed III.

Need for Speed III Hot Pursuit holds a very special place in my heart. It looked great at the time, the police chases were fun and challenging, the soundtrack was fantastic and the track featured in the demo (Rocky Pass) is still something I could probably draw up from memory, assuming that I could draw. At some point I discovered that it is possible to mod the game to include new cars and tracks, which resulted in ridiculous speeds, cars that could absolutely destroy the police cars, and lots and lots of game crashes. One of the sources for these mods was nfscars.net.

nfscars.net

Depending on when you will be reading this, nfscars.net is/was a website that hosts/hosted thousands cars, tracks and other mods for various Need for Speed titles. In 2020, I decided to check this site out again to see if it is still operational.

The last post referred to an “upcoming” NFS title that was already released a year ago.

The link to the forums resulted in an error page being shown.

Should the site suddenly disappear without any warning, all the content would be lost. Since I have a small datahoarding bug, I decided to take matters into my own hands and do my best to archive the contents of the site. That, and nostalgia for the old Need for Speed titles was what pushed me to start this archiving adventure.

Initial efforts

Before trying any fancy technical solutions to crawl the site and get all of the assets that way, I decided to start from the source and tried contacting the site owner to arrange the download of the assets. That did not work out at all, since the contact e-mails were either not active or the inboxes were full.

I signed up for the site and tried to message some user accounts which hopefully were related to someone who was in charge. No dice.

There is a small chatroom on the site named “Shoutbox”, so I tried asking about the site owners there. It seems that a couple of users are still occasionally chatting there, mainly about random topics and spammy users. Unfortunately none of the users there - even the moderators - could not assist with setting up a backup of the site.

With the domain (potentially) expiring on 2021-05-02, I felt like I had to act now, because the site was probably running on borrowed time.

Technical solutions

I checked a selection of solutions for performing a crawl on the site and decided on Heritrix, mainly due to this being backed by the Internet Archive. If they cannot get archiving right, then who can?

Setting up Heritrix turned out to be quite a hassle. There does not seem to be a quick guide that you can follow to get started in 5 minutes. After stumbling around, learning the config file format and understanding that the best way to configure crawls was to do it externally with a text editor, I was set up to launch my first crawl. Even then, I hit some roadblocks:

The crawl was slow. I understand that Heritrix is just trying to be polite here, but if you have millions of URL-s to crawl and the site might go down at any moment, then time really isn’t on your side.
Checkpointing is supported, but for some reason the default configuration does not enable it. If the machine that you are running this on should restart, then your crawl will be stopped and you cannot resume at that point, you can only start over.
Heritrix is based on Java, which means that theoretically it should run on anything that runs Java, such as an ARM CPU based SBC. However, I ran into weird java.nio related issues with symlinking on those platforms and I just could not get them resolved.
If you run too many crawls at once, then Heritrix is perfectly capable of killing your LAN by using up all the available connections. It managed to rack up 16000+ active connections, which hit the 16384 connection limit on my consumer-grade router. That was not fun to debug.
Automatically continuing crawls from a checkpoint on system startup requires you to use the Heritrix API, which isn’t too complicated, but I just wish that this was another feature that it supported out of the box.

After messing up a couple of times and trying again, I finally managed to complete a crawl of the site. This only took 6 months, where the crawl itself ran for 3 months. MONTHS!

What the hell is a WARC?

Heritrix collects the results of the crawl into a special format called WARC, which it then compresses, resulting in lots of little .warc.gz files. If you extract one of these, you will end up with a .warc file.

To extract files from a WARC file, you need one of these tools listed here. After trying a couple, I decided to go with warcat since it seemed to mostly work and it supported combining smaller WARC files into bigger ones and extracting the contents.

Depending on the size of the site, the extracting process might take a while. Crawls can contain thousands or even millions of small little files, and if you are using hard drives for storing these, then it will be slow. I have not done testing on an SSD, but it will probably be much faster due to its superior random I/O performance.

Once I had extracted the contents, I did some checking to see if the assets that I care about are present in the crawl. Did a file search for a modded car in Need for Speed III.

No results.

Okay, that’s odd. Tried it with another file name.

No results.

Well, what do we have? I checked the extracted contents using ncdu and found that while Heritrix managed to grab a lot of the webpages and the images from the galleries, it failed to download files, such as .zip and .rar files that contain the modifications themselves.

Getting the final pieces of the puzzle

I started investigating the download functionality of the site. On every page load, the download link would have a new identifier attached to it. Upon clicking on it, the download would start. If you wait too long, then the download link would expire.

Firefox has a handy feature where you can copy the request into a curl command. I did that, added some parameters to it so that it would use the filename that comes from the response header and made it save the file to disk. It worked!

Now I had to figure out how to collect all of these download links. The site has sections for each game, which is also reflected in the URL. However, I found that when you navigate to the page of a mod and change the numerical ID at the end, you could cycle through all the cars, tracks and other tools for all games, even if the URL was technically referring to one game in the series.

After bashing around for a while, I came up with this masterpiece:

#!/bin/bash
cd /your/download/destination || exit 1

for index in {1..18050}; do
  echo "Index at $index, starting."
  filePath=($(curl --silent https://www.nfscars.net/need-for-speed-hot-pursuit/1/files/view/$index/ -H 'Cookie: csrftoken=XXX; sessionid=YYY' | grep -oE "files/download/send/[0-9]+/[0-9a-zA-Z]+"))
  echo "File path: $filePath"
  echo "Making directory $index"
  mkdir -p "$index"
  echo "Going into directory $index"
  cd $index

  for path in "${filePath[@]}"; do
    echo "Starting download on path https://www.nfscars.net/$path"
    curl --silent "https://www.nfscars.net/$path" -H 'User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Connection: keep-alive' -H 'Referer: https://www.nfscars.net/need-for-speed-hot-pursuit/1/files/view/1/' -H 'Cookie: csrftoken=XXX; sessionid=YYY' -H 'Upgrade-Insecure-Requests: 1' -OJ
    echo "Download $path done."
  done

  cd /your/download/destination 
  echo "Back in home dir. Index $index end."
done

Not the prettiest script around, but it did the trick. In a day or so the assets were downloaded and you could use the index to connect these to a page that is contained in the crawl.

Results

All this effort would be for nothing if I didn’t share my results, so I decided to pack these up and serve them for anyone that wants to keep a backup of at least one part of Need for Speed history.

The archived content can be found on the Internet Archive. Here’s a newer crawl from 2021.

Pieces of history get lost all the time with sites going offline. I’m hoping that at least with this effort I have managed to do my part in preserving a tiny piece of it.

2021-03-15 update

The site seems to be down at the moment and has been for a couple of days. Oh no.

2021-03-20 update

Somehow the site is alive again.