Tools for data hoarders

Intro

This article describes UNIX utilities I use to collect information about handhelds and maintain my local file archive. Other collectors may find this overview quite interesting and useful, so I decided to share. Note that I run all the mentioned utilities in GNU/Linux; I’ve no information whether they support other operating systems (Windows, macOS).

Attention: The author of the article urges to respect the copyright and legal provisions of the network services you use. The author is not responsible for any unauthorized use of the tools listed below.

Why store data locally?

A few reasons why you might lose access to data stored on somebody else’s host:

  1. The remote server might grow unreachable.
  2. The data might get deleted from the remote server.
  3. The data provision policy might change on the remote server. The data might get chargeable (e.g. by subscription) or unavailable in your country due to political sanctions.

Downloading the static websites

I use wget to download multiple pages of one website. This utility is extremely rich in features and tweaky options. Below are configurations for most common use cases. Mind the restrictions: The command specified does not download the dynamically loaded content, for example, the content loaded by a browser on scrolling.

Downloading an entire website

This use case allows you to download the entire website content (all assets). Before using, replace example.com (two matches) with the domain of the website you want to download.

wget 
    --no-clobber 
    --mirror 
    --recursive 
    --convert-links 
    --backup-converted 
    --page-requisites 
    --span-hosts 
    --domains=example.com  
    --restrict-file-names=windows 
    http://example.com/

Downloading the website assets recursively

This use case allows you to download the website assets with common URI hierarchy. Before using, replace example.com (two matches) with the domain of the website you need and path, with the common URI part.

wget 
    --no-clobber 
    --mirror 
    --recursive 
    --convert-links 
    --backup-converted 
    --page-requisites 
    --span-hosts 
    --domains=example.com 
    --restrict-file-names=windows 
    --no-parent 
    http://example.com/path/

Downloading a single page

To download a single page, I use Firefox plugins—first of all, Save Page WE and SingleFile.
Unlike wget you can also use for this purpose, the plugins allow you to download the dynamically loaded content and save an entire page with all contents as a single HTML file. Such pages are very convenient for local use.
The plugins do the download job similarly. Either of them deals better with some websites and worse with the others; hence I use both. After you complete the download, I recommend checking the result and using the second plugin if you’re not satisfied.

If you need to download a page together with files referenced by the page links, use DownThemAll!, another Firefox plugin. With this one, you can download all files referenced by the page links at one click.

Bypassing the download restrictions imposed by a host

Some sites impose download restrictions hindering you from saving the content locally once the limit is reached. I encourage you to respect and abide by these restrictions. In exceptional cases, however, you may decide, at your own responsibility, to bypass the restrictions.
The restrictions may be active or passive. Among the latter is robots.txt, which is quite easy to ignore.
Bypassing active restrictions is a little more complicated. Install Tor and configure random latency between the requests. For requests to go through Tor, use tsocks, the proxy utility.

The final command to bypass common download restrictions looks as follows:

tsocks wget  
    --no-clobber  
    --mirror  
    --recursive  
    --convert-links  
    --backup-converted  
    --page-requisites  
    --span-hosts  
    --domains=example.com  
    --restrict-file-names=windows  
    --waitretry=5  
    --wait=2  
    --random-wait  
    -e robots=off  
    http://example.com/ 

YouTube

To download YouTube videos, I use youtube-dl. Make sure to choose the highest download quality.

Usage example:

youtube-dl -f bestvideo+bestaudio https://www.youtube.com/watch?v=ANW6X0DLT00

Wayback Machine

Wayback Machine, the Internet digital archive, offers free access to vast amounts of useful hard-to-find information about handhelds. The service delivers API to develop custom utilities. To download, suppose, an entire website (that is, all its assets available at Wayback Machine), you’d rather prefer dedicated utilities than the service web interface.

I use wayback_machine_downloader to download both entire websites and specific file types or directories (see the --only option in the documentation).

Usage example:

wayback_machine_downloader www.angstrom-distribution.org

Google books

To download books and magazines from Google Books, I use GoBooDo, Google Books Downloader. For this utility to function correctly, make sure to install tesseract.

The utility outputs a ready-to-use PDF file and snapshots of some pages.

Usage example:

python GoBooDo.py --id HD0EAAAAMBAJ

Here’s settings.json I use for download:

{
    "country": "ru",
    "page_resolution": 1500,
    "tesseract_path": "/usr/bin/tesseract",
    "proxy_links": 0,
    "proxy_images": 0,
    "max_retry_links": 1,
    "max_retry_images": 1,
    "global_retry_time": 30
}

Cloning the GitHub repositories

To clone all GitHub repostiories belonging to the same user, use curl and jq. Replace gemian with the user whose repositories you want to download. If the user has over a hundred repositories, download in several iterations by incrementing the page value.

Usage example:

curl -s "https://api.github.com/users/gemian/repos?per_page=100&page=1" | jq -r ".[].git_url" | xargs -L1 git clone

Downloading files from SourceForge

To download all files from SourceForge, I use the script below, a slightly modified version of the script taken from the SpiritQuaddicted repository. The original script doesn’t work for me.

#!/bin/sh
# Based on https://github.com/SpiritQuaddicted/sourceforge-file-download

set -e

display_usage() {
  echo "Downloads all of a SourceForge project's files."
  echo -e "\nUsage: ./sourceforge-file-download.sh [project name]\n"
}

if [ $# -eq 0 ]
then
  display_usage
  exit 1
fi

project=$1
echo "Downloading $project's files"

# download all the pages on which direct download links are
# be nice, sleep a second
wget -w 1 -np -m -A download https://sourceforge.net/projects/$project/files/

# extract those links
grep -Rh refresh sourceforge.net/ | grep -o "https[^\\?]*" | grep -v '&use_mirror=' > urllist

# remove temporary files, unless you want to keep them for some reason
rm -r sourceforge.net/

# download each of the extracted URLs, put into $projectname/
while read url; do wget --content-disposition -x -nH --cut-dirs=1 "${url}"; done < urllist

rm urllist

P.S.

As a conclusion, some thoughts on collecting information about handhelds:

  1. Consider sharing your local archives to avoid losing the information in the future.
  2. Consider contributing to the Internet Archive project. Uploading your materials to the Internet Archive is another way to enhance the likelihood of preserving the information in the future.