{"id":1336,"date":"2021-10-31T12:45:04","date_gmt":"2021-10-31T09:45:04","guid":{"rendered":"https:\/\/handheld.computer\/?page_id=1336"},"modified":"2021-10-31T22:16:19","modified_gmt":"2021-10-31T19:16:19","slug":"tools-for-data-hoarders","status":"publish","type":"page","link":"https:\/\/handheld.computer\/?page_id=1336","title":{"rendered":"Tools for data hoarders"},"content":{"rendered":"<h1>Intro<\/h1>\n<p>This article describes UNIX utilities I use to collect information about handhelds and maintain my local file archive. Other collectors may find this overview quite interesting and useful, so I decided to share. Note that I run all the mentioned utilities in GNU\/Linux; I\u2019ve no information whether they support other operating systems (Windows, macOS).<\/p>\n<p>Attention: The author of the article urges to respect the copyright and legal provisions of the network services you use. The author is not responsible for any unauthorized use of the tools listed below.<\/p>\n<h2>Why store data locally?<\/h2>\n<p>A few reasons why you might lose access to data stored on somebody else\u2019s host:<\/p>\n<ol>\n<li>The remote server might grow unreachable.<\/li>\n<li>The data might get deleted from the remote server.<\/li>\n<li>The data provision policy might change on the remote server. The data might get chargeable (e.g. by subscription) or unavailable in your country due to political sanctions.<\/li>\n<\/ol>\n<h1>Downloading the static websites<\/h1>\n<p>I use <a href=\"http:\/\/www.gnu.org\/software\/wget\/\">wget<\/a> to download multiple pages of one website. This utility is extremely rich in features and tweaky options. Below are configurations for most common use cases. Mind the restrictions: The command specified does not download the dynamically loaded content, for example, the content loaded by a browser on scrolling.<\/p>\n<h2>Downloading an entire website<\/h2>\n<p>This use case allows you to download the entire website content (all assets). Before using, replace example.com (two matches) with the domain of the website you want to download.<\/p>\n<pre><code>wget \n    --no-clobber \n    --mirror \n    --recursive \n    --convert-links \n    --backup-converted \n    --page-requisites \n    --span-hosts \n    --domains=example.com  \n    --restrict-file-names=windows \n    http:\/\/example.com\/<\/code><\/pre>\n<h2>Downloading the website assets recursively<\/h2>\n<p>This use case allows you to download the website assets with common URI hierarchy. Before using, replace example.com (two matches) with the domain of the website you need and path, with the common URI part.<\/p>\n<pre><code>wget \n    --no-clobber \n    --mirror \n    --recursive \n    --convert-links \n    --backup-converted \n    --page-requisites \n    --span-hosts \n    --domains=example.com \n    --restrict-file-names=windows \n    --no-parent \n    http:\/\/example.com\/path\/<\/code><\/pre>\n<h2>Downloading a single page<\/h2>\n<p>To download a single page, I use Firefox plugins\u2014first of all, <a href=\"https:\/\/addons.mozilla.org\/en-US\/firefox\/addon\/save-page-we\/\">Save Page WE<\/a> and <a href=\"https:\/\/addons.mozilla.org\/en-US\/android\/addon\/single-file\/\">SingleFile<\/a>.<br \/>\nUnlike wget you can also use for this purpose, the plugins allow you to download the dynamically loaded content and save an entire page with all contents as a single HTML file. Such pages are very convenient for local use.<br \/>\nThe plugins do the download job similarly. Either of them deals better with some websites and worse with the others; hence I use both. After you complete the download, I recommend checking the result and using the second plugin if you\u2019re not satisfied.<\/p>\n<p>If you need to download a page together with files referenced by the page links, use <a href=\"https:\/\/addons.mozilla.org\/en-US\/firefox\/addon\/downthemall\/\">DownThemAll!<\/a>, another Firefox plugin. With this one, you can download all files referenced by the page links at one click.<\/p>\n<h2>Bypassing the download restrictions imposed by a host<\/h2>\n<p>Some sites impose download restrictions hindering you from saving the content locally once the limit is reached. I encourage you to respect and abide by these restrictions. In exceptional cases, however, you may decide, at your own responsibility, to bypass the restrictions.<br \/>\nThe restrictions may be active or passive. Among the latter is robots.txt, which is quite easy to ignore.<br \/>\nBypassing active restrictions is a little more complicated. Install Tor and configure random latency between the requests. For requests to go through Tor, use <a href=\"http:\/\/tsocks.sourceforge.net\/\">tsocks<\/a>, the proxy utility.<\/p>\n<p>The final command to bypass common download restrictions looks as follows:<\/p>\n<pre><code>tsocks wget  \n    --no-clobber  \n    --mirror  \n    --recursive  \n    --convert-links  \n    --backup-converted  \n    --page-requisites  \n    --span-hosts  \n    --domains=example.com  \n    --restrict-file-names=windows  \n    --waitretry=5  \n    --wait=2  \n    --random-wait  \n    -e robots=off  \n    http:\/\/example.com\/ <\/code><\/pre>\n<h1>YouTube<\/h1>\n<p>To download YouTube videos, I use <a href=\"https:\/\/github.com\/ytdl-org\/youtube-dl\">youtube-dl<\/a>. Make sure to choose the highest download quality.<\/p>\n<p>Usage example:<\/p>\n<pre><code>youtube-dl -f bestvideo+bestaudio https:\/\/www.youtube.com\/watch?v=ANW6X0DLT00<\/code><\/pre>\n<h1>Wayback Machine<\/h1>\n<p><a href=\"https:\/\/archive.org\/web\/\">Wayback Machine<\/a>, the Internet digital archive, offers free access to vast amounts of useful hard-to-find information about handhelds. The service delivers <a href=\"https:\/\/archive.org\/help\/wayback_api.php\">API<\/a> to develop custom utilities. To download, suppose, an entire website (that is, all its assets available at Wayback Machine), you\u2019d rather prefer dedicated utilities than the service web interface.<\/p>\n<p>I use <a href=\"https:\/\/github.com\/hartator\/wayback-machine-downloader\">wayback_machine_downloader<\/a> to download both entire websites and specific file types or directories (see the <code>--only<\/code> option in the documentation).<\/p>\n<p>Usage example:<\/p>\n<pre><code>wayback_machine_downloader www.angstrom-distribution.org<\/code><\/pre>\n<h1>Google books<\/h1>\n<p>To download books and magazines from Google Books, I use <a href=\"https:\/\/github.com\/vaibhavk97\/GoBooDo\">GoBooDo<\/a>, Google Books Downloader. For this utility to function correctly, make sure to install <a href=\"https:\/\/github.com\/tesseract-ocr\/tesseract\">tesseract<\/a>.<\/p>\n<p>The utility outputs a ready-to-use PDF file and snapshots of some pages.<\/p>\n<p>Usage example:<\/p>\n<pre><code>python GoBooDo.py --id HD0EAAAAMBAJ<\/code><\/pre>\n<p>Here\u2019s settings.json I use for download:<\/p>\n<pre><code>{\n    \"country\": \"ru\",\n    \"page_resolution\": 1500,\n    \"tesseract_path\": \"\/usr\/bin\/tesseract\",\n    \"proxy_links\": 0,\n    \"proxy_images\": 0,\n    \"max_retry_links\": 1,\n    \"max_retry_images\": 1,\n    \"global_retry_time\": 30\n}<\/code><\/pre>\n<h1>Cloning the GitHub repositories<\/h1>\n<p>To clone all GitHub repostiories belonging to the same user, use <a href=\"https:\/\/curl.se\/\">curl<\/a> and <a href=\"https:\/\/stedolan.github.io\/jq\/\">jq<\/a>. Replace <code>gemian<\/code> with the user whose repositories you want to download. If the user has over a hundred repositories, download in several iterations by incrementing the <code>page<\/code> value.<\/p>\n<p>Usage example:<\/p>\n<pre><code>curl -s \"https:\/\/api.github.com\/users\/gemian\/repos?per_page=100&page=1\" | jq -r \".[].git_url\" | xargs -L1 git clone<\/code><\/pre>\n<h1>Downloading files from SourceForge<\/h1>\n<p>To download all files from SourceForge, I use the script below, a slightly modified version of the script taken from the <a href=\"https:\/\/github.com\/SpiritQuaddicted\/sourceforge-file-download\">SpiritQuaddicted<\/a> repository. The original script doesn\u2019t work for me.<\/p>\n<pre><code>#!\/bin\/sh\n# Based on https:\/\/github.com\/SpiritQuaddicted\/sourceforge-file-download\n\nset -e\n\ndisplay_usage() {\n  echo \"Downloads all of a SourceForge project's files.\"\n  echo -e \"\\nUsage: .\/sourceforge-file-download.sh [project name]\\n\"\n}\n\nif [ $# -eq 0 ]\nthen\n  display_usage\n  exit 1\nfi\n\nproject=$1\necho \"Downloading $project's files\"\n\n# download all the pages on which direct download links are\n# be nice, sleep a second\nwget -w 1 -np -m -A download https:\/\/sourceforge.net\/projects\/$project\/files\/\n\n# extract those links\ngrep -Rh refresh sourceforge.net\/ | grep -o \"https[^\\\\?]*\" | grep -v '&amp;use_mirror=' > urllist\n\n# remove temporary files, unless you want to keep them for some reason\nrm -r sourceforge.net\/\n\n# download each of the extracted URLs, put into $projectname\/\nwhile read url; do wget --content-disposition -x -nH --cut-dirs=1 \"${url}\"; done < urllist\n\nrm urllist<\/code><\/pre>\n<h1>P.S.<\/h1>\n<p>As a conclusion, some thoughts on collecting information about handhelds:<\/p>\n<ol>\n<li>Consider sharing your local archives to avoid losing the information in the future.<\/li>\n<li>Consider contributing to the <a href=\"https:\/\/archive.org\/\">Internet Archive<\/a> project. Uploading your materials to the Internet Archive is another way to enhance the likelihood of preserving the information in the future.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Intro This article describes UNIX utilities I use to collect information about handhelds and maintain my local file archive. Other collectors may find this overview quite interesting and useful, so I decided to share. Note that I run all the mentioned utilities in GNU\/Linux; I\u2019ve no information whether they support other operating systems (Windows, macOS). &hellip; <a href=\"https:\/\/handheld.computer\/?page_id=1336\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Tools for data hoarders&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":249,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"tags":[],"class_list":["post-1336","page","type-page","status-publish","hentry"],"post_mailing_queue_ids":[],"_links":{"self":[{"href":"https:\/\/handheld.computer\/index.php?rest_route=\/wp\/v2\/pages\/1336","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/handheld.computer\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/handheld.computer\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/handheld.computer\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/handheld.computer\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1336"}],"version-history":[{"count":7,"href":"https:\/\/handheld.computer\/index.php?rest_route=\/wp\/v2\/pages\/1336\/revisions"}],"predecessor-version":[{"id":1427,"href":"https:\/\/handheld.computer\/index.php?rest_route=\/wp\/v2\/pages\/1336\/revisions\/1427"}],"up":[{"embeddable":true,"href":"https:\/\/handheld.computer\/index.php?rest_route=\/wp\/v2\/pages\/249"}],"wp:attachment":[{"href":"https:\/\/handheld.computer\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1336"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/handheld.computer\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1336"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}