The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,482 other followers

Archive for the ‘InternetArchive’ Category

Highly esteemed science: An analysis of attitudes towards and perceived attributes of science in letters to the editor in two Dutch newspapers – Stefan P.L. de Jong, Elena Ketting, Leonie van Drooge, 2020

Posted by jpluimers on 2021/10/06

All my IPv4 addresses seem to be blocked with messages like this (note the odd, but allowed, leading zero in the IPv4 address [WayBack]):

Error

The IP you are accessing the site with (037.153.243.242) has been blocked because it has triggered one of our security measures. Please see the reason below:
Block reason: This IP was identified as infiltrated and is being used by sci-hub as a proxy.
To restore access, please contact onlinesupport@sagepub.com citing this message in full.

A quick [WayBack] “This IP was identified as infiltrated and is being used by sci-hub as a proxy.” – Google Search shows they also block the Google Bot.

I am not not even going to bother with companies that have bad infiltration detection.

Of course I ensured the paper has been archived:

[WayBack/Archive.is] Highly esteemed science: An analysis of attitudes towards and perceived attributes of science in letters to the editor in two Dutch newspapers – Stefan P.L. de Jong, Elena Ketting, Leonie van Drooge, 2020.

Note I do not run sci-hub, though it tempts me doing so. For more info: [WayBack] Sci-Hub – Wikipedia

I checked the router and web-proxy for any suspicious activity. There is none.

I do run the ArchiveBot by the ArchiveTeam to support the WayBackMachine of the InternetArchive and the great team Mark Graham has there providing some bandwidth and CPU/memory resources helping them archive public internet content for posterity.

It that triggers SAGE, too bad for them.

–jeroen

Read the rest of this entry »

Posted in Development, Internet, InternetArchive, LifeHacker, Power User, Software Development, WayBack machine, Web Development | Leave a Comment »

Windows and the current state of S.M.A.R.T. tooling that understands NVMe

Posted by jpluimers on 2021/09/16

I had trouble with two Intel 600p NVMe SSD devices: read-errors.

It appeared only few tools understand how to get S.M.A.R.T. health information from them, and even then they did not explain the read errors.

I’m going to RMA them, but in case anyone else needs to get health information from NVMe SSD devices, here is which tools do what:

So basically, CrystalDiskInfo and HD Tune are my first line of checking for drive issues, followed by smartmontools to get text output, then by vendor specific tools to assist with the RMA.

In the past, I used another smartmontools wrapper, but it was discontinued and had an even older version than GSmartControl: Source: Closed: HDD Guardian – Home.

On Intel 600p becoming locked in read-only mode after failure:

Start of Intel RMA procedure via [Wayback] Warranty Information.

My case looks remarkably similar to [Wayback] Full Diagnostic Scan always fails during Read Scan on my SSD 600p Series 256GB – Intel Community.

A few screenshots of the tools I used for health information:

Read the rest of this entry »

Posted in Hardware, NVMe, Power User, SSD, WayBack machine | Leave a Comment »

Overview of Client Libraries · Internet Archive

Posted by jpluimers on 2021/09/14

Besides manual upload at [Archive.is] Upload to Internet Archive, there are also automated ways of uploading content.

One day I need this to archive pages or sites into the WayBack machine: [WayBack] Overview of Client Libraries · Internet Archive (most of which is Python based):

Overview of Client Libraries


The Internet Archive and its community have developed several tools to give developers more control the Archive’s content and services:

`internetarchive` Command Line Tool (Python, CLI)


The internetarchive tool by Jake enables programatic access to archive.org item metadata and bulk upload of content to the Internet Archive.

Download instructions are available at [WayBack] https://github.com/jjjake/internetarchive/

Read documentation at [WayBackhttps://internetarchive.readthedocs.io/en/latest/

`openlibrary-client` Client Library (Python, CLI)


The openlibrary-client is the equivalent of the internetarchive tool for OpenLibrary. It provides developers with programatic access to Book edition and author metadata, as well as the ability to create new works.

Download instructions and documentation are available at [WayBackhttps://github.com/internetarchive/openlibrary-client

`warc` Client Library (Python)


WARC (Web ARChive) is a file format for storing web crawls (learn more at: [WayBackhttp://bibnum.bnf.fr/WARC)

This warc library makes it very easy to work with WARC files.

Download instructions and documentation are available at [WayBackhttps://github.com/internetarchive/warc

Related:

Via: [WayBack] Uploading to the Internet Archive – Archiveteam

Uploading to archive.org

[WayBack] Upload any content you manage to preserve! Registering takes a minute.

Tools

The are three main methods to upload items to Internet Archive programmatically:

Don’t use FTP upload, try to keep your items below 400 GiB size, add plenty of metadata.

Wayback machine save page now

Many scripts have been written to use the live proxy:

Torrent upload

Torrent upload, useful if you need resume (for huge files or because your bandwidth is insufficient for upload in one go):

  • Just create the item, make a torrent with your files in it, name it like the item, and upload it to the item.
  • archive.org will connect to you and other peers via a Transmission daemon and keep downloading all the contents till done;
  • For a command line tool you can use e.g. mktorrent or buildtorrent, example: mktorrent -a udp://tracker.publicbt.com:80/announce -a udp://tracker.openbittorrent.com:80 -a udp://tracker.ccc.de:80 -a udp://tracker.istole.it:80 -a http://tracker.publicbt.com:80/announce -a http://tracker.openbittorrent.com/announce "DIRECTORYTOUPLOAD" ;
  • You can then seed the torrent with one of the many graphical clients (e.g. Transmission) or on the command line (Transmission and rtorrent are the most popular; btdownloadcurses reportedly doesn’t work with udp trackers.)
  • archive.org will stop the download if the torrent stalls for some time and add a file to your item called “resume.tar.gz”, which contains whatever data was downloaded. To resume, delete the empty file called IDENTIFIER_torrent.txt; then, resume the download by re-deriving the item (you can do that from the Item Manager.) Make sure that there are online peers with the data before re-deriving and don’t delete the torrent file from the item.

Formats

Formats: anything, but:

  • Sites should be uploaded in [WayBackWARC format;
  • Audio, video, [WayBackbooks and other prints are supported from [WayBack] a number of formats;
  • For .tar and .zip files archive.org offers an online browser to search and download the specific files one needs, so you probably want to use either unless you have good reasons (e.g. if 7z or bzip2 reduce the size tenfold).

This [WayBackunofficial documentation page explains various of the special files found in every item.

Upload speed

Quite often, it’s hard to use your full bandwidth to/from the Internet Archive, which can be frustrating. The bottleneck may be temporary (check the current [WayBacknetwork speed and [WayBacks3 errors) but also persistent, especially if your network is far (e.g. transatlantic connections).

If your connection is slow or unreliable and you’re trying to upload a lot of data, it’s strongly recommended to use the bittorrent method (see above).

Some users with Gigabit upstream links or more, on common GNU/Linux operating systems (such as [WayBackAlpine), have had some success in increasing their upload speed by using more memory on [WayBackTCP congestion control and telling the kernel to live with higher latency and lower responsiveness, as in this example:

# sysctl net.core.rmem_default=8388608 net.core.rmem_max=8388608 net.ipv4.tcp_rmem="32768 131072 8388608" net.core.wmem_default=8388608 net.core.wmem_max=8388608 net.ipv4.tcp_wmem="32768 131072 8388608" net.core.default_qdisc=fq net.ipv4.tcp_congestion_control=bbr
# sysctl kernel.sched_min_granularity_ns=1000000000 kernel.sched_latency_ns=1000000000 kernel.sched_migration_cost_ns=2147483647 kernel.sched_rr_timeslice_ms=100 kernel.sched_wakeup_granularity_ns=1000000000

–jeroen

Posted in Development, Internet, InternetArchive, Power User, Python, Scripting, Software Development, WayBack machine | Leave a Comment »

GitHub – jjjake/internetarchive: A Python and Command-Line Interface to Archive.org

Posted by jpluimers on 2021/06/16

On my list of things to play with: [WayBack] GitHub – jjjake/internetarchive: A Python and Command-Line Interface to Archive.org.

Via:

Related:

  • [WayBack] The Internet Archive Python Library — Internet Archive item APIs 1.8.5 documentation
  • [WayBack] Command-Line Interface — Internet Archive item APIs 1.8.5 documentation
  • [WayBack] Quickstart — Internet Archive item APIs 1.8.5 documentation, including:

    Configuring

    Certain functionality of the internetarchive Python library requires your archive.org credentials. Your IA-S3 keys are required for uploading, searching, and modifying metadata, and your archive.org logged-in cookies are required for downloading access-restricted content and viewing your task history. To automatically create a config file with your archive.org credentials, you can use the ia command-line tool:

    $ ia configure
    Enter your archive.org credentials below to configure 'ia'.
    
    Email address: user@example.com
    Password:
    
    Config saved to: /home/user/.config/ia.ini
    

    Your config file will be saved to $HOME/.config/ia.ini, or $HOME/.ia if you do not have a .configdirectory in $HOME. Alternatively, you can specify your own path to save the config to via ia --config-file '~/.ia-custom-config' configure.

    If you have a netc file with your archive.org credentials in it, you can simply run ia configure --netrc. Note that Python’s netrc library does not currently support passphrases, or passwords with spaces in them, and therefore not currently suported here.

–jeroen

Read the rest of this entry »

Posted in Development, Internet, InternetArchive, Power User, Python, Scripting, Software Development, WayBack machine | Leave a Comment »

Check if this still happens: some Twitter content in the WayBack machine gets a slash in the URL removed during rendering on Chrome

Posted by jpluimers on 2021/06/11

From my research list; check if this still happens: [WayBack] Saving Twitter content in the WayBack archive: the fully loaded page has a wrong trailing URL (missing the second slash before the authority) · GitHub

  1. Visited https://twitter.com/MarkGraham
  2. Saved it using https://web.archive.org/save/https://twitter.com/MarkGraham
  3. Waited for the save to complete and the page to fully load and got https://web.archive.org/web/20190607081047/https:/twitter.com/MarkGraham
  4. Observed the trailing part is not a valid URL any more https:/twitter.com/MarkGraham: it is missing the second slash before the authority (see https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax)

This might be a Twitter.com thing:

Notes:

  • I have only tested this with my Chrome configurations on various machines (both regular and anonymous tabs) over at least a year; I need to figure out what happens when using different browsers.
  • It does not always happen.

Via: [WayBack] Jeroen Pluimers on Twitter: “I understand that the sites themselves pay a big role in this. That’s why I have the mangling of URLs that sometimes happens on my research list. I made this quick summary: …”

–jeroen

Read the rest of this entry »

Posted in Internet, InternetArchive, Power User, SocialMedia, Twitter, WayBack machine | Leave a Comment »

 
%d bloggers like this: