run --rm
](#sggsttsrnrm TIDDLYLINK) [2020-08-11]
ok, need to starti without the pdf, screenshot etc… takes too long [2020-08-05]
Release Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index · pirate/ArchiveBox [2019-04-06]
should run it after I normalise all the wereyouhere links? [2019-04-16]
ok, he’s working on django backend where we can use hashes https://github.com/pirate/ArchiveBox/issues/74 [2019-04-16]
pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… [2019-04-16]
pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… [2019-04-16]
[pirate/ArchiveBox] Bugfixes, new data integrity and invariant checks, remove title prefetching [2018-10-03]
kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager
[2019-12-20]
Web Archiving Community · pirate/ArchiveBox Wiki [[linkrot]][2019-12-11]
Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 | Hacker News [2020-05-28]
site-deaths - IndieWeb [[linkrot]][2019-04-19]
Bret Victor on Twitter: "60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade." [[linkrot]][2019-06-13]
Time Travel: Find Mementos in Internet Archive, Archive-It, British Library, archive.today, GitHub and many more! http://timetravel.mementoweb.org/ [[search]] [[linkrot]][2019-12-22]
This Page is Designed to Last [[linkrot]][2019-07-08]
Fund: On-Demand Web Archiving of Annotated Pages – Hypothesis https://web.hypothes.is/blog/fund-on-demand-web-archiving-of-annotated-pages/ [[linkrot]][2020-03-06]
Archiving URLs | Hacker News https://news.ycombinator.com/item?id=6504331 [2021-02-25]
Wikipedia:Database download - Wikipedia [[wikipedia]][2021-02-25]
Wikipedia:Database download - Wikipedia [2021-02-25]
Main Page - Kiwix [[prepping]] [[wikipedia]][2021-02-25]
jeharu comments on The full English Wikipedia on Kiwix now weighs 79Gb instead of 94Gb thanks to improvements in image compression [[kiwix]] [[prepping]][2021-02-25]
Wikipedia:Database download - Wikipedia [2021-03-26]
AdGuardHome/whotracksme.json at master · AdguardTeam/AdGuardHome [[archivebox]]The most ambitious & total approach to local caching is to set up a proxy to do your browsing through, and record literally all your web traffic; for example, using Live Archiving Proxy (LAP) or WarcProxy which will save as WARC files every page you visit through it. (Zachary Vance explains how to set up a local HTTPS certificate to MITM your HTTPS browsing as well.)
One may be reluctant to go this far, and prefer something lighter-weight, such as periodically extracting a list of visited URLs from one’s web browser and then attempting to archive them.
[2018-11-05]
just backup everything you can find in promnesia? [[promnesia]]The tool I’m currently using, very decent https://github.com/ArchiveBox/ArchiveBox#readme
[√] 2020-08-11 01:33:33 Update of 252 pages complete (146.68 min)
- 0 links skipped
- 228 links updated
- 24 links had errors
...
535M ./1597100812.87
609M ./1597100812.31
757M ./1597100812.221
1.1G ./1597100812.173
8.5G .
Ok, and second run the next day said it’s already added all of them to index. Nice!
run --rm
run archivebox init
run some export
run another export (potentially overlapping?, but with new urls)
it seems to fail…
[2020-08-11]
ok, need to starti without the pdf, screenshot etc… takes too longalso make sure it’s possibe to add pdfs as an afterthought?
[2020-08-05]
Release Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index · pirate/ArchiveBoxMajor new ArchiveBox version, with a brand new CLI, UI, and SQLite index
[2020-10-25]
would be nice to have parallel execution or something..[2020-10-25]
hmm, if archiving is interrupted, how to carry on? apparently ‘archivebox update’?[2020-10-25]
ok, it fetches new data on config change when running update? that’s nice[2020-10-25]
media – could def download later/in parallel..would be nice to mark different sources as well if possible?
I guess need promnesia provider. is it like my.links? [[hpi]]
move run script somewhere else; add ability to put output dir somewhere else
[2019-04-06]
should run it after I normalise all the wereyouhere links?I guess filter out all suspicious ones, containing special characters?
[2019-04-16]
ok, he’s working on django backend where we can use hashes https://github.com/pirate/ArchiveBox/issues/74[2019-04-16]
pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…https://github.com/pirate/ArchiveBox/
Storage Requirements
Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting FETCH_MEDIA=False to skip audio & video files.
[2019-04-16]
pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…https://github.com/pirate/ArchiveBox/
Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs).
[2019-04-16]
[pirate/ArchiveBox] Bugfixes, new data integrity and invariant checks, remove title prefetchingre-save index after archiving completes to update titles and urls
emove title prefetching in favor of new FETCH_TITLE archive method
e.g. wget -N -E -np -x -H -k -K -S –restrict-file-names=unix -p –user-agent=Bookmark Archiver –no-check-certificate https://charlie-charlie.ru/breakfast
– about 150M
[2018-10-03]
kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Managerhttps://github.com/kanishka-linux/reminiscence
[2018-10-05]
wonder how is it different from my bookmark archiver?<https://twitter.com/gwern/status/1233112807253716992 >
@gwern: @karlicoss @thomas536 Not documented in there yet is my latest archiving tool: https://t.co/If2Ypw1T1M https://t.co/NLh23nrkrh Currently costs 20GB for 7,677 PDFs & self-contained single-file HTML mirrors.
[2019-12-20]
Web Archiving Community · pirate/ArchiveBox Wiki [[linkrot]]https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community
[2019-12-11]
Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 | Hacker Newshttps://news.ycombinator.com/item?id=21737696
ven if we still had the Library of Alexandria, it may have shed zero light on the actual lives of citizens. Archiving content on the internet means capturing thousands of individual level perspectives and experiences. We don't know what will end up being important to historians 50 or 100 years from now. I would bet there are dozens if not hundreds of historians that would give anything for a record of their favorite time period that contains even a fraction of the amount of content today's archive efforts are storing.
[2020-05-28]
site-deaths - IndieWeb [[linkrot]][2019-04-19]
Bret Victor on Twitter: "60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade." [[linkrot]]<https://twitter.com/worrydream/status/478087637031325697 >
[2019-06-13]
Time Travel: Find Mementos in Internet Archive, Archive-It, British Library, archive.today, GitHub and many more! http://timetravel.mementoweb.org/ [[search]] [[linkrot]][2019-12-22]
This Page is Designed to Last [[linkrot]]https://jeffhuang.com/designed_to_last/
[2019-07-08]
Fund: On-Demand Web Archiving of Annotated Pages – Hypothesis https://web.hypothes.is/blog/fund-on-demand-web-archiving-of-annotated-pages/ [[linkrot]][2020-03-06]
Archiving URLs | Hacker News https://news.ycombinator.com/item?id=6504331[2021-02-25]
Wikipedia:Database download - Wikipedia [[wikipedia]]pages-articles-multistream.xml.bz2 – Current revisions only, no talk or user pages; this is probably what you want, and is approximately 18 GB compressed (expands to over 78 GB when decompressed).
[2021-02-25]
Wikipedia:Database download - Wikipediapages-articles.xml.bz2 and pages-articles-multistream.xml.bz2 both contain the same xml contents. So if you unpack either, you get the same data. But with multistream, it is possible to get an article from the archive without unpacking the whole thing.
[2021-02-25]
Main Page - Kiwix [[prepping]] [[wikipedia]][2021-02-25]
jeharu comments on The full English Wikipedia on Kiwix now weighs 79Gb instead of 94Gb thanks to improvements in image compression [[kiwix]] [[prepping]][–]jeharu54TB 46 points 2 months ago
no support yet for incremental updating, right? bummer.
permalinkembedsavereportgive awardreply
[–]The_other_kiwix_guy[S] 66 points 2 months ago
We've started working on a prototype but that'll take time and a lot more money than we have. Would not expect anything before another 2-3 years.
hm okay sad.. guess I can do a backup per year or smth for now
[2021-02-25]
Wikipedia:Database download - WikipediaThe only downside to multistream is that it is marginally larger
or just have a special source for manual notes/exobrainy stuff and another one for the rest?
https://github.com/ArchiveBox/ArchiveBox/issues/660
e.g. it archives medium-like stuff? https://archive.is/20181031123930/https://howwegettonext.com/exploring-the-future-without-cyberpunks-neon-and-noir-8e23562819e3
[2021-03-26]
AdGuardHome/whotracksme.json at master · AdguardTeam/AdGuardHome [[archivebox]]could use this to prune?
Rendering context...