Hacker News

9 hours ago by MisterTea

> I've wasted many an hour combing through Google and my search history to look up a good article, blog post, or just something I've seen before.

This is the fault of web browser vendors who have yet to give a damn about book marks.

So it's a searchable database for bookmarks then.

> The first thing you might notice is that the design is reminiscent of the old digital computer age, back in the Unix days. This is intentional for many reasons. In addition to paying homage to the greats of the past, this design makes me feel like I'm searching through something that is authentically my own. When I search for stuff, I genuinely feel like I'm travelling through the past.

This does not make any sense. It's Unix-like because it feels old? It seems like the author thoroughly misses the point of unix philosophy.

7 minutes ago by undefined

[deleted]

9 hours ago by chris_st

> So it's a searchable database for bookmarks then.

It appears to be that, but it appears also to pull out the content of the web page and index that too, so you can (presumably) find stuff that isn't in the "pure" bookmark, which I think of as a link with maybe a title.

8 hours ago by nextaccountic

I think browsers should download a full copy of each bookmark (so you can still see it when they are taken down) and make it fully searchable.

Actually, I've been trying to find Firefox extensions that give a better interface to bookmarks and there doesn't seem to be one. It's like, people don't use bookmarks anymore and accept that it might as well not exist, and use something else.

It's telling that Firefox has two bookmark systems built-in (pocket and regular bookmarks) and they aren't integrated with each other; I suppose that people that use pocket never think about regular bookmarks.

edit: but my pet peeve is that it isn't easy to search history for something I saw 10 days ago but I don't remember the exact keywords to search.

8 hours ago by phildenhoff

The difference, to me, about Pocket is that I use it specifically as a to-read list. My list is just "sites I want to visit/read/watch later", whereas bookmarks are more of "I want to go here regularly". Also, all the bookmark systems I've ever used treat links as files that can only be in one folder, whereas Pocket at least has tags so links can associate with multiple topics.

5 hours ago by cratermoon

> I think browsers should download a full copy of each bookmark

Have you tried Zotero?

3 hours ago by throwawayboise

In Firefox, File -> Save Page As... lets me do this. Local search tools should be able to index such archives (if they can index Word documents, they should be able to index HTML). Seems a fairly solved problem if it's something you need?

7 hours ago by asdff

Pocket isn't for bookmarks. It's a reading list. Safari and Chrome have this feature too.

2 hours ago by j1elo

An eternity ago I used a fantastic Firefox extension named "ScrapBook", which was like having your own internet archive: instead of just bookmarking a site, the extension would scrap and download the whole contents (or just sections) from the page you were visiting.

Its website seems to still be up to this day! http://www.xuldev.org/scrapbook/

This extension didn't survive the days of breaking API compatibility that Firefox went through (sigh). However, seems like some replicas exist that aim to provide the same or similar functionality, such as ScrapBee https://addons.mozilla.org/en-US/firefox/addon/scrapbee/

4 hours ago by 1vuio0pswjnm7

"It seems like the author thoroughly misses the point of the unix philosophy."

Almost as if "unix philosophy" might mean different things to different people.

"The first thing you might notice ..."

First thing I notice is this project is 100% tied to Google, what with Chrome and Go (even for SNOBOL pattern matching, sheesh).

"... this design makes me feel like I'm searching through something that is authentically my own."

Except it isn't. It shuns the use of freely available, open-source UNIX-like projects in favor of software belonging to a company that Hoovers up personal data and sells online ad services. Enjoy the illusion. :)

Life can be very comfortable inside the gilded cage.1 The Talosians will take good care of you.2

1. https://en.wikipedia.org/wiki/Gilded_cage

2. https://en.wikipedia.org/wiki/Talosians

9 hours ago by stevekemp

I've been thinking recently it might be interesting/useful to write a simple SOCKS proxy which could be used by my browser.

The SOCKS proxy would not just fetch the content of the page(s) requested, but would also dump them to ~/Archive/$year/$month/$day/$domain/$id.html.

Of course I'd only want to archive text/plain and text/html, but it seems like it should be a simple thing to write and might be useful. Searching would be a simple matter of grep..

8 hours ago by habibur

Did that. But then you will find your disk quickly getting filled up with GBs of cached contents that you rarely search within.

Rather when you need that same content, you will find yourself going to google, searching that and the page is instantly there unless removed.

There's a reason why bookmarks aren't as popular as it had been. People now use google + keywords instead of bookmarks.

7 hours ago by kbenson

Maybe archive.org should run a subscription service where for a few bucks a month, you can request your page visits be archived (in a timely manner and with some level of assurance) and leverage their system for tracking content over time. That, in conjunction with something like Google, might actually give fairly good assurance that what you're searching for actually exists in a state like you saw it, while also leveraging that 30 people accessing this blog today that use the service don't use significantly more resources to store the data, and also helps archive.org fulfill its mission.

8 hours ago by berkes

It would also miss all the pages that are built from ajax-requests on the client side. Which, nowadays, is a large amount. The client is the one assembling all the content into the thing you read and so it is the most likely candidate to offer the copy that you want indexed.

8 hours ago by simonw

My version of this is https://dogsheep.github.io/ - the idea is to pull your digital footprint from various different sources (Twitter, Foursquare, GitHub etc) into SQLite database files, then run Datasette on top to explore them.

On top of that I built a search engine called Dogsheep Beta which builds a full-text search index across all of the different sources and lets you search in one place: https://github.com/dogsheep/dogsheep-beta

You can see a live demonstration of that search engine on the Datasette website: https://datasette.io/-/beta?q=dogsheep

The key difference I see with Apollo is that Dogsheep separates fetching of data from search and indexing, and uses SQLite as the storage format. I'm using a YAML configuration to define how the search index should work: https://github.com/simonw/datasette.io/blob/main/templates/d... - it defines SQL queries that can be used to build the index from other tables, plus HTML fragments for how those results should be displayed.

8 hours ago by gizdan

Wow! That's super cool. I will have to check this out at some point. Am I correct in understanding that the pocket tool actually imports the URLs contents? If not, how hard would it be to include the actual content of URLs? Specifically, I'll probably end up using something else (for me NextCloud bookmarks).

8 hours ago by simonw

Sadly not - I'd love it to do that, but the Pocket API doesn't make that available.

I've been contemplating building an add-on for Dogsheep that can do this for any given URL (from Pocket or other sources) by shelling out to an archive script such as https://github.com/postlight/mercury-parser - I collected some suggestions for libraries to use here: https://twitter.com/simonw/status/1401656327869394945

That way you could save a URL using Pocket or browser bookmarks or Pinboard or anything else that I can extract saved URLs from an a separate script could then archive the full contents for you.

4 hours ago by neolog

SingleFile and SingleFileZ are chrome extensions that export full web pages pretty effectively.

https://chrome.google.com/webstore/detail/singlefile/mpiodij...

https://chrome.google.com/webstore/detail/singlefilez/offkdf...

3 hours ago by cxr

You might be interested in checking out Perkeep or Zotero.

8 hours ago by tomcam

Holy crap you should submit as a Show HN

8 hours ago by simonw

It's failed to make the homepage a few times in the past: https://hn.algolia.com/?q=dogsheep - the one time it did make it was this one about Dogsheep Photos: https://news.ycombinator.com/item?id=23271053

6 hours ago by mosselman

Simon is not an unknown on HN.

9 hours ago by yunruse

I love this idea, but the name “digital footprint” sort of implies it’s what effect you’ve had on the Internet for helping keep your online persona under control: your tweets, comments, emails, et cetera.

But this is a great idea! Having a search engine for vaguely _anything_ you touch very much does look like it’d increase the signal:noise ratio. It’d be interesting to be able to add whole sites (using, say, DuckDuckGo as an external crawler) to be able to fetch general ideas, such as, say, “Stack Exchange posts marked with these tags”.

9 hours ago by flanbiscuit

> but the name “digital footprint” sort of implies it’s what effect you’ve had on the Internet for helping keep your online persona under control: your tweets, comments, emails, et cetera.

I had the exact same thought when I saw that in the title. That would also be a cool idea to be able to search within your own online accounts.

So this is what the project's description of what "digital footprint" means:

> Apollo is a search engine and web crawler to digest your digital footprint. What this means is that you choose what to put in it. When you come across something that looks interesting, be it an article, blog post, website, whatever, you manually add it (with built in systems to make doing so easy). If you always want to pull in data from a certain data source, like your notes or something else, you can do that too. This tackles one of the biggest problems of recall in search engines returning a lot of irrelevant information because with Apollo, the signal to noise ratio is very high. You've chosen exactly what to put in it.

If I'm interpreting this correctly, this seems like an alternative way of bookmarking with advanced searching because it scrapes the data from the source. Cool idea, means I have to worry less about organizing my bookmarks.

9 hours ago by SahAssar

Looks very much like one of the ideas I've been thinking of building! The way I planned to do it was to use a similar approach to rga for files ( https://github.com/phiresky/ripgrep-all ) and having a webextension to pull all webpages I vist (filtered via something like https://github.com/mozilla/readability ), dump that into either sqlite with FTS5 or postgres with FTS for search.

A good search engine for "my stuff" and "stuff I've seen before" is not available for most people in my experience. Pinboard and similar sites fill some of that role, but only for things that you bookmark (and I'm not sure they do full-text search of the documents).

---

Two things I'd mention are:

1. Digital footprint usually means your info on other sites, not just things I've accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: https://en.wikipedia.org/wiki/Digital_footprint

2. I don't really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.

9 hours ago by eddieh

I've been toying with an idea like this too. I set my browser to never delete history items years ago, so I have a huge amount of daily web use that needs to be indexed. The browser's built in history search has saved me a few times, but it is so primitive it hurts.

7 hours ago by grae_QED

>I don't really get what makes it UNIX-style

I think what they meant was that it's an entirely text based program. Perhaps they are conflating UNIX with CLI.

3 hours ago by smusamashah

This sounds similar to Monocle https://github.com/thesephist/monocle

Demo: https://monocle.surge.sh/

Blog post explaining motivation https://thesephist.com/posts/monocle/

4 hours ago by jll29

Microsoft Research's Dr. Susan Dumais is the expert on this kind of personal information management.

Her landmark system (and associated seminal SIGIR'03 paper) "Stuff I've Seen" tackled re-finding material: http://susandumais.com/UMAP2009-DumaisKeynote_Share.pdf

5 hours ago by etherio

This is cool! Similar to one of the goals I'm trying to accomplish with Archivy (https://archivy.github.io) with the broader goal of not just storing your digital presence but also acting as a personal knowledge base.

9 hours ago by wydfre

It seems pretty cool - but I think falcon[0] is more practical. You can install it from the chrome extension store[1], if you are too lazy to get it running yourself.

[0]: https://github.com/lengstrom/falcon

[1]: https://chrome.google.com/webstore/detail/falcon/mmifbbohghe...

9 hours ago by grae_QED

Are there any Firefox equivalents to Falcon? I'm very interested in something like this.

9 hours ago by news_to_me

If it's a WebExtension, it's usually not too hard to port to Firefox (https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...)

9 hours ago by nathan_phoenix

In the issues someone says that it works even in FF. You just need to change the extension of the file. Tho I didn't try it yet.

https://github.com/lengstrom/falcon/issues/73#issuecomment-6...

Daily digest email

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.

A Unix-style personal search engine and web crawler for your digital footprint