Self-hosted metasearch engine protects against tracking from Google, Bing, etc

; Date: Wed Jul 13 2022

Tags: Docker »»»» Self Hosting

Are you worried that Google/Bing/etc know too many details about you? These companies collect data about us, to then sell advertising targeting us. This can be avoided using search tools that protect privacy. One, SearXNG, is an open source metasearch engine that guarantees complete anonymity.

Traditional centralized search engines like Google or Bing are convenient. Their size means they can pull together a massive index of all websites. They have large teams of software engineers working on improved understanding of good quality content, or search indexing infrastructure, or understanding a variety of human languages, and so on. It's no wonder that the main search engines are so popular, because they're very useful and provide a big value, for free.

As they say, if you're not paying for the service, then you're the product.

But, there are a number of problems with the main search engines. Some issues I have are:

You might say "Just use DuckDuckGo", because DuckDuckGo promises it doesn't track who you are. Indeed I've been using DDG for years as my primary search engine. I am happy with the results it gives, and mostly believes that DDG truly does not track what its users are doing.

But, recently I found an issue in that DDG (and Bing) wasn't indexing a couple of my websites. While working on rectifying that issue I learned that DDG gets (a portion of) its search results data from Bing. DDG uses several search data sources, with Bing being its primary source. As a result folks giving advice on getting your site indexed in DuckDuckGo say to first get it indexed in Bing, and then it will automatically show up in DDG.

The point for mentioning that is - I did not know that DDG used Bing search data, and is Bing then collecting any data about our searches via DDG? Further, Bing provides search data to several other search engines.

SearXNG - the privacy focused alternative to other search engines

While looking for alternatives, I came across a self-hosted search engine option made by people who are seriously and deeply interested in personal privacy. Namely - SearXNG - (docs.searxng.org) https://docs.searxng.org/

SearXNG is derived from another project, SearX. It is what's called a "metasearch engine", meaning it doesn't itself index the Internet, but instead sends search queries out to several other search engines. It then brings those results together, presenting them in a unified search result.

This looks like search engine results from other search engines, but there are differences. Notice that each search result listing contains a list of search engines in which the result was found. That's the "metasearch" part of the results. The tabs across the top are also a nice touch. For instance, instead of your video searches being limited to YouTube, they can equally give results from other video websites (Odysee etc).

What is a Metasearch Engine?

I hadn't heard of metasearch engines until finding SearXNG, and I bet you haven't either. To understand why we should install one, we should understand what metasearch engines do, etc. We've just seen an example of a metasearch engine, so lets try to create a definition.

A metasearch engine is an online information source using the data of one or more other search engines to produce its results.

In other words, a metasearch engine sends your query to one or more other search engines, aggregates the search results, ranking them based on rankings from the other search engine, presenting the results in its own format.

There are many examples of commercial metasearch engines. Two prominent ones are DogPile and Metacrawler, and I believe travel sites like Kayak are metasearch engines under the hood.

How is SearXNG better at protecting privacy?

Okay, you're thinking, SearXNG does search queries on other search engines. How does that differ from DuckDuckGo which also uses other search engines to satisfy search queries. The answer is that SearXNG strictly anonymizes the queries it sends to other search engines.

The SearXNG documentation says this:

  • removal of private data from requests going to search services
  • not forwarding anything from a third party services through search services (e.g. advertisement)
  • removal of private data from requests going to the result pages

It does this by not sending cookies to the search engines, and generating a random browser profile for each request. This prevents the search engine from tracking the person who made the query.

Because it is open source, we can verify these claims for ourselves by studying the source code. DuckDuckGo might be making a similar claim, but we cannot independently verify that they are doing so because DuckDuckGo is not open source.

Self-Hosting the SearXNG metasearch engine

Learning about how SearXNG protects privacy is what got me interested. After a couple weeks of using it, I'm fairly happy with the results. The only problem is that it often says "timeout" on querying a given search engine.

The simplest path to self-hosting SearXNG is with Docker. The SearXNG team provides a prebaked Docker Compose file that I used as a starting point.

See: (github.com) https://github.com/searxng/searxng-docker

Their preferred deployment uses:

  • Caddy - which is a reverse proxy server with bundled support for getting SSL certificates from Lets Encrypt -- I DID NOT USE CADDY IN MY DEPLOYMENT
  • SearXNG - the metasearch engine
  • REDIS - For caching data

On my server, I use NGINX Proxy Manager as a reverse proxy that manages SSL certificates from Lets Encrypt. Therefore, I did not include the Caddy portion of their Compose file.

Start by creating a directory to house some things. In that directory create another directory, searxng, containing a file named settings.yml, containing the following:

# see https://docs.searxng.org/admin/engines/settings.html#use-default-settings
use_default_settings: true
server:
  # base_url is defined in the SEARXNG_BASE_URL environment variable, see .env and docker-compose.yml
  secret_key: "ultrasecretkey"  # change this!
  limiter: true  # can be disabled for a private instance
  image_proxy: true
  method: "GET"
search:
  autocomplete: "duckduckgo"
general:
  instance_name: 'YOUR NAME HERE'
ui:
  static_use_hash: true
redis:
  url: redis://redis:6379/0

This file comes from the searxng-docker repository, with a couple small changes.

  • Setting method to GET fixes a behavior issue
  • The autocomplete setting lets you specify where autocomplete suggestions come from.
  • The instance_name setting is a first step to updating the branding to include your name

For the super secret key, you can use any string you like. It appears I ran a command like this:

$ cat searxng/settings.yml | md5sum
f29bdcafbb3d43e84e85c851459c61e1

# Or, this
$ uuid | md5sum
020a52f6b234e33eb0d4924a7b695683

Whatever way you prefer to generate a randomized string is good.

Next, create a file named docker-compose.yml containing this:

version: '3.7'

services:

  redis:
    container_name: redis
    image: "redis:alpine"
    command: redis-server --save "" --appendonly "no"
    networks:
      - searxng
    tmpfs:
      - /var/lib/redis
    cap_drop:
      - ALL
    cap_add:
      - SETGID
      - SETUID
      - DAC_OVERRIDE

  searxng:
    container_name: searxng
    image: searxng/searxng:latest
    networks:
      - searxng
    ports:
     - "8080:8080"
    volumes:
      - ./searxng:/etc/searxng:rw
    environment:
      - SEARXNG_BASE_URL=https://${SEARXNG_HOSTNAME:-localhost}/
    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - SETGID
      - SETUID
      - DAC_OVERRIDE
    # logging:
    #   driver: "json-file"
    #   options:
    #     max-size: "1m"
    #     max-file: "1"

networks:
  searxng:
    external: true
#    ipam:
#      driver: default

This also comes from the searxng-docker repository with some changes. The Caddy service is dropped completely. The searxng network is created externally because of how I manage virtual networks in my Docker installation.

The volume mount for the searxng directory mounts the configuration file discussed earlier into the container.

Notice that the SEARXNG_BASE_URL variable is set from an environment variable, SEARXNG_HOSTNAME. To set this as intended, create a file named .env containing:

SEARXNG_HOSTNAME=searx.DOMAIN-NAME.com
# LETSENCRYPT_EMAIL=<email>

This will inform SearXNG what to expect as its domain name. The LETSENCRYPT_EMAIL variable is used with Caddy, if you use that.

I don't understand why it should use a .env file since these values can be easily coded in the Compose file. And, for that matter, the base URL can be set in the settings.xml in the server.base_url value. That is, in the settings.xml shown above, replace the comment starting with base_url with base_url: "https://searx.DOMAIN".

Deploying the SearXNG metasearch engine

That's it for preparation. The files discussed in the previous section must be deployed on a host where you have Docker installed.

Upon running docker compose up (or docker-compose up), the two services will come up and you'll be able to use the search engine at http://DOMAIN-NAME:8080.

Notice I said docker compose rather than docker-compose. The latter is the traditional way of launching a Compose file. But, sometime in the last year or so the Docker team ported the functionality of docker-compose into the docker command such that we can now run docker compose (notice - no dash) instead.

In my case, I'm using NGINX Proxy Manager, and therefore created a proxy host in its configuration. This proxy connects to http://searxng:8080 on the backend, and is visible as https://searx.DOMAIN on the front end. Provisioning SSL certificates from Lets Encrypt is trivially easy while setting up the reverse proxy.

Using the SearXNG metasearch engine

If you've used a certain search engine whose name starts with G you know how to use SearXNG.

This is the user interface, do I really need to explain what to do?

An interesting place to look is the settings area, reached via the gear icon in the upper-right corner. You can select the precise search engines being used, and a bunch of other parameters. It's very cool.

Another cool resource is the documentation, particularly for the (docs.searxng.org) settings file.

You can easily customize the list of search engines being used. Because it's open source, you can integrate other search engines by writing code, then configuring them in the settings file.

There is a tremendous amount of flexibility.

Just one example covers the case of data you have in your own SQL database. Integrating that data is accomplished something like this:

- name: my_database
  engine: postgresql
  database: my_database
  username: searxng
  password: password
  query_str: 'SELECT * from my_table WHERE my_column = %(query)s'

In other words, it's possible to use this to build a highly customized search engine.

Adding your SearXNG instance as a default search engine in Chrome web browser

If you're happy with the results SearXNG gives, you can set up your web browser to use your SearXNG instance. We'll focus on doing this in Google Chrome, but you should be able to do this in other browsers like Firefox.

Open the Settings area, and navigate to Search engine. You'll see a button for Manage search engines and site search.

What's required is to add your SearXNG instance as a custom entry in this list, and then to select that as the default search engine.

There will be an Add button, so click on that.

For Search engine give it a user-friendly name, and in Shortcut give a shorter user-friendly name. For the URL use: https://searx.DOMAIN/search?q=%s -- Finally, click Add.

The custom search engine will be added to the list. At the far right hand side is a button you can click which pops up a menu, one of the choices is Make default. Click on that to make it your default search engine.

With that accomplished, any search performed via the browser location box will go through your SearXNG instance.

SearXNG timeouts

The only trouble I have is this:

Every so often, instead of search results it gives me timeouts like this. In the documentation it appears the timeout threshold can be adjusted. I suspect this is because my instance is hosted at home on a machine sitting behind my DSL router. An instance hosted on a regular web server should have better bandwidth to the Internet, and better response time to search queries.

Summary

SearXNG is a worthy alternative search engine. It strictly protects you from being tracked by regular search engines, while giving you search results from those search engines.

I make great use of search engines every day, and use them to find amazing resources. I've used SearXNG exclusively for about 2 weeks, and it has helped me find some amazing things. I didn't mean to overuse the word amazing, but what I mean is that I routinely find useful relavent resources via regular search engines, and so far I've done the same with SearXNG.

About the Author(s)

(davidherron.com) David Herron : David Herron is a writer and software engineer focusing on the wise use of technology. He is especially interested in clean energy technologies like solar power, wind power, and electric cars. David worked for nearly 30 years in Silicon Valley on software ranging from electronic mail systems, to video streaming, to the Java programming language, and has published several books on Node.js programming and electric vehicles.