This page is best viewed with a Desktop or Tablet computer
If you have a small screen device it may offer you a choice to switch to a “simple” display. If you accept that choice it will be easier to read the text, but you may miss the images and the important links.
Opening view of the MetaQuest App →
● The Action Menu has one item - Help ● The seed website is https://www.example.com. This is displayed in an EditText so you can change that to whatever website you want it to be. ● I have chosen the Celtic Dragon as the mascot for your journey… Celtic dragons, the most powerful of all the Celtic symbols, are creatures that protect the Earth and all living things.
The Bottom Menu has two items - ● MetaQuest ● Tutor MetaQuest is the
main crawling-scraping program Mick MultiMIPS@gmail.com |
|
The Help files are served from my website at https://mickwebsite.com
When you select Help. It will either start your default Browser or, if you have more than one browser installed on your device, it will offer a menu similar to this →
The typical sequence of events is like this:
|
|
This is the first page of this Help file →
If you have a small screen device it may offer you a choice to switch to a “simple” display. If you accept that choice it will be easier to read the text. But you will miss all of the images and, worse, you will not get any of the links. |
|
This → is the Tutor menu. From time to time I will include additional items. Users of MetaQuest are invited to suggest topics for this menu🙂
Mick
|
|
This → is the opening View of the MetaQuest menu item.
Note the navigation icon on the left side of the ToolBar [left directed arrow]. It works better to use this icon for exiting rather than the system icon which is along the bottom of your device.
There is one Action Menu - Help There are three Bottom Menus - HTML SCRAPE DEEP CRAWL STACK |
|
This → is the first page of this HELP menu… 🙂 |
|
The Stack menu → The Stack is central to the operation of Deep Crawling the internet with MetaQuest.
The stack data structure is also known as a PushDown Stack or a LIFO [Last In First Out] queue. Each item pushed onto the stack goes on top of the previous item. The last item pushed onto the stack is the first one to be popped off. It works like a stack of spring loaded lunch plates in a cafeteria.
The Stack is ideal for simulating a trail of breadcrumbs, left behind during a walk in the woods, to find your way back.
|
|
This → is the MetaQuest/DEEPCRAWL menu. This is the engine for internet crawling and HTML scraping.
The first 7 choices are numbered as this is the sequence in which they would usually be selected for internet crawling.
As an ethical web scraper you will select and study the robots.txt file after opening the seed URL and before downloading from the website.
Each of the 7 items may be selected or not, and in whatever sequence you choose. Usually you must choose Open Seed URL first, and you must get the IPs before getting the corresponding geoLocations or the links.
API_KEY Each user of this app needs to have their own free, personal API_KEY in order to access geoLocations from the internet. An illustrated set of steps to follow to get your free, personal API_KEY is here… https://mickwebsite.com/MMWebSite/ipgeolocation.html
You get 1000 free location lookups per day. That should be plenty for using this app. You can sign up for a paid account if you want more 🙂
MetaQuest separately keeps a count of the number of accesses and displays that count in a toast message when each access is started. MetaQuest resets this count to zero whenever it is started on a new day. You can check https://ipgeolocation.io at any time to see the official count. This process is simplified if you keep your browser running in background with the ipgeolocation.io website open, while using MetaQuest.
|
|
This → is basic web scraping, extracting specific data items from the website. This is a long menu with many choices.
You are encouraged to try each of these choices on various websites to determine which may be useful to yourself. You could also use the Find choice [near bottom of this menu] to see if there is additional information that can be retrieved. |
|
This → is still the Basic Web Scraping menu, showing it scrolled to bring the bottom menu items into view.
With the Find and Ping choices you must select this menu item twice - first to provide an element of data: a token for find, or a URL for ping, and the second time to start the Finding or Pinging process.
You can use Find to search the content of the website for items not listed here, or for more data on the items that are listed. For this purpose it might be best to first choose Clear Display and then Get HTML Code before choosing Find.
Serious, detailed technical information on these choices and more is available at… OR a tutorial here… https://www.javatpoint.com/jsoup-tutorial And other choices, located using Google to search for jsoup |
|
We have started our Quest at → https://www.example.com. As good web crawlers we begin by getting the robots.txt file. Note where it says FileNotFoundException. This website has no robots.txt file so we are free to crawl and scrape at will here 🙂 |
|
Next we selected Process robots.txt to confirm that we can go ahead → |
|
Here → we have selected Get IPs/Seed. This gets all* of the IPs for the seed URL. Note that it says that there are 2 IPs. The first is… 2606:2800:220:1:248:1893:25C8:1946
This is an IPv6 IP address. IPv6 is written in hexadecimal notation, separated into 8 groups of 16 bits by the colons, leading zeros omitted. Occasionally there is a double colon :: which has a specific meaning. Please do a Google search to learn about this.
To see the 2nd IP, swipe to the right on the IPv6…
*I have noticed that newer devices may retrieve more IP addresses than do older devices? |
|
This → shows the second IP. This is an IPv4, written as 4 groups of 3 decimal digits each group between zero and 255, with leading zeros omitted.
Question 1: What is the maximum number of IPv4 addresses?
Question2: What is the maximum number of IPv6 addresses?
If you are a math lover you may already know the answers, or you may love the challenge of doing the calculations. Otherwise Google knows everything! And you probably know why “they” created IPv6.
|
|
We have selected Get geoLocations/IPs → then Get Offsite Links/IP. One offsite link was found:
There is often, but not always, more than one IP for each seed URL, and possibly a different geoLocation for each IP, MetaQuest retrieves all [usually] of the IPS and their corresponding geoLocations.
As you swipe right on the current IP to see the next one, the corresponding geoLocation moves into its position. |
|
We Selected the text of the new URL and Copied it to the system clipboard → |
|
Now we have selected the old seed URL and will Paste the new one into its place → |
|
The new seed URL has been Pasted into place →
We then did Open, Get robots.txt. This time there is a robots.txt file [scroll your eyes to the bottom of the pic at right] which applies to all internet bots, spiders, crawlers etc including MetaQuest and its users, and which disallows nothing. So we can once again crawl and scrape to our hearts content 🙂 |
|
This shows an IPv6 for iana.org → |
|
This shows an IPv4 for iana.org →
Each URL may have a variety of IPv4s and/or IPv6s. There is no rule about having certain quantities of each. |
|
This → shows the 3 offsite links retrieved from iana.org
Note that they share the same IPV6 and the same geoLocation. |
|
Here → I am selecting the third offsite link which points to https://www.icann.org… |
|
I am prepared to paste the new link here → |
|
New link to icann.org is pasted, robots.txt was not found so nothing is disallowed 🙂 → |
|
7 Offset links were found → |
|
… and there is a robots.txt which applies to all of we crawlers →
In order to better understand these files it would be useful to search for something like “how to read roberts.txt”. Here is one link… https://www.seerinteractive.com/insights/how-to-read-robots-txt |
|
The 7th link goes to soundcloud.com → I have copied/pasted it into the seed URL, retrieved the 4 IPs and the 4 corresponding geoLocations, and also queried the robots.txt It wants to allow us into anything except the two sitemap files. |
|
At soundcloud.com we find 5 offsite links, we copy/paste the URL for the first one, simply because it looks interesting → |
|
It has 1 IP, it is an IPv4, it is in Durdevac, Croatia.
So, we have arrived at an interesting destination → |
|
→ and it has 33 offsite links 🙂 which point to Durdevac,,Croatia Istanbul, Turkey Stockholm, Sweden Falkenstein/Vogtl., saxony, Germany Gunzenhaisen, Bavaria, Germany Dublin, Ireland And a few places within the USA. [Continued on next pic down]
|
|
A gold mine! Where to go next!? At each stop you can use HTML SCRAPE to examine and download content from the website, you can also use the Save and Restore options at the bottom of the DEEP CRAWL menu as you exit from and return to MetaQuest, to run Google Maps, your browser, or any other app which you wish to use to explore that part of the world at your leisure; and not lose your breadcrumbs, which are in the Stack 🙂
Now, please download the MetaQuest app from the Google Play Store and…
Go where there is to go Find what there is to be found
Mick MultiMIPS@gmail.com
|
|