This article will take you through the steps to set up log monitoring with Oncrawl:
Make sure you have access to the right log files
Check your log format
Prepare answers to parsing questions
Activate log monitoring in your project
Check access to log file storage location
Set up an upload method
Set up your parsing
Monitor log processing
Begin your log analysis
1. Make sure you have access to the right files
Logs that record bot/user interactions
Depending on how your website is set up, log files can be kept and stored at different locations and by different tools including: CDNs, load balancers, caches, servers...
For SEO log monitoring, you will need the log files created at the point where a user first interacts with your site. This might mean you need the log files from your CDN or your load balancer, for example, rather than your server.
Logs that cover the parts of your site you want to monitor
Depending on how your website is set up, you might need multiple files to cover all of your site.
Make sure you have the files for the parts of your site you want to monitor, such as mobile pages or subdomains.
2. Check your log format
You will need to make sure Oncrawl can read your log files. To do so, you'll need a log file and a basic text editor, such as Notepad (Windows) or Notes (Mac).
File extension
First, check the file extension.
Prefer
.txt
or.json
filesAvoid
.csv
and.tsv
files
Log file contents
Open the file in your text editor. This is what your log file looks like. Oncrawl's job is to make sense of this. Yours is to make sure it contains all the information Oncrawl needs to extract.
Oncrawl will be looking for organic visits and for Googlebot hits. Here's what they might look like:
Sample log line for a Googlebot hit
You can see "Googlebot" in the user agent:
www.oncrawl.com:80 66.249.73.145 - - [07/Feb/2018:17:06:04 +0000] "GET /blog/ HTTP/1.1" 200 14486 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
Sample log line for an organic visit
You can see the "https://www.google.es/" as the referer:
www.oncrawl.com:80 37.14.184.94 - - [07/Feb/2018:17:06:04 +0000] "GET /blog/ HTTP/1.1" 200 37073 "https://www.google.es/" "Mozilla/5.0 (Linux; Android 7.0; SM-G920F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36" "-"
You should be able to confirm the following points by looking at your log file:
Most lines look approximately the same. (They all have the same format)
You can separate each information in the line, even if you don't know what it all means. For example, in the sample organic visit above, you can tell that the number 37073 (this is the number of bytes transferred) is one piece of information and "GET /blog/ HTTP/1.1" (this is the request itself, and contains the slug of the page requested: /blog/) is another.
Your log lines might look different from the examples above if your website uses a different type of server. Oncrawl regularly analyzes logs from IIS, Ngix, and Apache servers, and can support any server type that will provide a log file in JSON format.
Learn how to configure the right log format when using Apache or Ngnix here
Log line contents
The following information is required:
Query path (/blog/) or full URL (https://www.oncrawl.com/blog/). This indicates which page or resource the bot or the user wanted to view. We use this to calculate the number of hits per page, to filter or count hits by page group, to cross-analyze data for a given URL, and much more.
Date and time. This indicates when the page or resource was requested. We use this to establish crawl frequency, to create time graphs, to allow you to filter by date, and to target data for the right time period for an analysis.
User Agent. The User-Agent contains the essential information about who is making the request: type of device, and--more importantly--the bot name if the visitor is a bot.
Status code. The server returns a status code for every requested item. We use this to establish inconsistencies in status codes reported to users, to Oncrawl, and to Googlebot.
The following information is required if your site uses HTTPS:
Either the scheme (http or https), if not already present in the full URL
Or the port of the request (80 on HTTP / 443 on HTTPS)
This information is necessary to know whether the visitor requested the HTTP version or the HTTPS version of a page.
The following information is required if your log files contain information for multiple subdomains:
vhost, if not already present in the full URL
This information is necessary to distinguish between URLs on one subdomain and URLs on another.
The following information is optional but highly recommended:
Referer. This information passed on by the browser indicates the page from which the visitor is coming. It is only required if you intend to use log analysis to identify and examine organic traffic (users coming from a Google search page).
Client IP. We use this information to confirm that visitors with a Googlebot user-agent are really Googlebots by reverse lookup. This allows Oncrawl to filter out spam bots masquerading as Google. Once the check is performed, we do not keep or save this information anywhere. We do not need or use IP addresses for lines that do not have a Google user agent.
Response size. This indicates how much data the server transferred for the requested page or resource. It is extremely helpful when identifying page size and when searching for server errors such as pages that have a 200 status but 0 bytes of content.
Response time. This is the amount of time it took the server to provide the requested page or resource. This helps contribute to an accurate measure of page speed across your site.
You can download this list as a checklist that you use yourself or provide to your dev team if necessary.
Additional needs
Your company or client might not want to provide full log files to Oncrawl. If required by your in-house policies, you can filter your log files to remove lines that are not bot hits or SEO (organic) visits.
Find more on how to filter log lines here.
3. Prepare your parsing questions
In order to prepare an automatic analysis of your log files, we'll ask you a bunch of questions during the setup. Make sure you know how to answer them. You might need to ask your IT department for help.
Download the question checklist and make sure it's completed.
4. Activate log monitoring in your project
At this point, you're ready to activate log monitoring in the application if you haven't done so already. You will need to have the log monitoring option included in your Oncrawl subscription.
From your project page, click on "ADD DATA SOURCES".
In the first tab of the data sources page, click on "ACTIVATE LOGS MONITORING".
Use the question checklist from the previous step ("3. Prepare your parsing questions") to answer all questions in steps 1 ("Configure your needs") and 2 ("Logs completeness").
Now you are ready to upload your first log files to Oncrawl.
You can leave the page in the application and come back when you're ready.
5. Check access to log file storage location
Before going any further, make sure you know where your log files are stored, and that you can access them.
You may need to negotiate this with your IT department, or ask them to set up a solution that places a copy of log files in a location you have access to every day.
Keep in mind that the location of your log files depends on how your site and server are set up. The logs you need may be stored in multiple locations.
If your logs are stored on a server, the most common location is /var/log
.
If your server is used for multiple websites, make sure you are able to tell which logs are for which site. If the server logs requests for all sites in the same file you will need to:
Either make sure that the domain or full URL appears in each log line
Or filter the file in order to provide only your site's log lines to Oncrawl
6. Set up an upload method
Next, set up your upload method.
There are two methods available:
Automatic upload. This is a "pull" method: Oncrawl will connect to your log file storage location and collect your log files automatically. This method works through secure connectors developed by Oncrawl and should be used if you use a third-party storage location or platform, such as Amazon S3 for your log files.
Manual upload. This is a "push" method: You send your log files to Oncrawl. This process can also be automated by writing a script or a program that periodically executes the manual steps.
Automatic upload
Using this method, you don't have to provide your files. Oncrawl will come get them using one of our available connectors.
Please contact us to obtain and set up a connector. We support connectors for:
Google Cloud Storage
Akamai
Azure Blob
You should also contact us if you prefer to set up a data stream for:
Amazon: AWS S3
Google Cloud Storage
Manual upload
To manually upload files to Oncrawl, you will need to connect to your private, protected FTP space and place your files in the folder for your project.
First, you'll need to compress your log files, or "zip" them. You can use any program that produces one of the following common formats:
.zip
.gz
.7z
- Note that the 7z ppmd compression format is not supported. You may need to turn this option off when zipping the files.
Make sure that your network firewall is open for FTP connections.
Use a FTP client solution such as FileZilla to connect. You may need the following information:
Server:
ftp://ftp.oncrawl.com
or by IP:23.251.134.79
Username: Oncrawl Username
Password: the password set in your workspace settings > FTP Access
Ports:
21
for connection and10090
to10999
for passive mode communications
Make sure that your connection is secured (FTPS). If your FTP client does not use TLS by default, the way FileZilla does, you will need to to turn the option "Explicit FTP over TLS".
Oncrawl does not use an authentification key for the FTPS connection.
Once connected, you will see folders for each project in your account:
Directory: Your project name
Open the folder for your project and drop the zipped file(s) there.
You're (almost) done!
Learn how to secure your FTP connection here (this makes sure you're using FTPS)
Learn how to automate the daily log files uploading here
7. Set up your parsing
Your log files are now available.
Return to the log setup page in the "Add Data Sources" section of the Oncrawl application. (If you've closed this page, you can return by clicking on "ADD DATA SOURCES" from the project page, and making sure you're looking at the "Log Files" tab.)
You should see a message that Oncrawl has found the files you uploaded. The number of files found by Oncrawl should be the same as the number of files you uploaded.
Click "I'VE UPLOADED ALL FILES".
This will take you to step 4 ("Check logs format"). This screen reports on how Oncrawl has broken down each line in your file into separate informational elements, and what information about your log files it has determined.
It's normal to see "Parse failed" in the first line ("Result") at first. You will not be able to continue until this says "Everything seems OK".
To change a "Parse failed" into "Everything seems OK", you will need to correct any errors in the "Issues", the "HTTP / HTTPS analysis" and the "Subdomain analysis" sections. (You can go ahead with a parse schema even if there are still warnings, as long as the interpretation by Oncrawl seems ok to you.)
Issues
The "Issues" section lists the high-level problems found.
You can have warnings even if the result is "Everything seems OK":
Warning: No SEO visits detected. Oncrawl was able to break lines down into their different elements, but even so, we couldn't find any lines with a Google referer that can be classified as organic visits from Google SERPs. It's possible but not very likely that there are no organic visits recorded in the file or files you sent to Oncrawl. Make sure that this is the case before proceeding.
Warning: No bot hits detected. Oncrawl was able to break lines down into their different elements, but even so, we couldn't find any lines with a Googlebot User Agent and Google IP address that can be classified as Googlebot hits. It's possible but very unlikely that there are no bot hits recorded in the file or files you sent to Oncrawl. Make sure that this is the case before proceeding. (If you use a cache server, you might need to turn off IP validation for Googlebots.)
Warning: High parse error rate. Oncrawl believes it found the right way to break lines down into their different elements, but we've encountered a large number of lines that don't fit this model. This is often a sign that not all of your log lines have the same format or that there is a problem with how we are parsing the lines. If lines we listed as errors refer to Googlebot hits or organic visits, your log analysis will be incomplete, and therefore incorrect. You should try to understand why there are so many errors before proceeding.
HTTPS / HTTPS analysis
The "HTTPS / HTTPS analysis" section tells you how Oncrawl will handle the difference between HTTP and HTTPS in your log files. For example, if you see: "Error: HTTP scheme is required but could not be extracted", it means:
"HTTP scheme is required": You told Oncrawl your website uses HTTPS. The log analyzer needs to make sure that incoming requests are for the HTTPS version.
"Could not be extracted": Your log files do not contain (or the automatic parser could not find) the scheme or the port that differentiates between HTTP and HTTPS.
Therefore, you must tell the parser which part of a log line contains this information before you can proceed.
Subdomain analysis
The "Subdomain analysis" section tells you how Oncrawl will handle possible different subdomains in your log files. For example, if you see: "Warning: HTTP host could not be extracted. Full URLs rebuilt from default URL: https://www.yoursite.com", it means:
"HTTP host could not be extracted": Oncrawl can't find the information about the host in your log line. The host is the domain and subdomain (everything between "https://" and the slug, path, or filename): www.yoursite.com or shop.yoursite.com
"Full URLs rebuilt from default URL": Therefore, Oncrawl will treat all lines as though they are on the same subdomain. It will create full URLs using the default URL listed here.
Because this is a warning, you don't have to do anything if you agree with Oncrawl's conclusions. If, however, your log files contain requests for multiple subdomains, you'll need to fix this issue before continuing.
Additional information
The sections "Parse sample", "OK lines", "Filtered lines" and "Error lines" are to give you an idea of what Oncrawl was able to find in your log files. If you're very familiar with your site, this information can help you confirm that files have been correctly parsed.
Configure log parser
To correct errors and warnings, move down to the "Configure logs parser" section and switch to "Manual".
For each element found by Oncrawl, use the drop-down menu at the top to choose the type of information the element represents. You probably will not need to use all of the items in the drop-down menu.
You can also click on the "Advanced" tab. This is where you can turn off IP verification for Googlebots by unticking the "Check Google IP" checkbox, or set a timezone if your server doesn't provide one.
Click "CHECK LOGS FORMAT" when you're done.
If you still see "Parse error" or warnings, click "CONFIGURATION IS NOT OK" at the bottom of the page. This will allow you to edit the parsing again.
Repeat this cycle until you have the results "Everything seems OK" and you no longer have warnings.
If everything is okay and you no longer have any warnings, click "CONFIGURATION IS OK"
Still stuck?
In case of difficulties, you can contact us by chat. Tell us you're dealing with a "parse error".
8. Monitor log processing
Oncrawl's Log Manager tool allows you to track the processing of your log files.
You can find the "LOG MANAGER TOOL" button on the project page next to the "ADD DATA SOURCES" button.
What sort of information can be monitored?
"File Processing" displays information about the different steps in processing your files. It shows the time the last file was received, the timestamp of the last useful line (either a Googlebot hit or a hit coming from a search engine page), as well as the status of files in this process queue.
Some raw information is used for live log monitoring ("Live events"), but the full information from a file is aggregated by day ("Aggregated data") to allow it to be used in cross analysis with other information on the platform.
Information from both processing queues is available.
In the second section on this page, the Log Manager Tool displays graphs to track trends in the quality of your log files and in the regularity with which they are uploaded and processed.
Finally, the tool shows an explorable data table with information including File Name, Deposit Date, File Size, SEO visits and Bot lines, Filtered Lines, and Errors.
You can click on a file name to see examples of data from that file. This gives you a better idea of how Oncrawl interprets your log data.
How to interpret the Processed Files table
Having lots of large numbers in the columns "Files size" "SEO visits" and "Bot hits" should be fine
Having a few log lines in the columns "Errors" should not be a problem
On the other hand, having lots of log lines in the "Errors" column and very few "Bot hits" and "SEO visits" often indicates a parsing error. In that case, contact us using the Oncrawl chat box. We are happy to help.
9. Begin your log analysis
Log monitoring is now set up correctly for your project.
Click on the "SHOW LOGS MONITORING" from your project page, and begin your logs analysis.
This will take you to the "Log Analysis" tab in the analysis sidebar. You can also access these reports from any analysis in your project.
You can also modify your crawl profiles to cross-analyze crawl data and log data in future analyses:
In the crawl settings, scroll down to "Analyses"
Click on "SEO impact report" to expand the section
Make sure you are looking at the "Logs" tab
Tick the "Enable logs cross analysis" checkbox
Remember to save your changes
You can also find this article by searching for:
paso a paso cómo comenzar con la configuración de monitoreo
mise en place du suivi des logs, étape par étape