I have a website which offers pages in the format of https://www.example.com/X where X is a sequential, unique number increasing by one every time a page is created by the users and never reused even if the user deletes their page. Since the site doesn't offer a quick and painless way to know which one of those pages is still up I resorted to checking them one by one, contacting them through an HttpClient
and analyzing the HttpResponeMessage.StatusCode
for 200 or 404 Http codes. My main method is as follows:
private async Task CheckIfPageExistsAsync(int PageId)
{
string address = $"{ PageId }";
try
{
var result = await httpClient.GetAsync(address);
Console.WriteLine($"{ PageId } - { result.StatusCode }");
if (result.StatusCode == HttpStatusCode.OK)
{
ValidPagesChecked.Add(PageId);
}
}
//Code for HttpClient timeout handling
catch (Exception)
{
Console.WriteLine($"Failed ID: { PageId }");
}
}
This code is called like this in order to have a certain degree of parallelism:
public void Test()
{
var tasks = new ConcurrentBag<Task>();
var lastId = GetLastPageIdChecked();
//Here opens up 30 requests at a time because I found it's the upper limit before getting hit with a rate limiter and receiving 429 errors
Parallel.For(lastId + 1, lastId + 31, i =>
{
tasks.Add(CheckIfCharacterExistsAsync(i));
});
Task.WaitAll(tasks.ToArray());
lastId += 30;
Console.WriteLine("STEP");
WriteLastPageIdChecked(lastId);
WriteValidPageIdsList();
}
Now, from what I understand starting tasks through Parallel
should let the program handle itself when it comes to how the concurrent threads should be active at the same time and adding them all to a ConcurrentBag
enables me to wait for all of them to end before moving on to the next batch of pages to check. Since this whole operation is incredibly expensive time-wise I'd like to know if I've opted for a good approach when it comes to parallelism and asynchronous methods.