3

I have a website which offers pages in the format of https://www.example.com/X where X is a sequential, unique number increasing by one every time a page is created by the users and never reused even if the user deletes their page. Since the site doesn't offer a quick and painless way to know which one of those pages is still up I resorted to checking them one by one, contacting them through an HttpClient and analyzing the HttpResponeMessage.StatusCode for 200 or 404 Http codes. My main method is as follows:

private async Task CheckIfPageExistsAsync(int PageId)
    {
        string address = $"{ PageId }";
        try
        {
            var result = await httpClient.GetAsync(address);

            Console.WriteLine($"{ PageId } - { result.StatusCode }");

            if (result.StatusCode == HttpStatusCode.OK)
            {
                ValidPagesChecked.Add(PageId);
            }
        }
        //Code for HttpClient timeout handling
        catch (Exception)
        {
            Console.WriteLine($"Failed ID: { PageId }");
        }
    }

This code is called like this in order to have a certain degree of parallelism:

public void Test()
    {
        var tasks = new ConcurrentBag<Task>();
        var lastId = GetLastPageIdChecked();
        //Here opens up 30 requests at a time because I found it's the upper limit before getting hit with a rate limiter and receiving 429 errors
        Parallel.For(lastId + 1, lastId + 31, i =>
        {
            tasks.Add(CheckIfCharacterExistsAsync(i));
        });
        Task.WaitAll(tasks.ToArray());

        lastId += 30;
        Console.WriteLine("STEP");

        WriteLastPageIdChecked(lastId);
        WriteValidPageIdsList();
    }

Now, from what I understand starting tasks through Parallel should let the program handle itself when it comes to how the concurrent threads should be active at the same time and adding them all to a ConcurrentBag enables me to wait for all of them to end before moving on to the next batch of pages to check. Since this whole operation is incredibly expensive time-wise I'd like to know if I've opted for a good approach when it comes to parallelism and asynchronous methods.

nicktheone
  • 41
  • 1
  • 3
  • You have a site that you need to scrape? I am finding that hard to believe. – Martin K Feb 23 '20 at 20:41
  • Why do you think your approach is not good? – Navjot Singh Feb 23 '20 at 22:11
  • That is horribly inefficient. It would be better to create N threads and give each a block of K addresses to check. When it has checked its block it requests a new block of addresses from the backlog. If you craft the backlog well there well be little to no locking, as you can prebalance each set of address into seperate thread lists. The locking will only occur when a thread has chewed through everything and needs to steal work from another thread. – Kain0_0 Feb 23 '20 at 22:21
  • @MartinK why? It's for an MMO I play and I wanted to create a sort of unofficial census. – nicktheone Feb 23 '20 at 22:24
  • @NavjotSingh mainly the fact it's my first approach towards parallelism, on top of the fact I still have a hard time with asynchronous programming. – nicktheone Feb 23 '20 at 22:25
  • @Kain0_0 why though? I've read `Parallel` doesn't really create a new thread for every iteration. What it does instead is creating a pool adequate to the workload and the resources of your system, balancing each thread's load and shuffling things around without the need to optimize on the low level. – nicktheone Feb 23 '20 at 22:32
  • And on top of that you are batching, which is another word for serialising on the longest workload. `Parallel` when given the full set of work may be able to efficiently organise the work, but batching places a rate limit on top of that. Also my solution isn't optimising on the low level. Its a macro architecture for efficiently distributing work across multiple processors. – Kain0_0 Feb 23 '20 at 22:55
  • @Kain0_0 as you said yourself batching is a form of rate limiting, exactly what I need in order to avoid being blocked out alltogether from the server. As I wrote in the comment around 30 requests in a very small window of time is the max I can go without being throttled or flat out rejected by the server and batching it like that ensures I can keep track of how many requests are still open. If I were to let the app go on freely it would immediately hit the limit as it did the first time I implemented it. – nicktheone Feb 23 '20 at 23:02
  • Fair enough, then why the question? You don't even need to use `Parallel` here. A simple loop with a time window is sufficient and much less resource intensive on your end. – Kain0_0 Feb 23 '20 at 23:05
  • @Kain0_0 I’ll will keep in mind for the next time. As to why I asked this question in the first place it was mainly, as I said, because I had no confidence in having done a decent job, especially since I’ve always had issues grasping asynchronous programming. – nicktheone Feb 23 '20 at 23:12
  • @nicktheone you are thinking too much. Just go through the documentation carefully and you will be fine. – Navjot Singh Feb 24 '20 at 06:12
  • @nicktheone Since users are deleting and creating pages why not use a Dictionary of the sorts? It will be O(1) peformance in lookup. When a user adds a website you store it in memory and when he removes one you remove it from the dictionary. Generally what is that you are trying to achieve in a more broad sense? – A_kat Feb 24 '20 at 08:38
  • @A_kat I should have probably worded it better but it's not my website else I'd have had more refined methods for acquiring data so unfortunately I can't know when a user creates or deletes a page unless I scrape over them one by one. – nicktheone Feb 24 '20 at 08:56

1 Answers1

5

The first rule when talking about performance is to measure. The built in tools in visual studio is fairly competent, and in a pinch, adding a stopwatch or two can help reveal what part of the code is the bottleneck.

just looking at your code I would expect that each iteration of the parallel loop would complete very quickly since httpClient.GetAsync should do fairly little work before it returns the task. So the majority of the time would be spent in the Task.WaitAll call. Therefore I would try replacing the parallel.for with a regular loop and see if the parallelization is worth the effort. The actual web-calls will still be done in parallel due to the async call, as long as the calls are not awaited untill the end.

A rule of thumb is that Parallel.For/ForEach is most useful when you are CPU limited, while async is most useful when you are IO limited.

I would also recommend taking a look at semaphoreSlim for rate limiting. See How to limit the amount of concurrent async I/O operations for details. I would try something like this and see if it makes a difference.

        var semaphore = new SemaphoreSlim(MaxConcurrentCalls);
        var tasks = new List<Task>();
        for (int i = 0; i < NumberOfPagesToCheck; i++)
        {
            await semaphore.WaitAsync();
            try
            {
                tasks.Add(CheckIfPageExistsAsync(i));
            }
            finally
            {
                semaphore.Release();
            }
        }
        await Task.WhenAll(tasks);
JonasH
  • 3,426
  • 16
  • 13
  • 1
    Very nice and thorough answer! Definitely what I was looking for because it gave me the pointers I needed in order to try and improve my algorithm. – nicktheone Feb 25 '20 at 19:30