Rest API Scraping

Anti Snigger · Mar 5, 2024

Does anyone have experience with scraping sites that use a Rest API? I'm working on stuff for a thread, and I'm struggling to get it working. I'm presently using Python but realistically I can use anything if I need to.
I keep getting no-script errors at present and I'm not sure how to bypass this.

Geranium · Mar 5, 2024

Are you able to share the URL?

Anti Snigger · Mar 5, 2024

Geranium said:
Are you able to share the URL?

Various pages on Nebula. I want to be able to scrape the creator pages and ideally the full list of videos as a starting point

Spedestrian · Mar 5, 2024

Doubly Punished Snigger said:
I keep getting no-script errors at present and I'm not sure how to bypass this.

You're getting those because requests or whatever you're scraping with doesn't have JavaScript. If you want to scrape sites that rely on JavaScript or do other weird crap then Selenium Webdriver is really useful. It basically lets you take a fully-featured browser like Firefox or Chrome and control it with your programming language of choice, so you can automate and scrape anything that works in a regular browser.

Doubly Punished Snigger said:
Various pages on Nebula. I want to be able to scrape the creator pages and ideally the full list of videos as a starting point

I picked a random channel from their featured list and looked at the page source to see what was up. It seems like they have an RSS feed for every channel, and the RSS URL is easy to derive from the channel name. For example:

Channel URL: https://nebula.tv/joescott
Channel RSS URL: https://rss.nebula.app/video/channels/joescott.rss

So in this case, you'd probably be better served by just scraping their RSS feeds rather than setting up Selenium and learning how to use it. If you want to do more complicated scraping later then Selenium may still come in handy, but if you just want a list of videos then pulling the RSS feed is easier.

Anti Snigger · Mar 5, 2024

Spedestrian said:
You're getting those because requests or whatever you're scraping with doesn't have JavaScript. If you want to scrape sites that rely on JavaScript or do other weird crap then Selenium Webdriver is really useful. It basically lets you take a fully-featured browser like Firefox or Chrome and control it with your programming language of choice, so you can automate and scrape anything that works in a regular browser.

I picked a random channel from their featured list and looked at the page source to see what was up. It seems like they have an RSS feed for every channel, and the RSS URL is easy to derive from the channel name. For example:

Channel URL: https://nebula.tv/joescott
Channel RSS URL: https://rss.nebula.app/video/channels/joescott.rss

So in this case, you'd probably be better served by just scraping their RSS feeds rather than setting up Selenium and learning how to use it. If you want to do more complicated scraping later then Selenium may still come in handy, but if you just want a list of videos then pulling the RSS feed is easier.

I already have been looking at the RSS feeds, but even with the archives of them they're incomplete. I want a full content list as a starting point.
I'll take a look at selenium though.

Sic Semper Tyrannosaurus · Jul 18, 2024

Old thread but you could also use something like browserless as well. The simplest config of it you run it as a container, post your url to it, and it returns the html document. Something like:

Code:

curl -XPOST http://browserless-url/content -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' -d '{"url": "https://nebula.tv/joescott"}'

Your mileage may vary based on site (that nebula.tv still has a noscript error, probably some config option as the next one works) but I use it for a variety of sites with Ruby to periodically fetch updates.

It's also got a variety of endpoints to make your life easier like /scrape which will take a selector and return matched results:

Code:

curl -XPOST http://browserless-url/scrape -H 'Cache-Control: no-cache' -H 'Content-Type: application/json' -d '{"url": "https://nebula.tv/joescott", "elements": [ {"selector": "h3" }]}'

Spits out some json for that nebula.tv:

JSON:

{
  "data": [
    {
      "results": [
        {
          "attributes": [
            {
              "name": "title",
              "value": "When We Treated Mental Illness With An Ice Pick"
            },
            {
              "name": "class",
              "value": "css-std21b"
            }
          ],
          "height": 18,
          "html": "When We Treated Mental Illness With An Ice Pick",
          "left": 41,
          "text": "When We Treated Mental Illness With An Ice Pick",
          "top": 813.5,
          "width": 342
        },
...

Rest API Scraping

Anti Snigger

h̸͋̉̈́́̐́͑̇̅̄͛́̀̿̏̅̅̀̆̎͛̆̀̑̈́͊̐̈́͒̔͒͛̍͑̉͂̏̅̈̔̒̕̚͘̕͘͘̚

Geranium

Kincora; Dutroux; Epstein

Anti Snigger

h̸͋̉̈́́̐́͑̇̅̄͛́̀̿̏̅̅̀̆̎͛̆̀̑̈́͊̐̈́͒̔͒͛̍͑̉͂̏̅̈̔̒̕̚͘̕͘͘̚

Spedestrian

The Proboscis Soul

Anti Snigger

h̸͋̉̈́́̐́͑̇̅̄͛́̀̿̏̅̅̀̆̎͛̆̀̑̈́͊̐̈́͒̔͒͛̍͑̉͂̏̅̈̔̒̕̚͘̕͘͘̚

Sic Semper Tyrannosaurus