Programming thread

Chad Nasty · Jul 15, 2020

Anyone have a good webscraping tool?

Least Concern · Jul 15, 2020

SickNastyBastard said:
Anyone have a good webscraping tool?

Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.

XPath isn't that hard; you can think of it as alternative CSS syntax if that helps.

You can experiment with XPath in real time by installing a libxml package. That will give you a tool called xmllint which, among other things, will let you run XPath queries against files and print the result.

Of course, all this presumes your input isn't too soupy…

Chad Nasty · Jul 15, 2020

Least Concern said:
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.

XPath isn't that hard; you can think of it as alternative CSS syntax if that helps.

You can experiment with XPath in real time by installing a libxml package. That will give you a tool called xmllint which, among other things, will let you run XPath queries against files and print the result.

Of course, all this presumes your input isn't too soupy…

Sweet, thanks bro.

Kosher Dill · Jul 15, 2020

Least Concern said:
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document

Don't forget learning regular expressions to fiddle with all the things that prevent a page from loading as XML. (Unclosed tags such as IMG are a big one)

Considered HARMful · Jul 15, 2020

Least Concern said:
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.

Serious question: what percentage of webpages parse as a valid XML? I'd bet some coin that very few. A lot of the "HTML" isn't even valid HTML...

Kosher Dill said:
Don't forget learning regular expressions to fiddle with all the things that prevent a page from loading as XML. (Unclosed tags such as IMG are a big one)

Regular expression for handling context-free languages is asking for trouble.

Kosher Dill · Jul 15, 2020

Considered HARMful said:
Regular expression for handling context-free languages is asking for trouble.

Well yeah, don't try to actually handle the language that way.
When I was scraping Twitter, just nuking out the IMG elements and a few others like INPUT and SOURCE, plus unescaping some escaped characters, was enough to get the whole thing to parse as XML.

C#:

xml = Regex.Replace(xml, @"<img[^>]+>", "");

And so on.

(EDIT: and yes, this would have broken if some mongoloid had put <img> in there. I guess a * would have done fine.)

Least Concern · Jul 16, 2020

Considered HARMful said:
Serious question: what percentage of webpages parse as a valid XML? I'd bet some coin that very few. A lot of the "HTML" isn't even valid HTML...

Well, fewer than in the days when most pages were done by hand, I'd wager. If you're dealing with a professional site, they'll be running tests and such to notify them if their CMS or storefront or whatever is outputting bad HTML. But on the other hand, some XML parsers also have a "soup" mode which will have some tolerance for malformed XML, and pretty much all of them will gladly tell you where and how things go wrong if they just give up - so you can use regex, yes, to fudge the source into something more correct before loading it.

Chad Nasty · Jul 16, 2020

I'm starting my BERT stuff tonight, I need a lot of specific exchanges and trading of ideas to try and utilize to build a model,. I'm learning as I go. I'm using BERT+Tensorflow. I'm scraping the information and creating a data set for training with the most hardcore troons to ever exist. I will be #1 troon with deep learning. Then everyone will have to say I'm a #1 woman.

If anyone has any cool noob tips, I'd be glad to have em.

Considered HARMful · Jul 16, 2020

Least Concern said:
Well, fewer than in the days when most pages were done by hand, I'd wager. If you're dealing with a professional site, they'll be running tests and such to notify them if their CMS or storefront or whatever is outputting bad HTML. But on the other hand, some XML parsers also have a "soup" mode which will have some tolerance for malformed XML, and pretty much all of them will gladly tell you where and how things go wrong if they just give up - so you can use regex, yes, to fudge the source into something more correct before loading it.

In the past ~10 years I've soured on the concept of "be liberal in what you accept, conservative in what you send". I strongly believe we would live in a vastly superior world if the browsers were allowed, nay - mandated! - by RFC to outright reject invalid input. I'm sad that the XHTML movement effectively went nowhere.

Kosher Dill · Jul 17, 2020

Here's one I want to toss out to the audience - I was checking in on our perpetual laughingstock, the Covid simulator, and I happened across this diff:

Changes to BoundingBox after review · mrc-ide/covid-sim@0a537fa

This is the COVID-19 CovidSim microsimulation model developed by the MRC Centre for Global Infectious Disease Analysis hosted at Imperial College, London. - mrc-ide/covid-sim

github.com

In C++, what earthly reason is there to replace a "=default" constructor with a blank one that explicitly calls the superclass' default constructor? It seems weird that someone specifically added this in after a review - am I missing something?

Ledj · Jul 17, 2020

Kosher Dill said:
In C++, what earthly reason is there to replace a "=default" constructor with a blank one that explicitly calls the superclass' default constructor? It seems weird that someone specifically added this in after a review - am I missing something?

I looked over the source code a bit; the change was likely for consistency of style across the constructors because semantically nothing has changed. The only scenario where that change actually accomplishes something is if the base type were trivial and (for whatever reason) you wanted to disable the derived type's trivial/aggregate trait.

It's kind of astonishing that they took the time to reexamine the constructors yet managed to make the absolute worst possible refactor. Vector2 should have mirrored Min/Max's =default instead, because Vector2 is needlessly disabling its trivial trait solely because of its default constructor.

emuemuemu · Jul 19, 2020

Least Concern said:
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.

If you want to go down that road then you need a purpose built library like nokogiri or beautiful soup . You can't just feed html into a plain xml parser, I've tried. It works sometimes, but not well enough for a robust scraper you want to keep running.

Considered HARMful said:
Regular expression for handling context-free languages is asking for trouble.

You're not trying to parse the html language though. It's just a long string you want to extract some specific values from. Regular expressions are exactly what you want.
The problem with fancy things like xpath and css selectors is that some trivial change or error in the page no where near the data you actually want will fuck up your whole scraper.
Again, based on my experience, curl + grep is all you need for an effective scraper that won't keep breaking.

Least Concern · Jul 19, 2020

emuemuemu said:
You're not trying to parse the html language though. It's just a long string you want to extract some specific values from. Regular expressions are exactly what you want.
The problem with fancy things like xpath and css selectors is that some trivial change or error in the page no where near the data you actually want will fuck up your whole scraper.
Again, based on my experience, curl + grep is all you need for an effective scraper that won't keep breaking.

I'd say that a regex might have the same brittleness problems if the page structure changes, but you're right in that regex alone might be good enough depending on what you're trying to scrape.

I remember hearing from a colleague about an unscrapable site… the creators of the site were so intent that their data not be scraped that they did some really crazy things to avoid it, like having ids and class names be random strings that changed with every page load and having every element be wrapped in a number of meaningless divs that varied on each page load… If you're facing chaos like that, using regex instead of XPath is gonna be your only possible approach.

Marvin · Jul 22, 2020

XHTML wasn't an actual superset of HTML and neither vice versa, if I remember correctly. Unless that's changed sometime recently?

Least Concern · Jul 22, 2020

Marvin said:
XHTML wasn't an actual superset of HTML and neither vice versa, if I remember correctly. Unless that's changed sometime recently?

XML and HTML have common roots but had slightly different rules (if you could say early HTML had any rules at all). XHTML is an effort to write HTML which conforms strictly to the rules of XML such that if you had an XML parser, you also had an HTML parser. The big difference is that singleton tags such as   are valid like that in HTML but have to be self-closed in XHTML, e.g.   - which is how I instinctually write HTML at this point anyway. So no, HTML is not strictly a subset of XHTML.

Considered HARMful · Jul 22, 2020

Least Concern said:
XML and HTML have common roots but had slightly different rules (if you could say early HTML had any rules at all). XHTML is an effort to write HTML which conforms strictly to the rules of XML such that if you had an XML parser, you also had an HTML parser. The big difference is that singleton tags such as   are valid like that in HTML but have to be self-closed in XHTML, e.g.   - which is how I instinctually write HTML at this point anyway. So no, HTML is not strictly a subset of XHTML.

Besides, HTML allows for stuff such as not needing to close particular block tags, for example . Whenever you start a new paragraph with a , the previous one is implicitly closed.

Least Concern · Jul 22, 2020

Considered HARMful said:
Besides, HTML allows for stuff such as not needing to close particular block tags, for example . Whenever you start a new paragraph with a , the previous one is implicitly closed.

That didn't sound right to me, at least for HTML5, so I looked it up, but at least according to the venerable developer.mozilla.org, you're right.

If you're on my team and I catch you writing HTML like that, I'm calling you into my office, though.

Considered HARMful · Jul 23, 2020

I also thought someone did a nasty in one page I was trying to fix a long time ago. Only then I learned it's permissible.

Least Concern said:
If you're on my team and I catch you writing HTML like that, I'm calling you into my office, though.

That's exactly my sentiment though.

HolocaustDenier · Aug 4, 2020

SickNastyBastard said:
If anyone has any cool noob tips, I'd be glad to have em.

i made this simple script with beautifulsoup 4 html parser few years ago, scraps the "most popular searches" names from mercadolibre and saves them as a csv file with todays date as filename in dataframe format...

Tendencias - Palabras más buscadas en Mercado Libre

tendencias.mercadolibre.cl

we want to scrap every <li class="searches__item"> name inside the big box <andes-card searches>, using page_soup.findAll() we do that...

you can make sure the number of items makes sense using len

from datetime import date
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://tendencias.mercadolibre.cl'
uClient = uReq(my_url)
page_html = uClient.read()
page_soup = soup(page_html,"html.parser")
masusadas = page_soup.findAll("li",{"class":"searches__item"})
today = str(date.today())
filename = today + ".csv"
f = open(filename, "w")
for i in range(len(masusadas)):
listado = masusadas.text
f.write(listado + " ,")
f.close()

not the most useful or original scrapper idea but it helped me understand beautifulsoup, you can adapt it to scrap w/e you want in the format you want, just change my_url and the variable masusadas = page_soup.findAll("li",{"class":"searches__item"}) for w/e you want to scrap, cant upload the ,py ,pynb files here :/

Chad Nasty · Aug 5, 2020

HolocaustDenier said:
i made this simple script with beautifulsoup 4 html parser few years ago, scraps the "most popular searches" names from mercadolibre and saves them as a csv file with todays date as filename in dataframe format...

Tendencias - Palabras más buscadas en Mercado Libre

tendencias.mercadolibre.cl

we want to scrap every <li class="searches__item"> name inside the big box <andes-card searches>, using page_soup.findAll() we do that...
View attachment 1496244

View attachment 1496257
you can make sure the number of items makes sense using len
View attachment 1496326
View attachment 1496275

from datetime import date
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://tendencias.mercadolibre.cl'
uClient = uReq(my_url)
page_html = uClient.read()
page_soup = soup(page_html,"html.parser")
masusadas = page_soup.findAll("li",{"class":"searches__item"})
today = str(date.today())
filename = today + ".csv"
f = open(filename, "w")
for i in range(len(masusadas)):
listado = masusadas.text
f.write(listado + " ,")
f.close()

View attachment 1496281

not the most useful or original scrapper idea but it helped me understand beautifulsoup, you can adapt it to scrap w/e you want in the format you want, just change my_url and the variable masusadas = page_soup.findAll("li",{"class":"searches__item"}) for w/e you want to scrap, cant upload the ,py ,pynb files here :/

Thanks bro, new teaching myself this stuff has been a real uphill battle for a faggot retard like me. Thank you for your help.

Programming thread

Chad Nasty

Optimus Faggot

Least Concern

Least to meet you

Chad Nasty

Optimus Faggot

Kosher Dill

Potato Chips

Considered HARMful

Kosher Dill

Potato Chips

Least Concern

Least to meet you

Chad Nasty

Optimus Faggot

Considered HARMful

Kosher Dill

Potato Chips

Changes to BoundingBox after review · mrc-ide/covid-sim@0a537fa

Ledj

emuemuemu

Least Concern

Least to meet you

Marvin

Least Concern

Least to meet you

Considered HARMful

Least Concern

Least to meet you

Considered HARMful

HolocaustDenier

MEEEEEEEE REEEEEEEEEEEEEE

Tendencias - Palabras más buscadas en Mercado Libre

Attachments

Chad Nasty

Optimus Faggot

Tendencias - Palabras más buscadas en Mercado Libre