- Joined
- Jan 8, 2019
Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.Anyone have a good webscraping tool?
Sweet, thanks bro.Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.
XPath isn't that hard; you can think of it as alternative CSS syntax if that helps.
You can experiment with XPath in real time by installing a libxml package. That will give you a tool called xmllint which, among other things, will let you run XPath queries against files and print the result.
Of course, all this presumes your input isn't too soupy…
Don't forget learning regular expressions to fiddle with all the things that prevent a page from loading as XML. (Unclosed tags such as IMG are a big one)Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document
Serious question: what percentage of webpages parse as a valid XML? I'd bet some coin that very few. A lot of the "HTML" isn't even valid HTML...Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.
Regular expression for handling context-free languages is asking for trouble.Don't forget learning regular expressions to fiddle with all the things that prevent a page from loading as XML. (Unclosed tags such as IMG are a big one)
Well yeah, don't try to actually handle the language that way.Regular expression for handling context-free languages is asking for trouble.
xml = Regex.Replace(xml, @"<img[^>]+>", "");
<img>
in there. I guess a * would have done fine.)Well, fewer than in the days when most pages were done by hand, I'd wager. If you're dealing with a professional site, they'll be running tests and such to notify them if their CMS or storefront or whatever is outputting bad HTML. But on the other hand, some XML parsers also have a "soup" mode which will have some tolerance for malformed XML, and pretty much all of them will gladly tell you where and how things go wrong if they just give up - so you can use regex, yes, to fudge the source into something more correct before loading it.Serious question: what percentage of webpages parse as a valid XML? I'd bet some coin that very few. A lot of the "HTML" isn't even valid HTML...
In the past ~10 years I've soured on the concept of "be liberal in what you accept, conservative in what you send". I strongly believe we would live in a vastly superior world if the browsers were allowed, nay - mandated! - by RFC to outright reject invalid input. I'm sad that the XHTML movement effectively went nowhere.Well, fewer than in the days when most pages were done by hand, I'd wager. If you're dealing with a professional site, they'll be running tests and such to notify them if their CMS or storefront or whatever is outputting bad HTML. But on the other hand, some XML parsers also have a "soup" mode which will have some tolerance for malformed XML, and pretty much all of them will gladly tell you where and how things go wrong if they just give up - so you can use regex, yes, to fudge the source into something more correct before loading it.
In C++, what earthly reason is there to replace a "=default" constructor with a blank one that explicitly calls the superclass' default constructor? It seems weird that someone specifically added this in after a review - am I missing something?
If you want to go down that road then you need a purpose built library like nokogiri or beautiful soup . You can't just feed html into a plain xml parser, I've tried. It works sometimes, but not well enough for a robust scraper you want to keep running.Do you know XPath? Learn that first, then web scraping just becomes a matter of curling a page, loading it as an XML document, and querying away at the bits you need. Easy peasy.
You're not trying to parse the html language though. It's just a long string you want to extract some specific values from. Regular expressions are exactly what you want.Regular expression for handling context-free languages is asking for trouble.
I'd say that a regex might have the same brittleness problems if the page structure changes, but you're right in that regex alone might be good enough depending on what you're trying to scrape.You're not trying to parse the html language though. It's just a long string you want to extract some specific values from. Regular expressions are exactly what you want.
The problem with fancy things like xpath and css selectors is that some trivial change or error in the page no where near the data you actually want will fuck up your whole scraper.
Again, based on my experience, curl + grep is all you need for an effective scraper that won't keep breaking.
XHTML wasn't an actual superset of HTML and neither vice versa, if I remember correctly. Unless that's changed sometime recently?
<br>
are valid like that in HTML but have to be self-closed in XHTML, e.g. <br />
- which is how I instinctually write HTML at this point anyway. So no, HTML is not strictly a subset of XHTML.Besides, HTML allows for stuff such as not needing to close particular block tags, for exampleXML and HTML have common roots but had slightly different rules (if you could say early HTML had any rules at all). XHTML is an effort to write HTML which conforms strictly to the rules of XML such that if you had an XML parser, you also had an HTML parser. The big difference is that singleton tags such as<br>
are valid like that in HTML but have to be self-closed in XHTML, e.g.<br />
- which is how I instinctually write HTML at this point anyway. So no, HTML is not strictly a subset of XHTML.
<p>
. Whenever you start a new paragraph with a <p>
, the previous one is implicitly closed.That didn't sound right to me, at least for HTML5, so I looked it up, but at least according to the venerable developer.mozilla.org, you're right.Besides, HTML allows for stuff such as not needing to close particular block tags, for example<p>
. Whenever you start a new paragraph with a<p>
, the previous one is implicitly closed.
That's exactly my sentiment though.If you're on my team and I catch you writing HTML like that, I'm calling you into my office, though.
i made this simple script with beautifulsoup 4 html parser few years ago, scraps the "most popular searches" names from mercadolibre and saves them as a csv file with todays date as filename in dataframe format...If anyone has any cool noob tips, I'd be glad to have em.
Thanks bro, new teaching myself this stuff has been a real uphill battle for a faggot retard like me. Thank you for your help.i made this simple script with beautifulsoup 4 html parser few years ago, scraps the "most popular searches" names from mercadolibre and saves them as a csv file with todays date as filename in dataframe format...
Tendencias - Palabras más buscadas en Mercado Libre
tendencias.mercadolibre.cl
not the most useful or original scrapper idea but it helped me understand beautifulsoup, you can adapt it to scrap w/e you want in the format you want, just change my_url and the variable masusadas = page_soup.findAll("li",{"class":"searches__item"}) for w/e you want to scrap, cant upload the ,py ,pynb files here :/we want to scrap every <li class="searches__item"> name inside the big box <andes-card searches>, using page_soup.findAll() we do that...
View attachment 1496244
View attachment 1496257
you can make sure the number of items makes sense using len
View attachment 1496326
View attachment 1496275from datetime import date
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://tendencias.mercadolibre.cl'
uClient = uReq(my_url)
page_html = uClient.read()
page_soup = soup(page_html,"html.parser")
masusadas = page_soup.findAll("li",{"class":"searches__item"})
today = str(date.today())
filename = today + ".csv"
f = open(filename, "w")
for i in range(len(masusadas)):
listado = masusadas.text
f.write(listado + " ,")
f.close()