Page 3 of 4

Re: High Load Errors are annoying...

Posted: Fri Jan 18, 2008 2:17 am
by Murtak

Ok, then I will take a look :)

Re: High Load Errors are annoying...

Posted: Fri Jan 18, 2008 12:35 pm
by Zherog
Murtak at [unixtime wrote:1200557659[/unixtime]]
But since text is yucky you do not store the user as "Maj" but instead store Maj's unique number, say 104.


You're assuming two things:

First, that the use a database rather than text files. I think that's a safe assumption, but it's still an assumption.

Second, you're assuming they have a normalized database. I think that's a less safe assumption.

Re: High Load Errors are annoying...

Posted: Fri Jan 18, 2008 4:00 pm
by Murtak

True. But whatever form the data is in, it's still just text. The only trouble I am likely to run into is undocumented fields with cryptic values (say, a field named 'xplus' which denotes user status with 'b' being admin and '8' being a normal user). I am much more worried about encodings.

Are there any other boards you are interested in?

Re: High Load Errors are annoying...

Posted: Fri Jan 18, 2008 6:45 pm
by Zherog
phpBB is, in my opinion, the best looking free software out there.

Re: High Load Errors are annoying...

Posted: Sat Jan 19, 2008 5:42 pm
by Murtak
I downloaded and installed the phpbb stack and am trying to decipher it's table structure. So far it does not look too bad. Does anyone have a bbb backup for me to chew on?

Re: High Load Errors are annoying...

Posted: Sat Jan 19, 2008 10:32 pm
by fbmf
Wish I did. My computer guru could not make it to Ft. Worth this weekend. Next weekend perhaps. Shit.

Game On,
fbmf

Re: High Load Errors are annoying...

Posted: Sun Jan 20, 2008 3:39 pm
by Surgo
Murtak: go to the bbboy sample board; you can download a backup from it.

Re: High Load Errors are annoying...

Posted: Sun Jan 20, 2008 6:26 pm
by Murtak

Just downloaded the sample backup. I'm going to try again later, but either my backup is corrupted or they are using encryption or some weird compression format.

Re: High Load Errors are annoying...

Posted: Sun Jan 27, 2008 4:51 pm
by Murtak
So I had another go over the weekend and I still can't get anything even losely resembling readable text from the sample board backup.

Does anyone have a readable file for me to work on? I don't care whether it's bogus data, but what I get from the sample board won't work.

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 4:54 am
by Aycarus
Is it possible to parse the forums / messages directly from the HTML (that is, create a spider that reads every thread/page and saves the data into some sort of flatfile)? This would result in whispers and PMs being lost, but it is an option.

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 6:17 am
by Maj
I've had a programmer friend also take a look at the file, and he said it's undecipherable.

I think what we're going to have to do is upgrade to the version 2 of the boards (in theory, you get a month free), and then take a look at the databases.

What a pain in the ass.

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 8:36 am
by Murtak
Aycarus at [unixtime wrote:1202014450[/unixtime]]Is it possible to parse the forums / messages directly from the HTML (that is, create a spider that reads every thread/page and saves the data into some sort of flatfile)? This would result in whispers and PMs being lost, but it is an option.

Possible, yes.

However with the unbelievably bad markup on these boards (HTML Tidy gives me 190 warnings on this single page) parsing it for contents is going to be a nightmare.

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 3:40 pm
by Aycarus
Murtak at [unixtime wrote:1202027774[/unixtime]]
Aycarus at [unixtime wrote:1202014450[/unixtime]]Is it possible to parse the forums / messages directly from the HTML (that is, create a spider that reads every thread/page and saves the data into some sort of flatfile)? This would result in whispers and PMs being lost, but it is an option.

Possible, yes.

However with the unbelievably bad markup on these boards (HTML Tidy gives me 190 warnings on this single page) parsing it for contents is going to be a nightmare.


It's doable. If you look at "view source" it's actually quite structured. I'm nearly certain I can write a parser that will take the boards and convert it to a flatfile DB - if you can take the flatfile DB and convert it to whatever phpbb (or whatever) needs. Let me know.

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 5:20 pm
by Murtak

Well, if you can give me some sort of structured data I can give it a try. However I haven't written any spiders yet. Have you?

Oh, and if possible it would be great if you could put your text files in YAML format (like so:)

Code: Select all

[br]post1:[br]  username:"Murtak"[br]  text: "blabla"[br]


Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 5:30 pm
by Aycarus
I'd prefer XML myself since then you don't have collisions with quotation marks... or some sort of hybrid. Is the following format okay?

Code: Select all

[br]post1:[br]  username: "Murtak"[br]  date:  "10:36:11 Sun Feb 3 2008"[br]  subject: "Re: High Load Errors are annoying..."[br]  post: <POSTTEXT>[post text]</POSTTEXT>[br]

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 6:28 pm
by Murtak
That should be fine.

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 6:34 pm
by Aycarus
Proof of concept:
BBBoyParser.cpp

Compile this program using the g++ command line

g++ -o BBBoyParser BBBoyParser.cpp

The parser takes a BBBoy .html file as input and outputs a "parsed" file in the aforementioned format. Not thoroughly tested, but it worked on at least one test page.

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 7:38 pm
by Aycarus
Does anybody know if one can configure their user_cp to display all pages of a thread or all threads of a forum on a single page? i.e. without having to click through multiple pages of the thread or forum?

Re: High Load Errors are annoying...

Posted: Sun Feb 03, 2008 11:53 pm
by Jacob_Orlove
I couldn't find anything to allow that, but it should be possible for an admin to set the # posts/page to a much higher number, which would do the trick for all but a few threads.

Re: High Load Errors are annoying...

Posted: Mon Feb 04, 2008 3:28 am
by Maj
There's a limit to the number that an admin can set (it's 50, I think, but I'm not positive of that). The more posts/page, though, the more likely you are to encounter the high load errors, according to support.

Re: High Load Errors are annoying...

Posted: Mon Feb 04, 2008 1:17 pm
by Crissa
I've learned three things from this thread...

...We still don't know why we use 'more' cpu time...
...tzor's mother did unspeakable things to him as a child...
...And bbboy must've lost their coders in the web crunch.

-Crissa

Re: High Load Errors are annoying...

Posted: Mon Feb 04, 2008 1:35 pm
by Aycarus
Seems I had to do the extra work and write the spider to take into account multiple pages on threads. So... this is what we can do:

- Spider all the HTML on nifty [done-ish]
- Run HTML => Flatfile parser [done - need a script to automate this]
- Run Flatfile DB => PHPbb DB [in progress]

I think it's totally manageable, and best of all, free. Tho... anyone else feel kinda treasonous for discussing the idea here?

Re: High Load Errors are annoying...

Posted: Mon Feb 04, 2008 1:44 pm
by tzor
Crissa at [unixtime wrote:1202131029[/unixtime]]...We still don't know why we use 'more' cpu time...
...tzor's mother did unspeakable things to him as a child...
...And bbboy must've lost their coders in the web crunch.


...We still don't know if the cpu thing is true or just a vanillia lie
...well let's not speak about them, OK?
...they were outsourced to India during the outsourcing rush

Re: High Load Errors are annoying...

Posted: Mon Feb 04, 2008 3:41 pm
by Zherog
Aycarus at [unixtime wrote:1202132157[/unixtime]]Tho... anyone else feel kinda treasonous for discussing the idea here?


No, not in the least. Their software sucks ass; their support sucks more ass; I don't mind telling them to their face (I did, but they opted to delete it and give me warning), so I sure as hell don't mind saying it here.

As for spiders and such... I'll fully admit I know jack shit about html, xml, and so on. Wanna know about Oracle databases or Oracle Applications? Good chance I can help you. Wanna know about alphabet soup mark-up languages? No clue here.

Our current working theory over on Nifty is to convert to BbSuckass v2; that version uses an actual MySQL database, unlike the current BbSuckass version. Once we have the forums in a MySQL database, in theory it shouldn't be difficult to extract the data and insert it into phpBB (or another free forum package). The downside, as Maj said, is we'd have to write, test, and implement the conversion scripts in a month.

Maj's programmer friend she mentioned is helping out with the conversion. I'll be sure he gets a look at your crawler.

Re: High Load Errors are annoying...

Posted: Mon Feb 04, 2008 7:59 pm
by Aycarus
Zherog at [unixtime wrote:1202139684[/unixtime]]
Our current working theory over on Nifty is to convert to BbSuckass v2; that version uses an actual MySQL database, unlike the current BbSuckass version. Once we have the forums in a MySQL database, in theory it shouldn't be difficult to extract the data and insert it into phpBB (or another free forum package). The downside, as Maj said, is we'd have to write, test, and implement the conversion scripts in a month.


Inevitably their database formats will be different, which will probably be a pain in the ass when it comes to converting between the two. You'll also still have to go through the trouble of modifying the BBcode itself due to inconsistencies between the formatting. As a whole, it should be fun! :biggrin:

How much are you hoping to salvage, anyway? Converting the messages themselves is not too big of a problem... whispers will be essentially impossible... PMs are doable, but will require some thought.