Twitter posts retrieval script

**PeterAhlstrom** · March 2, 2012

Some of you might be interested in this, so here it is. Also, if you are good at code, maybe you can tell me where I am laughably inefficient.

First I get the timeline.

wget -O user_timeline.xml "http://api.twitter.com/1/statuses/user_timeline.xml?screen_name=brandsanderson&count=200&trim_user=true&since_id=168023980798779395"

(The since_id changes based on whatever I retrieved last time. Also, 200 posts is the max Twitter allows without OAuth.)

Then I do this:

/usr/bin/perl tweetthing.pl < user_timeline.xml > sorted.html

Here is my tweetthing.pl script:

#!/usr/bin/perl
use LWP::Simple;
use URI;
use URI::Find;
use Time::Piece;
use HTML::Entities;

my $bigbuf, $buf, $i;

# gather up all the input
while (read(STDIN, $buf, 1024)){
$bigbuf .= $buf;
}

#remove multiple spaces
#$bigbuf =~ s/\s+/ /g;

# split up the input into relevant tokens
my @parts = split(/<status>\n/, $bigbuf);
@parts = reverse @parts;
#remove last part, which is extraneous
pop(@parts);
# Add div tag to beginning of output document
print "<div class='brantweets'>";

foreach (@parts) {
my @brandontweet = split(/(<created_at>|<\/created_at>\n  <id>|<\/id>\n  <text>|<\/text>|<in_reply_to_status_id>|<\/in_reply_to_status_id>\n  <in_reply_to_user_id>|<\/in_reply_to_user_id>|<in_reply_to_screen_name>|<\/in_reply_to_screen_name>)/,$_);
#In the array brandontweet, Part 4 is the status number
#Part 2 is the timestamp
#Part 6 is status
#Part 10 is in reply to status number
#Part 16 is in reply to status person
#Part 12 is in reply to userID
my $brandonstatusid = $brandontweet[4];
#convert timezone to local
my $brandondatestamp = Time::Piece->strptime($brandontweet[2], "%a %b %d %H:%M:%S %z %Y");
my $brandondate = $brandondatestamp->strftime("%a %b %d");
#get rid of html entities
my $brandonstatus = decode_entities($brandontweet[6]);
#remove multiple spaces between sentences
$brandonstatus =~ s/\s+/ /g;
find_uris($brandonstatus, sub {
	my ($find_uri, $orig_uri) = @_;
	my $uri = URI->new( $orig_uri );
	$uri = $uri->canonical->as_string;
	return '<a href="' . $uri . '">' . $uri . '</a>';
});
my $fanuserid = $brandontweet[12];
my $fanusername = $brandontweet[16];
my $fanstatusid = $brandontweet[10];
if ($fanstatusid != ""){
	my $url="http://api.twitter.com/1/statuses/show/".$fanstatusid.".xml";
	my @fantweet = split(/(<created_at>|<\/created_at>|<text>|<\/text>|<profile_image_url>|<\/profile_image_url>)/,get($url));
	#In the array fantweet, Part 6 is the status
	#Part 2 is the timestamp
	#Part 10 is the image URL
	#convert timezone to local
	my $fandatestamp = Time::Piece->strptime($fantweet[2], "%a %b %d %H:%M:%S %z %Y");
	my $fandate = $fandatestamp->strftime("%a %b %d");
	#get rid of html entities
	my $fanstatus = decode_entities($fantweet[6]);
	#remove multiple spaces between sentences
	$fanstatus =~ s/\s+/ /g;
	find_uris($fanstatus, sub {
   		my ($find_uri, $orig_uri) = @_;
   		my $uri = URI->new( $orig_uri );
   		$uri = $uri->canonical->as_string;
   		return '<a href="' . $uri . '">' . $uri . '</a>';
	});
	my $fanimage = $fantweet[10];
	print "<p><img src='".$fanimage."'><a href='http://twitter.com/".$fanusername."/status/".$fanstatusid."'><b>".$fanusername."</b></a> ".$fandate."<br/>".$fanstatus."</p>\n<blockquote><p class='brtw'><a href='http://twitter.com/BrandSanderson/status/".$brandonstatusid."'><b>BrandSanderson</b></a> ".$brandondate."<br/>".$brandonstatus."</p></blockquote>\n\n";
}
else{
	print "<p class='brtw'><a href='http://twitter.com/BrandSanderson/status/".$brandonstatusid."'><b>BrandSanderson</b></a> ".$brandondate."<br/>".$brandonstatus."</p>\n\n";
}
}

# Close div tag in output document
print "</div>";

Then here is the css I stick at the beginning of a post (sorted.html).

<style type="text/css">div.brantweets p {min-height:58px}div.brantweets img {float:left;border:0;margin:5px 5px 0 0;height:48px;width:48px}p.brtw {background:url(http://brandonsanderson.com/images/Llama_Face.png) no-repeat 0px 5px;padding:0 0 0 53px;}</style>

The max size post that Brandon's website allows is about 51k, so if I have more than 30somethingk collected, I make a new Twitter posts archive.

Sorry about the stretched screen...

EDIT: Oh yeah, after I have the sorted html file, I go through it manually and hook up the longer conversations, or when Brandon makes more than one reply to the same tweet. And I fix when he replies to the wrong person, etc. etc.

**KChan** · March 2, 2012

No worries about the screen. It actually reminded me that I needed to fix that particular element, which turned out to be a much bigger pain than I thought it would.

Anyways, it's all good now.

Eric Peters · March 2, 2012

Here's something to store the tweet archive into a simple TSV file, that way you can later change the formatting if you ever need to.

#!/usr/bin/perl

# echo "168023980798779395" > lastTweet.tsv
# touch tweetArchive.tsv

use XML::Simple;
use Data::Dumper;
my $xml = new XML::Simple('SuppressEmpty' => 1);

my $tmpFile = "tmpFile$$";
my $lastIdFile = "lastTweet.tsv";
my $archiveFile = "tweetArchive.tsv";

my $lastId = "";
open(FILE, "$lastIdFile") || die "couldn't open $lastIdFile: $!";
while(<FILE>) { $lastId .= $_ }
chomp($lastId);
close(FILE);
my $URL = "http://api.twitter.com/1/statuses/user_timeline.xml?screen_name=brandsanderson&count=200&trim_user=true&since_id=$lastId";

my $curlParams = " -s "; #silent, can add other parameters
my $curlCmd = 'curl '.$curlParams.'  -o "'.$tmpFile.'" "'.$URL.'"';
#print $curlCmd . "\n";
system($curlCmd);

open(ARCHIVE, ">>$archiveFile") || die "couldn't open archive file for appending";
$data = $xml->XMLin($tmpFile);
unlink($tmpFile);
#print Dumper($data) . "\n";
my %statusHash = %{$data->{status}};

foreach my $id ( keys %statusHash ) {
 my $unode = $statusHash{$id};
 print ARCHIVE join("\t", 
($id,
,$unode->{text}
,$unode->{truncated}
,$unode->{favorited}
,$unode->{in_reply_to_status_id}
,$unode->{in_reply_to_user_id}
,$unode->{in_reply_to_screen_name}
,$unode->{retweet_count}
,$unode->{retweet_count}
,$unode->{user}->{name}
,$unode->{created_at}
)) . "\n";
 $lastId = $id if($id > $lastId);
}

close(ARCHIVE);

# Write out the last status ID
open(FILE, ">$lastIdFile") || die "couldn't open last id tracking file: $!";
print FILE $lastId . "\n";
close(FILE);

**PeterAhlstrom** · March 3, 2012

Eric, I'm not sure if your post is directed at me when it says "you"... Storing the tweets as tab-separated values is much much less useful than what I'm currently doing.

Eric Peters · March 5, 2012

Eric, I'm not sure if your post is directed at me when it says "you"... Storing the tweets as tab-separated values is much much less useful than what I'm currently doing.

I think it's VERY useful to store tab-separated values That way you can generate random HTML archives at any point later on. *shrug* each to his own

**Joe ST** · March 5, 2012

Eric, I'm not sure if your post is directed at me when it says "you"... Storing the tweets as tab-separated values is much much less useful than what I'm currently doing.

I think it's VERY useful to store tab-separated values That way you can generate random HTML archives at any point later on. *shrug* each to his own

I think he means, you should keep them *all* in a local database, and then just syphon off what you want when you want them.

**PeterAhlstrom** · March 5, 2012

You can generate random html archives from TSV files if you have a script to do so. Which I don't. Anyway, to me, TSV seems less useful than the original XML, which is well tagged so I know exactly what each item is for.

I was actually hoping for a better way to do the stuff I use "split" for, the foreach @parts thing. But maybe that's a good way to do it already? This script is essentially the whole of my perl knowledge, and I don't even understand some of the stuff it does, like s/\s+/ /g; — esoteric stuff drives me nuts. I took some computer science courses in college, Java mostly, which I have almost entirely forgotten, so I just wing it when it comes to stuff like this and the various javascript stuff on Brandon's store pages.

**Joe ST** · March 5, 2012

You can generate random html archives from TSV files if you have a script to do so. Which I don't. Anyway, to me, TSV seems less useful than the original XML, which is well tagged so I know exactly what each item is for.

I was actually hoping for a better way to do the stuff I use "split" for, the foreach @parts thing. But maybe that's a good way to do it already? This script is essentially the whole of my perl knowledge, and I don't even understand some of the stuff it does, like s/\s+/ /g; — esoteric stuff drives me nuts. I took some computer science courses in college, Java mostly, which I have almost entirely forgotten, so I just wing it when it comes to stuff like this and the various javascript stuff on Brandon's store pages.

I could try and comment it up if you want (to try and explain the more esoteric bits that I guess are just copypasta? ), and maybe add some improvements... its just I have no knowledge of perl, so I'd be just as much in the dark as you, lol

**PeterAhlstrom** · March 6, 2012

Hey, I'm totally open to using something else like php if it's better for the situation. It just has to be an end-to-end solution that does what this does already.

Yeah, there's a lot of copypasta in here. Something else I don't understand at all is the -> operations. Actually, for the esoteric stuff, I'd prefer a plain English alternative that works in the code. The most esoteric I'm up for is the ternary operator in javascript, and that's with reservations because every time I want to use it I have to look it up to remind myself of the syntax.

I do understand the %H:%M:%S part pretty well and the find_uris section, because they use terms that are easy to relate to the actual thing they do. Well, I don't know what @_ means. Besides one eye and a mouth.

OMG, I was just thinking how much I loved Hypercard back in the day, yet it was a shame it didn't support arrays, so I just searched and found out it DID support arrays:

Yes, you can use variables in a manner similar to fields to simulate
arrays. For example, you can say "line 1 of data", where "data" is the

name of a local or global variable. For example:

put 1 into line 1 of data

put 2 into line 2 of data

put 4 into line 3 of data

put (line 1 of data) + (line 2 of data) + (line 3 of data) into message

Of course, you can use variables instead of literals:

put 3 into n

put line n of data into message

Instead of 'lines' you can use 'items':

put "ABC" into item 2 of data

For multiply-dimensioned arrays you can do:

put "ABC" into item 2 of line 3 of data

If I'd known this back in the early 90s it would have made that Star Trek game I was making in Hypercard work much much better. Instead I had a hidden card with a ton of text fields on it and an algorithm to change a multiply-dimensioned array into a field number... Edited March 6, 2012 by PeterAhlstrom

Eric Peters · March 6, 2012

You can generate random html archives from TSV files if you have a script to do so. Which I don't. Anyway, to me, TSV seems less useful than the original XML, which is well tagged so I know exactly what each item is for.

I was actually hoping for a better way to do the stuff I use "split" for, the foreach @parts thing. But maybe that's a good way to do it already? This script is essentially the whole of my perl knowledge, and I don't even understand some of the stuff it does, like s/\s+/ /g; — esoteric stuff drives me nuts. I took some computer science courses in college, Java mostly, which I have almost entirely forgotten, so I just wing it when it comes to stuff like this and the various javascript stuff on Brandon's store pages.

the first s/ does a search/replace, the \s is the regex character that matches white space characters (tabs/spaces/etc) the + matches one more more times, the second / / is replacing whitespace characters with a space, the g does a recursion on all of the matches, so it effectively all occurrences of multiple whitespace characters will just become one space instead.

I have much perl fu, let me know if you have any specific questions. Guess I wasn't quite sure what you were specifically asking for. I still believe the right approach is to store the archive of the "raw" tweets/etc in some sort of data file (TSV, BDB, MySQL, etc) then you gain flexibility of reformatting them later.

-Eric

EDIT:

Hey, I'm totally open to using something else like php if it's better for the situation. It just has to be an end-to-end solution that does what this does already.

Yeah, there's a lot of copypasta in here. Something else I don't understand at all is the -> operations. Actually, for the esoteric stuff, I'd prefer a plain English alternative that works in the code. The most esoteric I'm up for is the ternary operator in javascript, and that's with reservations because every time I want to use it I have to look it up to remind myself of the syntax.

I do understand the %H:%M:%S part pretty well and the find_uris section, because they use terms that are easy to relate to the actual thing they do. Well, I don't know what @_ means. Besides one eye and a mouth.

OMG, I was just thinking how much I loved Hypercard back in the day, yet it was a shame it didn't support arrays, so I just searched and found out it DID support arrays:If I'd known this back in the early 90s it would have made that Star Trek game I was making in Hypercard work much much better. Instead I had a hidden card with a ton of text fields on it and an algorithm to change a multiply-dimensioned array into a field number...

$_ is a scalar representation of the default input, the @_ is an array of the default inputs

Good little article on it is: http://www.wellho.net/mouth/969_Perl-and-.html

Generally I like to do stuff like:

while(<FILE>) {
 chomp($_);
 my $line = $_;

 if($line =~ /blahblah/) {

 }
}

That way I can "save" the input operator in a more friendly named variable....They're also related to $1, $2, $3 for regex matching.

Edited March 6, 2012 by KChan
Doublepost

**KChan** · March 6, 2012

Eric, if you want to quote two different posts, please don't doublepost. We have a multi-quote feature for that instead. Thanks!

**Joe ST** · March 6, 2012

Hey there, after a bit of a code, I came up with this html page. I dont think it exactly duplicates the functionality of your script, and its probably got plenty of bugs in it (in-particular, it doesn't sort the tweets yet). I got this far and ran out of API allowance.

Basically I switched to the JSON outputs, used jQuery to JSON-P them into local variables, which I then iterate over and append the `<p class='bwst'><a href=twitter.com>...` lines to the DOM directly, rather than via 'nasty' strings. Whilst doing this, I get the reply-to tweets (recursively) and append each of those onto the DOM too.

I will then do something like `$('[data-date]').sort()` on the data-date attribute, leaving them all in correct date order.

I can also then maybe strip out the data-date attributes if you want.

tweets.html

**PeterAhlstrom** · March 6, 2012

Joe,

That looks interesting and promising. I can't figure out how to save its output. If I open it in a browser and look at the source or save as an html file, it just gives me your code, not the output of your code. Would it do the time zone shifting that the original code does? Mine also does automatic URL parsing.

Eric,

Reformatting takes too much time to do manually. That's why I cobbled together the script in the first place.

**Joe ST** · March 6, 2012

Yes, it can do url-parsing, tz-shifting, etc. I just can't put that stuff in atm, as I ran out of API requests lol.

Hmmm, the saving... good question, I can make it add a textarea at the bottom containing the source of the file, if you want. It wont be pretty printed though :\

Sign In

Recent episodes

WTCC submissions

Other areas

Site info

Links

Twitter posts retrieval script

Recommended Posts

PeterAhlstrom he/him

KChan she/her

Eric Peters

PeterAhlstrom he/him

Eric Peters

Joe ST he/him

PeterAhlstrom he/him

Joe ST he/him

PeterAhlstrom he/him

Eric Peters

KChan she/her

Joe ST he/him

PeterAhlstrom he/him

Joe ST he/him

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Links

More